SP
BravenNow
Tucano 2 Cool: Better Open Source LLMs for Portuguese
| USA | technology | ✓ Verified - arxiv.org

Tucano 2 Cool: Better Open Source LLMs for Portuguese

📖 Full Retelling

arXiv:2603.03543v1 Announce Type: cross Abstract: We present Tucano 2, a fully open suite of large language models (LLMs) with 0.5-3.7 billion parameters, designed to address certain gaps in open-source development for Portuguese LLMs. Following our previous works, we now extend our dataset, GigaVerbo-v2, to a new degree of quality and scale, while also introducing a new synthetic dataset, GigaVerbo-v2 Synth, aimed at filling missing gaps in GigaVerbo-v2, and two post-training datasets, GigaVer

Entity Intersection Graph

No entity connections available yet for this article.

}
Original Source
--> Computer Science > Computation and Language arXiv:2603.03543 [Submitted on 3 Mar 2026] Title: Tucano 2 Cool: Better Open Source LLMs for Portuguese Authors: Nicholas Kluge Corrêa , Aniket Sen , Shiza Fatimah , Sophia Falk , Lennard Landgraf , Julia Kastner , Lucie Flek View a PDF of the paper titled Tucano 2 Cool: Better Open Source LLMs for Portuguese, by Nicholas Kluge Corr\^ea and Aniket Sen and Shiza Fatimah and Sophia Falk and Lennard Landgraf and Julia Kastner and Lucie Flek View PDF HTML Abstract: We present Tucano 2, a fully open suite of large language models with 0.5-3.7 billion parameters, designed to address certain gaps in open-source development for Portuguese LLMs. Following our previous works, we now extend our dataset, GigaVerbo-v2, to a new degree of quality and scale, while also introducing a new synthetic dataset, GigaVerbo-v2 Synth, aimed at filling missing gaps in GigaVerbo-v2, and two post-training datasets, GigaVerbo-v2 SFT and GigaVerbo-v2 Preferences, that allow Portuguese LLMs to be trained in domains like retrieval augmented generation, coding, tool use, chain-of-thought reasoning, and many other domains of interest. Through extensive ablation studies, we design both pretraining and continual pretraining recipes for the Tucano 2 suite (Base, Instruct, and Think), which achieve state-of-the-art performance on several Portuguese-language modeling benchmarks. We also extend and refine the evaluation harness introduced in our earlier work, yielding a comprehensive evaluation suite that provides strong signals across different pretraining, continual pretraining, and post-training regimes. All artifacts associated with Tucano 2 are openly released, including training recipes, logs, and source code, ensuring that our work is reproducible, accessible, and extendable by the broader Portuguese NLP community. Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI) Cite as: arXiv:2603.03543 [cs.CL] (or arXiv:2603.03543v1 [cs....
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine