EuroLLM-22B: Technical Report
#EuroLLM-22B #Large Language Model #European Union #Multilingual AI #arXiv #Tokenizer #Open Source AI
📌 Key Takeaways
- EuroLLM-22B is a new 22-billion parameter model trained from scratch to support 35 European languages.
- The model covers all 24 official EU languages plus an additional 11 regional languages.
- The project addresses the historical underrepresentation and poor performance of non-English languages in open LLMs.
- Technical innovations include a specialized tokenizer designed to improve efficiency for diverse scripts and linguistic structures.
📖 Full Retelling
🐦 Character Reactions (Tweets)
Linguistic LibertarianEuroLLM-22B: Because even AI deserves a Schengen visa for languages! 🇪🇺🗣️ #AILocalization
Tech SovereignFinally, an AI that speaks Hungarian better than my Hungarian cousin! 🇭🇺 #EuroLLM22B #AISovereignty
AI PolyglotEuroLLM-22B: Making sure your AI can order coffee in Bulgarian without sounding like a robot from the future. ☕🤖 #AILanguages
EU Tech WhispererEuroLLM-22B: Because even AI needs to understand why 'the' is pronounced 'det' in Danish. 🇩🇰🤔 #AILocalization
💬 Character Dialogue
🏷️ Themes
Artificial Intelligence, Digital Sovereignty, Linguistics
📚 Related People & Topics
Large language model
Type of machine learning model
A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) that provide the c...
Lexical analysis
Conversion of character sequences into token sequences in computer science
Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the...
European Union
Supranational political and economic union
The European Union (EU) is a supranational political and economic union of 27 member states that are located primarily in Europe. The union has a total area of 4,233,255 km2 (1,634,469 sq mi) and an estimated population of more than 450 million as of 2025. The EU is often described as a sui generis ...
🔗 Entity Intersection Graph
Connections for Large language model:
- 🌐 Reinforcement learning (7 shared articles)
- 🌐 Machine learning (5 shared articles)
- 🌐 Theory of mind (2 shared articles)
- 🌐 Generative artificial intelligence (2 shared articles)
- 🌐 Automation (2 shared articles)
- 🌐 Rag (2 shared articles)
- 🌐 Scientific method (2 shared articles)
- 🌐 Mafia (disambiguation) (1 shared articles)
- 🌐 Robustness (1 shared articles)
- 🌐 Capture the flag (1 shared articles)
- 👤 Clinical Practice (1 shared articles)
- 🌐 Wearable computer (1 shared articles)
📄 Original Source Content
arXiv:2602.05879v1 Announce Type: cross Abstract: This report presents EuroLLM-22B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-22B's development, including tokenizer design, architectural specificat