Ena

Georgian is one of the most underserved languages in modern large language model development, and the reasons are structural rather than incidental. The language uses the Mkhedruli script, a unique writing system unrelated to Latin, Cyrillic, or any other widely represented alphabet in current LLM training corpora. This creates a compounding problem that begins at the lowest level of text representation and propagates upward through every layer of model architecture and training.

The most fundamental issue is encoding. UTF-8, the universal character encoding used across virtually all modern NLP pipelines, assigns a variable number of bytes to each character depending on its Unicode code point. Latin characters fall within the ASCII range and occupy exactly one byte per character. Georgian characters, however, reside in the Unicode block U+10D0 to U+10FF, which places them in the three-byte encoding range. This means that before any tokenization or model processing occurs, a Georgian text already consumes three times the raw byte budget of an equivalent English text for the same number of characters.

This byte-level disparity cascades into tokenization. BPE tokenizers used by all major LLMs were trained predominantly on English and other Latin-script languages, with Georgian comprising a negligible fraction of the training corpus. Because BPE learns merge operations from frequency statistics, it builds rich, multi-character tokens for English — common words like "the" or "information" are encoded as single tokens. Georgian characters, being rare in the training data, were never merged into meaningful subword units. The result is that a single Georgian word is fragmented into five to eight tokens on average, while the equivalent English word is typically encoded in one to two tokens.

This is not merely an aesthetic problem. It means that for any given context window, a Georgian-language prompt consumes roughly four to six times more tokens than an English prompt conveying the same semantic content. Training costs scale linearly with token count. Inference latency scales with sequence length. The effective context window shrinks proportionally. A model with a 128,000-token context window can process approximately 96,000 English words but only 16,000 to 20,000 Georgian words within the same budget.

The second dimension of the problem is data scarcity. Georgian is spoken by approximately 3.7 million people, and its digital footprint is correspondingly small. The CulturaX dataset on Hugging Face represents what we believe to be near-total coverage of publicly available Georgian web text, containing approximately two billion words. To put this in perspective, English training corpora for frontier models exceed ten trillion words. Georgian has roughly 0.02% of that volume. We have observed that models such as Gemini, which currently perform best on Georgian, appear to have reached a performance ceiling — likely because all gatherable web data has already been consumed by their training pipelines. Further improvement through web-scale data alone is not feasible.

The third dimension is the translation asymmetry. Existing LLMs demonstrate a consistent directional bias: they translate competently from Georgian to English but perform poorly in the reverse direction. This is a direct consequence of training data composition. The models have seen enough Georgian text to extract meaning from it, but they have not internalized enough Georgian linguistic structure — morphology, syntax, case agreement, verb conjugation — to generate fluent Georgian output. Georgian is a morphologically rich, agglutinative language with a complex verb system, and generating correct Georgian requires substantially more language-specific knowledge than comprehending it.

Our Plan

Our approach proceeds through six sequential phases, each building on the outputs of the previous one. The overall architecture is designed to address all three dimensions of the problem simultaneously: tokenizer inefficiency, data scarcity, and generation quality.

Phase 1: Data Collection and Processing

The foundation of this project is a large-scale data acquisition effort targeting Georgian text that exists outside the publicly crawled web. Over the course of several months, we have identified and begun acquiring more than 300,000 non-copyrighted works spanning books, newspapers, academic journals, short stories, and poetry. These materials represent a substantial body of Georgian literary and journalistic output that has never been digitized into machine-readable plain text.

Approximately 60% of these documents exist only as scanned images or PDFs without embedded text layers, requiring optical character recognition. We use Google Vision OCR for this task, which handles Georgian script with high accuracy due to the regularity and distinctiveness of the Mkhedruli alphabet — Georgian characters have minimal ambiguity in their glyph shapes compared to, for example, degraded Latin-script prints. The remaining 40% of documents have been exported, chunked into training-appropriate segments, and cleaned of formatting artifacts, page numbers, headers, and other non-content elements.

In addition to these publicly sourced works, several Georgian companies have voluntarily contributed internal documentation, correspondence, and domain-specific text in Georgian. This corporate data adds both volume and genre diversity to the corpus, covering registers of language — technical, business, administrative — that are underrepresented in literary and journalistic sources.

Our projected total corpus size from all these sources is approximately eight billion words. Combined with the roughly two billion words available from CulturaX, this gives us a working corpus of approximately ten billion Georgian words — a five-fold increase over what any existing model has likely been trained on.

Phase 2: Tokenizer Adaptation

This phase is technically the most critical, and it draws directly on recent research in efficient tokenizer modification for pre-trained models. We begin with a small pre-trained model in the 800-million to 2-billion parameter range — models such as Llama 3.2 1B or a comparable architecture. The first operation is vocabulary pruning, followed by vocabulary extension with Georgian tokens.

The pruning step removes tokens from the existing vocabulary that correspond to scripts and languages irrelevant to our use case. A model like Llama 3 has a 128,000-token vocabulary, but a large fraction of those tokens encode Chinese, Japanese, Korean, Arabic, Hindi, and other scripts that we will never use in a Georgian-English bilingual system. These tokens occupy embedding parameters that contribute nothing to our task — and in small models, embedding parameters constitute a disproportionate share of total parameters. For reference, Gemma 3's smallest model has 270 million parameters, of which 170 million are vocabulary embeddings.

We do not perform naive pruning by simply removing the last N tokens or the least frequent tokens globally. As demonstrated in recent work by Purason et al. (2025), naive frequency-based pruning breaks BPE merge paths and creates unreachable tokens — tokens that exist in the vocabulary but can never be produced during tokenization because the intermediate merge operations that lead to them have been removed. Instead, we employ leaf-based frequency pruning, which iteratively removes only tokens that are leaves in the BPE merge graph — tokens that do not participate as components in any other merge operation. This preserves the structural integrity of the tokenizer. Research shows that up to 62.5% of tokens can be removed this way without measurable degradation in downstream performance on the retained languages.

After pruning, we extend the vocabulary with Georgian tokens. The target composition is approximately 40% Georgian tokens in the vocabulary. Rather than training a separate Georgian tokenizer and appending its non-overlapping tokens — the naive approach that has been standard practice — we use continued BPE training. This method resumes the BPE merge learning process directly on Georgian text, starting from the existing tokenizer's merge state. The key advantage is that every new merge operation is guaranteed to be compatible with the existing merge sequence. The naive approach, by contrast, introduces tokens that look useful in isolation but are unreachable through the actual merge operations of the combined tokenizer. Across 70 languages tested, continued BPE training produced zero unreachable tokens, while the naive method left 4% to 10% of added tokens unreachable.

The practical effect of this tokenizer adaptation on Georgian text processing is substantial. Consider what happens at the byte level. With the default Llama 3 tokenizer, Georgian text achieves a compression ratio of approximately 2.64 bytes per token — meaning each token encodes on average only 2.64 bytes of Georgian text. Since each Georgian character is three bytes, this means the tokenizer is essentially splitting individual characters across multiple tokens. After pruning and extending with 16,000 Georgian tokens via continued BPE training, compression improves to approximately 4.46 bytes per token — a 69% improvement. This directly translates to shorter sequences, faster training, faster inference, and a larger effective context window for Georgian text.

After extending the vocabulary, the model's embedding matrices must be resized. New token embeddings are initialized using Fast Vocabulary Transfer, which copies embeddings directly for tokens that existed in the original vocabulary and initializes new tokens by averaging the embeddings of their constituent sub-tokens under the original tokenizer. This provides a warm start that substantially reduces the amount of continued pre-training needed to bring the new tokens to parity.

Phase 3: Continued Pre-training

With the adapted tokenizer in place, we perform continued pre-training on our full bilingual corpus. The training data consists of our approximately ten billion words of Georgian text and a matched volume of English text from Fineweb, maintaining a balanced bilingual mix. This balance is essential: if Georgian dominates the training mix, the model's English capabilities degrade. If English dominates, the model fails to internalize Georgian structure deeply enough.

The model architecture remains unchanged — we use the original pre-trained weights with the modified embedding layer. Training proceeds with standard continued pre-training hyperparameters: a cosine learning rate schedule with warmup, AdamW optimizer, and sequence length of 4096 tokens. Because our adapted tokenizer compresses Georgian text more efficiently, the same ten billion words of Georgian are encoded into fewer tokens than they would have been with the original tokenizer. Research on Llama 3.2 1B with a similar setup showed a 26% reduction in total training tokens after tokenizer adaptation, translating directly to reduced GPU-hours.

Phase 4: Translation Fine-tuning

This phase exploits a specific asymmetry in existing LLM capabilities. Models like Gemini translate Georgian to English competently. This means that high-quality Georgian-to-English parallel sentence pairs can be generated at scale using existing systems. The critical insight is that a parallel corpus of Georgian-to-English translations is structurally identical to an English-to-Georgian corpus — the only difference is which side is treated as the source and which as the target.

We generate a large parallel dataset by having Gemini translate Georgian texts into English. These translations are high quality because Georgian-to-English is the direction in which current models are strong. We then take this corpus and reverse it: the English translations become the source, and the original Georgian texts become the target. We fine-tune our continued pre-trained model on this reversed corpus for English-to-Georgian translation.

This approach bypasses the fundamental bottleneck. We do not need a model that is already good at generating Georgian — we only need a model that is good at understanding Georgian, which existing models already are. The supervised fine-tuning teaches our model to produce fluent Georgian by training it on authentic Georgian text as the target output, paired with clean English input that provides unambiguous semantic grounding.

Phase 5: Synthetic Data Generation

Once we have a competent English-to-Georgian translation system, we use it to scale our Georgian corpus by a factor of five to ten. The source material for this expansion is English text — high-quality books, articles, technical documents, and educational content — which exists in virtually unlimited supply. We translate this English material into Georgian using our fine-tuned model from Phase 4.

The distinction between this approach and generic synthetic data generation is important. When an LLM generates text from scratch — inventing content, reasoning, and structure entirely on its own — the output tends to converge on a narrow distribution of patterns, exhibiting repetitive phrasings, limited stylistic variety, and a well-documented tendency toward self-reinforcing errors known as model collapse. Translation, however, is fundamentally constrained by its source. The ideas, structure, argumentation, narrative, and factual content all originate from genuine human-written English text. The model's task is limited to rendering that content in Georgian. It is not inventing; it is transposing. This constraint means that the linguistic diversity of the output is bounded by the diversity of the English source material, which is vast, rather than by the model's own generative tendencies.

The resulting synthetic Georgian corpus of 50 to 100 billion words would represent an unprecedented volume of Georgian-language training data, covering domains and registers that barely exist in the natural Georgian web corpus.

Phase 6: Large Model Training

With a corpus of 60 to 110 billion Georgian words available — authentic and translated combined — alongside matched English data, we are positioned to train a substantially larger model. The specific architecture and parameter count will be determined by the quality metrics achieved in the preceding phases and the available compute budget, but the target range is a model large enough to exhibit genuine reasoning and generation capabilities in Georgian, rather than merely pattern-matching from a limited training signal.

The tokenizer adapted in Phase 2 carries forward to this stage, meaning the training efficiency gains compound: more data, better tokenized, feeding a larger model. The synthetic data from Phase 5 also provides domain coverage that Georgian has never had at scale — scientific text, technical documentation, legal language, medical terminology — all rendered in Georgian from high-quality English sources.

The path forward

The Georgian language faces a structural disadvantage in the current landscape of large language model development. The three-byte UTF-8 encoding of its script, combined with the near-total absence of Georgian tokens in existing BPE vocabularies, means that Georgian text is tokenized at roughly five to six times the cost of English text — consuming more context, more compute, and more training budget per word of actual content. Compounding this, the total supply of digitally available Georgian text is approximately two billion words, which appears to have already been exhausted by frontier models.

Our plan addresses this through a sequence of interventions that each independently improve the situation and collectively transform it. Tokenizer adaptation through leaf-based pruning and continued BPE training reduces the tokenization cost of Georgian by approximately 69%, bringing it closer to parity with English. A large-scale data acquisition effort targeting offline Georgian texts — books, newspapers, journals, and corporate documents — expands the available corpus by a factor of five. A translation pipeline that exploits the existing strength of LLMs in Georgian-to-English direction, then reverses the parallel data for English-to-Georgian fine-tuning, creates a competent Georgian generator without requiring one to exist first. And constrained synthetic data generation through translation scales the corpus by another order of magnitude without the quality degradation associated with unconstrained generation.

The end result is a path from two billion words of web-crawled Georgian and a tokenizer that fragments every word into six tokens, to over sixty billion words of diverse Georgian text and a tokenizer that processes the language at close to English-level efficiency. This is what is needed to train a model that does not merely tolerate Georgian but is genuinely fluent in it.

Ena

Georgian is one of the most underserved languages in modern large language model development, and the reasons are structural rather than incidental. The language uses the Mkhedruli script, a unique writing system unrelated to Latin, Cyrillic, or any other widely represented alphabet in current LLM training corpora. This creates a compounding problem that begins at the lowest level of text representation and propagates upward through every layer of model architecture and training.

The most fundamental issue is encoding. UTF-8, the universal character encoding used across virtually all modern NLP pipelines, assigns a variable number of bytes to each character depending on its Unicode code point. Latin characters fall within the ASCII range and occupy exactly one byte per character. Georgian characters, however, reside in the Unicode block U+10D0 to U+10FF, which places them in the three-byte encoding range. This means that before any tokenization or model processing occurs, a Georgian text already consumes three times the raw byte budget of an equivalent English text for the same number of characters.

This byte-level disparity cascades into tokenization. BPE tokenizers used by all major LLMs were trained predominantly on English and other Latin-script languages, with Georgian comprising a negligible fraction of the training corpus. Because BPE learns merge operations from frequency statistics, it builds rich, multi-character tokens for English — common words like "the" or "information" are encoded as single tokens. Georgian characters, being rare in the training data, were never merged into meaningful subword units. The result is that a single Georgian word is fragmented into five to eight tokens on average, while the equivalent English word is typically encoded in one to two tokens.

This is not merely an aesthetic problem. It means that for any given context window, a Georgian-language prompt consumes roughly four to six times more tokens than an English prompt conveying the same semantic content. Training costs scale linearly with token count. Inference latency scales with sequence length. The effective context window shrinks proportionally. A model with a 128,000-token context window can process approximately 96,000 English words but only 16,000 to 20,000 Georgian words within the same budget.

The second dimension of the problem is data scarcity. Georgian is spoken by approximately 3.7 million people, and its digital footprint is correspondingly small. The CulturaX dataset on Hugging Face represents what we believe to be near-total coverage of publicly available Georgian web text, containing approximately two billion words. To put this in perspective, English training corpora for frontier models exceed ten trillion words. Georgian has roughly 0.02% of that volume. We have observed that models such as Gemini, which currently perform best on Georgian, appear to have reached a performance ceiling — likely because all gatherable web data has already been consumed by their training pipelines. Further improvement through web-scale data alone is not feasible.

The third dimension is the translation asymmetry. Existing LLMs demonstrate a consistent directional bias: they translate competently from Georgian to English but perform poorly in the reverse direction. This is a direct consequence of training data composition. The models have seen enough Georgian text to extract meaning from it, but they have not internalized enough Georgian linguistic structure — morphology, syntax, case agreement, verb conjugation — to generate fluent Georgian output. Georgian is a morphologically rich, agglutinative language with a complex verb system, and generating correct Georgian requires substantially more language-specific knowledge than comprehending it.

Our Plan

Our approach proceeds through six sequential phases, each building on the outputs of the previous one. The overall architecture is designed to address all three dimensions of the problem simultaneously: tokenizer inefficiency, data scarcity, and generation quality.

Phase 1: Data Collection and Processing

The foundation of this project is a large-scale data acquisition effort targeting Georgian text that exists outside the publicly crawled web. Over the course of several months, we have identified and begun acquiring more than 300,000 non-copyrighted works spanning books, newspapers, academic journals, short stories, and poetry. These materials represent a substantial body of Georgian literary and journalistic output that has never been digitized into machine-readable plain text.

Approximately 60% of these documents exist only as scanned images or PDFs without embedded text layers, requiring optical character recognition. We use Google Vision OCR for this task, which handles Georgian script with high accuracy due to the regularity and distinctiveness of the Mkhedruli alphabet — Georgian characters have minimal ambiguity in their glyph shapes compared to, for example, degraded Latin-script prints. The remaining 40% of documents have been exported, chunked into training-appropriate segments, and cleaned of formatting artifacts, page numbers, headers, and other non-content elements.

In addition to these publicly sourced works, several Georgian companies have voluntarily contributed internal documentation, correspondence, and domain-specific text in Georgian. This corporate data adds both volume and genre diversity to the corpus, covering registers of language — technical, business, administrative — that are underrepresented in literary and journalistic sources.

Our projected total corpus size from all these sources is approximately eight billion words. Combined with the roughly two billion words available from CulturaX, this gives us a working corpus of approximately ten billion Georgian words — a five-fold increase over what any existing model has likely been trained on.

Phase 2: Tokenizer Adaptation

This phase is technically the most critical, and it draws directly on recent research in efficient tokenizer modification for pre-trained models. We begin with a small pre-trained model in the 800-million to 2-billion parameter range — models such as Llama 3.2 1B or a comparable architecture. The first operation is vocabulary pruning, followed by vocabulary extension with Georgian tokens.

The pruning step removes tokens from the existing vocabulary that correspond to scripts and languages irrelevant to our use case. A model like Llama 3 has a 128,000-token vocabulary, but a large fraction of those tokens encode Chinese, Japanese, Korean, Arabic, Hindi, and other scripts that we will never use in a Georgian-English bilingual system. These tokens occupy embedding parameters that contribute nothing to our task — and in small models, embedding parameters constitute a disproportionate share of total parameters. For reference, Gemma 3's smallest model has 270 million parameters, of which 170 million are vocabulary embeddings.

We do not perform naive pruning by simply removing the last N tokens or the least frequent tokens globally. As demonstrated in recent work by Purason et al. (2025), naive frequency-based pruning breaks BPE merge paths and creates unreachable tokens — tokens that exist in the vocabulary but can never be produced during tokenization because the intermediate merge operations that lead to them have been removed. Instead, we employ leaf-based frequency pruning, which iteratively removes only tokens that are leaves in the BPE merge graph — tokens that do not participate as components in any other merge operation. This preserves the structural integrity of the tokenizer. Research shows that up to 62.5% of tokens can be removed this way without measurable degradation in downstream performance on the retained languages.

After pruning, we extend the vocabulary with Georgian tokens. The target composition is approximately 40% Georgian tokens in the vocabulary. Rather than training a separate Georgian tokenizer and appending its non-overlapping tokens — the naive approach that has been standard practice — we use continued BPE training. This method resumes the BPE merge learning process directly on Georgian text, starting from the existing tokenizer's merge state. The key advantage is that every new merge operation is guaranteed to be compatible with the existing merge sequence. The naive approach, by contrast, introduces tokens that look useful in isolation but are unreachable through the actual merge operations of the combined tokenizer. Across 70 languages tested, continued BPE training produced zero unreachable tokens, while the naive method left 4% to 10% of added tokens unreachable.

The practical effect of this tokenizer adaptation on Georgian text processing is substantial. Consider what happens at the byte level. With the default Llama 3 tokenizer, Georgian text achieves a compression ratio of approximately 2.64 bytes per token — meaning each token encodes on average only 2.64 bytes of Georgian text. Since each Georgian character is three bytes, this means the tokenizer is essentially splitting individual characters across multiple tokens. After pruning and extending with 16,000 Georgian tokens via continued BPE training, compression improves to approximately 4.46 bytes per token — a 69% improvement. This directly translates to shorter sequences, faster training, faster inference, and a larger effective context window for Georgian text.

After extending the vocabulary, the model's embedding matrices must be resized. New token embeddings are initialized using Fast Vocabulary Transfer, which copies embeddings directly for tokens that existed in the original vocabulary and initializes new tokens by averaging the embeddings of their constituent sub-tokens under the original tokenizer. This provides a warm start that substantially reduces the amount of continued pre-training needed to bring the new tokens to parity.

Phase 3: Continued Pre-training

With the adapted tokenizer in place, we perform continued pre-training on our full bilingual corpus. The training data consists of our approximately ten billion words of Georgian text and a matched volume of English text from Fineweb, maintaining a balanced bilingual mix. This balance is essential: if Georgian dominates the training mix, the model's English capabilities degrade. If English dominates, the model fails to internalize Georgian structure deeply enough.

The model architecture remains unchanged — we use the original pre-trained weights with the modified embedding layer. Training proceeds with standard continued pre-training hyperparameters: a cosine learning rate schedule with warmup, AdamW optimizer, and sequence length of 4096 tokens. Because our adapted tokenizer compresses Georgian text more efficiently, the same ten billion words of Georgian are encoded into fewer tokens than they would have been with the original tokenizer. Research on Llama 3.2 1B with a similar setup showed a 26% reduction in total training tokens after tokenizer adaptation, translating directly to reduced GPU-hours.

Phase 4: Translation Fine-tuning

This phase exploits a specific asymmetry in existing LLM capabilities. Models like Gemini translate Georgian to English competently. This means that high-quality Georgian-to-English parallel sentence pairs can be generated at scale using existing systems. The critical insight is that a parallel corpus of Georgian-to-English translations is structurally identical to an English-to-Georgian corpus — the only difference is which side is treated as the source and which as the target.

We generate a large parallel dataset by having Gemini translate Georgian texts into English. These translations are high quality because Georgian-to-English is the direction in which current models are strong. We then take this corpus and reverse it: the English translations become the source, and the original Georgian texts become the target. We fine-tune our continued pre-trained model on this reversed corpus for English-to-Georgian translation.

This approach bypasses the fundamental bottleneck. We do not need a model that is already good at generating Georgian — we only need a model that is good at understanding Georgian, which existing models already are. The supervised fine-tuning teaches our model to produce fluent Georgian by training it on authentic Georgian text as the target output, paired with clean English input that provides unambiguous semantic grounding.

Phase 5: Synthetic Data Generation

Once we have a competent English-to-Georgian translation system, we use it to scale our Georgian corpus by a factor of five to ten. The source material for this expansion is English text — high-quality books, articles, technical documents, and educational content — which exists in virtually unlimited supply. We translate this English material into Georgian using our fine-tuned model from Phase 4.

The distinction between this approach and generic synthetic data generation is important. When an LLM generates text from scratch — inventing content, reasoning, and structure entirely on its own — the output tends to converge on a narrow distribution of patterns, exhibiting repetitive phrasings, limited stylistic variety, and a well-documented tendency toward self-reinforcing errors known as model collapse. Translation, however, is fundamentally constrained by its source. The ideas, structure, argumentation, narrative, and factual content all originate from genuine human-written English text. The model's task is limited to rendering that content in Georgian. It is not inventing; it is transposing. This constraint means that the linguistic diversity of the output is bounded by the diversity of the English source material, which is vast, rather than by the model's own generative tendencies.

The resulting synthetic Georgian corpus of 50 to 100 billion words would represent an unprecedented volume of Georgian-language training data, covering domains and registers that barely exist in the natural Georgian web corpus.

Phase 6: Large Model Training

With a corpus of 60 to 110 billion Georgian words available — authentic and translated combined — alongside matched English data, we are positioned to train a substantially larger model. The specific architecture and parameter count will be determined by the quality metrics achieved in the preceding phases and the available compute budget, but the target range is a model large enough to exhibit genuine reasoning and generation capabilities in Georgian, rather than merely pattern-matching from a limited training signal.

The tokenizer adapted in Phase 2 carries forward to this stage, meaning the training efficiency gains compound: more data, better tokenized, feeding a larger model. The synthetic data from Phase 5 also provides domain coverage that Georgian has never had at scale — scientific text, technical documentation, legal language, medical terminology — all rendered in Georgian from high-quality English sources.

The path forward

The Georgian language faces a structural disadvantage in the current landscape of large language model development. The three-byte UTF-8 encoding of its script, combined with the near-total absence of Georgian tokens in existing BPE vocabularies, means that Georgian text is tokenized at roughly five to six times the cost of English text — consuming more context, more compute, and more training budget per word of actual content. Compounding this, the total supply of digitally available Georgian text is approximately two billion words, which appears to have already been exhausted by frontier models.

Our plan addresses this through a sequence of interventions that each independently improve the situation and collectively transform it. Tokenizer adaptation through leaf-based pruning and continued BPE training reduces the tokenization cost of Georgian by approximately 69%, bringing it closer to parity with English. A large-scale data acquisition effort targeting offline Georgian texts — books, newspapers, journals, and corporate documents — expands the available corpus by a factor of five. A translation pipeline that exploits the existing strength of LLMs in Georgian-to-English direction, then reverses the parallel data for English-to-Georgian fine-tuning, creates a competent Georgian generator without requiring one to exist first. And constrained synthetic data generation through translation scales the corpus by another order of magnitude without the quality degradation associated with unconstrained generation.

The end result is a path from two billion words of web-crawled Georgian and a tokenizer that fragments every word into six tokens, to over sixty billion words of diverse Georgian text and a tokenizer that processes the language at close to English-level efficiency. This is what is needed to train a model that does not merely tolerate Georgian but is genuinely fluent in it.

Ena

Georgian is one of the most underserved languages in modern large language model development, and the reasons are structural rather than incidental. The language uses the Mkhedruli script, a unique writing system unrelated to Latin, Cyrillic, or any other widely represented alphabet in current LLM training corpora. This creates a compounding problem that begins at the lowest level of text representation and propagates upward through every layer of model architecture and training.

The most fundamental issue is encoding. UTF-8, the universal character encoding used across virtually all modern NLP pipelines, assigns a variable number of bytes to each character depending on its Unicode code point. Latin characters fall within the ASCII range and occupy exactly one byte per character. Georgian characters, however, reside in the Unicode block U+10D0 to U+10FF, which places them in the three-byte encoding range. This means that before any tokenization or model processing occurs, a Georgian text already consumes three times the raw byte budget of an equivalent English text for the same number of characters.

This byte-level disparity cascades into tokenization. BPE tokenizers used by all major LLMs were trained predominantly on English and other Latin-script languages, with Georgian comprising a negligible fraction of the training corpus. Because BPE learns merge operations from frequency statistics, it builds rich, multi-character tokens for English — common words like "the" or "information" are encoded as single tokens. Georgian characters, being rare in the training data, were never merged into meaningful subword units. The result is that a single Georgian word is fragmented into five to eight tokens on average, while the equivalent English word is typically encoded in one to two tokens.

This is not merely an aesthetic problem. It means that for any given context window, a Georgian-language prompt consumes roughly four to six times more tokens than an English prompt conveying the same semantic content. Training costs scale linearly with token count. Inference latency scales with sequence length. The effective context window shrinks proportionally. A model with a 128,000-token context window can process approximately 96,000 English words but only 16,000 to 20,000 Georgian words within the same budget.

The second dimension of the problem is data scarcity. Georgian is spoken by approximately 3.7 million people, and its digital footprint is correspondingly small. The CulturaX dataset on Hugging Face represents what we believe to be near-total coverage of publicly available Georgian web text, containing approximately two billion words. To put this in perspective, English training corpora for frontier models exceed ten trillion words. Georgian has roughly 0.02% of that volume. We have observed that models such as Gemini, which currently perform best on Georgian, appear to have reached a performance ceiling — likely because all gatherable web data has already been consumed by their training pipelines. Further improvement through web-scale data alone is not feasible.

The third dimension is the translation asymmetry. Existing LLMs demonstrate a consistent directional bias: they translate competently from Georgian to English but perform poorly in the reverse direction. This is a direct consequence of training data composition. The models have seen enough Georgian text to extract meaning from it, but they have not internalized enough Georgian linguistic structure — morphology, syntax, case agreement, verb conjugation — to generate fluent Georgian output. Georgian is a morphologically rich, agglutinative language with a complex verb system, and generating correct Georgian requires substantially more language-specific knowledge than comprehending it.

Our Plan

Our approach proceeds through six sequential phases, each building on the outputs of the previous one. The overall architecture is designed to address all three dimensions of the problem simultaneously: tokenizer inefficiency, data scarcity, and generation quality.

Phase 1: Data Collection and Processing

The foundation of this project is a large-scale data acquisition effort targeting Georgian text that exists outside the publicly crawled web. Over the course of several months, we have identified and begun acquiring more than 300,000 non-copyrighted works spanning books, newspapers, academic journals, short stories, and poetry. These materials represent a substantial body of Georgian literary and journalistic output that has never been digitized into machine-readable plain text.

Approximately 60% of these documents exist only as scanned images or PDFs without embedded text layers, requiring optical character recognition. We use Google Vision OCR for this task, which handles Georgian script with high accuracy due to the regularity and distinctiveness of the Mkhedruli alphabet — Georgian characters have minimal ambiguity in their glyph shapes compared to, for example, degraded Latin-script prints. The remaining 40% of documents have been exported, chunked into training-appropriate segments, and cleaned of formatting artifacts, page numbers, headers, and other non-content elements.

In addition to these publicly sourced works, several Georgian companies have voluntarily contributed internal documentation, correspondence, and domain-specific text in Georgian. This corporate data adds both volume and genre diversity to the corpus, covering registers of language — technical, business, administrative — that are underrepresented in literary and journalistic sources.

Our projected total corpus size from all these sources is approximately eight billion words. Combined with the roughly two billion words available from CulturaX, this gives us a working corpus of approximately ten billion Georgian words — a five-fold increase over what any existing model has likely been trained on.

Phase 2: Tokenizer Adaptation

This phase is technically the most critical, and it draws directly on recent research in efficient tokenizer modification for pre-trained models. We begin with a small pre-trained model in the 800-million to 2-billion parameter range — models such as Llama 3.2 1B or a comparable architecture. The first operation is vocabulary pruning, followed by vocabulary extension with Georgian tokens.

The pruning step removes tokens from the existing vocabulary that correspond to scripts and languages irrelevant to our use case. A model like Llama 3 has a 128,000-token vocabulary, but a large fraction of those tokens encode Chinese, Japanese, Korean, Arabic, Hindi, and other scripts that we will never use in a Georgian-English bilingual system. These tokens occupy embedding parameters that contribute nothing to our task — and in small models, embedding parameters constitute a disproportionate share of total parameters. For reference, Gemma 3's smallest model has 270 million parameters, of which 170 million are vocabulary embeddings.

We do not perform naive pruning by simply removing the last N tokens or the least frequent tokens globally. As demonstrated in recent work by Purason et al. (2025), naive frequency-based pruning breaks BPE merge paths and creates unreachable tokens — tokens that exist in the vocabulary but can never be produced during tokenization because the intermediate merge operations that lead to them have been removed. Instead, we employ leaf-based frequency pruning, which iteratively removes only tokens that are leaves in the BPE merge graph — tokens that do not participate as components in any other merge operation. This preserves the structural integrity of the tokenizer. Research shows that up to 62.5% of tokens can be removed this way without measurable degradation in downstream performance on the retained languages.

After pruning, we extend the vocabulary with Georgian tokens. The target composition is approximately 40% Georgian tokens in the vocabulary. Rather than training a separate Georgian tokenizer and appending its non-overlapping tokens — the naive approach that has been standard practice — we use continued BPE training. This method resumes the BPE merge learning process directly on Georgian text, starting from the existing tokenizer's merge state. The key advantage is that every new merge operation is guaranteed to be compatible with the existing merge sequence. The naive approach, by contrast, introduces tokens that look useful in isolation but are unreachable through the actual merge operations of the combined tokenizer. Across 70 languages tested, continued BPE training produced zero unreachable tokens, while the naive method left 4% to 10% of added tokens unreachable.

The practical effect of this tokenizer adaptation on Georgian text processing is substantial. Consider what happens at the byte level. With the default Llama 3 tokenizer, Georgian text achieves a compression ratio of approximately 2.64 bytes per token — meaning each token encodes on average only 2.64 bytes of Georgian text. Since each Georgian character is three bytes, this means the tokenizer is essentially splitting individual characters across multiple tokens. After pruning and extending with 16,000 Georgian tokens via continued BPE training, compression improves to approximately 4.46 bytes per token — a 69% improvement. This directly translates to shorter sequences, faster training, faster inference, and a larger effective context window for Georgian text.

After extending the vocabulary, the model's embedding matrices must be resized. New token embeddings are initialized using Fast Vocabulary Transfer, which copies embeddings directly for tokens that existed in the original vocabulary and initializes new tokens by averaging the embeddings of their constituent sub-tokens under the original tokenizer. This provides a warm start that substantially reduces the amount of continued pre-training needed to bring the new tokens to parity.

Phase 3: Continued Pre-training

With the adapted tokenizer in place, we perform continued pre-training on our full bilingual corpus. The training data consists of our approximately ten billion words of Georgian text and a matched volume of English text from Fineweb, maintaining a balanced bilingual mix. This balance is essential: if Georgian dominates the training mix, the model's English capabilities degrade. If English dominates, the model fails to internalize Georgian structure deeply enough.

The model architecture remains unchanged — we use the original pre-trained weights with the modified embedding layer. Training proceeds with standard continued pre-training hyperparameters: a cosine learning rate schedule with warmup, AdamW optimizer, and sequence length of 4096 tokens. Because our adapted tokenizer compresses Georgian text more efficiently, the same ten billion words of Georgian are encoded into fewer tokens than they would have been with the original tokenizer. Research on Llama 3.2 1B with a similar setup showed a 26% reduction in total training tokens after tokenizer adaptation, translating directly to reduced GPU-hours.

Phase 4: Translation Fine-tuning

This phase exploits a specific asymmetry in existing LLM capabilities. Models like Gemini translate Georgian to English competently. This means that high-quality Georgian-to-English parallel sentence pairs can be generated at scale using existing systems. The critical insight is that a parallel corpus of Georgian-to-English translations is structurally identical to an English-to-Georgian corpus — the only difference is which side is treated as the source and which as the target.

We generate a large parallel dataset by having Gemini translate Georgian texts into English. These translations are high quality because Georgian-to-English is the direction in which current models are strong. We then take this corpus and reverse it: the English translations become the source, and the original Georgian texts become the target. We fine-tune our continued pre-trained model on this reversed corpus for English-to-Georgian translation.

This approach bypasses the fundamental bottleneck. We do not need a model that is already good at generating Georgian — we only need a model that is good at understanding Georgian, which existing models already are. The supervised fine-tuning teaches our model to produce fluent Georgian by training it on authentic Georgian text as the target output, paired with clean English input that provides unambiguous semantic grounding.

Phase 5: Synthetic Data Generation

Once we have a competent English-to-Georgian translation system, we use it to scale our Georgian corpus by a factor of five to ten. The source material for this expansion is English text — high-quality books, articles, technical documents, and educational content — which exists in virtually unlimited supply. We translate this English material into Georgian using our fine-tuned model from Phase 4.

The distinction between this approach and generic synthetic data generation is important. When an LLM generates text from scratch — inventing content, reasoning, and structure entirely on its own — the output tends to converge on a narrow distribution of patterns, exhibiting repetitive phrasings, limited stylistic variety, and a well-documented tendency toward self-reinforcing errors known as model collapse. Translation, however, is fundamentally constrained by its source. The ideas, structure, argumentation, narrative, and factual content all originate from genuine human-written English text. The model's task is limited to rendering that content in Georgian. It is not inventing; it is transposing. This constraint means that the linguistic diversity of the output is bounded by the diversity of the English source material, which is vast, rather than by the model's own generative tendencies.

The resulting synthetic Georgian corpus of 50 to 100 billion words would represent an unprecedented volume of Georgian-language training data, covering domains and registers that barely exist in the natural Georgian web corpus.

Phase 6: Large Model Training

With a corpus of 60 to 110 billion Georgian words available — authentic and translated combined — alongside matched English data, we are positioned to train a substantially larger model. The specific architecture and parameter count will be determined by the quality metrics achieved in the preceding phases and the available compute budget, but the target range is a model large enough to exhibit genuine reasoning and generation capabilities in Georgian, rather than merely pattern-matching from a limited training signal.

The tokenizer adapted in Phase 2 carries forward to this stage, meaning the training efficiency gains compound: more data, better tokenized, feeding a larger model. The synthetic data from Phase 5 also provides domain coverage that Georgian has never had at scale — scientific text, technical documentation, legal language, medical terminology — all rendered in Georgian from high-quality English sources.

The path forward

The Georgian language faces a structural disadvantage in the current landscape of large language model development. The three-byte UTF-8 encoding of its script, combined with the near-total absence of Georgian tokens in existing BPE vocabularies, means that Georgian text is tokenized at roughly five to six times the cost of English text — consuming more context, more compute, and more training budget per word of actual content. Compounding this, the total supply of digitally available Georgian text is approximately two billion words, which appears to have already been exhausted by frontier models.

Our plan addresses this through a sequence of interventions that each independently improve the situation and collectively transform it. Tokenizer adaptation through leaf-based pruning and continued BPE training reduces the tokenization cost of Georgian by approximately 69%, bringing it closer to parity with English. A large-scale data acquisition effort targeting offline Georgian texts — books, newspapers, journals, and corporate documents — expands the available corpus by a factor of five. A translation pipeline that exploits the existing strength of LLMs in Georgian-to-English direction, then reverses the parallel data for English-to-Georgian fine-tuning, creates a competent Georgian generator without requiring one to exist first. And constrained synthetic data generation through translation scales the corpus by another order of magnitude without the quality degradation associated with unconstrained generation.

The end result is a path from two billion words of web-crawled Georgian and a tokenizer that fragments every word into six tokens, to over sixty billion words of diverse Georgian text and a tokenizer that processes the language at close to English-level efficiency. This is what is needed to train a model that does not merely tolerate Georgian but is genuinely fluent in it.