Morphology in NLP: From Language Families to Subword Algorithms
Morphology is the branch of linguistics that studies the internal structure of words: how they’re formed, how they change, and what those changes mean. For NLP, it’s not an academic curiosity. It shapes every step of the pipeline — from tokenization to generation — and understanding it deeply is one of the cleaner ways to separate systems that generalize from those that merely memorize.
This post starts with the typological map of world languages (why not all languages are equally “hard”), moves through classical and statistical morphological analysis, then lands on the algorithms powering modern tokenizers — BPE, WordPiece, Unigram — and closes on where the field is actively bleeding.1
1. The Morphological Spectrum of Human Languages
Linguists have long organized languages along a spectrum defined by how they encode grammatical information. The extremes are clean; most languages live somewhere in between.
1.1 Analytic (Isolating) Languages
In analytic languages, words tend to be invariant. Grammatical relationships are expressed through word order and separate function words, not through word-internal changes.
Examples: Mandarin Chinese, Vietnamese, Modern English (partially).
In Mandarin, the same word serves as noun, verb, or adjective depending on context; there are no verb conjugations and no plural suffixes. A sentence like wǒ chī fàn (“I eat rice”) uses three completely invariant units.
For NLP, analytic languages are in many ways the easiest case: a word is a word. Vocabulary coverage is high, and simple whitespace tokenization gets you surprisingly far.
1.2 Agglutinative Languages
Agglutinative languages build words by stacking morphemes together, where each morpheme carries a single, discrete meaning. The morpheme boundaries are relatively transparent — you can read a word like a compound molecule.
Examples: Turkish, Finnish, Swahili, Hungarian, Basque.
The canonical demonstration is Turkish:
| Form | Gloss |
|---|---|
ev | house |
ev-ler | houses (plural) |
ev-ler-im | my houses |
ev-ler-im-de | in my houses |
ev-ler-im-de-ki | that which is in my houses |
ev-ler-im-de-ki-ler-den | from those that are in my houses |
The word evlerindekilerden is a single orthographic token but encodes what English would express with an entire prepositional phrase. A naïve word-level vocabulary over a Turkish corpus is an explosion — the same root generates thousands of surface forms.
1.3 Fusional (Inflectional) Languages
Fusional languages also modify words to express grammar, but unlike agglutinative languages, a single morpheme can carry multiple grammatical features simultaneously — they are fused together and not cleanly separable.
Examples: Arabic, Russian, Latin, Greek, German, Spanish, French.
Consider the Latin ending -ō in amō (“I love”). That single suffix simultaneously encodes: first person, singular number, present tense, indicative mood, active voice. There is no way to decompose it into five independent morphemes.
Arabic is a particularly rich case. It uses a root-and-pattern (templatic) morphology: a consonantal root (usually three consonants) is intercalated with vowel patterns to produce a family of semantically related words:
| Root | Pattern | Form | Meaning |
|---|---|---|---|
| ك-ت-ب (k-t-b) | CāCaC | كَتَبَ (kataba) | he wrote |
| ك-ت-ب | CāCiC | كَاتِب (kātib) | writer |
| ك-ت-ب | maCCaCa | مَكْتَبَة (maktaba) | library |
| ك-ت-ب | CiCāC | كِتَاب (kitāb) | book |
| ك-ت-ب | CuCuC | كُتُب (kutub) | books |
This non-concatenative structure breaks the assumption of every standard subword tokenizer, which models morphology as linear concatenation.
1.4 Polysynthetic Languages
At the extreme end, polysynthetic languages can pack what amounts to a full sentence’s worth of information into a single word.
Examples: Inuktitut, Mohawk, West Greenlandic.
The Inuktitut word tusaatsiarunnanngittualuujunga translates roughly as “I can’t hear very well.” It is one word. This extreme degree of morphological complexity makes even the definition of “word” slippery, and vocabulary-based NLP approaches essentially don’t work without morphological decomposition.
2. Why Morphology Matters for NLP Systems
The practical impact of morphological richness is felt everywhere:
Vocabulary explosion. For a fixed-size corpus, morphologically rich languages produce far more unique surface forms. A word-level model for Finnish or Arabic will suffer from severe data sparsity even on large corpora.
Lemmatization difficulty. Mapping wrote, written, writes, writing to write is trivial in English; doing the same for Arabic or Russian requires genuine morphological analysis.
Out-of-vocabulary (OOV) generalization. If a model has never seen antiestablishmentarianism but knows anti-, establish, -ment, -arian, -ism separately, it should be able to handle it. Pure word-level models cannot.
Transfer learning. When fine-tuning a pretrained language model on a morphologically rich low-resource language, the tokenizer’s segmentation quality directly bounds model performance. A bad tokenizer cuts off morpheme boundaries mid-unit, producing meaningless subwords.
3. Classical Morphological Analysis
Before neural methods, morphological analyzers were carefully engineered, linguistically grounded systems.
3.1 Finite-State Transducers (FSTs)
The workhorse of classical computational morphology is the Finite-State Transducer — a finite automaton that maps one string to another. A morphological analyzer implemented as an FST maps a surface form (what you see in text) to a lemma+feature string (what the word means morphologically).
Formally, an FST is a 7-tuple:
where is a finite set of states, the input alphabet, the output alphabet, the transition function, the initial state, and the set of accepting states.
The elegance of FSTs is that they are invertible: the same transducer compiled in reverse generates all valid surface forms from a lemma+features specification. This gives you both analysis and generation for free.
Systems like the Buckwalter Arabic Morphological Analyzer (BAMA) and the IRSTLM-based tools for Turkish were FST-based and remained competitive for years. Their weakness: they require deep linguistic expertise to build and are brittle in the face of neologisms, foreign borrowings, and informal text (social media, dialects).
3.2 Rule-Based Stemming
Simpler than FSTs, stemmers apply heuristic suffix-stripping rules without attempting to produce linguistically accurate analyses. The Porter Stemmer for English is the classic example — it reduces running → run, happily → happi (not a real word, but consistent). Fast and language-agnostic in spirit, but linguistically crude.
4. Statistical Morphological Segmentation
The middle ground between full linguistic FSTs and neural black boxes is unsupervised statistical segmentation: learning morpheme boundaries from raw text, without linguistic annotation.
4.1 Morfessor
Morfessor 2 is the canonical system in this space. It models the lexicon as a generative model over morpheme sequences and learns to segment words by optimizing the Minimum Description Length (MDL) criterion:
The first term penalizes a large morpheme lexicon; the second penalizes poor compression of the corpus. The system finds segmentations that are compact without being trivial (i.e., treating every character as a morpheme or every word as an atom).
Morfessor works reasonably well on agglutinative languages and was for years the recommended baseline for morpheme segmentation in the CoNLL shared tasks.
5. Subword Tokenization: The Neural Era
Modern NLP systems don’t perform explicit morphological analysis — they use subword tokenization algorithms that learn a vocabulary of frequent character sequences from data, implicitly capturing morphological structure as a byproduct.
The three algorithms that matter are BPE, WordPiece, and Unigram Language Model.
5.1 Byte-Pair Encoding (BPE)
BPE3 was originally a data compression algorithm, adapted for NLP by Sennrich et al. in 2016 for Neural Machine Translation. It is now the backbone of GPT-2, GPT-4, LLaMA, and many others.
The algorithm:
- Start with a vocabulary of individual characters (and a special end-of-word token
</w>). - Count all adjacent symbol pair frequencies in the training corpus.
- Merge the most frequent pair into a new symbol.
- Repeat until the vocabulary reaches a target size .
from collections import Counter, defaultdict
def get_vocab(corpus: list[str]) -> dict[str, int]:
"""Convert corpus to character-level vocabulary with word counts."""
vocab = Counter()
for word in corpus:
# Represent each word as space-separated characters with end marker
vocab[' '.join(list(word)) + ' </w>'] += 1
return vocab
def get_pairs(vocab: dict[str, int]) -> dict[tuple, int]:
"""Count all adjacent character-pair frequencies."""
pairs = defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols) - 1):
pairs[(symbols[i], symbols[i + 1])] += freq
return pairs
def merge_vocab(pair: tuple, vocab: dict[str, int]) -> dict[str, int]:
"""Merge the best pair across the entire vocabulary."""
new_vocab = {}
bigram = ' '.join(pair)
replacement = ''.join(pair)
for word in vocab:
new_word = word.replace(bigram, replacement)
new_vocab[new_word] = vocab[word]
return new_vocab
def learn_bpe(corpus: list[str], num_merges: int) -> list[tuple]:
vocab = get_vocab(corpus)
merges = []
for _ in range(num_merges):
pairs = get_pairs(vocab)
if not pairs:
break
best_pair = max(pairs, key=pairs.get)
vocab = merge_vocab(best_pair, vocab)
merges.append(best_pair)
return merges
# Example
corpus = ["low", "lower", "lowest", "new", "newer", "newest", "wider", "wide"]
merges = learn_bpe(corpus, num_merges=10)
for merge in merges:
print(f"Merged: {merge[0]} + {merge[1]} → {''.join(merge)}")
After training, a lookup table of merges is applied greedily at inference time. The vocabulary size is a hyperparameter controlling the granularity/coverage trade-off: small → more characters, more OOV robustness; large → more whole words, faster inference.
A key property: BPE naturally tends to keep common words intact (they appear as full tokens) while splitting rare words into recognizable pieces. The word unhappiness might become un, happiness or un, happy, ness depending on the training corpus.
Complexity. Each merge step is pair counting; with merges and corpus size , the full training run is . In practice, efficient implementations (Hugging Face tokenizers) run in seconds on large corpora.
5.2 WordPiece
WordPiece is used by BERT, DistilBERT, and most of the Google family. The key difference from BPE: instead of merging the most frequent pair, WordPiece merges the pair that maximally increases the likelihood of the training data under a language model:
This is essentially a pointwise mutual information criterion. Pairs that are frequent but only because both parts are individually frequent (like th in English) score lower than pairs that co-occur specifically together. The result is a vocabulary that captures more linguistically meaningful units.
WordPiece also introduces the ## prefix convention: subwords that appear word-internally are prefixed with ## to signal they are continuations, not word-initial tokens. So embedding → em, ##bed, ##ding.
5.3 Unigram Language Model
Unigram LM tokenization4, used in SentencePiece (which powers T5, XLNet, mBART, and NLLB), takes a fundamentally different approach: instead of a greedy merge procedure, it starts with a large candidate vocabulary and prunes it.
The objective is to find the vocabulary and segmentation probabilities for each subword that maximize the likelihood of the corpus. For a word with candidate segmentations :
The EM algorithm alternates between:
- E-step: Find the best segmentation for each word given current using the Viterbi algorithm.
- M-step: Update from the expected counts.
Pruning then removes the subwords whose removal causes the smallest drop in total corpus log-likelihood, iterating until the target vocabulary size is reached.
The advantage: Unigram LM supports probabilistic segmentation — at inference, you can sample from the distribution over segmentations rather than always taking the argmax. This acts as a data augmentation technique during training and improves robustness.
5.4 Comparison Summary
| Property | BPE | WordPiece | Unigram LM |
|---|---|---|---|
| Direction | Bottom-up merges | Bottom-up merges | Top-down pruning |
| Merge criterion | Frequency | PMI / likelihood gain | Log-likelihood loss |
| Segmentation | Deterministic | Deterministic | Probabilistic |
| OOV handling | Character fallback | Character fallback | Character fallback |
| Used by | GPT family, LLaMA | BERT family | T5, mBART, NLLB |
| Speed (training) | Fast | Fast | Slower (EM) |
6. The Problem These Algorithms Don’t Fully Solve
Subword tokenization is a pragmatic engineering solution, not a linguistic one. It works well for English and other analytic/mildly inflected languages. But it has known failure modes that become acute on morphologically rich languages.
The Arabic problem. Arabic words are often represented by a single orthographic token that encodes a conjunction, a preposition, a definite article, a stem, a subject agreement marker, and an object agreement marker. BPE will split this token, but the split is determined by corpus statistics, not morpheme boundaries. The resulting subwords may cross morpheme lines, fragmenting the semantic units the model needs to learn.
The cliticization problem. In many languages, clitics (words that attach phonologically to a host) are written without spaces. French je l’ai vu (I saw him/it) already works fine because of the apostrophe — but Arabic and Hebrew clitic writing is spaced differently, and BPE has no way to know that the first character of a token is a clitic that should arguably be treated as a separate word.
Vocabulary allocation. A multilingual model trained on 100 languages with a 250K vocabulary (like NLLB-200) must partition that vocabulary across all languages. Morphologically rich low-resource languages end up with a tiny allocation, causing severe over-segmentation and consequently longer, more expensive sequences.
Empirically, the relationship between tokenization granularity and downstream task performance has been studied extensively. For low-resource MT, morpheme-aware tokenization consistently outperforms pure BPE when training data is below approximately 100K sentence pairs.
7. Current Research Directions
7.1 Morphology-Aware Tokenization
Several recent works attempt to inject linguistic knowledge back into subword tokenization:
- MorphPiece constrains BPE merges to respect morpheme boundaries detected by a morphological analyzer, producing subwords that align with morphemes rather than merely frequent n-grams.
- Linguistically Motivated Vocabulary Reduction (LMVR) uses morphological analysis to seed the initial vocabulary before running BPE, improving segmentation quality for Turkish and Arabic.
7.2 Character and Byte-Level Models
An alternative to subword tokenization is to operate directly at the character or byte level, eliminating the tokenization step entirely:
- ByT55 (Byte-level T5) processes raw UTF-8 bytes. It is naturally robust to noise, spelling variation, and all morphological complexity. The cost: sequences are 3–5x longer, and attention over long sequences is expensive.
- Charformer introduces a Gradient-Based Subword Tokenization (GBST) module that learns soft subword boundaries end-to-end as part of training, rather than as a preprocessing step.
7.3 Morphological Probing of LLMs
A growing body of work probes pretrained LLMs for morphological knowledge using probing classifiers — lightweight linear models trained on top of frozen representations to predict morphological features (case, number, tense, etc.).
Results are nuanced: large models encode substantial morphological information in their representations even when trained on raw subword sequences. But the encoding is inconsistent across layers and languages, and it correlates with tokenization quality — models that receive better-segmented tokens learn better morphological representations.
7.4 Low-Resource Morphological Analysis with Neural Models
For languages where FSTs exist (Arabic, Turkish, Finnish), hybrid systems combining neural encoders with FST constraints are now state of the art. The neural component handles ambiguity resolution and dialectal variation; the FST constrains the output to the space of linguistically valid analyses.
For languages without FSTs — the vast majority of the world’s 7,000+ languages — cross-lingual transfer is the primary strategy: training on related, higher-resource languages and fine-tuning on small annotated sets. The Universal Dependencies (UD) project has enabled this by providing morphologically annotated treebanks in a consistent schema across 100+ languages.
8. A Practical Takeaway
When you’re building an NLP system for a new language, the first question to ask is not “which model architecture?” but “what is the morphological profile of this language?”
If it’s analytic, standard tokenization with a modest vocabulary is probably fine. If it’s agglutinative or fusional, you need to take morphology seriously — and that means thinking carefully about: vocabulary size, segmentation algorithm choice, whether a morphological analyzer exists and is worth integrating, and whether your evaluation data appropriately reflects morphological diversity (i.e., includes rare inflectional forms, not just frequent lemmas).
Morphology is where the gap between “works on the benchmark” and “works in the wild” is largest. Building systems that respect it isn’t academic purism — it’s good engineering.
References
Footnotes
-
Nikolaev, D. et al. (2023). A Survey of Morphological Analysis for Low-Resource Languages. ACL Anthology. ↩
-
Creutz, M., & Lagus, K. (2002). Unsupervised Discovery of Morphemes. Morphological and Phonological Learning Workshop, ACL 2002. ↩
-
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. ACL 2016. arXiv:1508.07909 ↩
-
Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. ACL 2018. arXiv:1804.10959 ↩
-
Xue, L. et al. (2022). ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. TACL 2022. arXiv:2105.13626 ↩