Building Multilingual Dictionaries for E-Readers

If you read books on an e-reader in a foreign language, you know the built-in dictionary is a lifesaver. Tap a word, get a definition, keep reading. But for less common language pairs, you're on your own — no e-reader manufacturer offers a Dutch-Hungarian dictionary, and neither does anyone else. The alternative is switching to a phone app every time you encounter an unfamiliar word, which breaks the reading flow entirely.

The problem goes deeper than just finding one translation. As a language learner, seeing a word translated into multiple languages you already speak helps you memorize faster. If I look up a Dutch word and see both the Hungarian and the English translation side by side, the meaning sticks better. That's a feature you won't find on any e-reader today.

That's the problem this project solves. It's a pipeline for collecting words, translations, and example sentences using modern tools — including LLMs and open databases — to build custom multilingual dictionaries that work on Kindle, Kobo, and other e-readers. Not just source-to-target, but source-to-multiple-targets. The pipeline is currently being developed and refined on a Dutch-Hungarian-English dictionary — a combination that matters to me personally — but it's designed to work with any language combination you choose.

What Makes a Good Dictionary

A dictionary isn't built in one pass — it grows gradually, entry by entry, source by source. Each entry should aim to include what you'd expect from any proper dictionary: the word itself, its translations, pronunciation (IPA), part of speech (noun, verb, adjective), and example sentences showing how the word is used in context.

Since entries come from many different sources — curated databases, text corpora, LLMs — reliability varies. Every translation carries a quality score from 1 to 5. When the same translation is confirmed by multiple sources, the score increases. If Wiktionary says "huis" means "house" and OpenSubtitles agrees, that translation is more trustworthy than one generated by an LLM alone.

When a new source is imported, the system doesn't create duplicate entries. It merges intelligently: new translations are added to existing words, existing translations get their quality boosted if confirmed, and manual edits are never overwritten by automated imports. This way the dictionary gets richer and more reliable over time, without losing work that's already been done.

The Pipeline

Building a dictionary is a four-step process. Each step adds a layer of depth and quality to the final result.

Step 1: Collect Words

The foundation is a word list. Open databases like Wiktionary provide a rich, human-curated starting point — with definitions, pronunciation (IPA), and parts of speech already structured. A parser reads the Wiktionary data dump and extracts entries for the source language. These form the authoritative base of the dictionary.

Step 2: Collect Sentences

A dictionary entry without context is hard to learn from. The next step is collecting real-world sentences where each word appears. Sources like OpenSubtitles — which aligns movie and TV subtitle files across languages — provide thousands of natural sentence pairs. Book corpora, news archives, and parallel text datasets all contribute too, each with different strengths: subtitles give colloquial language, news gives formal vocabulary, books give literary context. The system is designed to plug in new corpus sources as they become available.

Step 3: Analyse and Structure

Raw sentences and word lists need processing before they become a dictionary. This is where text mining comes in: analysing grammar in each sentence, removing inflections, identifying whether a word is a noun, verb, or adjective, and mapping inflected forms back to their base form. Transformer-based models help here too — they can disambiguate words based on context, so a word like "bank" gets the right translation depending on whether the sentence is about finance or a riverbank. Hungarian, for example, adds many suffixes to words, so a single Dutch word might map to dozens of inflected forms — the pipeline detects and filters these. Translations found across multiple sources get a higher quality score, giving the reader a clear signal of reliability.

Step 4: Fill the Gaps with LLMs

After curated sources and corpus analysis, there are still words without translations in every target language. This is where LLMs come in. They can generate translations, suggest example sentences, and fill in missing entries — all without being tied to a single translation provider. LLM-generated content gets a lower quality score than human-curated or corpus-derived translations, making it a useful fallback while keeping transparency about the source.

Web Interface

The project includes a web UI that serves two purposes. First, it works as an online dictionary — you can search for any word just like you would on any dictionary website. Second, it allows human curation: you can review, correct, and enrich entries generated by the pipeline. You can add words by hand in word - translation format, add usage examples, and adjust quality scores. This human-in-the-loop step is what turns a raw data pipeline into a polished dictionary.

Once you're happy with the result, you can export the dictionary to e-reader formats — Kindle (.mobi), Kobo (dicthtml.zip), and StarDict (.ifo/.idx/.dict) for PocketBook, KOReader, and GoldenDict — and sideload it onto your device. All three formats are generated from the same underlying data. Or simply keep using the web interface as your go-to online dictionary.

Are You Interested?

The architecture is language-agnostic — adding a new language combination means providing the right import sources and running the pipeline. Whether it's 2, 3, or 4 languages, the web interface, quality scoring, and export formats all work without changes.

If you want this project to be ready faster, give it a thumbs up!

The more votes this project gets, the higher its priority. Your vote helps decide what we build next.