Skip to content

Extractors

extractors

Defines the extractors that turn each upstream source into database rows

Sources To Rows
  • Every extractor is a generator that yields DatabaseRow pairs, where the first item is a table name and the second is a row (dict) whose keys match that table's columns in the Schema

  • Extractors know a single source format and nothing about SQLite. The Builder knows how to insert rows and nothing about parsing. They meet only at the DatabaseRow tuple and at the arguments that each extractor might take

Data Flow Map
  • JMdict.gz -> extract_jmdict -> Fills (jmdict_entry, jmdict_kanji, jmdict_kana, jmdict_sense, jmdict_gloss, and tag)

  • JMnedict.gz -> extract_jmnedict -> Fills (jmnedict_entry, jmnedict_kanji, jmnedict_kana, jmnedict_translation, jmnedict_gloss, and tag)

  • KanjiDic2.gz -> extract_kanjidic -> Fills (kanji, kanji_reading, kanji_meaning, kanji_nanori, kanji_dic_ref, kanji_query_code, kanji_variant, and kanji_codepoint)

  • kradzip -> extract_krad -> Fills (radical and kanji_radical)

  • JmdictFurigana -> extract_furigana -> Fills (furigana)

  • KanjiVG.gz -> extract_kanjivg -> Fills (kanji_strokes)

  • Tanos JLPT JSON -> extract_jlpt -> Fills (jlpt_vocab, jlpt_kanji, and jlpt_grammar)

  • Tatoeba bz2/tar -> extract_tatoeba -> Fills (sentence and sentence_link)

  • Kanji alive zip -> extract_audio -> Fills (audio), which lives in the separate audio pack database rather than the core one

DatabaseRow module-attribute

DatabaseRow = tuple[str, dict[str, Any]]

A single database row produced by an extractor

A tuple whose first item is the target table name and whose second item is a dictionary mapping that table's column names to their row values

EXTRACTORS module-attribute

EXTRACTORS = {
    "jmdict": Extractor(
        "jmdict",
        (
            "jmdict_entry",
            "jmdict_kanji",
            "jmdict_kana",
            "jmdict_sense",
            "jmdict_gloss",
            "tag",
        ),
        extract_jmdict,
    ),
    "jmnedict": Extractor(
        "jmnedict",
        (
            "jmnedict_entry",
            "jmnedict_kanji",
            "jmnedict_kana",
            "jmnedict_translation",
            "jmnedict_gloss",
            "tag",
        ),
        extract_jmnedict,
    ),
    "kanjidic": Extractor(
        "kanjidic",
        (
            "kanji",
            "kanji_reading",
            "kanji_meaning",
            "kanji_nanori",
            "kanji_dic_ref",
            "kanji_query_code",
            "kanji_variant",
            "kanji_codepoint",
        ),
        extract_kanjidic,
    ),
    "krad": Extractor(
        "krad", ("radical", "kanji_radical"), extract_krad
    ),
    "furigana": Extractor(
        "furigana", ("furigana",), extract_furigana
    ),
    "kanjivg": Extractor(
        "kanjivg", ("kanji_strokes",), extract_kanjivg
    ),
    "jlpt": Extractor(
        "jlpt",
        ("jlpt_vocab", "jlpt_kanji", "jlpt_grammar"),
        extract_jlpt,
    ),
    "tatoeba": Extractor(
        "tatoeba",
        ("sentence", "sentence_link"),
        extract_tatoeba,
    ),
    "audio": Extractor("audio", ("audio",), extract_audio),
}

Represents all extractors available to the builder

A dictionary mapping extractor names to their respective Extractor object

ExtractorFunction module-attribute

ExtractorFunction = Callable[..., Iterable[DatabaseRow]]

The signature of an extractor function

Every extractor is a generator that yields DatabaseRow pairs, where the first item is a table name and the second is a row (dict) whose keys match that table's columns in the Schema

Extractor dataclass

Represents a data extractor for a single usptream source

Attribute Breakdown
  • name → A string used to identify the extractor's run output in build logs

  • tables → The database table names which this extractor yields rows for, always a subset of the Schema

  • run → The function which parses the upstream source data and yields the database rows

Attributes:

Name Type Description
name str

Short identifier used in build logs

tables tuple[str, ...]

The tables this extractor fills, which makes the data flow readable from the registry alone

run ExtractorFunction

Callable that takes an arbitrary number of arguments and yields DatabaseRow pairs

TagResolver

Recovers stable tag codes from JMdict and JMnedict entity descriptions

Document Type Definition
  • JMdict writes tags as XML entities such as <pos>&n;</pos>, and the XML's DTD (Document Type Definition) at the top of the file maps each code to a description, for example n to noun (common) (futsuumeishi)

  • The file is parsed with entities resolved, so an element arrives already expanded to the long description

  • In order to store the short, stable code of the tag, but keep the information that the long descriptions provide, this resolver reverses the DTD map so that the long description can be turned back into the code, while the description is emitted into the tag table of the database

Attributes:

Name Type Description
code_to_desc dict[str, str]

Mapping of each tag code to its description

__init__

__init__(code_to_desc)

Build a resolver from a code to description mapping

Parameters:

Name Type Description Default
code_to_desc dict[str, str]

Mapping of tag code to description

required

codes

codes(elements)

Map resolved element texts back to their stable codes

An element whose text is not a known description is kept verbatim, which leaves any non entity value untouched

Parameters:

Name Type Description Default
elements Iterable[_Element]

Elements whose text is a resolved entity description

required

Returns:

Type Description
list[str]

The stable code for each element, in order

from_dtd classmethod

from_dtd(path, *, stop)

Read the DTD (Document Type Definition) entity table from the top of a gzipped XML file

The scan stops at the first line containing the stop sentinel, which is the start of the document body

Parameters:

Name Type Description Default
path Path

Path of the gzipped XML file

required
stop str

Substring that marks the end of the DTD, such as <JMdict for JMdict or ]> for JMnedict

required

Returns:

Type Description
TagResolver

A resolver populated from the file's entity table

tag_rows

tag_rows(elements, category)

Emit tag rows for every element that resolves to a known code

This is the additional extractor for the tags table of the database, shared by both JMDict and JMNedict

Parameters:

Name Type Description Default
elements Iterable[_Element]

Elements whose text is a resolved entity description

required
category str

The tag category to record, such as pos or misc

required

Yields:

Type Description
DatabaseRow

One tag row per element with a known code

extract_audio

extract_audio(audio_path, ka_data_path)

Stream Kanji Alive audio clips into database rows

The raw mp3 bytes are stored in the row alongside their license metadata

File Format
  • audio-mp3.zip is a ZIP of mp3 clips named {kname}_{index}_{variant}.mp3, such as jutsu-no(beru)_1_a.mp3, where the leading kname prefix is a romanized reading

  • ka_data.csv is a spreadsheet with a kanji column and a kname column, where kname matches the filename prefix and maps each clip to its kanji

  • Clips whose prefix is not found in the spreadsheet are skipped

Parameters:

Name Type Description Default
audio_path Path

Path of the audio-mp3.zip archive

required
ka_data_path Path

Path of the ka_data.csv spreadsheet

required

Yields:

Type Description
DatabaseRow

Rows for the audio table

extract_furigana

extract_furigana(path)

Stream JmdictFurigana into database rows

File Format
  • A gzipped tar archive holding a single .json file, which is one large JSON array of records

  • Each record is an object with text (the written spelling), reading (its full kana reading) and furigana (the segmentation)

  • furigana is a list of {"ruby": span, "rt": kana} objects that align each span of the spelling to the kana it is read as, and is stored verbatim as JSON in the segments column

Parameters:

Name Type Description Default
path Path

Path of the gzipped tar archive that holds the JSON

required

Yields:

Type Description
DatabaseRow

Rows for the furigana table

Raises:

Type Description
FileNotFoundError

If the archive has no JSON member

extract_jlpt

extract_jlpt()

Stream the Tanos JLPT lists into database rows

File Format
  • The lists come from Tanos and ship inside the package, not downloaded, as the files {kind}_n{level}.json for the three kinds across the five levels

  • Each file is a JSON array of records

  • A vocab record has kanji, hiragana and english, where the stored word falls back to hiragana when kanji is empty

  • A kanji record has kanji, space joined on and kun readings and english

  • A grammar record has grammar, formation and examples, though formation and examples are empty across the current dataset

Yields:

Type Description
DatabaseRow

Rows for the jlpt_vocab, jlpt_kanji and jlpt_grammar tables

extract_jmdict

extract_jmdict(path)

Stream JMdict into database rows

File Format
  • A gzipped XML file with a root <JMdict> element and one <entry> per dictionary entry, streamed one entry at a time

  • <ent_seq> holds the unique sequence number used as the entry id

  • Each written form is a <k_ele> with the kanji spelling in <keb>, spelling-info tags in <ke_inf> and priority codes in <ke_pri>

  • Each reading is an <r_ele> with the kana in <reb>, info tags in <re_inf>, priority codes in <re_pri>, optional <re_restr> entries that limit the reading to specific <keb>, and a <re_nokanji/> flag for readings that pair with no kanji

  • Each <sense> is one meaning, holding part of speech <pos>, field of use <field>, register <misc> (slang, vulgar, ...), <dial> dialect, free notes <s_inf>, <xref> and <ant> cross references, <stagk> and <stagr> form restrictions, loanword origin <lsource> and the translations in <gloss>

  • The tag elements (<pos>, <ke_inf>, <misc>, ...) are written as XML entities such as &n; defined in the DTD header, which TagResolver turns back into stable codes

  • A <sense> with no <pos> reuses the previous sense's, so the last seen value is carried forward

  • Priority codes include corpus frequency bands written as nfXX, where a lower number is more frequent

Parameters:

Name Type Description Default
path Path

Path of the gzipped JMdict XML

required

Yields:

Type Description
DatabaseRow

Rows for the jmdict_entry, jmdict_kanji, jmdict_kana, jmdict_sense, jmdict_gloss and tag tables

extract_jmnedict

extract_jmnedict(path)

Stream JMnedict into database rows

File Format
  • A gzipped XML file with the same shape as JMdict but for proper names, streamed one <entry> at a time

  • <ent_seq> holds the unique sequence number used as the entry id

  • Written forms are <k_ele> with the spelling in <keb>, readings are <r_ele> with the kana in <reb>

  • Each <trans> block is one name reading with its type in <name_type> (surname, place, given, ...), <xref> cross references and the actual translated names in <trans_det>

  • <name_type> is written as an XML entity defined in the DTD header, which TagResolver turns back into a stable code

Parameters:

Name Type Description Default
path Path

Path of the gzipped JMnedict XML

required

Yields:

Type Description
DatabaseRow

Rows for the jmnedict_entry, jmnedict_kanji, jmnedict_kana, jmnedict_translation, jmnedict_gloss and tag tables

extract_kanjidic

extract_kanjidic(path)

Stream KanjiDic2 into database rows

A single character fans out across the kanji table and its seven detail tables

File Format
  • A gzipped XML file with a root <kanjidic2> element and one <character> per kanji, streamed one at a time. Unlike JMdict it uses no XML entities, so values are read directly

  • <literal> holds the kanji character itself

  • <codepoint> lists <cp_value cp_type=...> encodings (Unicode, JIS)

  • <radical> lists <rad_value rad_type=...> radical numbers, where classical and nelson_c are kept

  • <misc> holds the school <grade>, one or more <stroke_count> values (the first is accepted, the rest are common miscounts), newspaper <freq>, the old <jlpt> level and <variant> forms

  • <dic_number> lists <dic_ref dr_type=...> references into print dictionaries, with optional m_vol and m_page attributes

  • <query_code> lists <q_code qc_type=...> lookup codes such as SKIP, where a skip_misclass attribute flags common misclassifications

  • <reading_meaning> groups readings and meanings in <rmgroup> blocks, where <reading r_type=...> covers on, kun, pinyin and korean, and <meaning m_lang=...> defaults to English, while <nanori> name readings sit outside the groups

Parameters:

Name Type Description Default
path Path

Path of the gzipped KanjiDic2 XML

required

Yields:

Type Description
DatabaseRow

Rows for the kanji, kanji_reading, kanji_meaning, kanji_nanori, kanji_dic_ref, kanji_query_code, kanji_variant and kanji_codepoint tables

extract_kanjivg

extract_kanjivg(path)

Stream KanjiVG stroke order data into database rows

File Format
  • A gzipped XML file whose root holds one <kanji> element per character, streamed one at a time

  • The id attribute is kvg:kanji_XXXXX, where XXXXX is the kanji's lowercase hex Unicode codepoint, decoded back into the literal

  • Variant forms carry an extra suffix on the id and are skipped so each literal appears once

  • Inside, nested <g> groups hold the <path> stroke elements, where each <path> is one stroke. The whole element is stored as SVG markup and the stroke count is the number of <path> descendants

Parameters:

Name Type Description Default
path Path

Path of the gzipped KanjiVG XML

required

Yields:

Type Description
DatabaseRow

Rows for the kanji_strokes table

extract_krad

extract_krad(path)

Stream KRADFILE and RADKFILE into database rows

File Format
  • The kradzip archive is a ZIP bundling several EUC-JP encoded text files, where lines starting with # are comments

  • RADKFILE is grouped by radical, where a $ radical strokes line introduces a radical and is followed by the kanji that contain it. Only the $ lines are read here, for the radical stroke counts

  • KRADFILE lists one kanji per line as kanji : rad1 rad2 ..., giving the radical components that make up that kanji

  • The two 2 suffixed members (kradfile2, radkfile2) are newer supplements read the same way

Parameters:

Name Type Description Default
path Path

Path of the kradzip archive

required

Yields:

Type Description
DatabaseRow

Rows for the radical and kanji_radical tables

extract_tatoeba

extract_tatoeba(jpn_path, links_path=None, eng_path=None)

Stream Tatoeba sentences and their alignments into database rows

The global links file is huge, so it is filtered in a single pass to the links that touch a Japanese sentence rather than held in full, and only the English sentences that those links reach are stored

File Format
  • jpn_sentences.tsv.bz2 and eng_sentences.tsv.bz2 are bzip2 compressed TSV files with the columns id, lang and text

  • links.tar.bz2 is a tar archive holding links.csv, which is actually tab separated source_id and target_id pairs, listing both directions of every translation link

  • The id columns are integers and are matched across the three files to align Japanese sentences with their English translations

Parameters:

Name Type Description Default
jpn_path Path

Path of the Japanese jpn_sentences.tsv.bz2

required
links_path Path | None

Path of the global links.tar.bz2, or None to skip alignment

None
eng_path Path | None

Path of the English eng_sentences.tsv.bz2, or None to skip alignment

None

Yields:

Type Description
DatabaseRow

Rows for the sentence and sentence_link tables

stream_elements

stream_elements(path, tag, *, resolve_entities=True)

Stream one tag from a gzipped XML file with flat memory usage

How It Works
  • iterparse runs on libxml2 and fires once per finished element, so the whole document is never built into a tree

  • After each element is yielded, it is cleared and its already processed previous siblings are deleted, which is what keeps memory usage flat across a multi-hundred megabyte file

  • The element is still fully populated while the caller holds it, so all reads must happen inside the loop body before the next iteration

Parameters:

Name Type Description Default
path Path

Path of the gzipped XML file

required
tag str

The element tag to emit, such as entry or character

required
resolve_entities bool

When True, expand XML entities such as &n; into their description text while parsing

True

Yields:

Type Description
_Element

Each finished element of the requested tag, cleared once the caller moves on