Extractors

extractors ¶

Defines the extractors that turn each upstream source into database rows

Sources To Rows

Every extractor is a generator that yields DatabaseRow pairs, where the first item is a table name and the second is a row (dict) whose keys match that table's columns in the Schema
Extractors know a single source format and nothing about SQLite. The Builder knows how to insert rows and nothing about parsing. They meet only at the DatabaseRow tuple and at the arguments that each extractor might take

Data Flow Map

JMdict.gz -> extract_jmdict -> Fills (jmdict_entry, jmdict_kanji, jmdict_kana, jmdict_sense, jmdict_gloss, and tag)
JMnedict.gz -> extract_jmnedict -> Fills (jmnedict_entry, jmnedict_kanji, jmnedict_kana, jmnedict_translation, jmnedict_gloss, and tag)
KanjiDic2.gz -> extract_kanjidic -> Fills (kanji, kanji_reading, kanji_meaning, kanji_nanori, kanji_dic_ref, kanji_query_code, kanji_variant, and kanji_codepoint)
kradzip -> extract_krad -> Fills (radical and kanji_radical)
JmdictFurigana -> extract_furigana -> Fills (furigana)
KanjiVG.gz -> extract_kanjivg -> Fills (kanji_strokes)
Tanos JLPT JSON -> extract_jlpt -> Fills (jlpt_vocab, jlpt_kanji, and jlpt_grammar)
Tatoeba bz2/tar -> extract_tatoeba -> Fills (sentence and sentence_link)
Kanji alive zip -> extract_audio -> Fills (audio), which lives in the separate audio pack database rather than the core one

DatabaseRow `module-attribute` ¶

DatabaseRow = tuple[str, dict[str, Any]]

A single database row produced by an extractor

A tuple whose first item is the target table name and whose second item is a dictionary mapping that table's column names to their row values

EXTRACTORS `module-attribute` ¶

EXTRACTORS = {
    "jmdict": Extractor(
        "jmdict",
        (
            "jmdict_entry",
            "jmdict_kanji",
            "jmdict_kana",
            "jmdict_sense",
            "jmdict_gloss",
            "tag",
        ),
        extract_jmdict,
    ),
    "jmnedict": Extractor(
        "jmnedict",
        (
            "jmnedict_entry",
            "jmnedict_kanji",
            "jmnedict_kana",
            "jmnedict_translation",
            "jmnedict_gloss",
            "tag",
        ),
        extract_jmnedict,
    ),
    "kanjidic": Extractor(
        "kanjidic",
        (
            "kanji",
            "kanji_reading",
            "kanji_meaning",
            "kanji_nanori",
            "kanji_dic_ref",
            "kanji_query_code",
            "kanji_variant",
            "kanji_codepoint",
        ),
        extract_kanjidic,
    ),
    "krad": Extractor(
        "krad", ("radical", "kanji_radical"), extract_krad
    ),
    "furigana": Extractor(
        "furigana", ("furigana",), extract_furigana
    ),
    "kanjivg": Extractor(
        "kanjivg", ("kanji_strokes",), extract_kanjivg
    ),
    "jlpt": Extractor(
        "jlpt",
        ("jlpt_vocab", "jlpt_kanji", "jlpt_grammar"),
        extract_jlpt,
    ),
    "tatoeba": Extractor(
        "tatoeba",
        ("sentence", "sentence_link"),
        extract_tatoeba,
    ),
    "audio": Extractor("audio", ("audio",), extract_audio),
}

Represents all extractors available to the builder

A dictionary mapping extractor names to their respective Extractor object

ExtractorFunction `module-attribute` ¶

ExtractorFunction = Callable[..., Iterable[DatabaseRow]]

The signature of an extractor function

Every extractor is a generator that yields DatabaseRow pairs, where the first item is a table name and the second is a row (dict) whose keys match that table's columns in the Schema

Extractor `dataclass` ¶

Represents a data extractor for a single usptream source

Attribute Breakdown

name → A string used to identify the extractor's run output in build logs
tables → The database table names which this extractor yields rows for, always a subset of the Schema
run → The function which parses the upstream source data and yields the database rows

Attributes:

Name	Type	Description
`name`	`str`	Short identifier used in build logs
`tables`	`tuple[str, ...]`	The tables this extractor fills, which makes the data flow readable from the registry alone
`run`	`ExtractorFunction`	Callable that takes an arbitrary number of arguments and yields `DatabaseRow` pairs

TagResolver ¶

Recovers stable tag codes from JMdict and JMnedict entity descriptions

Document Type Definition

JMdict writes tags as XML entities such as <pos>&n;</pos>, and the XML's DTD (Document Type Definition) at the top of the file maps each code to a description, for example n to noun (common) (futsuumeishi)
The file is parsed with entities resolved, so an element arrives already expanded to the long description
In order to store the short, stable code of the tag, but keep the information that the long descriptions provide, this resolver reverses the DTD map so that the long description can be turned back into the code, while the description is emitted into the tag table of the database

Attributes:

Name	Type	Description
`code_to_desc`	`dict[str, str]`	Mapping of each tag code to its description

init ¶

__init__(code_to_desc)

Build a resolver from a code to description mapping

Parameters:

Name	Type	Description	Default
`code_to_desc`	`dict[str, str]`	Mapping of tag code to description	required

codes ¶

codes(elements)

Map resolved element texts back to their stable codes

An element whose text is not a known description is kept verbatim, which leaves any non entity value untouched

Parameters:

Name	Type	Description	Default
`elements`	`Iterable[_Element]`	Elements whose text is a resolved entity description	required

Returns:

Type	Description
`list[str]`	The stable code for each element, in order

from_dtd `classmethod` ¶

from_dtd(path, *, stop)

Read the DTD (Document Type Definition) entity table from the top of a gzipped XML file

The scan stops at the first line containing the stop sentinel, which is the start of the document body

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path of the gzipped XML file	required
`stop`	`str`	Substring that marks the end of the DTD, such as `<JMdict` for JMdict or `]>` for JMnedict	required

Returns:

Type	Description
`TagResolver`	A resolver populated from the file's entity table

tag_rows ¶

tag_rows(elements, category)

Emit tag rows for every element that resolves to a known code

This is the additional extractor for the tags table of the database, shared by both JMDict and JMNedict

Parameters:

Name	Type	Description	Default
`elements`	`Iterable[_Element]`	Elements whose text is a resolved entity description	required
`category`	`str`	The tag category to record, such as `pos` or `misc`	required

Yields:

Type	Description
`DatabaseRow`	One `tag` row per element with a known code

extract_audio ¶

extract_audio(audio_path, ka_data_path)

Stream Kanji Alive audio clips into database rows

The raw mp3 bytes are stored in the row alongside their license metadata

File Format

audio-mp3.zip is a ZIP of mp3 clips named {kname}_{index}_{variant}.mp3, such as jutsu-no(beru)_1_a.mp3, where the leading kname prefix is a romanized reading
ka_data.csv is a spreadsheet with a kanji column and a kname column, where kname matches the filename prefix and maps each clip to its kanji
Clips whose prefix is not found in the spreadsheet are skipped

Parameters:

Name	Type	Description	Default
`audio_path`	`Path`	Path of the `audio-mp3.zip` archive	required
`ka_data_path`	`Path`	Path of the `ka_data.csv` spreadsheet	required

Yields:

Type	Description
`DatabaseRow`	Rows for the `audio` table

extract_furigana ¶

extract_furigana(path)

Stream JmdictFurigana into database rows

File Format

A gzipped tar archive holding a single .json file, which is one large JSON array of records
Each record is an object with text (the written spelling), reading (its full kana reading) and furigana (the segmentation)
furigana is a list of {"ruby": span, "rt": kana} objects that align each span of the spelling to the kana it is read as, and is stored verbatim as JSON in the segments column

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path of the gzipped tar archive that holds the JSON	required

Yields:

Type	Description
`DatabaseRow`	Rows for the `furigana` table

Raises:

Type	Description
`FileNotFoundError`	If the archive has no `JSON` member

extract_jlpt ¶

extract_jlpt()

Stream the Tanos JLPT lists into database rows

File Format

The lists come from Tanos and ship inside the package, not downloaded, as the files {kind}_n{level}.json for the three kinds across the five levels
Each file is a JSON array of records
A vocab record has kanji, hiragana and english, where the stored word falls back to hiragana when kanji is empty
A kanji record has kanji, space joined on and kun readings and english
A grammar record has grammar, formation and examples, though formation and examples are empty across the current dataset

Yields:

Type	Description
`DatabaseRow`	Rows for the `jlpt_vocab`, `jlpt_kanji` and `jlpt_grammar` tables

extract_jmdict ¶

extract_jmdict(path)

Stream JMdict into database rows

File Format

A gzipped XML file with a root <JMdict> element and one <entry> per dictionary entry, streamed one entry at a time
<ent_seq> holds the unique sequence number used as the entry id
Each written form is a <k_ele> with the kanji spelling in <keb>, spelling-info tags in <ke_inf> and priority codes in <ke_pri>
Each reading is an <r_ele> with the kana in <reb>, info tags in <re_inf>, priority codes in <re_pri>, optional <re_restr> entries that limit the reading to specific <keb>, and a <re_nokanji/> flag for readings that pair with no kanji
Each <sense> is one meaning, holding part of speech <pos>, field of use <field>, register <misc> (slang, vulgar, ...), <dial> dialect, free notes <s_inf>, <xref> and <ant> cross references, <stagk> and <stagr> form restrictions, loanword origin <lsource> and the translations in <gloss>
The tag elements (<pos>, <ke_inf>, <misc>, ...) are written as XML entities such as &n; defined in the DTD header, which TagResolver turns back into stable codes
A <sense> with no <pos> reuses the previous sense's, so the last seen value is carried forward
Priority codes include corpus frequency bands written as nfXX, where a lower number is more frequent

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path of the gzipped JMdict XML	required

Yields:

Type	Description
`DatabaseRow`	Rows for the `jmdict_entry`, `jmdict_kanji`, `jmdict_kana`, `jmdict_sense`, `jmdict_gloss` and `tag` tables

extract_jmnedict ¶

extract_jmnedict(path)

Stream JMnedict into database rows

File Format

A gzipped XML file with the same shape as JMdict but for proper names, streamed one <entry> at a time
<ent_seq> holds the unique sequence number used as the entry id
Written forms are <k_ele> with the spelling in <keb>, readings are <r_ele> with the kana in <reb>
Each <trans> block is one name reading with its type in <name_type> (surname, place, given, ...), <xref> cross references and the actual translated names in <trans_det>
<name_type> is written as an XML entity defined in the DTD header, which TagResolver turns back into a stable code

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path of the gzipped JMnedict XML	required

Yields:

Type	Description
`DatabaseRow`	Rows for the `jmnedict_entry`, `jmnedict_kanji`, `jmnedict_kana`, `jmnedict_translation`, `jmnedict_gloss` and `tag` tables

extract_kanjidic ¶

extract_kanjidic(path)

Stream KanjiDic2 into database rows

A single character fans out across the kanji table and its seven detail tables

File Format

A gzipped XML file with a root <kanjidic2> element and one <character> per kanji, streamed one at a time. Unlike JMdict it uses no XML entities, so values are read directly
<literal> holds the kanji character itself
<codepoint> lists <cp_value cp_type=...> encodings (Unicode, JIS)
<radical> lists <rad_value rad_type=...> radical numbers, where classical and nelson_c are kept
<misc> holds the school <grade>, one or more <stroke_count> values (the first is accepted, the rest are common miscounts), newspaper <freq>, the old <jlpt> level and <variant> forms
<dic_number> lists <dic_ref dr_type=...> references into print dictionaries, with optional m_vol and m_page attributes
<query_code> lists <q_code qc_type=...> lookup codes such as SKIP, where a skip_misclass attribute flags common misclassifications
<reading_meaning> groups readings and meanings in <rmgroup> blocks, where <reading r_type=...> covers on, kun, pinyin and korean, and <meaning m_lang=...> defaults to English, while <nanori> name readings sit outside the groups

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path of the gzipped KanjiDic2 XML	required

Yields:

Type	Description
`DatabaseRow`	Rows for the `kanji`, `kanji_reading`, `kanji_meaning`, `kanji_nanori`, `kanji_dic_ref`, `kanji_query_code`, `kanji_variant` and `kanji_codepoint` tables

extract_kanjivg ¶

extract_kanjivg(path)

Stream KanjiVG stroke order data into database rows

File Format

A gzipped XML file whose root holds one <kanji> element per character, streamed one at a time
The id attribute is kvg:kanji_XXXXX, where XXXXX is the kanji's lowercase hex Unicode codepoint, decoded back into the literal
Variant forms carry an extra suffix on the id and are skipped so each literal appears once
Inside, nested <g> groups hold the <path> stroke elements, where each <path> is one stroke. The whole element is stored as SVG markup and the stroke count is the number of <path> descendants

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path of the gzipped `KanjiVG` XML	required

Yields:

Type	Description
`DatabaseRow`	Rows for the `kanji_strokes` table

extract_krad ¶

extract_krad(path)

Stream KRADFILE and RADKFILE into database rows

File Format

The kradzip archive is a ZIP bundling several EUC-JP encoded text files, where lines starting with # are comments
RADKFILE is grouped by radical, where a $ radical strokes line introduces a radical and is followed by the kanji that contain it. Only the $ lines are read here, for the radical stroke counts
KRADFILE lists one kanji per line as kanji : rad1 rad2 ..., giving the radical components that make up that kanji
The two 2 suffixed members (kradfile2, radkfile2) are newer supplements read the same way

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path of the kradzip archive	required

Yields:

Type	Description
`DatabaseRow`	Rows for the `radical` and `kanji_radical` tables

extract_tatoeba ¶

extract_tatoeba(jpn_path, links_path=None, eng_path=None)

Stream Tatoeba sentences and their alignments into database rows

The global links file is huge, so it is filtered in a single pass to the links that touch a Japanese sentence rather than held in full, and only the English sentences that those links reach are stored

File Format

jpn_sentences.tsv.bz2 and eng_sentences.tsv.bz2 are bzip2 compressed TSV files with the columns id, lang and text
links.tar.bz2 is a tar archive holding links.csv, which is actually tab separated source_id and target_id pairs, listing both directions of every translation link
The id columns are integers and are matched across the three files to align Japanese sentences with their English translations

Parameters:

Name	Type	Description	Default
`jpn_path`	`Path`	Path of the Japanese `jpn_sentences.tsv.bz2`	required
`links_path`	`Path \| None`	Path of the global `links.tar.bz2`, or None to skip alignment	`None`
`eng_path`	`Path \| None`	Path of the English `eng_sentences.tsv.bz2`, or None to skip alignment	`None`

Yields:

Type	Description
`DatabaseRow`	Rows for the `sentence` and `sentence_link` tables

stream_elements ¶

stream_elements(path, tag, *, resolve_entities=True)

Stream one tag from a gzipped XML file with flat memory usage

How It Works

iterparse runs on libxml2 and fires once per finished element, so the whole document is never built into a tree
After each element is yielded, it is cleared and its already processed previous siblings are deleted, which is what keeps memory usage flat across a multi-hundred megabyte file
The element is still fully populated while the caller holds it, so all reads must happen inside the loop body before the next iteration

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path of the gzipped XML file	required
`tag`	`str`	The element tag to emit, such as `entry` or `character`	required
`resolve_entities`	`bool`	When True, expand XML entities such as `&n;` into their description text while parsing	`True`

Yields:

Type	Description
`_Element`	Each finished element of the requested tag, cleared once the caller moves on

Extractors

extractors ¶

DatabaseRow module-attribute ¶

EXTRACTORS module-attribute ¶

ExtractorFunction module-attribute ¶

Extractor dataclass ¶

TagResolver ¶

__init__ ¶

codes ¶

from_dtd classmethod ¶

tag_rows ¶

extract_audio ¶

extract_furigana ¶

extract_jlpt ¶

extract_jmdict ¶

extract_jmnedict ¶

extract_kanjidic ¶

extract_kanjivg ¶

extract_krad ¶

extract_tatoeba ¶

stream_elements ¶

DatabaseRow `module-attribute` ¶

EXTRACTORS `module-attribute` ¶

ExtractorFunction `module-attribute` ¶

Extractor `dataclass` ¶

init ¶

from_dtd `classmethod` ¶