Extractors
extractors
¶
Defines the extractors that turn each upstream source into database rows
Sources To Rows
-
Every extractor is a generator that yields
DatabaseRowpairs, where the first item is a table name and the second is a row (dict) whose keys match that table's columns in theSchema -
Extractors know a single source format and nothing about
SQLite. TheBuilderknows how to insert rows and nothing about parsing. They meet only at theDatabaseRowtuple and at the arguments that each extractor might take
Data Flow Map
-
JMdict.gz->extract_jmdict-> Fills (jmdict_entry,jmdict_kanji,jmdict_kana,jmdict_sense,jmdict_gloss, andtag) -
JMnedict.gz->extract_jmnedict-> Fills (jmnedict_entry,jmnedict_kanji,jmnedict_kana,jmnedict_translation,jmnedict_gloss, andtag) -
KanjiDic2.gz->extract_kanjidic-> Fills (kanji,kanji_reading,kanji_meaning,kanji_nanori,kanji_dic_ref,kanji_query_code,kanji_variant, andkanji_codepoint) -
kradzip->extract_krad-> Fills (radicalandkanji_radical) -
JmdictFurigana->extract_furigana-> Fills (furigana) -
KanjiVG.gz->extract_kanjivg-> Fills (kanji_strokes) -
Tanos JLPT JSON->extract_jlpt-> Fills (jlpt_vocab,jlpt_kanji, andjlpt_grammar) -
Tatoeba bz2/tar->extract_tatoeba-> Fills (sentenceandsentence_link) -
Kanji alive zip->extract_audio-> Fills (audio), which lives in the separate audio pack database rather than the core one
DatabaseRow
module-attribute
¶
A single database row produced by an extractor
A tuple whose first item is the target table name and whose second item is a dictionary mapping that table's column names to their row values
EXTRACTORS
module-attribute
¶
EXTRACTORS = {
"jmdict": Extractor(
"jmdict",
(
"jmdict_entry",
"jmdict_kanji",
"jmdict_kana",
"jmdict_sense",
"jmdict_gloss",
"tag",
),
extract_jmdict,
),
"jmnedict": Extractor(
"jmnedict",
(
"jmnedict_entry",
"jmnedict_kanji",
"jmnedict_kana",
"jmnedict_translation",
"jmnedict_gloss",
"tag",
),
extract_jmnedict,
),
"kanjidic": Extractor(
"kanjidic",
(
"kanji",
"kanji_reading",
"kanji_meaning",
"kanji_nanori",
"kanji_dic_ref",
"kanji_query_code",
"kanji_variant",
"kanji_codepoint",
),
extract_kanjidic,
),
"krad": Extractor(
"krad", ("radical", "kanji_radical"), extract_krad
),
"furigana": Extractor(
"furigana", ("furigana",), extract_furigana
),
"kanjivg": Extractor(
"kanjivg", ("kanji_strokes",), extract_kanjivg
),
"jlpt": Extractor(
"jlpt",
("jlpt_vocab", "jlpt_kanji", "jlpt_grammar"),
extract_jlpt,
),
"tatoeba": Extractor(
"tatoeba",
("sentence", "sentence_link"),
extract_tatoeba,
),
"audio": Extractor("audio", ("audio",), extract_audio),
}
Represents all extractors available to the builder
A dictionary mapping extractor names to their respective Extractor object
ExtractorFunction
module-attribute
¶
The signature of an extractor function
Every extractor is a generator that yields
DatabaseRow pairs, where
the first item is a table name and the second is a row (dict) whose keys
match that table's columns in the Schema
Extractor
dataclass
¶
Represents a data extractor for a single usptream source
Attribute Breakdown
-
name→ A string used to identify the extractor'srunoutput in build logs -
tables→ The database table names which this extractor yields rows for, always a subset of theSchema -
run→ The function which parses the upstream source data and yields the database rows
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
Short identifier used in build logs |
tables |
tuple[str, ...]
|
The tables this extractor fills, which makes the data flow readable from the registry alone |
run |
ExtractorFunction
|
Callable that takes an arbitrary number of
arguments and yields
|
TagResolver
¶
Recovers stable tag codes from JMdict and JMnedict entity descriptions
Document Type Definition
-
JMdictwrites tags as XML entities such as<pos>&n;</pos>, and the XML'sDTD(Document Type Definition) at the top of the file maps each code to a description, for examplentonoun (common) (futsuumeishi) -
The file is parsed with entities resolved, so an element arrives already expanded to the long description
-
In order to store the short, stable code of the tag, but keep the information that the long descriptions provide, this resolver reverses the
DTDmap so that the long description can be turned back into the code, while the description is emitted into thetagtable of the database
Attributes:
| Name | Type | Description |
|---|---|---|
code_to_desc |
dict[str, str]
|
Mapping of each tag code to its description |
__init__
¶
codes
¶
Map resolved element texts back to their stable codes
An element whose text is not a known description is kept verbatim, which leaves any non entity value untouched
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements
|
Iterable[_Element]
|
Elements whose text is a resolved entity description |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
The stable code for each element, in order |
from_dtd
classmethod
¶
Read the DTD (Document Type Definition) entity table
from the top of a gzipped XML file
The scan stops at the first line containing the stop sentinel, which
is the start of the document body
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path of the gzipped XML file |
required |
stop
|
str
|
Substring that marks the end of the DTD, such as
|
required |
Returns:
| Type | Description |
|---|---|
TagResolver
|
A resolver populated from the file's entity table |
tag_rows
¶
Emit tag rows for every element that resolves to a known code
This is the additional extractor for the tags table of the database,
shared by both JMDict and JMNedict
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements
|
Iterable[_Element]
|
Elements whose text is a resolved entity description |
required |
category
|
str
|
The tag category to record, such as |
required |
Yields:
| Type | Description |
|---|---|
DatabaseRow
|
One |
extract_audio
¶
Stream Kanji Alive audio clips into database rows
The raw mp3 bytes are stored in the row alongside their license metadata
File Format
-
audio-mp3.zipis a ZIP of mp3 clips named{kname}_{index}_{variant}.mp3, such asjutsu-no(beru)_1_a.mp3, where the leadingknameprefix is a romanized reading -
ka_data.csvis a spreadsheet with akanjicolumn and aknamecolumn, whereknamematches the filename prefix and maps each clip to its kanji -
Clips whose prefix is not found in the spreadsheet are skipped
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_path
|
Path
|
Path of the |
required |
ka_data_path
|
Path
|
Path of the |
required |
Yields:
| Type | Description |
|---|---|
DatabaseRow
|
Rows for the |
extract_furigana
¶
Stream JmdictFurigana into database rows
File Format
-
A gzipped tar archive holding a single
.jsonfile, which is one largeJSONarray of records -
Each record is an object with
text(the written spelling),reading(its full kana reading) andfurigana(the segmentation) -
furiganais a list of{"ruby": span, "rt": kana}objects that align each span of the spelling to the kana it is read as, and is stored verbatim as JSON in thesegmentscolumn
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path of the gzipped tar archive that holds the JSON |
required |
Yields:
| Type | Description |
|---|---|
DatabaseRow
|
Rows for the |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the archive has no |
extract_jlpt
¶
Stream the Tanos JLPT lists into database rows
File Format
-
The lists come from Tanos and ship inside the package, not downloaded, as the files
{kind}_n{level}.jsonfor the three kinds across the five levels -
Each file is a
JSONarray of records -
A
vocabrecord haskanji,hiraganaandenglish, where the stored word falls back tohiraganawhenkanjiis empty -
A
kanjirecord haskanji, space joinedonandkunreadings andenglish -
A
grammarrecord hasgrammar,formationandexamples, thoughformationandexamplesare empty across the current dataset
Yields:
| Type | Description |
|---|---|
DatabaseRow
|
Rows for the |
extract_jmdict
¶
Stream JMdict into database rows
File Format
-
A gzipped XML file with a root
<JMdict>element and one<entry>per dictionary entry, streamed one entry at a time -
<ent_seq>holds the unique sequence number used as the entry id -
Each written form is a
<k_ele>with the kanji spelling in<keb>, spelling-info tags in<ke_inf>and priority codes in<ke_pri> -
Each reading is an
<r_ele>with the kana in<reb>, info tags in<re_inf>, priority codes in<re_pri>, optional<re_restr>entries that limit the reading to specific<keb>, and a<re_nokanji/>flag for readings that pair with no kanji -
Each
<sense>is one meaning, holding part of speech<pos>, field of use<field>, register<misc>(slang, vulgar, ...),<dial>dialect, free notes<s_inf>,<xref>and<ant>cross references,<stagk>and<stagr>form restrictions, loanword origin<lsource>and the translations in<gloss> -
The tag elements (
<pos>,<ke_inf>,<misc>, ...) are written as XML entities such as&n;defined in the DTD header, whichTagResolverturns back into stable codes -
A
<sense>with no<pos>reuses the previous sense's, so the last seen value is carried forward -
Priority codes include corpus frequency bands written as
nfXX, where a lower number is more frequent
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path of the gzipped JMdict XML |
required |
Yields:
| Type | Description |
|---|---|
DatabaseRow
|
Rows for the |
extract_jmnedict
¶
Stream JMnedict into database rows
File Format
-
A gzipped XML file with the same shape as
JMdictbut for proper names, streamed one<entry>at a time -
<ent_seq>holds the unique sequence number used as the entry id -
Written forms are
<k_ele>with the spelling in<keb>, readings are<r_ele>with the kana in<reb> -
Each
<trans>block is one name reading with its type in<name_type>(surname, place, given, ...),<xref>cross references and the actual translated names in<trans_det> -
<name_type>is written as an XML entity defined in the DTD header, whichTagResolverturns back into a stable code
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path of the gzipped JMnedict XML |
required |
Yields:
| Type | Description |
|---|---|
DatabaseRow
|
Rows for the |
extract_kanjidic
¶
Stream KanjiDic2 into database rows
A single character fans out across the kanji table and its seven detail tables
File Format
-
A gzipped XML file with a root
<kanjidic2>element and one<character>per kanji, streamed one at a time. UnlikeJMdictit uses no XML entities, so values are read directly -
<literal>holds the kanji character itself -
<codepoint>lists<cp_value cp_type=...>encodings (Unicode, JIS) -
<radical>lists<rad_value rad_type=...>radical numbers, whereclassicalandnelson_care kept -
<misc>holds the school<grade>, one or more<stroke_count>values (the first is accepted, the rest are common miscounts), newspaper<freq>, the old<jlpt>level and<variant>forms -
<dic_number>lists<dic_ref dr_type=...>references into print dictionaries, with optionalm_volandm_pageattributes -
<query_code>lists<q_code qc_type=...>lookup codes such as SKIP, where askip_misclassattribute flags common misclassifications -
<reading_meaning>groups readings and meanings in<rmgroup>blocks, where<reading r_type=...>covers on, kun, pinyin and korean, and<meaning m_lang=...>defaults to English, while<nanori>name readings sit outside the groups
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path of the gzipped KanjiDic2 XML |
required |
Yields:
| Type | Description |
|---|---|
DatabaseRow
|
Rows for the |
extract_kanjivg
¶
Stream KanjiVG stroke order data into database rows
File Format
-
A gzipped XML file whose root holds one
<kanji>element per character, streamed one at a time -
The
idattribute iskvg:kanji_XXXXX, whereXXXXXis the kanji's lowercase hex Unicode codepoint, decoded back into the literal -
Variant forms carry an extra suffix on the
idand are skipped so each literal appears once -
Inside, nested
<g>groups hold the<path>stroke elements, where each<path>is one stroke. The whole element is stored as SVG markup and the stroke count is the number of<path>descendants
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path of the gzipped |
required |
Yields:
| Type | Description |
|---|---|
DatabaseRow
|
Rows for the |
extract_krad
¶
Stream KRADFILE and RADKFILE into database rows
File Format
-
The
kradziparchive is a ZIP bundling severalEUC-JPencoded text files, where lines starting with#are comments -
RADKFILEis grouped by radical, where a$ radical strokesline introduces a radical and is followed by the kanji that contain it. Only the$lines are read here, for the radical stroke counts -
KRADFILElists one kanji per line askanji : rad1 rad2 ..., giving the radical components that make up that kanji -
The two
2suffixed members (kradfile2,radkfile2) are newer supplements read the same way
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path of the kradzip archive |
required |
Yields:
| Type | Description |
|---|---|
DatabaseRow
|
Rows for the |
extract_tatoeba
¶
Stream Tatoeba sentences and their alignments into database rows
The global links file is huge, so it is filtered in a single pass to the links that touch a Japanese sentence rather than held in full, and only the English sentences that those links reach are stored
File Format
-
jpn_sentences.tsv.bz2andeng_sentences.tsv.bz2are bzip2 compressed TSV files with the columnsid,langandtext -
links.tar.bz2is a tar archive holdinglinks.csv, which is actually tab separatedsource_idandtarget_idpairs, listing both directions of every translation link -
The id columns are integers and are matched across the three files to align Japanese sentences with their English translations
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
jpn_path
|
Path
|
Path of the Japanese |
required |
links_path
|
Path | None
|
Path of the global |
None
|
eng_path
|
Path | None
|
Path of the English |
None
|
Yields:
| Type | Description |
|---|---|
DatabaseRow
|
Rows for the |
stream_elements
¶
Stream one tag from a gzipped XML file with flat memory usage
How It Works
-
iterparseruns onlibxml2and fires once per finished element, so the whole document is never built into a tree -
After each element is yielded, it is cleared and its already processed previous siblings are deleted, which is what keeps memory usage flat across a multi-hundred megabyte file
-
The element is still fully populated while the caller holds it, so all reads must happen inside the loop body before the next iteration
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path of the gzipped XML file |
required |
tag
|
str
|
The element tag to emit, such as |
required |
resolve_entities
|
bool
|
When True, expand XML entities such as |
True
|
Yields:
| Type | Description |
|---|---|
_Element
|
Each finished element of the requested tag, cleared once the caller moves on |