Skip to content

Models

models

Defines kotobase's database SQLAlchemy 2.0 typed ORM schema

Schema
  • Preserves every field from the upstream sources

  • Includes the JMdict part of speech, register, field, dialect, sense information and priority tags + furigana segmentation from jmdict_furigana

  • Inclues the JMNedict kanji, kana, and translation information + furigana segmentation from jmnedict_furigana

  • Includes the full KanjiDic2 reading, meaning and reference set, radical decompositions from KRADFILE, stroke orders from KanjiVG, and pronunciation audios from kanjialive (separate database, attached when present)

  • Includes japanese example sentence / translation pairs and audio provenance from Tatoeba

  • Includes grammar, kanji, and vocabulary information extracted from the Tanos JLPT Lists

Data Format
  • List shaped, read-only values such as tag code lists, cross references and furigana segments are stored in JSON columns using the SQLite json1 extension

  • Anything that is searched or joined is normalized into its own table

  • Full text search is provided by FTS5 virtual tables that the Build Pipeline creates at build time, so they are not declared here

Versioning
  • The schema is versioned by db_meta['schema_version']

  • Bump SCHEMA_VERSION whenever the table layout changes so that a stale database can be detected

SCHEMA_VERSION = 1 module-attribute

Layout version stored in db_meta and checked by the read layer

Audio

Bases: Base

A pronunciation audio clip together with its provenance

Optional Audio Pack
  • The data column is filled only in the optional audio pack

  • In the core database, a row may instead carry a url that points at a remote clip, for example a Tatoeba recording

Attributes:

Name Type Description
id int

Primary key row identifier

kind str

What the clip pronounces, such as kanji_word, kanji_reading or sentence

key str

Lookup key for the clip, such as a kanji, a word or a sentence identifier

reading str | None

The reading the clip pronounces when relevant

fmt str | None

Audio container or codec such as ogg or mp3

sample_rate int | None

Sample rate of the clip in hertz

data bytes | None

Raw audio bytes when bundled in the audio pack

url str | None

Remote location of the clip when it is not bundled

source str

Name of the upstream source, such as kanjialive

license str | None

License identifier for the clip

attribution str | None

Required attribution text or link

Base

Bases: DeclarativeBase

Declarative base class shared by every kotobase ORM model

DbMeta

Bases: Base

A build metadata key and value pair

Contains
  • Schema Version
  • Build Date
  • Build Duration
  • Database Size
  • Version / Date Of Each Data Source

Attributes:

Name Type Description
key str

Primary key, the metadata name

value str | None

The metadata value, stored as text

Furigana

Bases: Base

Furigana segmentation for a spelling and reading pair

Attributes:

Name Type Description
id int

Primary key row identifier

text str

The written spelling, usually containing kanji

reading str

The full kana reading of the spelling

segments list[dict]

The JmdictFurigana segmentation, a list of {"ruby": ..., "rt": ...} records that align spans of the spelling with their readings, where the pair of text and reading is unique

JMDictEntry

Bases: Base

A single JMdict dictionary entry

Represents one ent_seq record from JMdict, which is the root of a Japanese to English entry. Its surface forms, readings and senses are attached through relationships

Attributes:

Name Type Description
id int

Primary key, the JMdict ent_seq sequence number

is_common bool

True when any form of the entry carries a priority marker that classifies it as common

freq_rank int | None

Frequency rank where a lower value is more frequent, or None when the entry has no priority information

kanji list[JMDictKanji]

Ordered kanji surface forms of the entry

kana list[JMDictKana]

Ordered kana readings of the entry

senses list[JMDictSense]

Ordered senses, each holding its glosses

JMDictGloss

Bases: Base

A single gloss (translation) belonging to a JMdict sense

Attributes:

Name Type Description
id int

Primary key row identifier

sense_id int

Foreign key to JMDictSense

position int

Zero based order of the gloss within the sense

lang str

ISO 639 language code of the gloss, defaults to eng

text str

The translated text

gender str | None

Grammatical gender of the gloss when given

gtype str | None

Gloss type such as lit, fig, expl or tm

sense JMDictSense

The owning sense

JMDictKana

Bases: Base

A kana (reading) form of a JMdict entry

Attributes:

Name Type Description
id int

Primary key row identifier

entry_id int

Foreign key to JMDictEntry

position int

Zero based order of the reading within the entry

text str

The kana reading

is_common bool

True when this reading carries a common priority marker

no_kanji bool

True when the reading is not a reading of any kanji form

restrictions list[str]

Kanji forms this reading is restricted to, empty when the reading applies to all kanji forms

info list[str]

Reading information tag codes such as ik or ok

priority list[str]

Priority code list such as news1 or ichi1

entry JMDictEntry

The owning entry

JMDictKanji

Bases: Base

A kanji (written) surface form of a JMdict entry

Attributes:

Name Type Description
id int

Primary key row identifier

entry_id int

Foreign key to JMDictEntry

position int

Zero based order of the form within the entry

text str

The kanji spelling

is_common bool

True when this form carries a common priority marker

info list[str]

Spelling information tag codes such as iK or ateji

priority list[str]

Priority code list such as news1 or ichi1

entry JMDictEntry

The owning entry

JMDictSense

Bases: Base

A sense (one meaning) of a JMdict entry

A sense groups one or more glosses that share the same part of speech and usage information

Misc Tags

The misc list carries register and slang markers such as sl (slang), net-sl (internet slang), col (colloquial) and vulg (vulgar)

Attributes:

Name Type Description
id int

Primary key row identifier

entry_id int

Foreign key to JMDictEntry

position int

Zero based order of the sense within the entry

pos list[str]

Part of speech tag codes such as n or v5r

field list[str]

Field of application tag codes such as comp or med

misc list[str]

Miscellaneous register tag codes, see the note above

dialect list[str]

Dialect tag codes such as ksb for the Kansai dialect

info list[str]

Free text sense information notes

xref list[str]

Cross reference targets to related entries

antonym list[str]

Antonym references for the sense

applies_to_kanji list[str]

Kanji forms the sense is restricted to, empty when it applies to all kanji forms

applies_to_kana list[str]

Kana forms the sense is restricted to, empty when it applies to all kana forms

lsource list[dict]

Source language records, each holding the language, text, type and a wasei flag for loanwords

entry JMDictEntry

The owning entry

glosses list[JMDictGloss]

Ordered glosses belonging to the sense

JMnedictEntry

Bases: Base

A JMnedict proper name entry

Attributes:

Name Type Description
id int

Primary key, the JMnedict sequence number

kanji list[JMnedictKanji]

Kanji forms of the name

kana list[JMnedictKana]

Kana forms of the name

translations list[JMnedictTranslation]

Ordered translation blocks

JMnedictGloss

Bases: Base

A single translated name belonging to a JMnedict translation block

Attributes:

Name Type Description
id int

Primary key row identifier

translation_id int

Foreign key to JMnedictTranslation

position int

Zero based order of the gloss within the block

lang str

ISO 639 language code of the gloss, defaults to eng

text str

The translated name text

translation JMnedictTranslation

The owning translation block

JMnedictKana

Bases: Base

A kana form of a JMnedict entry

Attributes:

Name Type Description
id int

Primary key row identifier

entry_id int

Foreign key to JMnedictEntry

position int

Zero based order of the form within the entry

text str

The kana reading of the name

entry JMnedictEntry

The owning entry

JMnedictKanji

Bases: Base

A kanji form of a JMnedict entry

Attributes:

Name Type Description
id int

Primary key row identifier

entry_id int

Foreign key to JMnedictEntry

position int

Zero based order of the form within the entry

text str

The kanji spelling of the name

entry JMnedictEntry

The owning entry

JMnedictTranslation

Bases: Base

A translation block of a JMnedict entry

Each block records the kind of name and holds one or more translated glosses

Attributes:

Name Type Description
id int

Primary key row identifier

entry_id int

Foreign key to JMnedictEntry

position int

Zero based order of the block within the entry

name_type list[str]

Name type tag codes such as place, surname, given or company

xref list[str]

Cross reference targets to related entries

entry JMnedictEntry

The owning entry

glosses list[JMnedictGloss]

Ordered translated names in this block

JlptGrammar

Bases: Base

A Tanos JLPT grammar point

Note

The formation and examples columns are kept for forward compatibility. The current Tanos data does not populate them, so they are normally empty

Attributes:

Name Type Description
id int

Primary key row identifier

level int

JLPT level from 1 (hardest) to 5 (easiest)

grammar str

The grammar point itself

formation str | None

How the grammar point is formed when known

examples list[str]

Example sentences for the grammar point

JlptKanji

Bases: Base

A Tanos JLPT kanji item

Attributes:

Name Type Description
id int

Primary key row identifier

level int

JLPT level from 1 (hardest) to 5 (easiest)

kanji str

The kanji character

on_yomi str | None

On readings, space separated

kun_yomi str | None

Kun readings, space separated

meaning str | None

The English meaning, with senses comma separated

JlptVocab

Bases: Base

A Tanos JLPT vocabulary item

Attributes:

Name Type Description
id int

Primary key row identifier

level int

JLPT level from 1 (hardest) to 5 (easiest)

word str | None

The headword, written with kanji when one exists and falling back to the kana reading otherwise

reading str | None

The kana reading of the headword

meaning str | None

The English meaning, with senses comma separated

Kanji

Bases: Base

A KanjiDic2 character and its scalar attributes

The repeating attributes of a character such as readings, meanings and references live in dedicated child tables that are reachable through the relationships below

Attributes:

Name Type Description
literal str

Primary key, the kanji character itself

grade int | None

School grade in which the kanji is taught

stroke_count int | None

Accepted stroke count

freq int | None

Newspaper frequency rank where a lower value is more frequent

jlpt_old int | None

Pre 2010 four level JLPT class from KanjiDic2

rad_classical int | None

Classical (Kangxi) radical number

rad_nelson int | None

Nelson radical number when it differs

stroke_miscounts list[int]

Alternative miscount stroke values

readings list[KanjiReading]

On, kun and foreign readings

meanings list[KanjiMeaning]

Meanings keyed by language

nanori list[KanjiNanori]

Name only readings

dic_refs list[KanjiDicRef]

External dictionary references

query_codes list[KanjiQueryCode]

Lookup codes such as SKIP

variants list[KanjiVariant]

Variant form references

codepoints list[KanjiCodepoint]

Character encoding codepoints

strokes KanjiStrokes | None

KanjiVG stroke order data when present

KanjiCodepoint

Bases: Base

A character encoding codepoint of a kanji

Attributes:

Name Type Description
id int

Primary key row identifier

literal str

Foreign key to Kanji

type str

Codepoint type such as ucs or jis208

value str

The codepoint value in that encoding

kanji Kanji

The owning kanji

KanjiDicRef

Bases: Base

An external dictionary reference for a kanji

Attributes:

Name Type Description
id int

Primary key row identifier

literal str

Foreign key to Kanji

type str

Reference type such as nelson_c, heisig or moro

value str

The reference value within that dictionary

extra dict | None

Extra metadata, for example volume and page for Morohashi references

kanji Kanji

The owning kanji

KanjiMeaning

Bases: Base

A meaning of a kanji in a given language

Attributes:

Name Type Description
id int

Primary key row identifier

literal str

Foreign key to Kanji

lang str

ISO 639 language code of the meaning, defaults to en

value str

The meaning text

position int

Zero based order of the meaning for its language

kanji Kanji

The owning kanji

KanjiNanori

Bases: Base

A nanori (name only) reading of a kanji

Attributes:

Name Type Description
id int

Primary key row identifier

literal str

Foreign key to Kanji

value str

The nanori reading text

position int

Zero based order of the nanori for the kanji

kanji Kanji

The owning kanji

KanjiQueryCode

Bases: Base

A lookup query code for a kanji

Attributes:

Name Type Description
id int

Primary key row identifier

literal str

Foreign key to Kanji

type str

Code type such as skip, four_corner or deroo

value str

The code value

skip_misclass str | None

SKIP misclassification kind when present

kanji Kanji

The owning kanji

KanjiRadical

Bases: Base

A kanji to radical decomposition edge, taken from KRADFILE

Each row records that a kanji contains a given radical component. The pair of kanji and radical is unique

Attributes:

Name Type Description
id int

Primary key row identifier

literal str

The kanji that contains the radical

radical str

The radical component contained in the kanji

KanjiReading

Bases: Base

A reading of a kanji

Attributes:

Name Type Description
id int

Primary key row identifier

literal str

Foreign key to Kanji

type str

Reading type such as ja_on, ja_kun, pinyin or korean_r

value str

The reading text

position int

Zero based order of the reading for its type

kanji Kanji

The owning kanji

KanjiStrokes

Bases: Base

KanjiVG stroke order data for a kanji

Licensing

Provenance and licensing for KanjiVG are recorded once in db_meta rather than on every row to keep the table small

Attributes:

Name Type Description
literal str

Primary key and foreign key to Kanji

stroke_count int | None

Number of strokes in the diagram

svg str

Serialized KanjiVG <kanji> group markup with the stroke paths, from which a consumer can render stroke order

kanji Kanji

The owning kanji

KanjiVariant

Bases: Base

A variant form reference for a kanji

Attributes:

Name Type Description
id int

Primary key row identifier

literal str

Foreign key to Kanji

type str

Encoding of the variant value such as jis208

value str

The variant reference value

kanji Kanji

The owning kanji

Radical

Bases: Base

A search radical and its stroke count, taken from RADKFILE

Attributes:

Name Type Description
radical str

Primary key, the radical character

stroke_count int | None

Number of strokes in the radical

Sentence

Bases: Base

A Tatoeba sentence in a single language

A row is either a Japanese sentence or an English translation of one. The lang column distinguishes them

Attributes:

Name Type Description
id int

Primary key, the Tatoeba sentence identifier

lang str

ISO 639 language code of the sentence

text str

The sentence text

Bases: Base

A translation link from a Japanese sentence to another sentence

Attributes:

Name Type Description
id int

Primary key row identifier

source_id int

Foreign key to the Japanese Sentence

target_id int

Foreign key to the translated Sentence

Tag

Bases: Base

An entity tag code and its human readable description

Populated from the JMdict and JMnedict <!ENTITY> definitions so that codes such as sl (slang) or ksb (Kansai dialect) can be expanded to text

Attributes:

Name Type Description
code str

Primary key part, the tag code as it appears in the source

category str

Primary key part, the tag family such as pos, misc, field, dialect or name_type

description str

Human readable description of the tag