Skip to content

Jpdict

jpdict

BundleMode

Bases: str, Enum

How aggressively the stitch function groups UniDic short-unit tokens into words

The modes trade granularity for a learner's needs, from whole dictionary words down to raw morphemes

words
  • The coarsest grouping

  • Merges compound nouns and the whole inflected tail of a predicate, including the connecting て, bound auxiliary verbs, and politeness, into single dictionary words (e.g. 図書館, 食べてみたかった)

  • Best for looking words up

grammar (default)
  • The learning view

  • Keeps compound nouns and a predicate's inflectional auxiliaries together, but breaks off the pieces a learner parses separately, like the connecting て, bound auxiliary verbs (みる/いる/出す), and the politeness stems (ます/です) (e.g. 食べ | て | みたかった, 読み | ました)

morphemes
  • The finest grouping

  • No stitching at all, one word per UniDic short unit (e.g. 図書 | 館, 読み | まし | た)

EnrichedJapaneseWord

Bases: BaseModel

A JapaneseWord paired with its dictionary data

Returned by the enrichment endpoint (one kotobase lookup per stitched word), as opposed to the fast tokenize endpoint which returns bare JapaneseWord models

Parameters:

Name Type Description Default
word JapaneseWord

The stitched word

required
kotobase_data KotobaseData

The dictionary data for the word's lemma

required

JMEntry

Bases: BaseModel

Represents a single word entry in the Japanase-Multilingual Dictionary

rank

Kotobase calculates the rank attribute based on JMDict's <pri> tags. The following are the possible values and their meanings

  • 0 → High-frequency words found across standard textbooks (ichi1) and newspapers (news1)

  • 1-48 → The specific 500-word corpus interval the word belongs to (e.g., tier 5 means the word is within the top 2001-2500 most common words)

  • 99 → Low-priority or niche words containing auxiliary tags

Parameters:

Name Type Description Default
rank int

A categorized numerical value mapping the word's real-world popularity

required
kana list[str] | None

List of kana readings

required
kanji list[str] | None

List of kanji readings

required
senses list[WordSense] | None

List of JMWordSense models

required

JMNEntry

Bases: BaseModel

Represents a single name entry in the Japanese Multi-Lingual Dictionary

Parameters:

Name Type Description Default
kana list[str] | None

List of kana readings

required
kanji list[str] | None

List of kanji readings

required
translation_type str | None

Type of name

required
gloss list[str] | None

list of translation strings

required

JMWordSense

Bases: BaseModel

Represents a single sense (distinct meanings, translations, or nuances of a Japanese word) for a word within the Japanese-Multilingual Dictionary

order
  • Represents the sequential arrangement of word meanings based on lexicographical hierarchy

  • Senses progress logically from primary, literal definitions to secondary, figurative, or technical nuances

  • This order is editorially curated and does not reflect mathematical usage frequency

Parameters:

Name Type Description Default
order int

editorial priority order

required
pos str

Grammatical classifications like verb (v5u), noun (n), or adjective (adj-no) that apply to this specific meaning

required
gloss str

The English equivalent of the word

required

JapaneseWord

Bases: BaseModel

Represents a single useful word stitched from one or more UniDic short-unit tokens

Stitching
  • UniDic segments at the short-unit level, which is often too granular to be useful (e.g. 図書館 -> 図書 + , or a verb split from its auxiliaries)

  • A JapaneseWord re-bundles those short units into the word a learner actually wants to click

  • The original short-unit Token models are kept in tokens so no morphological detail is lost

the word built from 読み + まし + た
  • surface = "読みました" (the pieces joined as written)
  • reading = "ヨミマシタ" (their katakana readings joined)
  • lemma = "読む" (the dictionary form, for look-ups)
  • pos = "動詞" (verb -- the head piece's part of speech)
  • tokens = [読み, まし, た] (the three original short units)

Parameters:

Name Type Description Default
surface str

The bundle's combined surface form

required
reading str

The combined katakana reading of the component tokens

required
lemma str

Dictionary-lookup form (the head token's UniDic's orthBase for inflected words, or the combined surface for noun compounds)

required
pos str

The head token's top-level part of speech

required
tokens list[Token]

The component short-unit tokens, in order

required

KanjiInfo

Bases: BaseModel

Represents a single Kanji entry in KANJIDIC2

Parameters:

Name Type Description Default
literal str

Kanji literal

required
grade int | None

Optional Japanese grade in which Kanji is learned

required
stroke_count int | None

Number of strokes in handwriting

required
meanings list[str] | None

List of known meanings

required
onyomi list[str] | None

List of on readings

required
kunyomi list[str] | None

List of kun readings

required
jlpt_kanjidic int | None

Optional JLPT level present in KANJIDIC2

required
jlpt_tanos int | None

Optional JLPT level in Tanos list

required

KotobaseData

Bases: BaseModel

Represents all information extracted from kotobase for a single query (either a single Japanese word, or a wildcard pattern matching multiple words)

meanings
  • Exposes the gloss attributes (English equivalent of the word) of all JMWordSense models contained inside the first Japanase-Multilingual Dictionary entry for the query

  • If the query has only JMNEntry entries, the first entry's gloss attribute is used

Parameters:

Name Type Description Default
query str

query literal (either a single Japanese word or a wildcard pattern)

required
jmentries list[JMEntry]

All Japanese-Multilingual Dictionary entries for the query

required
jmnentries list[JMNEntry]

All Japanese-Multilingual Dictionary name entries for the query

required
kanji list[KanjiInfo]

KANJIDIC2 entries for all Kanji present in the query

required
meanings list[str]

All English equivalents contained in the first JMEntry, or JMNEntry

required
jlpt str

JLPT vocabulary level for the word extracted from the Tanos list. Defaults to Unknown when it's a wildcard query or the word is not in the list

required
examples list[str]

List of example sentences containing the single word or any words matched by the wildcard query

required

Token

Bases: BaseModel

Represents morphological data extracted for a single Japanese token

Maps all core token features and deep UniDic morphological data produced by Fugashi. Converts internal dictionary symbols (like asterisks) into clean pythonic types.

Attributes:

Name Type Description
surface str

The raw string exactly as it appears in the text.

lemma str

The dictionary base form (語彙素) of the word.

reading str

The standard reading of the token in Katakana.

pos str

The broad, top-level part of speech (品詞).

pos2 str

Sub-category level 2 part of speech.

pos3 str

Sub-category level 3 part of speech.

pos4 str

Sub-category level 4 part of speech.

c_type str

Conjugation type (活用型) if applicable.

c_form str

Conjugation form (活用形) if applicable.

l_form str

Lemma reading in Katakana.

orth str

Orthographic surface representation.

pron str

Actual pronunciation including long vowels.

orth_base str

Base form using current orthography.

pron_base str

Pronunciation of the base form.

goshu str

Word origin type (語種) e.g., Native, Sino-Japanese.

i_type str

Word-initial transformation type.

i_form str

Word-initial transformation form.

f_type str

Word-final transformation type.

f_form str

Word-final transformation form.

clear_asterisks(data) classmethod

Cleans incoming dictionary fields by converting UniDic's "*" sentinel and any missing (None) feature to an empty string, since unknown / out-of-vocabulary tokens leave some features unset

Parameters:

Name Type Description Default
data dict

Raw dictionary data containing morphological fields

required

Returns:

Name Type Description
dict dict[str, Any]

The modified dictionary with "*"/None values replaced by ""