Skip to content

Text

text

Defines functions that use fugashi and kotobase to tokenize Japanese sentences and build pydantic models containing relevant dictionary data

enrich(sentence, mode=BundleMode.grammar)

Segments a sentence and enriches each stitched word with dictionary data

Runs one kotobase lookup per stitched word, keyed on the word's lemma. Stitching first means a compound like 図書館 (library) is looked up as one word and gets a real dictionary entry, instead of looking up the fragments 図書 and 館 separately (which is both slower and less useful)

Parameters:

Name Type Description Default
sentence str

The Japanese sentence to process

required
mode BundleMode

How aggressively to group the tokens

grammar

Returns:

Type Description
list[EnrichedJapaneseWord]

A list of EnrichedJapaneseWord (stitched word + dictionary data)

Raises:

Type Description
FugashiError

If tokenisation fails

KotobaseError

If a dictionary lookup fails

ensure_fugashi()

Performs a simple tokenisation operation using fugashi to ensure that it's functional, raising an exception on any failures

Raises:

Type Description
FugashiError

If any error occurs during tokenisation

ensure_kotobase()

Performs a simple lookup operation using kotobase to ensure that it's functional, raising an exception on any failures

Raises:

Type Description
KotobaseError

If any error occurs during the lookup

query_kotobase(query, wildcard=False, include_names=True, sentence_limit=5, entry_limit=None) cached

Wraps kotobase.Kotobase.lookup to provide a lru-cache for queries and build a pydantic model from the results

Parameters:

Name Type Description Default
query str

word or wildcard pattern to query

required
wildcard bool

When True, allows wildcards to be passed to query in order to match multiple words

False
include_names bool

When True, also includes proper-name entries from the JMNe Dictionary

True
sentence_limit int

Defines how many Tatoeba example sentences to fetch

5
entry_limit int | None

Defines the maximum number of combined entries (JMDict + JMNeDict) to fetch. Fetches all entries when set to None

None

Returns:

Type Description
KotobaseData

Pydantic model containing all information extracted from kotobase for the query word

Raises:

Type Description
KotobaseError

If the kotobase lookup fails

segment(sentence, mode=BundleMode.grammar)

Tokenizes and stitches a sentence into useful words (no dictionary lookups)

Usage
  • This is the fast path used to render clickable text

  • Since it skips the (relatively slow) dictionary lookups, it is suited to tokenising whole subtitles/transcripts

  • The dictionary data is fetched later by enrich or on a word click

Parameters:

Name Type Description Default
sentence str

The Japanese sentence to segment

required
mode BundleMode

How aggressively to group the tokens

grammar

Returns:

Type Description
list[JapaneseWord]

The stitched JapaneseWord bundles

Raises:

Type Description
FugashiError

If tokenisation fails

segment_batch(sentences, mode=BundleMode.grammar)

Segments many sentences in one call (see segment)

Used to tokenize a whole subtitle file up front in a single request, so the player never tokenizes per-cue mid-playback

Parameters:

Name Type Description Default
sentences list[str]

The sentences to segment, in order

required
mode BundleMode

How aggressively to group the tokens

grammar

Returns:

Type Description
list[list[JapaneseWord]]

One stitched-word list per input sentence, in the same order

Raises:

Type Description
FugashiError

If tokenisation fails

stitch(tokens, mode=BundleMode.grammar)

Stitches UniDic short-unit tokens into useful, learner-facing words

The grouping is controlled by mode (see BundleMode)

私は図書館で本を読みました (grammar mode)
UniDic short units (10):
    私 | は | 図書 | 館 | で | 本 | を | 読み | まし | た

Stitched words (8):
    私 | は | 図書館 | で | 本 | を | 読み | ました

図書 + 館 -> 図書館 (library)
読み stays on its own, the polite まし + た splits off -> 読み | ました
Particles は/で/を stay on their own
Reliability
  • Splitting verbs, auxiliaries and particles follows directly from UniDic's grammatical labels, so it is essentially deterministic

  • Noun compounding is a heuristic. UniDic returns a run of nouns, not whether they form one word or several (that lives in its separate "long unit word" layer, which the short-unit output does not expose), so consecutive nouns are merged by rule and may over- or under-merge

Parameters:

Name Type Description Default
tokens list[Token]

Short-unit tokens, in order

required
mode BundleMode

How aggressively to group the tokens

grammar

Returns:

Type Description
list[JapaneseWord]

The stitched JapaneseWord bundles, in order

tokenize(sentence)

Tokenizes a Japanese sentence using fugashi and extracts all token information into a pydantic model

Parameters:

Name Type Description Default
sentence str

Sentence to tokenize

required

Returns:

Type Description
list[Token]

list of Token models containing extracted token information

Raises:

Type Description
FugashiError

If the tagger can't be initialised or tokenisation fails