Text

`text` ¶

Defines functions that use fugashi and kotobase to tokenize Japanese sentences and build pydantic models containing relevant dictionary data

`enrich(sentence, mode=BundleMode.grammar)` ¶

Segments a sentence and enriches each stitched word with dictionary data

Runs one kotobase lookup per stitched word, keyed on the word's lemma. Stitching first means a compound like 図書館 (library) is looked up as one word and gets a real dictionary entry, instead of looking up the fragments 図書 and 館 separately (which is both slower and less useful)

Parameters:

Name	Type	Description	Default
`sentence`	`str`	The Japanese sentence to process	required
`mode`	`BundleMode`	How aggressively to group the tokens	`grammar`

Returns:

Type	Description
`list[EnrichedJapaneseWord]`	A list of `EnrichedJapaneseWord` (stitched word + dictionary data)

Raises:

Type	Description
`FugashiError`	If tokenisation fails
`KotobaseError`	If a dictionary lookup fails

`ensure_fugashi()` ¶

Performs a simple tokenisation operation using fugashi to ensure that it's functional, raising an exception on any failures

Raises:

Type	Description
`FugashiError`	If any error occurs during tokenisation

`ensure_kotobase()` ¶

Performs a simple lookup operation using kotobase to ensure that it's functional, raising an exception on any failures

Raises:

Type	Description
`KotobaseError`	If any error occurs during the lookup

`query_kotobase(query, wildcard=False, include_names=True, sentence_limit=5, entry_limit=None)` `cached` ¶

Wraps kotobase.Kotobase.lookup to provide a lru-cache for queries and build a pydantic model from the results

Parameters:

Name	Type	Description	Default
`query`	`str`	word or wildcard pattern to query	required
`wildcard`	`bool`	When `True`, allows wildcards to be passed to `query` in order to match multiple words	`False`
`include_names`	`bool`	When `True`, also includes proper-name entries from the `JMNe Dictionary`	`True`
`sentence_limit`	`int`	Defines how many `Tatoeba` example sentences to fetch	`5`
`entry_limit`	`int \| None`	Defines the maximum number of combined entries (JMDict + JMNeDict) to fetch. Fetches all entries when set to `None`	`None`

Returns:

Type	Description
`KotobaseData`	Pydantic model containing all information extracted from `kotobase` for the query word

Raises:

Type	Description
`KotobaseError`	If the `kotobase` lookup fails

`segment(sentence, mode=BundleMode.grammar)` ¶

Tokenizes and stitches a sentence into useful words (no dictionary lookups)

Usage

This is the fast path used to render clickable text
Since it skips the (relatively slow) dictionary lookups, it is suited to tokenising whole subtitles/transcripts
The dictionary data is fetched later by enrich or on a word click

Parameters:

Name	Type	Description	Default
`sentence`	`str`	The Japanese sentence to segment	required
`mode`	`BundleMode`	How aggressively to group the tokens	`grammar`

Returns:

Type	Description
`list[JapaneseWord]`	The stitched `JapaneseWord` bundles

Raises:

Type	Description
`FugashiError`	If tokenisation fails

`segment_batch(sentences, mode=BundleMode.grammar)` ¶

Segments many sentences in one call (see segment)

Used to tokenize a whole subtitle file up front in a single request, so the player never tokenizes per-cue mid-playback

Parameters:

Name	Type	Description	Default
`sentences`	`list[str]`	The sentences to segment, in order	required
`mode`	`BundleMode`	How aggressively to group the tokens	`grammar`

Returns:

Type	Description
`list[list[JapaneseWord]]`	One stitched-word list per input sentence, in the same order

Raises:

Type	Description
`FugashiError`	If tokenisation fails

`stitch(tokens, mode=BundleMode.grammar)` ¶

Stitches UniDic short-unit tokens into useful, learner-facing words

The grouping is controlled by mode (see BundleMode)

私は図書館で本を読みました (grammar mode)

UniDic short units (10):
    私 | は | 図書 | 館 | で | 本 | を | 読み | まし | た

Stitched words (8):
    私 | は | 図書館 | で | 本 | を | 読み | ました

図書 + 館 -> 図書館 (library)
読み stays on its own, the polite まし + た splits off -> 読み | ました
Particles は/で/を stay on their own

Reliability

Splitting verbs, auxiliaries and particles follows directly from UniDic's grammatical labels, so it is essentially deterministic
Noun compounding is a heuristic. UniDic returns a run of nouns, not whether they form one word or several (that lives in its separate "long unit word" layer, which the short-unit output does not expose), so consecutive nouns are merged by rule and may over- or under-merge

Parameters:

Name	Type	Description	Default
`tokens`	`list[Token]`	Short-unit tokens, in order	required
`mode`	`BundleMode`	How aggressively to group the tokens	`grammar`

Returns:

Type	Description
`list[JapaneseWord]`	The stitched `JapaneseWord` bundles, in order

`tokenize(sentence)` ¶

Tokenizes a Japanese sentence using fugashi and extracts all token information into a pydantic model

Parameters:

Name	Type	Description	Default
`sentence`	`str`	Sentence to tokenize	required

Returns:

Type	Description
`list[Token]`	list of `Token` models containing extracted token information

Raises:

Type	Description
`FugashiError`	If the tagger can't be initialised or tokenisation fails

Text

text ¶

enrich(sentence, mode=BundleMode.grammar) ¶

ensure_fugashi() ¶

ensure_kotobase() ¶

query_kotobase(query, wildcard=False, include_names=True, sentence_limit=5, entry_limit=None) cached ¶

segment(sentence, mode=BundleMode.grammar) ¶

segment_batch(sentences, mode=BundleMode.grammar) ¶

stitch(tokens, mode=BundleMode.grammar) ¶

tokenize(sentence) ¶

`text` ¶

`enrich(sentence, mode=BundleMode.grammar)` ¶

`ensure_fugashi()` ¶

`ensure_kotobase()` ¶

`query_kotobase(query, wildcard=False, include_names=True, sentence_limit=5, entry_limit=None)` `cached` ¶

`segment(sentence, mode=BundleMode.grammar)` ¶

`segment_batch(sentences, mode=BundleMode.grammar)` ¶

`stitch(tokens, mode=BundleMode.grammar)` ¶

`tokenize(sentence)` ¶