Text
text
¶
Defines functions that use fugashi and kotobase to tokenize Japanese
sentences and build pydantic models containing relevant dictionary data
enrich(sentence, mode=BundleMode.grammar)
¶
Segments a sentence and enriches each stitched word with dictionary data
Runs one kotobase lookup per stitched word, keyed on the word's lemma.
Stitching first means a compound like 図書館 (library) is looked up as one
word and gets a real dictionary entry, instead of looking up the fragments
図書 and 館 separately (which is both slower and less useful)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sentence
|
str
|
The Japanese sentence to process |
required |
mode
|
BundleMode
|
How aggressively to group the tokens |
grammar
|
Returns:
| Type | Description |
|---|---|
list[EnrichedJapaneseWord]
|
A list of |
Raises:
| Type | Description |
|---|---|
FugashiError
|
If tokenisation fails |
KotobaseError
|
If a dictionary lookup fails |
ensure_fugashi()
¶
Performs a simple tokenisation operation using fugashi to ensure that
it's functional, raising an exception on any failures
Raises:
| Type | Description |
|---|---|
FugashiError
|
If any error occurs during tokenisation |
ensure_kotobase()
¶
Performs a simple lookup operation using kotobase to ensure that
it's functional, raising an exception on any failures
Raises:
| Type | Description |
|---|---|
KotobaseError
|
If any error occurs during the lookup |
query_kotobase(query, wildcard=False, include_names=True, sentence_limit=5, entry_limit=None)
cached
¶
Wraps kotobase.Kotobase.lookup to provide a lru-cache for queries and
build a pydantic model from the results
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
word or wildcard pattern to query |
required |
wildcard
|
bool
|
When |
False
|
include_names
|
bool
|
When |
True
|
sentence_limit
|
int
|
Defines how many |
5
|
entry_limit
|
int | None
|
Defines the maximum number of combined
entries (JMDict + JMNeDict) to fetch. Fetches all entries when set
to |
None
|
Returns:
| Type | Description |
|---|---|
KotobaseData
|
Pydantic model containing all information extracted from |
Raises:
| Type | Description |
|---|---|
KotobaseError
|
If the |
segment(sentence, mode=BundleMode.grammar)
¶
Tokenizes and stitches a sentence into useful words (no dictionary lookups)
Usage
-
This is the fast path used to render clickable text
-
Since it skips the (relatively slow) dictionary lookups, it is suited to tokenising whole subtitles/transcripts
-
The dictionary data is fetched later by
enrichor on a word click
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sentence
|
str
|
The Japanese sentence to segment |
required |
mode
|
BundleMode
|
How aggressively to group the tokens |
grammar
|
Returns:
| Type | Description |
|---|---|
list[JapaneseWord]
|
The stitched |
Raises:
| Type | Description |
|---|---|
FugashiError
|
If tokenisation fails |
segment_batch(sentences, mode=BundleMode.grammar)
¶
Segments many sentences in one call (see segment)
Used to tokenize a whole subtitle file up front in a single request, so the player never tokenizes per-cue mid-playback
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sentences
|
list[str]
|
The sentences to segment, in order |
required |
mode
|
BundleMode
|
How aggressively to group the tokens |
grammar
|
Returns:
| Type | Description |
|---|---|
list[list[JapaneseWord]]
|
One stitched-word list per input sentence, in the same order |
Raises:
| Type | Description |
|---|---|
FugashiError
|
If tokenisation fails |
stitch(tokens, mode=BundleMode.grammar)
¶
Stitches UniDic short-unit tokens into useful, learner-facing words
The grouping is controlled by mode (see BundleMode)
私は図書館で本を読みました (grammar mode)
Reliability
-
Splitting verbs, auxiliaries and particles follows directly from UniDic's grammatical labels, so it is essentially deterministic
-
Noun compounding is a heuristic. UniDic returns a run of nouns, not whether they form one word or several (that lives in its separate "long unit word" layer, which the short-unit output does not expose), so consecutive nouns are merged by rule and may over- or under-merge
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
list[Token]
|
Short-unit tokens, in order |
required |
mode
|
BundleMode
|
How aggressively to group the tokens |
grammar
|
Returns:
| Type | Description |
|---|---|
list[JapaneseWord]
|
The stitched |
tokenize(sentence)
¶
Tokenizes a Japanese sentence using fugashi and extracts
all token information into a pydantic model
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sentence
|
str
|
Sentence to tokenize |
required |
Returns:
| Type | Description |
|---|---|
list[Token]
|
list of |
Raises:
| Type | Description |
|---|---|
FugashiError
|
If the tagger can't be initialised or tokenisation fails |