Build

`build` ¶

Defines the builder that compiles upstream sources into one SQLite database

Builder and Loader

Builder is a coordinator with a single purpose, turning a stream of DatabaseRow pairs into one finished database. It owns the connection, the build PRAGMAs, the loader and the post load steps, and it delegates parsing to extractors, source paths to config and downloading to download
Loader is the batched insert engine. It is kept as a separate collaborator so the builder reads as a recipe and the executemany buffering stays in one place

The Recipe

build_core Runs Download -> Create Schema -> Load -> Build Index -> Write Metadata -> Optimize
The read model schema is created from the SQLAlchemy metadata, while the bulk load runs on a raw sqlite3 connection for speed

`AUDIO_SOURCES = ('kanjialive', 'kanjialive_data')` `module-attribute` ¶

Extra source keys downloaded to build the optional audio pack database

`CORE_SOURCES = ('jmdict', 'jmnedict', 'kanjidic2', 'kradzip', 'kanjivg', 'jmdict_furigana', 'tatoeba_jpn')` `module-attribute` ¶

All Source keys downloaded for every core build, see SOURCES

`LINK_SOURCES = ('tatoeba_links', 'tatoeba_eng')` `module-attribute` ¶

Extra source keys downloaded only when Tatoeba sentence alignment is requested

`Builder` ¶

Coordinates building one SQLite database from a stream of database rows

Scope

The builder owns only what is intrinsic to writing the database efficiently, which includes the following

The connection
The build PRAGMAs
A Loader
The post load steps
Any arguments which the Extractors might receive

Attributes:

Name	Type	Description
`path`	`Path`	The database file being written
`conn`	`Connection`	The open connection used for the bulk load
`loader`	`Loader`	The batched insert helper bound to the connection

`enter()` ¶

Enter the builder context

Returns:

Type	Description
`Builder`	The builder itself

`exit(*exc)` ¶

Close the connection on exit

Parameters:

Name	Type	Description	Default
`*exc`	`object`	The 3 exception arguments, ignored	`()`

`init(path)` ¶

Open a connection to the target database and prepare it for loading

Parameters:

Name	Type	Description	Default
`path`	`Path`	The database file to write, with its schema already created	required

`build_fts()` ¶

Create the gloss full text search index after the bulk load

`finish_load()` ¶

Flush the remaining buffered rows and commit the bulk load

The streamed inserts accumulate in a single implicit transaction, so this commits them once at the end

`optimize(*, analyze=True)` ¶

Restore a normal journal, update statistics and compact the file

Parameters:

Name	Type	Description	Default
`analyze`	`bool`	When True, run `ANALYZE` so the query planner has statistics, which the small audio pack does not need	`True`

`report_counts()` ¶

Print the number of rows inserted into each table

`run(name, *args)` ¶

Stream one registered extractor through the loader

The extractor is looked up by name in EXTRACTORS and called with whatever positional arguments it declares, since each extractor owns its own signature

Parameters:

Name	Type	Description	Default
`name`	`str`	Registry key of the extractor to run	required
`*args`	`Any`	Positional arguments forwarded to the extractor, such as the downloaded source paths it parses	`()`

`write_meta(paths, seconds)` ¶

Record build metadata into the db_meta table

Parameters:

Name	Type	Description	Default
`paths`	`dict[str, Path]`	Mapping of source key to downloaded file	required
`seconds`	`float`	Wall clock build duration in seconds	required

`Loader` ¶

Batched multi-table insert helper for a sqlite3 connection

How It Works

Rows are buffered per table and flushed with executemany once a batch fills, which is far faster than inserting one row at a time
The insert statement for a table is derived from the keys of its first row, so every row for a table must carry the same keys

Attributes:

Name	Type	Description
`conn`	`Connection`	The open database connection
`batch`	`int`	Number of rows to buffer before a flush
`counts`	`dict[str, int]`	Mapping of table names to the number of rows inserted to it by the instance using `add`
`_buffers`	`dict[str, list[tuple[Any, ...]]]`	Mappng of table names to their individual row buffers, each one accumulates rows added by `add` until their length is greater than `batch`, upon which they are inserted into the database with `executemany`
`_columns`	`dict[str, list[str]]`	Mapping of table names to a list containing their column names. The order is derived from the `add` function's `row` argument (`list(row.keys())`) when it is first called on a `table` and is used to build a consistent insert statement for subsequent rows of that same `table`
`_statements`	`dict[str, str]`	Mapping of table names to their `INSERT` SQL satement derived from `rows` and `_OR_IGNORE`

`init(conn, *, batch=_BATCH)` ¶

Create a loader bound to a connection

Parameters:

Name	Type	Description	Default
`conn`	`Connection`	The open database connection	required
`batch`	`int`	Number of rows to buffer before a flush	`_BATCH`

`add(table, row)` ¶

Buffer a single row for a table and flush when the batch is full

Parameters:

Name	Type	Description	Default
`table`	`str`	Target table name	required
`row`	`dict`	Row whose keys are column names	required

`flush_all()` ¶

Flush every buffered table

Used to insert the remaining rows still in the buffer

`build_audio(*, force=False)` ¶

Build the optional audio pack from the Kanji alive media

The audio pack is a separate database holding only the audio table, which the read layer attaches when it is present. Keeping it out of the core database keeps the default download small

Parameters:

Name	Type	Description	Default
`force`	`bool`	When True, rebuild even if the pack already exists	`False`

Returns:

Type	Description
`Path`	The path of the compiled audio pack

Raises:

Type	Description
`FileExistsError`	If the pack already exists and `force` is False

`build_core(*, force=False, include_links=True)` ¶

Build the kotobase core database from its sources

Parameters:

Name	Type	Description	Default
`force`	`bool`	When True, rebuild even if a database already exists	`False`
`include_links`	`bool`	When True, download and align the Tatoeba links and English sentences, which is the heaviest part of the build	`True`

Returns:

Type	Description
`Path`	The path of the compiled database

Raises:

Type	Description
`FileExistsError`	If a database already exists and `force` is False

`compress(database=None)` ¶

Compress a built database to a zstandard archive for publishing

Parameters:

Name	Type	Description	Default
`database`	`Path \| None`	Database to compress, or None for the default cache location	`None`

Returns:

Type	Description
`Path`	The path of the written zstandard archive

Build

build ¶

AUDIO_SOURCES = ('kanjialive', 'kanjialive_data') module-attribute ¶

CORE_SOURCES = ('jmdict', 'jmnedict', 'kanjidic2', 'kradzip', 'kanjivg', 'jmdict_furigana', 'tatoeba_jpn') module-attribute ¶

LINK_SOURCES = ('tatoeba_links', 'tatoeba_eng') module-attribute ¶

Builder ¶

__enter__() ¶

__exit__(*exc) ¶

__init__(path) ¶

build_fts() ¶

finish_load() ¶

optimize(*, analyze=True) ¶

report_counts() ¶

run(name, *args) ¶

write_meta(paths, seconds) ¶

Loader ¶

__init__(conn, *, batch=_BATCH) ¶

add(table, row) ¶

flush_all() ¶

build_audio(*, force=False) ¶

build_core(*, force=False, include_links=True) ¶

compress(database=None) ¶

`build` ¶

`AUDIO_SOURCES = ('kanjialive', 'kanjialive_data')` `module-attribute` ¶

`CORE_SOURCES = ('jmdict', 'jmnedict', 'kanjidic2', 'kradzip', 'kanjivg', 'jmdict_furigana', 'tatoeba_jpn')` `module-attribute` ¶

`LINK_SOURCES = ('tatoeba_links', 'tatoeba_eng')` `module-attribute` ¶

`Builder` ¶

`enter()` ¶

`exit(*exc)` ¶

`init(path)` ¶

`build_fts()` ¶

`finish_load()` ¶

`optimize(*, analyze=True)` ¶

`report_counts()` ¶

`run(name, *args)` ¶

`write_meta(paths, seconds)` ¶

`Loader` ¶

`init(conn, *, batch=_BATCH)` ¶

`add(table, row)` ¶

`flush_all()` ¶

`build_audio(*, force=False)` ¶

`build_core(*, force=False, include_links=True)` ¶

`compress(database=None)` ¶