Build
build
¶
Defines the builder that compiles upstream sources into one SQLite database
Builder and Loader
-
Builderis a coordinator with a single purpose, turning a stream ofDatabaseRowpairs into one finished database. It owns the connection, the build PRAGMAs, the loader and the post load steps, and it delegates parsing toextractors, source paths toconfigand downloading todownload -
Loaderis the batched insert engine. It is kept as a separate collaborator so the builder reads as a recipe and theexecutemanybuffering stays in one place
The Recipe
-
build_coreRunsDownload->Create Schema->Load->Build Index->Write Metadata->Optimize -
The read model schema is created from the
SQLAlchemymetadata, while the bulk load runs on a rawsqlite3connection for speed
AUDIO_SOURCES = ('kanjialive', 'kanjialive_data')
module-attribute
¶
Extra source keys downloaded to build the optional audio pack database
CORE_SOURCES = ('jmdict', 'jmnedict', 'kanjidic2', 'kradzip', 'kanjivg', 'jmdict_furigana', 'tatoeba_jpn')
module-attribute
¶
All Source keys downloaded for every core build, see
SOURCES
LINK_SOURCES = ('tatoeba_links', 'tatoeba_eng')
module-attribute
¶
Extra source keys downloaded only when Tatoeba sentence alignment is
requested
Builder
¶
Coordinates building one SQLite database from a stream of database rows
Scope
The builder owns only what is intrinsic to writing the database efficiently, which includes the following
-
The connection
-
The build PRAGMAs
-
A
Loader -
The post load steps
-
Any arguments which the
Extractorsmight receive
Attributes:
| Name | Type | Description |
|---|---|---|
path |
Path
|
The database file being written |
conn |
Connection
|
The open connection used for the bulk load |
loader |
Loader
|
The batched insert helper bound to the connection |
__exit__(*exc)
¶
Close the connection on exit
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*exc
|
object
|
The 3 exception arguments, ignored |
()
|
__init__(path)
¶
Open a connection to the target database and prepare it for loading
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
The database file to write, with its schema already created |
required |
build_fts()
¶
Create the gloss full text search index after the bulk load
finish_load()
¶
Flush the remaining buffered rows and commit the bulk load
The streamed inserts accumulate in a single implicit transaction, so this commits them once at the end
optimize(*, analyze=True)
¶
Restore a normal journal, update statistics and compact the file
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
analyze
|
bool
|
When True, run |
True
|
report_counts()
¶
Print the number of rows inserted into each table
run(name, *args)
¶
Stream one registered extractor through the loader
The extractor is looked up by name in
EXTRACTORS and called
with whatever positional arguments it declares, since each extractor
owns its own signature
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Registry key of the extractor to run |
required |
*args
|
Any
|
Positional arguments forwarded to the extractor, such as the downloaded source paths it parses |
()
|
Loader
¶
Batched multi-table insert helper for a sqlite3 connection
How It Works
-
Rows are buffered per table and flushed with
executemanyonce a batch fills, which is far faster than inserting one row at a time -
The insert statement for a table is derived from the keys of its first row, so every row for a table must carry the same keys
Attributes:
| Name | Type | Description |
|---|---|---|
conn |
Connection
|
The open database connection |
batch |
int
|
Number of rows to buffer before a flush |
counts |
dict[str, int]
|
Mapping of table names to the number of
rows inserted to it by the instance using |
_buffers |
dict[str, list[tuple[Any, ...]]]
|
Mappng of table names
to their individual row buffers, each one accumulates rows
added by |
_columns |
dict[str, list[str]]
|
Mapping of table names to a list
containing their column names. The order is derived from the
|
_statements |
dict[str, str]
|
Mapping of table names to their
|
__init__(conn, *, batch=_BATCH)
¶
Create a loader bound to a connection
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
conn
|
Connection
|
The open database connection |
required |
batch
|
int
|
Number of rows to buffer before a flush |
_BATCH
|
add(table, row)
¶
flush_all()
¶
Flush every buffered table
Used to insert the remaining rows still in the buffer
build_audio(*, force=False)
¶
Build the optional audio pack from the Kanji alive media
The audio pack is a separate database holding only the audio table, which
the read layer attaches when it is present. Keeping it out of the core
database keeps the default download small
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
force
|
bool
|
When True, rebuild even if the pack already exists |
False
|
Returns:
| Type | Description |
|---|---|
Path
|
The path of the compiled audio pack |
Raises:
| Type | Description |
|---|---|
FileExistsError
|
If the pack already exists and |
build_core(*, force=False, include_links=True)
¶
Build the kotobase core database from its sources
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
force
|
bool
|
When True, rebuild even if a database already exists |
False
|
include_links
|
bool
|
When True, download and align the Tatoeba links and English sentences, which is the heaviest part of the build |
True
|
Returns:
| Type | Description |
|---|---|
Path
|
The path of the compiled database |
Raises:
| Type | Description |
|---|---|
FileExistsError
|
If a database already exists and |