Skip to content

Build

build

Defines the builder that compiles upstream sources into one SQLite database

Builder and Loader
  • Builder is a coordinator with a single purpose, turning a stream of DatabaseRow pairs into one finished database. It owns the connection, the build PRAGMAs, the loader and the post load steps, and it delegates parsing to extractors, source paths to config and downloading to download

  • Loader is the batched insert engine. It is kept as a separate collaborator so the builder reads as a recipe and the executemany buffering stays in one place

The Recipe
  • build_core Runs Download -> Create Schema -> Load -> Build Index -> Write Metadata -> Optimize

  • The read model schema is created from the SQLAlchemy metadata, while the bulk load runs on a raw sqlite3 connection for speed

AUDIO_SOURCES = ('kanjialive', 'kanjialive_data') module-attribute

Extra source keys downloaded to build the optional audio pack database

CORE_SOURCES = ('jmdict', 'jmnedict', 'kanjidic2', 'kradzip', 'kanjivg', 'jmdict_furigana', 'tatoeba_jpn') module-attribute

All Source keys downloaded for every core build, see SOURCES

Extra source keys downloaded only when Tatoeba sentence alignment is requested

Builder

Coordinates building one SQLite database from a stream of database rows

Scope

The builder owns only what is intrinsic to writing the database efficiently, which includes the following

  • The connection

  • The build PRAGMAs

  • A Loader

  • The post load steps

  • Any arguments which the Extractors might receive

Attributes:

Name Type Description
path Path

The database file being written

conn Connection

The open connection used for the bulk load

loader Loader

The batched insert helper bound to the connection

__enter__()

Enter the builder context

Returns:

Type Description
Builder

The builder itself

__exit__(*exc)

Close the connection on exit

Parameters:

Name Type Description Default
*exc object

The 3 exception arguments, ignored

()

__init__(path)

Open a connection to the target database and prepare it for loading

Parameters:

Name Type Description Default
path Path

The database file to write, with its schema already created

required

build_fts()

Create the gloss full text search index after the bulk load

finish_load()

Flush the remaining buffered rows and commit the bulk load

The streamed inserts accumulate in a single implicit transaction, so this commits them once at the end

optimize(*, analyze=True)

Restore a normal journal, update statistics and compact the file

Parameters:

Name Type Description Default
analyze bool

When True, run ANALYZE so the query planner has statistics, which the small audio pack does not need

True

report_counts()

Print the number of rows inserted into each table

run(name, *args)

Stream one registered extractor through the loader

The extractor is looked up by name in EXTRACTORS and called with whatever positional arguments it declares, since each extractor owns its own signature

Parameters:

Name Type Description Default
name str

Registry key of the extractor to run

required
*args Any

Positional arguments forwarded to the extractor, such as the downloaded source paths it parses

()

write_meta(paths, seconds)

Record build metadata into the db_meta table

Parameters:

Name Type Description Default
paths dict[str, Path]

Mapping of source key to downloaded file

required
seconds float

Wall clock build duration in seconds

required

Loader

Batched multi-table insert helper for a sqlite3 connection

How It Works
  • Rows are buffered per table and flushed with executemany once a batch fills, which is far faster than inserting one row at a time

  • The insert statement for a table is derived from the keys of its first row, so every row for a table must carry the same keys

Attributes:

Name Type Description
conn Connection

The open database connection

batch int

Number of rows to buffer before a flush

counts dict[str, int]

Mapping of table names to the number of rows inserted to it by the instance using add

_buffers dict[str, list[tuple[Any, ...]]]

Mappng of table names to their individual row buffers, each one accumulates rows added by add until their length is greater than batch, upon which they are inserted into the database with executemany

_columns dict[str, list[str]]

Mapping of table names to a list containing their column names. The order is derived from the add function's row argument (list(row.keys())) when it is first called on a table and is used to build a consistent insert statement for subsequent rows of that same table

_statements dict[str, str]

Mapping of table names to their INSERT SQL satement derived from rows and _OR_IGNORE

__init__(conn, *, batch=_BATCH)

Create a loader bound to a connection

Parameters:

Name Type Description Default
conn Connection

The open database connection

required
batch int

Number of rows to buffer before a flush

_BATCH

add(table, row)

Buffer a single row for a table and flush when the batch is full

Parameters:

Name Type Description Default
table str

Target table name

required
row dict

Row whose keys are column names

required

flush_all()

Flush every buffered table

Used to insert the remaining rows still in the buffer

build_audio(*, force=False)

Build the optional audio pack from the Kanji alive media

The audio pack is a separate database holding only the audio table, which the read layer attaches when it is present. Keeping it out of the core database keeps the default download small

Parameters:

Name Type Description Default
force bool

When True, rebuild even if the pack already exists

False

Returns:

Type Description
Path

The path of the compiled audio pack

Raises:

Type Description
FileExistsError

If the pack already exists and force is False

build_core(*, force=False, include_links=True)

Build the kotobase core database from its sources

Parameters:

Name Type Description Default
force bool

When True, rebuild even if a database already exists

False
include_links bool

When True, download and align the Tatoeba links and English sentences, which is the heaviest part of the build

True

Returns:

Type Description
Path

The path of the compiled database

Raises:

Type Description
FileExistsError

If a database already exists and force is False

compress(database=None)

Compress a built database to a zstandard archive for publishing

Parameters:

Name Type Description Default
database Path | None

Database to compress, or None for the default cache location

None

Returns:

Type Description
Path

The path of the written zstandard archive