Skip to main content
The database is the persistence layer for kiro-learn. Every event the Collector ingests and every memory record Extraction produces ends up here. It is a single SQLite file on disk, accessed through a small, well-defined interface. Nothing ever leaves your machine from this layer. The database file lives at:
~/.kiro-learn/kiro-learn.db
No server, no sidecar process, no network port. SQLite runs in-process inside the collector daemon and writes to a single file (plus its WAL and SHM siblings).

What it stores

Two things, in two tables:
TableWhat it storesWritten by
eventsRaw events exactly as the shims posted them (after cleaning)Ingestion pipeline
memory_recordsStructured memories extracted from batches of eventsExtraction worker
Plus two internal tables that support them:
TableWhat it stores
memory_records_ftsFull-text search index over memory record titles, summaries, and facts
_migrationsOne row per applied schema migration — the bookkeeping the runner uses to detect drift

The events table

Every incoming event lands here as one row. The schema mirrors the wire format — event ID, session ID, actor ID, namespace, kind, body, timestamps — with the body and source block stored as JSON blobs. The primary key is event_id (a ULID), so duplicate POSTs from a retrying shim are a no-op instead of an error. The table is indexed for the two common access patterns: listing events for a project in reverse chronological order, and looking up an event by its session or parent. Two timestamp columns track different notions of “when”:
  • valid_time — when the event actually happened (set by the shim).
  • transaction_time — when the database recorded it (stamped on insert).
These almost always agree, but the split exists so future versions can support bi-temporal queries — “what did memory look like as of last Tuesday?” — without a schema change.

The memory_records table

Each row is one memory record: a title, a summary, a list of facts, a list of concepts, a list of files touched, an observation type, and the IDs of the source events that produced it. The source_event_ids field links every memory back to the raw events it came from, so you can always trace a memory to its provenance. Alongside that content, each record carries an optional embedding column — a dense vector of the record, stored as a BLOB. It is NULL on fresh rows until the embedder fills it in, and is never exposed on the wire. The Embeddings section below covers the encoding. The primary key is record_id (mr_ + ULID). Unlike events, memory records are never supposed to collide — a duplicate record ID is an upstream bug, so the database rejects the insert loudly instead of silently ignoring it.

STRICT tables

Both user-facing tables are declared STRICT. This is SQLite’s opt-in type-safety mode. In a non-strict table, inserting a string into an INTEGER column silently succeeds and stores the string. In a strict table, the same insert fails loudly. For a persistence layer that serializes wire data validated by Zod on the way in, silent type coercion is a liability, not a feature. STRICT closes that gap — if a code change accidentally passes a number where a string is expected, the database catches it at insert time instead of letting a malformed row drift into storage. Memory records need to be searchable by content. Retrieval pulls relevant memories into the agent’s prompt, and the match quality directly shapes how useful the feature is. kiro-learn uses SQLite’s built-in FTS5 full-text search extension. When a memory record is inserted into memory_records, a companion row is also inserted into the memory_records_fts virtual table. Both inserts happen inside a single transaction — either both rows land or neither does. The FTS row indexes three fields:
  • title
  • summary
  • facts_text — the record’s facts joined into a single searchable blob
The record_id and namespace are carried as UNINDEXED columns so the search path can join back to the primary table and filter results by project without a separate lookup.

Tokenizer

The FTS5 index uses a three-stage tokenizer: porter unicode61 remove_diacritics 2. Each stage does one thing:
StageWhat it does
porterStems tokens to their root form — “migrating”, “migrates”, “migration” all collapse to “migrat”
unicode61Unicode-aware word segmentation and case folding
remove_diacritics 2Strips diacritical marks so “café” matches “cafe”
The combination means a search for “migrations” hits memories that mention “migrate”, “migrating”, or “migration” — behavior closer to what a developer types into a code search box than strict substring match.

Availability over ranking

Retrieval builds a sanitized FTS5 query by tokenizing the prompt and emitting an OR of quoted phrases. If FTS5 rejects the query anyway — its grammar has corners that are hard to fully account for — the storage layer catches the error and falls back to a LIKE '%query%' search against the original string. The LIKE path is slower and unranked (results come back by creation date), but it always works. The explicit tradeoff is availability over ranking quality: a worse-ordered result set is better than an error.

Embeddings

FTS5 catches records that share the user’s words. Hybrid retrieval also needs to catch records that share the user’s meaning — different words, same idea. For that, every memory record gets a dense semantic vector stored alongside its content.

The embedding column

The vector lives directly on the memory record. Migration 0005_memory_record_embedding adds a single column:
ALTER TABLE memory_records ADD COLUMN embedding BLOB DEFAULT NULL;
That is the whole schema change. No new tables, no new indexes, no FTS5 changes. The column is nullable and defaults to NULL, which is the truthful state for a record whose vector has not yet been computed — either because it predates this migration or because the embedder was busy when it was written. Readers that need the vector must handle NULL explicitly; hybrid search treats it as “no vector contribution” and falls through to the lexical ranking for that record. Storing the vector inline on the row (rather than in a sidecar table) means the record and its embedding live or die together. A record delete takes the vector with it. There is no second table to keep consistent.

BLOB encoding

The vector is a 384-dimensional Float32Array produced by the all-MiniLM-L6-v2 sentence-transformer model. It is encoded as exactly 1,536 bytes — 384 float32 values × 4 bytes each — written in little-endian IEEE-754 order.
byte 0–3    : f32 component 0   (little-endian)
byte 4–7    : f32 component 1
...
byte 1532–1535 : f32 component 383
The encoder and decoder go through DataView.setFloat32(offset, value, /* littleEndian */ true) and DataView.getFloat32(offset, true) explicitly rather than relying on the host’s byte order. Node runs little-endian on every supported platform today, but making it explicit in code means the format survives a platform we have not tested on, and it makes the invariant unambiguous: the BLOB is portable bytes, not a host-dependent memory dump. The round-trip preserves every IEEE-754 bit pattern — NaN, ±Infinity, ±0, subnormals all survive — which a property test verifies by comparing the underlying Uint32Array views after a re-decode. The decoder also validates length: a BLOB that is not exactly 1,536 bytes raises an error annotated with the offending record_id, so a corrupted row is visible at the record level rather than silently returning bogus similarity scores. No sqlite-vec, no virtual table, no ANN index. Cosine similarity is computed in JavaScript over the in-memory per-namespace vector index; SQLite’s job is to durably store the bytes, nothing more. See the retrieval page for how the math runs at query time.

Migration is additive

ALTER TABLE … ADD COLUMN on SQLite is an O(1) metadata operation when the column is nullable with a NULL default. Migration 0005_memory_record_embedding does not rewrite a single existing row — every pre-existing record ends up with embedding = NULL and keeps its rowid, its content, and every other column byte-identical. The FTS5 companion table is not touched. Populating the embeddings on existing records is a separate, asynchronous concern handled by the backfill worker, which scans for embedding IS NULL and fills rows in batches between ingestions. The migration does not wait for that to complete — the daemon is usable the moment the schema is current, and hybrid search falls back to lexical-only for any record whose vector has not arrived yet.

Migrations

The schema evolves over time. New features need new columns, new indexes, or new constraints. Rather than handwriting upgrade scripts, kiro-learn uses a small migration runner that applies code-embedded DDL in order and tracks what has been applied. Rules the runner enforces:
  1. Append-only. New migrations are added at the end with the next integer version. Released migrations are never reordered, renamed, or edited in place.
  2. Transactional. Each migration runs inside a BEGIN/COMMIT. If the DDL throws, the transaction rolls back and _migrations is left untouched — the next run tries again from the same point.
  3. Idempotent. Running the runner twice with the same list is a no-op. After every migration is applied, subsequent opens of the database do no DDL work.
  4. Drift-fatal. If _migrations records a version whose name disagrees with the code, the runner throws MigrationDriftError and refuses to proceed. This catches the case where migrations were renamed or reordered after being applied.
  5. Forward-only. There are no down-migrations. A broken migration is rolled back at the transaction level by SQLite itself; the developer fixes the DDL and re-runs.

The migration history

The schema has evolved through five migrations so far:
VersionNameWhat it does
0001initCreates events, memory_records, memory_records_fts, and _migrations, plus the initial indexes
0002xml_extraction_fieldsAdds concepts_json, files_touched_json, and observation_type columns to memory_records for the XML extraction pipeline
0003project_pathAdds a nullable project_path column to events plus a compound index on (namespace, project_path) for project listing
0004session_summary_typeWidens the observation_type CHECK constraint to include 'session_summary' (needed because SQLite has no ALTER CONSTRAINT, so the table is rebuilt)
0005memory_record_embeddingAdds a nullable embedding BLOB column to memory_records for local semantic vectors. Additive — no existing row is rewritten.
New migrations appear at the end of the list. The five existing ones are frozen.

Privacy

Everything in the database is local. There is no remote replica, no cloud sync, no background upload. The only path out of your machine is extraction, where batches of events are sent to Amazon Bedrock via kiro-cli for LLM processing — and even that traffic goes out of your AWS account, using your credentials.

<private> is stripped before it reaches the database

Users can wrap sensitive content in <private>...</private> tags, and those spans are replaced with [REDACTED] before storage. The redaction happens in the collector’s cleaning pipeline, not in the database. By the time an event arrives at putEvent, the private spans are already gone. This is a deliberate layering choice. Storage is a sink: if scrubbing lived here, a caller that forgot to scrub would still land scrubbed data in the database but would never see the private spans again and could build mistaken upstream assumptions (“storage protects us”). Centralizing the scrub in the pipeline puts the guarantee in one place and makes it visible. A guard test enforces that the storage layer never references <private> at all.

File-system permissions

The installer creates ~/.kiro-learn/ with mode 0700 — owner-read/write/execute only. Other users on a shared machine cannot read your memory database. The storage layer does not widen these permissions; it inherits whatever the parent directory grants.

The storage interface

Callers do not import from the SQLite module directly. The collector’s pipeline code, retrieval code, and read API all see a single pluggable interface:
interface StorageBackend {
  putEvent(event: KiroMemEvent): Promise<void>;
  getEventById(eventId: string): Promise<KiroMemEvent | null>;
  putMemoryRecord(record: MemoryRecord): Promise<void>;
  deleteMemoryRecord(recordIds: readonly string[]): Promise<void>;
  withTransaction<T>(fn: (tx: StorageTransaction) => Promise<T> | T): Promise<T>;
  searchMemoryRecords(params: SearchParams): Promise<MemoryRecord[]>;
  close(): Promise<void>;

  // Read API for the viewer UI
  getStats(namespace?: string): Promise<StatsResult>;
  listProjects(): Promise<ProjectInfo[]>;
  listMemoryRecords(params: { namespace?; limit; offset }): Promise<{ items; total }>;
  listEvents(params: { namespace?; limit }): Promise<{ items; total }>;
}

interface StorageTransaction {
  putMemoryRecord(record: MemoryRecord): Promise<void>;
  putEmbedding(recordId: string, embedding: Float32Array): Promise<void>;
  deleteMemoryRecord(recordIds: readonly string[]): Promise<void>;
}
The SQLite implementation is one backend behind that interface. Only the collector’s bootstrap code knows it exists — everything else (pipeline stages, retrieval assembler, read API handlers) receives a StorageBackend as a dependency and never asks which backend it is. A guard test enforces this: any file outside src/collector/storage/sqlite/ that tries to import from that directory fails CI. The point of this split is future-proofing without disruption. Later versions will ship alternative backends (Postgres with pgvector, Bedrock AgentCore Memory) and swapping them in becomes a matter of choosing a different opener at startup. The pipeline, retrieval, and UI code changes not at all.

Method contracts

The interface is small because the contracts are strict. A few worth calling out:
  • putEvent is idempotent. The insert uses INSERT OR IGNORE on the event ID primary key, so a retry from a confused shim is a silent no-op. The original transaction_time is preserved on collision — a second insert does not overwrite it.
  • putMemoryRecord rejects collisions. Memory record IDs are generated fresh on every extraction. A collision is an upstream bug, so the database surfaces it as an error instead of silently dropping the write.
  • deleteMemoryRecord cascades and is idempotent. See Merge deletion is destructive for the full cascade semantics — in short, one call drops the row from memory_records, its embedding from embeddings, and its FTS5 entry from memory_records_fts, all atomically. Unknown record_ids are silent no-ops so retries after a partial crash do not raise.
  • withTransaction runs one atomic unit of work. The callback receives a restricted StorageTransaction handle exposing only putMemoryRecord, putEmbedding, and deleteMemoryRecord. All three calls commit together or roll back together. Nested transactions are not supported. This is the seam the extraction worker’s push phase uses to commit a summary record and delete its merged originals in one go.
  • getEventById returns null for misses. Not an exception. The caller decides whether a missing event is a problem.
  • searchMemoryRecords is availability-biased. When FTS5 rejects a query, it falls back to LIKE instead of failing. The cost is worse ranking on the fallback path; the benefit is that retrieval never errors out on a weird prompt.
  • close is idempotent. A second close is a silent no-op, not an error.
Every method is async even though better-sqlite3 is synchronous. The async surface keeps the interface compatible with future backends that genuinely are async (remote databases, HTTP APIs) without forcing callers to change shape.

Prepared statements, parameterized queries

Every SQL statement is prepared once when the backend opens and reused for the life of the handle. Every user-controlled value — event body, FTS5 query string, namespace prefix — is bound positionally with ? placeholders. SQL and data are never concatenated. This is the primary defense against SQL injection. It also happens to be faster: preparing a statement once and binding values on every call avoids re-parsing and re-planning the query on each insert or search.

Merge deletion is destructive

The extraction worker is the only caller that deletes memory records. When the judge model rules that a cluster of candidates and some subset of existing neighbors describe the same thing, the worker commits a new summary record and deletes the merged originals in the same transaction. There is no soft delete, no tombstone, no deleted_at column. The merged rows are gone. The cascade is explicit. deleteMemoryRecord(recordIds) touches three things for every id, all inside one db.transaction(...) block:
for each record_id in recordIds:
  DELETE FROM memory_records_fts WHERE record_id = ?
  UPDATE memory_records SET embedding = NULL WHERE record_id = ?
  DELETE FROM memory_records WHERE record_id = ?
The embedding is a BLOB column on memory_records itself (no separate embeddings table in the v1 schema), so “deleting the embedding” is an UPDATE that nulls the column before the row itself is dropped. Ordering it this way keeps the semantics “child state before parent” — the NULL transition is observable inside the transaction, and if a future schema splits embeddings into a separate table the same ordering still holds. All three statements are explicit prepared statements. SQLite’s FTS5 here is a plain virtual table (no content-linked INSERT … INTO … (content) rowid coupling), and there are no ON DELETE triggers in any migration — the FTS5 companion row is managed in application code the same way it is written in putMemoryRecord. A missing cascade statement would leak a ghost FTS row; the property tests catch that by asserting the FTS index is empty for every deleted id. Because DELETE (and UPDATE) on a nonexistent row affects zero rows and does not raise, deleteMemoryRecord is idempotent. An id that was deleted in an earlier attempt is a silent no-op on retry, which is what lets a partially-crashed reconciliation run safely retry the same merge on the next ingestion trigger.
Merge commits have no undo. A <merge> decision deletes the original rows outright. Nothing in the database remembers that record_id mr_abc123 used to exist, and nothing in the retrieval path can surface a merged-away record. The extraction worker tunes its intra-batch and neighbor similarity thresholds conservatively by default — a more lenient configuration paired with a lenient judge can silently collapse rows that should have stayed distinct, and the only recovery path is to re-learn from fresh events.

Reconciliation has no schema footprint

Reconciliation has no schema footprint — no columns, no indexes, no tables, no triggers. It operates on the memory_records schema as it stands. The storage surface exposes two methods that reconciliation relies on:
  • deleteMemoryRecord(recordIds) on the StorageBackend interface (and its SQLite implementation).
  • withTransaction(fn) — the restricted handle the reconciler uses to commit a summary-plus-deletes atomically.
Reconciliation only touches a row when it comes up as a neighbor of a new candidate cluster and the judge decides to merge it away — no proactive rewrites or migrations of historical data.

Key design decisions

SQLite first. SQLite is the simplest thing that covers the v1 requirements: durable local storage, full-text search, multi-process safety (WAL mode), and zero operational overhead. No server to run, no credentials to manage, no port to expose. The trade-off is that it does not scale to multiple machines — which is fine, because kiro-learn is explicitly a local tool until a future version decides otherwise. One file. Everything lives in ~/.kiro-learn/kiro-learn.db. Backing up your memory is cp. Starting fresh is rm. There is no migration tool, no schema dump format, no cross-file consistency to worry about. STRICT tables everywhere. SQLite’s default laxity about types is convenient for scripting and hostile for a persistence layer. Opting in to STRICT catches type errors at the boundary instead of letting them propagate. FTS5 over building lexical search by hand. FTS5 is production-tested, maintained by the SQLite project, and handles tokenization, stemming, and ranking in one extension. Rebuilding any of that would be wasted effort. Semantic search is layered on top as an independent ranking (hybrid retrieval) rather than replacing FTS5 — Bedrock Knowledge Bases uses the same hybrid shape, so the algorithm is portable. Migrations are code, not files. Each migration’s DDL is a string constant inside a TypeScript module, compiled into the package. There is nothing to read from disk at runtime and nothing extra to ship. The compiled dist/ directory is self-contained. Interface-first. The StorageBackend interface is the contract every backend satisfies. Writing the SQLite backend against the interface (rather than exposing it directly) forces the contract to stay small and backend-agnostic. Every capability that is hard to express in the interface is a capability a future backend would struggle to provide. Privacy is a pipeline property, not a storage property. Storage stores what it is given. The pipeline is responsible for redaction. Putting the scrub in the right layer keeps the guarantee auditable and prevents a “defense in depth” that silently swallows bugs.

Code pointers

  • src/collector/storage/index.ts — the StorageBackend interface and method contracts
  • src/collector/storage/sqlite/index.ts — the SQLite backend implementation
  • src/collector/storage/sqlite/statements.ts — every prepared SQL statement in one file, including the three DELETE statements that implement the merge cascade
  • src/collector/storage/sqlite/fts5.ts — FTS5 query sanitization and LIKE escaping
  • src/collector/storage/sqlite/migrations/runner.ts — migration runner with drift detection
  • src/collector/storage/sqlite/migrations/0001_init.ts — initial schema
  • src/collector/storage/sqlite/migrations/0002_xml_extraction_fields.ts — extraction fields
  • src/collector/storage/sqlite/migrations/0003_project_path.ts — project path column
  • src/collector/storage/sqlite/migrations/0004_session_summary_type.ts — widened observation type constraint
  • src/collector/storage/sqlite/migrations/0005_memory_record_embedding.ts — embedding BLOB column
  • src/collector/embedding/blob.ts — Float32Array ↔ 1,536-byte BLOB encode/decode

Viewer

The dashboard that reads from the database via the read API

Retrieval

How memory records are searched via hybrid FTS5 + vector

Extraction

The worker that drives putMemoryRecord and deleteMemoryRecord

Compaction

What happens when buffers grow faster than extraction drains them

Collector

The daemon that writes events to the database