keplordb / architecture · v0.1.0 · typed derive · intern cache
system design · sharded WAL · append-only · time-ordered · bitmap + zone indexes · intern cache

A single write path, a single read path, and three index structures in the middle.

KeplorDB is three moving parts — an in-memory columnar buffer, a write-ahead log, and a set of immutable column-oriented segment files. There are no background threads and no compactions. Segment reads are lock-free via ArcSwap. Rollups are deferred to rotation time. Aggregation across segments is parallelised with rayon. Status bitmap indexes and zone maps enable row- and chunk-level pruning before any column data is touched.

[ diagram 01 ]

The full system — write path down, read path up, three indexes in between.

fig. 1 · engine topology
KeplorDB engine topology application engine · in-memory disk your service append( &event ) batch producer append_batch( &[…] ) dashboard query_recent · aggregate ops gc( cutoff_ts ) WAL writer len-prefixed · reusable buf fsync every N events (configurable) columnar buffer in-memory · 20 columns · arena-interned ts_ns metric status dim[0] dim[2] counters ⋯ 20 cols query engine simd · bloom · bitmap · zone wal buffer rotate live fsync rotate → .kseg mmap unlink active.wal append-only · len-prefixed 0001.kseg min 02:00 · max 02:14 · 100k 0002.kseg min 02:14 · max 02:29 · 100k 0003.kseg min 02:29 · max 02:43 · 43k (open) tombstones.txt deleted event ids manifest in-memory · ArcSwap ── write path ── read path ── rotate · gc (dashed)
application / engine node
query engine (read kernels)
immutable segment file
durability path (fsync)
rotate / gc
[ diagram 02 ]

Write pipeline — five steps, one calling thread, indexes built at rotation.

01

intern

Dims resolved through a hashbrown arena. No String allocations.

~ 40 ns
02

push columns

Twenty column buffers receive unchecked vec pushes. Bounds verified at open.

~ 120 ns
03

frame WAL

Serialise to a reusable buffer, length-prefixed, then write().

~ 240 ns
04

fsync / 64

Every 64th event durably syncs the WAL. Configurable per engine.

~ amortised
05

rotate?

When buffer hits wal_max_events, flush columns to a new .kseg. Build status bitmap + zone maps.

~ lazy
[ diagram 03 ]

Segment file anatomy — columns first, indexes last.

header
magic · version · N · min/max ts · col offsets · bitmap/zone offsets
256 B
i64 block
zstd · delta-encoded ts_ns + metric sorted
variable
u32 cols
latency_ms · latency_detail · counters[0..4]
variable
status · flags
u16 × N · u16 × N bitmap idx
4·N B
dim indices
u16/u8 × N × 5 zone-mapped
≈10·N B
ext indices
u16 × N × 4 id + labels
8·N B
bloom filter
1024 bits · primary dim
128 B
status bitmap
zstd · per-value compressed bitmaps O(1) lookup
variable
zone maps
min/max per 256-row chunk × D dims chunk pruning
variable
intern table
zstd compressed · lazy-load skippable
variable
[ diagram 04 ]

Module map — what the crate actually contains.

src/write

ingest

  • mod.rs buffer + rotation
  • wal.rs len-prefixed log, sharded
  • recovery.rs replay on open
src/storage

on disk

  • segment.rs kseg writer + reader
  • mmap.rs read-only mappings
  • intern.rs arena table
  • bloom.rs bit filter
  • compress.rs zstd bridge
src/read

query

  • query.rs recent + aggregate
  • rollup.rs BTreeMap per-day
  • simd.rs avx2 + prefetch kernels
src/ops

admin

  • gc.rs cutoff unlink, in-memory manifest
  • archive.rs cold move
  • meta.rs catalog
[ notes ]

The four invariants that make this simple. Plus three indexes.

i · write

One writer per engine. The write path holds only a Mutex on the WAL — rollups are deferred to rotation. Reads are lock-free via ArcSwap — concurrent with writes, zero contention.

ii · time

Events arrive in non-decreasing ts_ns order. Each segment's ts column is sorted; readers binary-search. Delta-encoded + zstd compressed.

iii · immutable

A closed .kseg is never modified. All updates are tombstones in a side index. Status bitmap + zone maps built at rotation time.

iv · partition by time

Retention is unlink(). GC uses in-memory manifest — zero disk reads. There is no compaction because there is nothing to compact into.