keplordb / architecture · v0.1.0
system design · one writer · append-only · time-ordered

A single write path, a single read path, and the disk in the middle.

KeplorDB is three moving parts — an in-memory columnar buffer, a write-ahead log, and a set of immutable column-oriented segment files. There are no background threads and no compactions. Segment mmap handles are cached and incrementally invalidated on rotation and GC. Aggregation across segments is parallelised with rayon.

[ diagram 01 ]

The full system — write path down, read path up.

fig. 1 · engine topology
KeplorDB engine topology application engine · in-memory disk your service append( &event ) batch producer append_batch( &[…] ) dashboard query_recent · aggregate ops gc( cutoff_ts ) WAL writer len-prefixed · reusable buf fsync every N events (configurable) columnar buffer in-memory · 20 columns · arena-interned ts_ns metric status dim[0] dim[2] payload ⋯ 20 cols query engine simd · zerocopy · bloom wal buffer rotate live fsync rotate → .kseg mmap unlink active.wal append-only · len-prefixed 0001.kseg min 02:00 · max 02:14 · 100k 0002.kseg min 02:14 · max 02:29 · 100k 0003.kseg min 02:29 · max 02:43 · 43k (open) tombstones.txt deleted event ids manifest rebuilt from .kseg headers ── write path ── read path ── rotate · gc (dashed)
application / engine node
query engine (read kernels)
immutable segment file
durability path (fsync)
rotate / gc
[ diagram 02 ]

Write pipeline — five steps, one calling thread.

01

intern

Dims resolved through a hashbrown arena. No String allocations.

~ 40 ns
02

push columns

Twenty column buffers receive unchecked vec pushes. Bounds verified at open.

~ 120 ns
03

frame WAL

Serialise to a reusable buffer, length-prefixed, then write().

~ 240 ns
04

fsync / 64

Every 64th event durably syncs the WAL. Configurable per engine.

~ amortised
05

rotate?

When buffer hits wal_max_events, flush columns to a new .kseg.

~ lazy
[ diagram 03 ]

Segment file anatomy — columns first, metadata last.

header
magic · version · N · min/max ts · col offsets
256 B
ts_ns
i64 × N sorted
8·N B
metric
i64 × N simd sum
8·N B
counters
u32 × N × 5
20·N B
latencies
u32 × N × 2
8·N B
status · flags
u16 × N · u16 × N count_eq
4·N B
dim indices
u16/u8 × N × 5 interned
≈10·N B
ext indices
u16 × N × 4 labels
8·N B
bloom filter
1024 bits · primary dim
128 B
intern table
zstd compressed · lazy-load skippable
variable
variable data
zstd · labels + payload
variable
[ diagram 04 ]

Module map — what the crate actually contains.

src/write

ingest

  • mod.rs buffer + rotation
  • wal.rs len-prefixed log
  • recovery.rs replay on open
src/storage

on disk

  • segment.rs kseg writer
  • mmap.rs read-only mappings
  • intern.rs arena table
  • bloom.rs bit filter
  • compress.rs zstd bridge
src/read

query

  • query.rs recent + point
  • rollup.rs per-day fxmap
  • simd.rs avx2 kernels
src/ops

admin

  • gc.rs cutoff unlink
  • archive.rs cold move
  • meta.rs catalog
[ notes ]

The four invariants that make this simple.

i · write

One writer per engine. The write path holds a RwLock on rollups and a Mutex on the WAL. Reads acquire only read locks — concurrent with writes.

ii · time

Events arrive in non-decreasing ts_ns order. Each segment's ts column is sorted; readers binary-search.

iii · immutable

A closed .kseg is never modified. All updates are tombstones in a side index.

iv · partition by time

Retention is unlink(). There is no compaction because there is nothing to compact into.