keplordb / docs · v0.1.0 · typed derive · intern cache
docs/getting started/overview

KeplorDB, in one page.

A columnar, append-only log engine written in Rust — purpose-built for high-throughput structured event ingestion. Declare the event shape with a derive macro, open the engine, append events, query columns. Bitmap indexes for status, zone maps for dimensions, AVX2 SIMD with hardware prefetching. No server, no SQL, no background threads.

Overview

KeplorDB is an embeddable library. It is not a database server. You link it into your Rust binary, call Engine::open(), and write events. Every append() goes to a sharded WAL on disk and to an in-memory columnar buffer. When the buffer fills, it rotates into an immutable .kseg segment file.

Reads mmap those segment files and scan the relevant columns directly — no deserialization, no row reconstruction, no query planner. Aggregations run over contiguous i64 and u32 arrays using AVX2 SIMD with hardware prefetching, with a scalar fallback. Cross-segment queries fan out across rayon workers and merge globally.

Query performance is boosted by three index structures — per-segment bloom filters on the primary dimension, per-value compressed status bitmap indexes, and per-chunk zone maps on all dimensions — plus a per-segment OnceLock cache for the string intern resolve table. Together these skip entire segments, rows, and 256-row chunks before any column data is touched.

not forMutable rows, joins, ad-hoc secondary indexes, SQL, or externally-sourced untrusted segment files. KeplorDB assumes append-only, time-ordered events. See PRODUCTION.md for the full list of pre-1.0 gaps.

Install

KeplorDB ships as a Cargo workspace — the keplordb crate (engine) plus keplordb-macros (the #[derive(Schema)] proc-macro, re-exported from the main crate). One git dep gets both:

Cargo.toml[dependencies]
keplordb = { git = "https://github.com/themankindproject/keplordb" }

Or via cargo:

shell$ cargo add keplordb --git https://github.com/themankindproject/keplordb

Requires Rust 1.82 or newer. The crate pulls in zstd, zerocopy, memmap2, thiserror, rustc-hash, hashbrown, mimalloc, rayon, crc32fast, and arc-swap. The macro crate adds syn 2, quote, proc-macro2 at build time only.

Schema derive

The recommended entry point. #[derive(keplordb::Schema)] turns a named-field struct into a typed event with fluent getters/setters plus an auto-generated Schema impl — no positional dims[0] indexing anywhere in user code.

schema — named dims, counters, labelsuse keplordb::Schema;

#[derive(Schema)]
#[keplordb(id = 1)]
pub struct AiCall {
    #[dim(bloom, rollup)] pub user:    String,
    #[dim(rollup)]        pub org:     String,
    #[dim]                pub model:   String,
    #[counter]            pub tokens:  u32,
    #[label]              pub region:  String,
}

The macro emits four things:

  1. impl Schema for AiCallNUM_DIMS, NUM_COUNTERS, NUM_LABELS, BLOOM_DIM, ROLLUP_DIMS, and name-lookup match arms wired from source order.
  2. AiCallEvent — a typed wrapper with fluent setters (.user(…), .tokens(…)) plus scalar setters for ts_ns, metric, status, flags, latency_ms, and id. Getters are named get_user() etc. .into_log_event() materialises a positional LogEvent<D, C, L>.
  3. AiCallFilter — a newtype over QueryFilter<D> with the same named setters plus from_ts/to_ts/status_eq/cursor pass-throughs. .into_filter() unwraps for engine.query_recent / aggregate.
  4. Inherent methods on the structAiCall::new(ts_ns), AiCall::filter(), and AiCall::default_config(data_dir) which builds an EngineConfig with bloom_dim and rollup_dims prefilled.

Validation enforced by the macro, all with spans pointing at your source:

positional apiThe derive is purely additive. The raw LogEvent<D, C, L> and QueryFilter<D> API stays available for dynamic / runtime-decided schemas. See examples/raw_positional.rs.

Quickstart

src/main.rs — derive · open · append · aggregateuse keplordb::{Engine, Schema};

#[derive(Schema)]
#[keplordb(id = 1)]
pub struct AiCall {
    #[dim(bloom, rollup)] pub user:   String,
    #[dim]                pub model:  String,
    #[counter]            pub tokens: u32,
}

fn main() -> Result<(), keplordb::DbError> {
    // `default_config` prefills bloom_dim + rollup_dims from the schema.
    let engine: Engine<2,1,0> = Engine::open(
        AiCall::default_config("/tmp/my_logs".into())
    )?;

    // Fluent typed builder — no positional indexing.
    engine.append(&AiCall::new(ts_ns())
        .user("alice").model("gpt-4o")
        .tokens(1000).metric(5_000_000).status(200)
        .into_log_event())?;

    // Named filter. `into_filter()` unwraps for the engine API.
    let page = engine.query_recent(
        &AiCall::filter().user("alice").into_filter(),
        50,
    )?;

    // Aggregate — SIMD scan + zone map pruning + intern cache.
    let totals = engine.aggregate(&AiCall::filter().into_filter())?;
    println!("events: {}, metric sum: {}", totals.event_count, totals.metric);

    engine.flush()?;
    Ok(())
}
tipThe engine is Send + Sync; share it behind an Arc. append_batch is the throughput path — each batch lands on one WAL shard, so N concurrent writer threads see zero contention when wal_shard_count ≥ N.

Data model

Every record is a LogEvent<D, C, L> — a flat, fixed-shape struct parameterised by three const generics. D is the dimension count, C the counter count, L the label count. The #[derive(Schema)] macro chooses these for you based on the field tags; you rarely spell them out manually.

There is no schema migration: every event in a given data directory has the same shape. The header's schema_id is checked on open and a mismatch is rejected as DbError::Corrupt. Unused columns are cheap (dims / labels intern to the empty string; counters default to zero).

Dimensions vs. labels vs. counters

LogEvent schema

fieldtypedescription
idStringUnique event identifier. Interned for fast point lookups.
ts_nsi64Nanosecond timestamp. Sorted, binary-searchable per segment. Delta-encoded + zstd.
metrici64Primary signed metric — cost, duration, scalar value. Delta-encoded + zstd.
counters[0..5]u32Five unsigned counters — tokens, bytes, retries.
latency_msu32Primary latency (ms).
latency_detail_msu32Detailed latency breakdown (ms).
latency_detail_msu32Detailed latency breakdown (ms).
statusu16Status code — HTTP, gRPC, or application-defined. Bitmap-indexed.
flagsu1616 boolean bitflags.
dims[0..5]StringFive indexed, filterable dimensions. Interned per segment. Zone-mapped for chunk pruning.
labels[0..3]StringThree free-form string labels.

Segments & WAL

A KeplorDB data directory contains:

Writes go to both the WAL and an in-memory columnar buffer. When the buffer hits wal_max_events, it serialises to a new segment file and the WAL is truncated. Segments are never modified after creation — only deleted.

Each segment file (KSG3 v3) contains: a 256-byte header, zstd-compressed delta-encoded i64 block (ts_ns + metric), raw u32/u16 columns, a 128-byte bloom filter, compressed status bitmap index, raw zone maps, and a zstd-compressed intern table.

Durability

Engine::append / append_batch writes the event into the on-disk WAL and returns once the bytes hit the kernel's page cache. The WAL is fsync'd in batches, not on every append. Two knobs on EngineConfig control when the fsync fires:

fielddefaultmeaning
wal_sync_interval64fsync after this many events (per shard).
wal_sync_bytes262144fsync when buffered bytes crosses this (256 KB).

The consequence: on a hard crash, you can lose up to one full sync interval per shard — at defaults that is up to wal_sync_interval × wal_shard_count events, or roughly 256 KB of writes that had made it to the OS page cache but not to the disk platter. Three profiles:

On clean shutdown call Engine::flush() to rotate the WAL into a segment and fsync everything.

Recovery: Engine::open() replays wal.log files frame-by-frame with CRC32 validation; partial / truncated frames at the end are detected and recovered up to the last complete frame. Crash-mid-rotation is also handled — orphaned *.wal.rotating files are replayed too.

caveatfsync only protects against process + OS crashes. For power-loss durability on consumer SSDs, also ensure your filesystem is mounted with data=journal or equivalent. KeplorDB does not flush hardware write caches.

Schema id

Every segment header records EngineConfig.schema_id (a single u8). On open, Engine::open() verifies every segment's schema_id matches the config; a mismatch returns DbError::Corrupt naming both ids. This protects against mounting a data directory written under a different schema — the column layout is strictly positional, so a silent mismatch would corrupt every read.

Pick a schema_id for a given data directory and never change it. If your deployment topology has multiple event shapes, give each its own data_dir and its own schema_id.

migrationEvolving a schema in place — reordering fields, adding a dim, changing a counter — is not supported today. See the pre-1.0 gaps in PRODUCTION.md.

Garbage collection

Retention is segment-level. engine.gc(cutoff_ts_ns) deletes every segment whose max_ts < cutoff. There is no compaction, no background merge, and no write amplification — GC is a few unlink() calls.

GC uses the in-memory manifest (ArcSwap index) for segment metadata — zero disk reads during GC. No syscalls beyond the actual unlink() calls. This is O(files) in-memory operations vs O(files) syscalls in the old approach.

// drop segments older than 7 days
engine.gc(ts_ns() - 7 * 86_400 * 1_000_000_000)?;

API reference

The full surface of the Engine struct.

Lifecycle

methoddescription
Engine::open(config)Open (or create) a data directory. Replays the WAL on start.
engine.flush()Flush in-memory buffer + WAL to disk. Always called in Drop.

Write

methoddescription
engine.append(&event)Append a single event. WAL-durable.
engine.append_batch(&events)Append a slice of events. Single WAL frame, bulk column writes.

Read

methoddescription
engine.query_recent(&filter, limit)Most recent events matching filter, strictly newest-first across all segments. Each candidate segment contributes up to limit matches; the pool is merged by ts_ns descending and truncated. max_ts-descending scan order with early termination keeps the hot path at ~1 segment visit for time-partitioned layouts.
engine.aggregate(&filter)SIMD-scanned totals with zone map chunk pruning and bloom segment skip. Event count, metric sum, counter totals. Rayon fan-out across segments; cached intern resolve table per segment.
engine.query_rollups(from_day, to_day, &dim_filters)Per-day rollups for the configured rollup_dims across the selected range.
engine.get_event("id")Point lookup by event id. Intern lookup skips non-matching segments; matching uses u16 index comparison, not string comparison. Respects tombstones.

Admin

methoddescription
engine.delete_event("id")Tombstone a single event by id. O(1) append to tombstones file. Excluded from subsequent reads.
engine.gc(cutoff_ts_ns)Drop every segment with max_ts < cutoff. Uses in-memory manifest — zero disk reads. Returns stats.

Errors

All fallible calls return Result<T, DbError>. Notable variants:

DbError::Io(io::Error) io
Underlying filesystem or mmap failure. Engine state is typically preserved; retry after diagnosing.
DbError::WalCorrupt { offset } recovery
WAL frame failed CRC check. The engine truncates at offset and surfaces this once on open.
DbError::Corrupt(msg) recovery
Segment header or index data is corrupt — bad magic, truncated bitmap, invalid zone map, or truncated intern data. File is skipped and a warning is logged.
DbError::InternTableFull write
A single segment accumulated more than 65,535 unique strings in one dim. Rotate the segment or reduce cardinality.

Segment format

Every .kseg file is self-describing and read-only. Columns are written in fixed-width blocks so mmap'd slices can be reinterpreted as typed arrays via zerocopy::FromBytes.

kseg — on-disk layout (KSG3 v3)┌────────────────────────────────┐
│  header            256 B       │  magic · version · N · offsets
├────────────────────────────────┤
│  i64 block         zstd+d      │  delta-encoded ts_ns + metric
│  u32 cols          latency+cnt │  latency_ms, latency_detail, counters
│  u16 cols          status+dim  │  status, flags, dims, id, labels
├────────────────────────────────┤
│  bloom filter      128 B       │  primary dim skip
│  status bitmap     zstd        │  per-value compressed bitmaps
│  zone maps         raw         │  min/max per 256-row chunk × D
│  intern table      zstd        │  lazy-decompressed
└────────────────────────────────┘

Write path

Read path

SIMD & scan

Hot scan kernels compile to AVX2 when the target supports it and fall back to scalar code otherwise. Every SIMD loop includes hardware prefetch hints (_mm_prefetch with _MM_HINT_T0) to hide memory latency.

sum_i64(col: &[i64]) -> i64 avx2 + prefetch
Horizontal sum over the metric column. 4 lanes × 256-bit accumulators. Prefetches 8 iterations (128 bytes) ahead.
sum_u32_as_u64(col: &[u32]) -> u64 avx2 + prefetch
Widening sum for counter columns — avoids overflow on long segments. Prefetches 8 iterations ahead.
count_eq_u16(col: &[u16], needle: u16) -> u64 avx2 + prefetch
Vectorised equality count for status and flag columns. Prefetches 4 iterations (128 bytes) ahead.
filtered_aggregate(…) avx2 + prefetch
Combined mask + sum pass: filter by dim index, sum metric in a single linear scan. Prefetches 4 iterations ahead.
filtered_aggregate_generic<C>(…) avx2 + prefetch
Generic filtered aggregate with arbitrary counter columns. SIMD dim match + scalar counter accumulation.
count_range_i64(…) avx2 + prefetch
Count i64 elements in [lo, hi] range. Prefetches 8 iterations ahead.
sum_range_i64(…) avx2 + prefetch
Sum i64 elements in [lo, hi] range. Early-skips chunks with no matches. Prefetches 8 iterations ahead.
filtered_sum_u32(…) avx2 + prefetch
Sum u32 elements where corresponding dim_col == needle. Prefetches 4 iterations ahead.

Status bitmap index

Each segment stores a compressed bitmap per unique status value. Bit i is set if row i has that status. This enables O(1) status lookups without scanning the status column.

formatSerialized as [num_entries: u16][value: u16, len: u32, bytes...]*, then zstd-compressed. Decompression is lazy — only the needed bitmap is decompressed. The StatusBitmapIter iterates set bits efficiently.

Zone maps

Per-dimension min/max values are stored for each 256-row chunk. When a dimension filter is present, chunks whose min/max range doesn't contain the filter value are skipped entirely.

formatSerialized as [num_chunks: u16][min: u16, max: u16]*D per chunk. Stored raw (uncompressed) between the status bitmap and intern table. ZONE_MAP_CHUNK_SIZE = 256 rows.

Configuration

AiCall::default_config(data_dir) (generated by the derive) returns an EngineConfig with bloom_dim and rollup_dims prefilled from the schema. Everything else follows EngineConfig::default():

fieldtypedefaultdescription
data_dirPathBufDirectory to hold WAL + segments. Created if missing.
wal_max_eventsu32500_000Events per WAL shard before rotation into a segment.
wal_sync_intervalu3264fsync after this many events per shard. Set to 1 for zero data loss.
wal_sync_bytesu64262_144fsync when a shard has buffered this many bytes (256 KB).
wal_shard_countusize4Number of WAL shards. Each append_batch lands on one shard (round-robin at batch level).
schema_idu80Written into every segment header. Verified on open — a mismatch rejects the data directory.
bloom_dimusize0Which dim drives the per-segment bloom filter. Must be < NUM_DIMS.
rollup_dimsVec<usize>[0, 1]Dim indices composing the daily rollup key.
mmap_cache_capacityusize256Max simultaneously mmap'd segment files. Older entries evicted LRU.
rollup_replay_daysu327Days of historical segments to replay into the rollup store on startup.

Crash recovery

Engine::open() scans the data directory in this order:

  1. Load in-memory manifest — if missing, rebuild from segment headers.
  2. Validate each .kseg header magic; move corrupt files aside.
  3. Replay wal.log frame-by-frame, CRC-checked; truncate at first bad frame. Handles truncated files gracefully.
  4. Write replayed events into a new segment; update manifest.

Recovery is single-threaded and proportional to WAL size. For a default 64-event sync interval, recovery processes tens of thousands of events per second. The WAL uses sharded buffers for concurrent append safety.

Sizing & limits

limitvaluewhy
events / segment2³¹u32 row indices throughout the column layout.
unique strings / dim / segment65_535u16 intern index. Exceeding triggers early rotation.
zone map chunk size256 rowsBalance between pruning granularity and index size. SIMD-built during rotation.
status bitmap entries≤ 65,536One bitmap per unique status value. Compressed with zstd.
concurrent writers1Single-writer by design. Wrap Engine in an Arc<Mutex> for multi-producer.

Use cases

The fixed LogEvent schema maps to any append-only, time-ordered workload. Below is how the generic fields adapt per domain.

All use cases benefit from the same performance optimizations: bitmap indexes for status filtering, zone maps for dimension chunk pruning, and AVX2 SIMD with hardware prefetching for aggregation.

Schema mapping

fieldLLMHTTP logsIoTpaymentsCI/CDCDN
metriccost (nanodollars)response timesensor readingamount (cents)build durationbytes xfr
counters[0]input tokensreq bytessample countitem counttest countcache hits
counters[1]output tokensresp byteserror counttaxfailurescache miss
latency_mstotal latencyTTFBpoll intervalprocessingqueue waitedge latency
statusHTTP statusHTTP statusdevice statustxn statusexit codeHTTP status
dims[0]user_idclient IPdevice_idmerchantrepoedge POP
dims[1]api_keymethodlocationcustomerbranchcountry
dims[2]modelroutesensor typecurrencyworkflowasset type
dims[3]providerupstreamfirmwarepay methodrunner OSorigin
labels[0]error typeuser agentalert levelerror codecommit SHAreferer
idtrace idreq idevent idtxn idrun idreq id
flagsstreaming, toolskeep-alivebattery_lowrefundmanual, retrycompressed

Payment ledger

let mut e = LogEvent::new(ts);
e.dims[0] = "merchant-99".into();
e.dims[1] = "cust-abc".into();
e.dims[2] = "USD".into();
e.dims[3] = "stripe".into();
e.counters[0] = 3;  // items
e.counters[1] = 450; // tax
e.metric = 4999;  // $49.99 in cents
e.status = 200;
engine.append(&e)?;

CI/CD pipeline events

let mut e = LogEvent::new(ts);
e.dims[0] = "myorg/api-server".into();
e.dims[1] = "main".into();
e.dims[2] = "deploy-prod".into();
e.dims[3] = "ubuntu-latest".into();
e.counters[0] = 342; // tests
e.labels[0] = "abc1234".into(); // SHA
e.metric = 184_000;
e.status = 0;
engine.append(&e)?;

CDN edge logs

let mut e = LogEvent::new(ts);
e.dims[0] = "IAD".into();
e.dims[1] = "US".into();
e.dims[2] = "image/webp".into();
e.dims[4] = "HIT".into();
e.counters[0] = 1; // cache hit
e.metric     = 284_000;
e.latency_ms = 12;
e.status     = 200;
engine.append(&e)?;
patternEvery use case follows the same shape: metric = primary value, counters = secondary tallies, dims = filterable axes (zone-mapped), labels = free-form context, status = bitmap-indexed. No schema migration needed.