docs/getting started/overview

KeplorDB, in one page.

A columnar, append-only log engine written in Rust — purpose-built for high-throughput structured event ingestion. Declare the event shape with a derive macro, open the engine, append events, query columns. Bitmap indexes for status, zone maps for dimensions, AVX2 SIMD with hardware prefetching. No server, no SQL, no background threads.

Overview

KeplorDB is an embeddable library. It is not a database server. You link it into your Rust binary, call Engine::open(), and write events. Every append() goes to a sharded WAL on disk and to an in-memory columnar buffer. When the buffer fills, it rotates into an immutable .kseg segment file.

Reads mmap those segment files and scan the relevant columns directly — no deserialization, no row reconstruction, no query planner. Aggregations run over contiguous i64 and u32 arrays using AVX2 SIMD with hardware prefetching, with a scalar fallback. Cross-segment queries fan out across rayon workers and merge globally.

Query performance is boosted by three index structures — per-segment bloom filters on the primary dimension, per-value compressed status bitmap indexes, and per-chunk zone maps on all dimensions — plus a per-segment OnceLock cache for the string intern resolve table. Together these skip entire segments, rows, and 256-row chunks before any column data is touched.

not forMutable rows, joins, ad-hoc secondary indexes, SQL, or externally-sourced untrusted segment files. KeplorDB assumes append-only, time-ordered events. See PRODUCTION.md for the full list of pre-1.0 gaps.

Install

KeplorDB ships as a Cargo workspace — the keplordb crate (engine) plus keplordb-macros (the #[derive(Schema)] proc-macro, re-exported from the main crate). One git dep gets both:

Cargo.toml[dependencies]
keplordb = { git = "https://github.com/themankindproject/keplordb" }

Or via cargo:

shell$ cargo add keplordb --git https://github.com/themankindproject/keplordb

Requires Rust 1.82 or newer. The crate pulls in zstd, zerocopy, memmap2, thiserror, rustc-hash, hashbrown, mimalloc, rayon, crc32fast, and arc-swap. The macro crate adds syn 2, quote, proc-macro2 at build time only.

Schema derive

The recommended entry point. #[derive(keplordb::Schema)] turns a named-field struct into a typed event with fluent getters/setters plus an auto-generated Schema impl — no positional dims[0] indexing anywhere in user code.

schema — named dims, counters, labelsuse keplordb::Schema;

#[derive(Schema)]
#[keplordb(id = 1)]
pub struct AiCall {
    #[dim(bloom, rollup)] pub user:    String,
    #[dim(rollup)]        pub org:     String,
    #[dim]                pub model:   String,
    #[counter]            pub tokens:  u32,
    #[label]              pub region:  String,
}

The macro emits four things:

impl Schema for AiCall — NUM_DIMS, NUM_COUNTERS, NUM_LABELS, BLOOM_DIM, ROLLUP_DIMS, and name-lookup match arms wired from source order.
AiCallEvent — a typed wrapper with fluent setters (.user(…), .tokens(…)) plus scalar setters for ts_ns, metric, status, flags, latency_ms, and id. Getters are named get_user() etc. .into_log_event() materialises a positional LogEvent<D, C, L>.
AiCallFilter — a newtype over QueryFilter<D> with the same named setters plus from_ts/to_ts/status_eq/cursor pass-throughs. .into_filter() unwraps for engine.query_recent / aggregate.
Inherent methods on the struct — AiCall::new(ts_ns), AiCall::filter(), and AiCall::default_config(data_dir) which builds an EngineConfig with bloom_dim and rollup_dims prefilled.

Validation enforced by the macro, all with spans pointing at your source:

Every field must carry exactly one of #[dim], #[counter], #[label].
#[dim] and #[label] fields must be typed String; #[counter] must be u32.
At most one #[dim(bloom)].
#[keplordb(id = N)] required exactly once with N: u8.
Caps: D ≤ 256, C ≤ 64, L ≤ 64 — enforced via a const { assert!(...) } block.

positional apiThe derive is purely additive. The raw LogEvent<D, C, L> and QueryFilter<D> API stays available for dynamic / runtime-decided schemas. See examples/raw_positional.rs.

Quickstart

src/main.rs — derive · open · append · aggregateuse keplordb::{Engine, Schema};

#[derive(Schema)]
#[keplordb(id = 1)]
pub struct AiCall {
    #[dim(bloom, rollup)] pub user:   String,
    #[dim]                pub model:  String,
    #[counter]            pub tokens: u32,
}

fn main() -> Result<(), keplordb::DbError> {
    // `default_config` prefills bloom_dim + rollup_dims from the schema.
    let engine: Engine<2,1,0> = Engine::open(
        AiCall::default_config("/tmp/my_logs".into())
    )?;

    // Fluent typed builder — no positional indexing.
    engine.append(&AiCall::new(ts_ns())
        .user("alice").model("gpt-4o")
        .tokens(1000).metric(5_000_000).status(200)
        .into_log_event())?;

    // Named filter. `into_filter()` unwraps for the engine API.
    let page = engine.query_recent(
        &AiCall::filter().user("alice").into_filter(),
        50,
    )?;

    // Aggregate — SIMD scan + zone map pruning + intern cache.
    let totals = engine.aggregate(&AiCall::filter().into_filter())?;
    println!("events: {}, metric sum: {}", totals.event_count, totals.metric);

    engine.flush()?;
    Ok(())
}

tipThe engine is Send + Sync; share it behind an Arc. append_batch is the throughput path — each batch lands on one WAL shard, so N concurrent writer threads see zero contention when wal_shard_count ≥ N.

Data model

Every record is a LogEvent<D, C, L> — a flat, fixed-shape struct parameterised by three const generics. D is the dimension count, C the counter count, L the label count. The #[derive(Schema)] macro chooses these for you based on the field tags; you rarely spell them out manually.

There is no schema migration: every event in a given data directory has the same shape. The header's schema_id is checked on open and a mismatch is rejected as DbError::Corrupt. Unused columns are cheap (dims / labels intern to the empty string; counters default to zero).

Dimensions vs. labels vs. counters

dims — indexed and filterable. Tagged #[dim]. Backed by per-segment intern table (String → u16); compared as u16 during query. Optional #[dim(bloom)] picks the primary dim for the per-segment bloom filter. Optional #[dim(rollup)] includes the dim in the daily rollup key. Zone-mapped per 256-row chunk for pruning.
counters — unsigned u32 tallies. Tagged #[counter]. Summed in aggregate queries via AVX2.
labels — free-form strings. Tagged #[label]. Stored but not indexed. Returned with query_recent, invisible to aggregate.
status — HTTP / gRPC / application status code (u16). Bitmap-indexed per unique value for O(1) lookups.

LogEvent schema

field	type	description
id	String	Unique event identifier. Interned for fast point lookups.
ts_ns	i64	Nanosecond timestamp. Sorted, binary-searchable per segment. Delta-encoded + zstd.
metric	i64	Primary signed metric — cost, duration, scalar value. Delta-encoded + zstd.
counters[0..5]	u32	Five unsigned counters — tokens, bytes, retries.
latency_ms	u32	Primary latency (ms).
latency_detail_ms	u32	Detailed latency breakdown (ms).
latency_detail_ms	u32	Detailed latency breakdown (ms).
status	u16	Status code — HTTP, gRPC, or application-defined. Bitmap-indexed.
flags	u16	16 boolean bitflags.
dims[0..5]	String	Five indexed, filterable dimensions. Interned per segment. Zone-mapped for chunk pruning.
labels[0..3]	String	Three free-form string labels.

Segments & WAL

A KeplorDB data directory contains:

An active wal.log — append-only framed log of every event.
Zero or more immutable *.kseg segments — columnar, compressed, mmap-friendly. Format v3 (KSG3) includes status bitmap index and zone maps.
An in-memory manifest rebuilt from segment headers on open — no persistent catalog file.

Writes go to both the WAL and an in-memory columnar buffer. When the buffer hits wal_max_events, it serialises to a new segment file and the WAL is truncated. Segments are never modified after creation — only deleted.

Each segment file (KSG3 v3) contains: a 256-byte header, zstd-compressed delta-encoded i64 block (ts_ns + metric), raw u32/u16 columns, a 128-byte bloom filter, compressed status bitmap index, raw zone maps, and a zstd-compressed intern table.

Durability

Engine::append / append_batch writes the event into the on-disk WAL and returns once the bytes hit the kernel's page cache. The WAL is fsync'd in batches, not on every append. Two knobs on EngineConfig control when the fsync fires:

field	default	meaning
wal_sync_interval	64	fsync after this many events (per shard).
wal_sync_bytes	262144	fsync when buffered bytes crosses this (256 KB).

The consequence: on a hard crash, you can lose up to one full sync interval per shard — at defaults that is up to wal_sync_interval × wal_shard_count events, or roughly 256 KB of writes that had made it to the OS page cache but not to the disk platter. Three profiles:

Zero-loss — wal_sync_interval = 1, wal_sync_bytes = 1. Every append fsyncs. Throughput drops to fsync rate.
Balanced (default) — 64 events / 256 KB. Losing the trailing batch on crash is acceptable; replay recovers everything else.
Best-effort — wal_sync_interval = u32::MAX, wal_sync_bytes = u64::MAX. fsync only at segment rotation. Max throughput; worst crash window.

On clean shutdown call Engine::flush() to rotate the WAL into a segment and fsync everything.

Recovery: Engine::open() replays wal.log files frame-by-frame with CRC32 validation; partial / truncated frames at the end are detected and recovered up to the last complete frame. Crash-mid-rotation is also handled — orphaned *.wal.rotating files are replayed too.

caveatfsync only protects against process + OS crashes. For power-loss durability on consumer SSDs, also ensure your filesystem is mounted with data=journal or equivalent. KeplorDB does not flush hardware write caches.

Schema id

Every segment header records EngineConfig.schema_id (a single u8). On open, Engine::open() verifies every segment's schema_id matches the config; a mismatch returns DbError::Corrupt naming both ids. This protects against mounting a data directory written under a different schema — the column layout is strictly positional, so a silent mismatch would corrupt every read.

Pick a schema_id for a given data directory and never change it. If your deployment topology has multiple event shapes, give each its own data_dir and its own schema_id.

migrationEvolving a schema in place — reordering fields, adding a dim, changing a counter — is not supported today. See the pre-1.0 gaps in PRODUCTION.md.

Garbage collection

Retention is segment-level. engine.gc(cutoff_ts_ns) deletes every segment whose max_ts < cutoff. There is no compaction, no background merge, and no write amplification — GC is a few unlink() calls.

GC uses the in-memory manifest (ArcSwap index) for segment metadata — zero disk reads during GC. No syscalls beyond the actual unlink() calls. This is O(files) in-memory operations vs O(files) syscalls in the old approach.

// drop segments older than 7 days
engine.gc(ts_ns() - 7 * 86_400 * 1_000_000_000)?;

API reference

The full surface of the Engine struct.

Lifecycle

method	description
Engine::open(config)	Open (or create) a data directory. Replays the WAL on start.
engine.flush()	Flush in-memory buffer + WAL to disk. Always called in `Drop`.

Write

method	description
engine.append(&event)	Append a single event. WAL-durable.
engine.append_batch(&events)	Append a slice of events. Single WAL frame, bulk column writes.

Read

method	description
engine.query_recent(&filter, limit)	Most recent events matching filter, strictly newest-first across all segments. Each candidate segment contributes up to `limit` matches; the pool is merged by ts_ns descending and truncated. max_ts-descending scan order with early termination keeps the hot path at ~1 segment visit for time-partitioned layouts.
engine.aggregate(&filter)	SIMD-scanned totals with zone map chunk pruning and bloom segment skip. Event count, metric sum, counter totals. Rayon fan-out across segments; cached intern resolve table per segment.
engine.query_rollups(from_day, to_day, &dim_filters)	Per-day rollups for the configured `rollup_dims` across the selected range.
engine.get_event("id")	Point lookup by event id. Intern lookup skips non-matching segments; matching uses u16 index comparison, not string comparison. Respects tombstones.

Admin

method	description
engine.delete_event("id")	Tombstone a single event by id. O(1) append to tombstones file. Excluded from subsequent reads.
engine.gc(cutoff_ts_ns)	Drop every segment with `max_ts < cutoff`. Uses in-memory manifest — zero disk reads. Returns stats.

Errors

All fallible calls return Result<T, DbError>. Notable variants:

DbError::Io(io::Error) io: Underlying filesystem or mmap failure. Engine state is typically preserved; retry after diagnosing.
DbError::WalCorrupt { offset } recovery: WAL frame failed CRC check. The engine truncates at offset and surfaces this once on open.
DbError::Corrupt(msg) recovery: Segment header or index data is corrupt — bad magic, truncated bitmap, invalid zone map, or truncated intern data. File is skipped and a warning is logged.
DbError::InternTableFull write: A single segment accumulated more than 65,535 unique strings in one dim. Rotate the segment or reduce cardinality.

Segment format

Every .kseg file is self-describing and read-only. Columns are written in fixed-width blocks so mmap'd slices can be reinterpreted as typed arrays via zerocopy::FromBytes.

kseg — on-disk layout (KSG3 v3)┌────────────────────────────────┐
│  header            256 B       │  magic · version · N · offsets
├────────────────────────────────┤
│  i64 block         zstd+d      │  delta-encoded ts_ns + metric
│  u32 cols          latency+cnt │  latency_ms, latency_detail, counters
│  u16 cols          status+dim  │  status, flags, dims, id, labels
├────────────────────────────────┤
│  bloom filter      128 B       │  primary dim skip
│  status bitmap     zstd        │  per-value compressed bitmaps
│  zone maps         raw         │  min/max per 256-row chunk × D
│  intern table      zstd        │  lazy-decompressed
└────────────────────────────────┘

Write path

Sharded WAL, batch-routed — each append_batch claims a shard with a single fetch_add and takes one Mutex. Concurrent writers land on different shards; no per-event lock fan-out. Per-event append still rotates shards round-robin.
Arena-backed InternTable — hashbrown::HashTable with usize-width FxHash (8 bytes/iter, no Hasher trait overhead). Zero String allocations per event on the intern hot path.
Deferred rollup accumulation — rollups are built from segment column data during rotation, not per-event. merge_from lets thread-local rollups merge under a short write lock.
Reusable WAL serialisation buffer — no per-event heap allocation. CRC32-framed records. Crash-safe three-phase rotation: rename → write segment → unlink, with *.wal.rotating files recovered on next open.
O(1) tombstone append — single line appended + sync_data. Full rewrite deferred to GC.
Status bitmap + zone map build — during rotation, SIMD min/max reduction (_mm256_min_epu16/_mm256_max_epu16) builds per-dim zone maps; per-status bitmaps are compressed with zstd.
Column-by-column sort reorder — SegmentWriter::sort_by_ts reorders with one scratch Vec per element type instead of cloning the full column buffer; peak memory drops from 2× to ~1.1×.
Delta-encoded i64 block — ts_ns and metric stored as deltas, compressed with zstd. Small positive deltas compress very well.
GC uses in-memory manifest — segment metadata from the ArcSwap index, no disk reads during gc; only the actual unlink() syscalls.

Read path

Lock-free segment access — segment manifest + mmap cache behind ArcSwap. Readers load with one atomic pointer op, zero contention with writers.
Lock-free tombstone check — is_tombstoned() is a single atomic load via ArcSwap<HashSet>.
Global newest-first merge — query_recent sorts candidate segments by max_ts desc, pulls up to limit matches per segment, pools them, sorts by ts_ns desc, and truncates. Early-terminates when the next segment's max_ts is below the kth-best pool entry.
Rayon parallel aggregate — per-segment SIMD scans fan out across cores; results summed at the end. One atomic for rollup reads; lock-free index access.
Cached intern resolve table — MmapSegment::intern_resolve_table() decompresses + builds the HashMap<String,u16> once via OnceLock; subsequent calls return &cached. 170× faster filtered aggregate, 407× faster query_recent_100 after warmup.
Pre-decompressed i64 cache — MmapSegment caches the delta-decoded ts_ns + metric block at open time.
Zero-copy u32/u16 columns via zerocopy::FromBytes — pointer arithmetic into the mmap'd region.
Segment-level time skip using min_ts/max_ts from segment metadata.
Bloom filter check operates directly on mmap'd bytes — no struct copy.
Status bitmap index — per-value compressed bitmaps enable O(1) status lookups, skipping non-matching rows entirely. Falls back to full scan if unavailable.
Zone map chunk pruning — min/max per 256-row chunk per dimension. Chunks that can't match are skipped before any column access. Used by both query_recent and filtered aggregates.
Late materialization — query_recent only loads full columns (metric, latency, counters, labels, id) for rows that pass all filters.

SIMD & scan

Hot scan kernels compile to AVX2 when the target supports it and fall back to scalar code otherwise. Every SIMD loop includes hardware prefetch hints (_mm_prefetch with _MM_HINT_T0) to hide memory latency.

sum_i64(col: &[i64]) -> i64 avx2 + prefetch: Horizontal sum over the metric column. 4 lanes × 256-bit accumulators. Prefetches 8 iterations (128 bytes) ahead.
sum_u32_as_u64(col: &[u32]) -> u64 avx2 + prefetch: Widening sum for counter columns — avoids overflow on long segments. Prefetches 8 iterations ahead.
count_eq_u16(col: &[u16], needle: u16) -> u64 avx2 + prefetch: Vectorised equality count for status and flag columns. Prefetches 4 iterations (128 bytes) ahead.
filtered_aggregate(…) avx2 + prefetch: Combined mask + sum pass: filter by dim index, sum metric in a single linear scan. Prefetches 4 iterations ahead.
filtered_aggregate_generic<C>(…) avx2 + prefetch: Generic filtered aggregate with arbitrary counter columns. SIMD dim match + scalar counter accumulation.
count_range_i64(…) avx2 + prefetch: Count i64 elements in [lo, hi] range. Prefetches 8 iterations ahead.
sum_range_i64(…) avx2 + prefetch: Sum i64 elements in [lo, hi] range. Early-skips chunks with no matches. Prefetches 8 iterations ahead.
filtered_sum_u32(…) avx2 + prefetch: Sum u32 elements where corresponding dim_col == needle. Prefetches 4 iterations ahead.

Status bitmap index

Each segment stores a compressed bitmap per unique status value. Bit i is set if row i has that status. This enables O(1) status lookups without scanning the status column.

Build — during segment rotation, unique status values are collected and bitmaps built per value, then zstd-compressed. Format: [num_entries: u16][value: u16, len: u32, bytes...]*.
Query — status_range_indices(min, max) decompresses matching bitmaps and returns sorted row indices. status_bitmap_iter(value) returns an iterator for single-value lookups.
Fallback — if the bitmap index is unavailable or empty, the query falls back to a full status column scan.
Integration — query_recent uses the bitmap index when a status filter is present, iterating only matching rows in reverse (newest-first).

formatSerialized as [num_entries: u16][value: u16, len: u32, bytes...]*, then zstd-compressed. Decompression is lazy — only the needed bitmap is decompressed. The StatusBitmapIter iterates set bits efficiently.

Zone maps

Per-dimension min/max values are stored for each 256-row chunk. When a dimension filter is present, chunks whose min/max range doesn't contain the filter value are skipped entirely.

Build — during segment rotation, SIMD min/max reduction (_mm256_min_epu16 / _mm256_max_epu16) computes chunk boundaries for all D dimensions in parallel.
Query — zone_map_pruned_chunks() returns a list of (start, end) ranges for chunks that might match. Used by both query_recent (with dim filters) and aggregate_filtered.
Integration — when zone maps are available and dimension filters exist, the query scans only matching chunks in reverse order (newest-first), skipping entire 256-row blocks.

formatSerialized as [num_chunks: u16][min: u16, max: u16]*D per chunk. Stored raw (uncompressed) between the status bitmap and intern table. ZONE_MAP_CHUNK_SIZE = 256 rows.

Configuration

AiCall::default_config(data_dir) (generated by the derive) returns an EngineConfig with bloom_dim and rollup_dims prefilled from the schema. Everything else follows EngineConfig::default():

field	type	default	description
data_dir	PathBuf	—	Directory to hold WAL + segments. Created if missing.
wal_max_events	u32	500_000	Events per WAL shard before rotation into a segment.
wal_sync_interval	u32	64	fsync after this many events per shard. Set to 1 for zero data loss.
wal_sync_bytes	u64	262_144	fsync when a shard has buffered this many bytes (256 KB).
wal_shard_count	usize	4	Number of WAL shards. Each `append_batch` lands on one shard (round-robin at batch level).
schema_id	u8	0	Written into every segment header. Verified on open — a mismatch rejects the data directory.
bloom_dim	usize	0	Which dim drives the per-segment bloom filter. Must be < `NUM_DIMS`.
rollup_dims	Vec<usize>	[0, 1]	Dim indices composing the daily rollup key.
mmap_cache_capacity	usize	256	Max simultaneously mmap'd segment files. Older entries evicted LRU.
rollup_replay_days	u32	7	Days of historical segments to replay into the rollup store on startup.

Crash recovery

Engine::open() scans the data directory in this order:

Load in-memory manifest — if missing, rebuild from segment headers.
Validate each .kseg header magic; move corrupt files aside.
Replay wal.log frame-by-frame, CRC-checked; truncate at first bad frame. Handles truncated files gracefully.
Write replayed events into a new segment; update manifest.

Recovery is single-threaded and proportional to WAL size. For a default 64-event sync interval, recovery processes tens of thousands of events per second. The WAL uses sharded buffers for concurrent append safety.

Sizing & limits

limit	value	why
events / segment	2³¹	`u32` row indices throughout the column layout.
unique strings / dim / segment	65_535	`u16` intern index. Exceeding triggers early rotation.
zone map chunk size	256 rows	Balance between pruning granularity and index size. SIMD-built during rotation.
status bitmap entries	≤ 65,536	One bitmap per unique status value. Compressed with zstd.
concurrent writers	1	Single-writer by design. Wrap `Engine` in an `Arc<Mutex>` for multi-producer.

Use cases

The fixed LogEvent schema maps to any append-only, time-ordered workload. Below is how the generic fields adapt per domain.

All use cases benefit from the same performance optimizations: bitmap indexes for status filtering, zone maps for dimension chunk pruning, and AVX2 SIMD with hardware prefetching for aggregation.

Schema mapping

field	LLM	HTTP logs	IoT	payments	CI/CD	CDN
metric	cost (nanodollars)	response time	sensor reading	amount (cents)	build duration	bytes xfr
counters[0]	input tokens	req bytes	sample count	item count	test count	cache hits
counters[1]	output tokens	resp bytes	error count	tax	failures	cache miss
latency_ms	total latency	TTFB	poll interval	processing	queue wait	edge latency
status	HTTP status	HTTP status	device status	txn status	exit code	HTTP status
dims[0]	user_id	client IP	device_id	merchant	repo	edge POP
dims[1]	api_key	method	location	customer	branch	country
dims[2]	model	route	sensor type	currency	workflow	asset type
dims[3]	provider	upstream	firmware	pay method	runner OS	origin
labels[0]	error type	user agent	alert level	error code	commit SHA	referer
id	trace id	req id	event id	txn id	run id	req id
flags	streaming, tools	keep-alive	battery_low	refund	manual, retry	compressed

Payment ledger

let mut e = LogEvent::new(ts);
e.dims[0] = "merchant-99".into();
e.dims[1] = "cust-abc".into();
e.dims[2] = "USD".into();
e.dims[3] = "stripe".into();
e.counters[0] = 3;  // items
e.counters[1] = 450; // tax
e.metric = 4999;  // $49.99 in cents
e.status = 200;
engine.append(&e)?;

CI/CD pipeline events

let mut e = LogEvent::new(ts);
e.dims[0] = "myorg/api-server".into();
e.dims[1] = "main".into();
e.dims[2] = "deploy-prod".into();
e.dims[3] = "ubuntu-latest".into();
e.counters[0] = 342; // tests
e.labels[0] = "abc1234".into(); // SHA
e.metric = 184_000;
e.status = 0;
engine.append(&e)?;

CDN edge logs

let mut e = LogEvent::new(ts);
e.dims[0] = "IAD".into();
e.dims[1] = "US".into();
e.dims[2] = "image/webp".into();
e.dims[4] = "HIT".into();
e.counters[0] = 1; // cache hit
e.metric     = 284_000;
e.latency_ms = 12;
e.status     = 200;
engine.append(&e)?;

patternEvery use case follows the same shape: metric = primary value, counters = secondary tallies, dims = filterable axes (zone-mapped), labels = free-form context, status = bitmap-indexed. No schema migration needed.

← previous

Overview

API reference