KeplorDB, in one page.
A columnar, append-only log engine written in Rust — purpose-built for high-throughput structured event ingestion. Declare the event shape with a derive macro, open the engine, append events, query columns. Bitmap indexes for status, zone maps for dimensions, AVX2 SIMD with hardware prefetching. No server, no SQL, no background threads.
Overview
KeplorDB is an embeddable library. It is not a database server. You link it into your Rust binary, call Engine::open(), and write events. Every append() goes to a sharded WAL on disk and to an in-memory columnar buffer. When the buffer fills, it rotates into an immutable .kseg segment file.
Reads mmap those segment files and scan the relevant columns directly — no deserialization, no row reconstruction, no query planner. Aggregations run over contiguous i64 and u32 arrays using AVX2 SIMD with hardware prefetching, with a scalar fallback. Cross-segment queries fan out across rayon workers and merge globally.
Query performance is boosted by three index structures — per-segment bloom filters on the primary dimension, per-value compressed status bitmap indexes, and per-chunk zone maps on all dimensions — plus a per-segment OnceLock cache for the string intern resolve table. Together these skip entire segments, rows, and 256-row chunks before any column data is touched.
Install
KeplorDB ships as a Cargo workspace — the keplordb crate (engine) plus keplordb-macros (the #[derive(Schema)] proc-macro, re-exported from the main crate). One git dep gets both:
Cargo.toml[dependencies] keplordb = { git = "https://github.com/themankindproject/keplordb" }
Or via cargo:
shell$ cargo add keplordb --git https://github.com/themankindproject/keplordb
Requires Rust 1.82 or newer. The crate pulls in zstd, zerocopy, memmap2, thiserror, rustc-hash, hashbrown, mimalloc, rayon, crc32fast, and arc-swap. The macro crate adds syn 2, quote, proc-macro2 at build time only.
Schema derive
The recommended entry point. #[derive(keplordb::Schema)] turns a named-field struct into a typed event with fluent getters/setters plus an auto-generated Schema impl — no positional dims[0] indexing anywhere in user code.
schema — named dims, counters, labelsuse keplordb::Schema; #[derive(Schema)] #[keplordb(id = 1)] pub struct AiCall { #[dim(bloom, rollup)] pub user: String, #[dim(rollup)] pub org: String, #[dim] pub model: String, #[counter] pub tokens: u32, #[label] pub region: String, }
The macro emits four things:
impl Schema for AiCall—NUM_DIMS,NUM_COUNTERS,NUM_LABELS,BLOOM_DIM,ROLLUP_DIMS, and name-lookup match arms wired from source order.AiCallEvent— a typed wrapper with fluent setters (.user(…),.tokens(…)) plus scalar setters forts_ns,metric,status,flags,latency_ms, andid. Getters are namedget_user()etc..into_log_event()materialises a positionalLogEvent<D, C, L>.AiCallFilter— a newtype overQueryFilter<D>with the same named setters plusfrom_ts/to_ts/status_eq/cursorpass-throughs..into_filter()unwraps forengine.query_recent/aggregate.- Inherent methods on the struct —
AiCall::new(ts_ns),AiCall::filter(), andAiCall::default_config(data_dir)which builds anEngineConfigwithbloom_dimandrollup_dimsprefilled.
Validation enforced by the macro, all with spans pointing at your source:
- Every field must carry exactly one of
#[dim],#[counter],#[label]. #[dim]and#[label]fields must be typedString;#[counter]must beu32.- At most one
#[dim(bloom)]. #[keplordb(id = N)]required exactly once withN: u8.- Caps:
D ≤ 256,C ≤ 64,L ≤ 64— enforced via aconst { assert!(...) }block.
LogEvent<D, C, L> and QueryFilter<D> API stays available for dynamic / runtime-decided schemas. See examples/raw_positional.rs.Quickstart
src/main.rs — derive · open · append · aggregateuse keplordb::{Engine, Schema}; #[derive(Schema)] #[keplordb(id = 1)] pub struct AiCall { #[dim(bloom, rollup)] pub user: String, #[dim] pub model: String, #[counter] pub tokens: u32, } fn main() -> Result<(), keplordb::DbError> { // `default_config` prefills bloom_dim + rollup_dims from the schema. let engine: Engine<2,1,0> = Engine::open( AiCall::default_config("/tmp/my_logs".into()) )?; // Fluent typed builder — no positional indexing. engine.append(&AiCall::new(ts_ns()) .user("alice").model("gpt-4o") .tokens(1000).metric(5_000_000).status(200) .into_log_event())?; // Named filter. `into_filter()` unwraps for the engine API. let page = engine.query_recent( &AiCall::filter().user("alice").into_filter(), 50, )?; // Aggregate — SIMD scan + zone map pruning + intern cache. let totals = engine.aggregate(&AiCall::filter().into_filter())?; println!("events: {}, metric sum: {}", totals.event_count, totals.metric); engine.flush()?; Ok(()) }
Send + Sync; share it behind an Arc. append_batch is the throughput path — each batch lands on one WAL shard, so N concurrent writer threads see zero contention when wal_shard_count ≥ N.Data model
Every record is a LogEvent<D, C, L> — a flat, fixed-shape struct parameterised by three const generics. D is the dimension count, C the counter count, L the label count. The #[derive(Schema)] macro chooses these for you based on the field tags; you rarely spell them out manually.
There is no schema migration: every event in a given data directory has the same shape. The header's schema_id is checked on open and a mismatch is rejected as DbError::Corrupt. Unused columns are cheap (dims / labels intern to the empty string; counters default to zero).
Dimensions vs. labels vs. counters
- dims — indexed and filterable. Tagged
#[dim]. Backed by per-segment intern table (String → u16); compared asu16during query. Optional#[dim(bloom)]picks the primary dim for the per-segment bloom filter. Optional#[dim(rollup)]includes the dim in the daily rollup key. Zone-mapped per 256-row chunk for pruning. - counters — unsigned
u32tallies. Tagged#[counter]. Summed in aggregate queries via AVX2. - labels — free-form strings. Tagged
#[label]. Stored but not indexed. Returned withquery_recent, invisible toaggregate. - status — HTTP / gRPC / application status code (
u16). Bitmap-indexed per unique value for O(1) lookups.
LogEvent schema
| field | type | description |
|---|---|---|
| id | String | Unique event identifier. Interned for fast point lookups. |
| ts_ns | i64 | Nanosecond timestamp. Sorted, binary-searchable per segment. Delta-encoded + zstd. |
| metric | i64 | Primary signed metric — cost, duration, scalar value. Delta-encoded + zstd. |
| counters[0..5] | u32 | Five unsigned counters — tokens, bytes, retries. |
| latency_ms | u32 | Primary latency (ms). |
| latency_detail_ms | u32 | Detailed latency breakdown (ms). |
| latency_detail_ms | u32 | Detailed latency breakdown (ms). |
| status | u16 | Status code — HTTP, gRPC, or application-defined. Bitmap-indexed. |
| flags | u16 | 16 boolean bitflags. |
| dims[0..5] | String | Five indexed, filterable dimensions. Interned per segment. Zone-mapped for chunk pruning. |
| labels[0..3] | String | Three free-form string labels. |
Segments & WAL
A KeplorDB data directory contains:
- An active
wal.log— append-only framed log of every event. - Zero or more immutable
*.ksegsegments — columnar, compressed, mmap-friendly. Format v3 (KSG3) includes status bitmap index and zone maps. - An in-memory manifest rebuilt from segment headers on open — no persistent catalog file.
Writes go to both the WAL and an in-memory columnar buffer. When the buffer hits wal_max_events, it serialises to a new segment file and the WAL is truncated. Segments are never modified after creation — only deleted.
Each segment file (KSG3 v3) contains: a 256-byte header, zstd-compressed delta-encoded i64 block (ts_ns + metric), raw u32/u16 columns, a 128-byte bloom filter, compressed status bitmap index, raw zone maps, and a zstd-compressed intern table.
Durability
Engine::append / append_batch writes the event into the on-disk WAL and returns once the bytes hit the kernel's page cache. The WAL is fsync'd in batches, not on every append. Two knobs on EngineConfig control when the fsync fires:
| field | default | meaning |
|---|---|---|
| wal_sync_interval | 64 | fsync after this many events (per shard). |
| wal_sync_bytes | 262144 | fsync when buffered bytes crosses this (256 KB). |
The consequence: on a hard crash, you can lose up to one full sync interval per shard — at defaults that is up to wal_sync_interval × wal_shard_count events, or roughly 256 KB of writes that had made it to the OS page cache but not to the disk platter. Three profiles:
- Zero-loss —
wal_sync_interval = 1,wal_sync_bytes = 1. Every append fsyncs. Throughput drops to fsync rate. - Balanced (default) — 64 events / 256 KB. Losing the trailing batch on crash is acceptable; replay recovers everything else.
- Best-effort —
wal_sync_interval = u32::MAX,wal_sync_bytes = u64::MAX. fsync only at segment rotation. Max throughput; worst crash window.
On clean shutdown call Engine::flush() to rotate the WAL into a segment and fsync everything.
Recovery: Engine::open() replays wal.log files frame-by-frame with CRC32 validation; partial / truncated frames at the end are detected and recovered up to the last complete frame. Crash-mid-rotation is also handled — orphaned *.wal.rotating files are replayed too.
fsync only protects against process + OS crashes. For power-loss durability on consumer SSDs, also ensure your filesystem is mounted with data=journal or equivalent. KeplorDB does not flush hardware write caches.Schema id
Every segment header records EngineConfig.schema_id (a single u8). On open, Engine::open() verifies every segment's schema_id matches the config; a mismatch returns DbError::Corrupt naming both ids. This protects against mounting a data directory written under a different schema — the column layout is strictly positional, so a silent mismatch would corrupt every read.
Pick a schema_id for a given data directory and never change it. If your deployment topology has multiple event shapes, give each its own data_dir and its own schema_id.
Garbage collection
Retention is segment-level. engine.gc(cutoff_ts_ns) deletes every segment whose max_ts < cutoff. There is no compaction, no background merge, and no write amplification — GC is a few unlink() calls.
GC uses the in-memory manifest (ArcSwap index) for segment metadata — zero disk reads during GC. No syscalls beyond the actual unlink() calls. This is O(files) in-memory operations vs O(files) syscalls in the old approach.
// drop segments older than 7 days engine.gc(ts_ns() - 7 * 86_400 * 1_000_000_000)?;
API reference
The full surface of the Engine struct.
Lifecycle
| method | description |
|---|---|
| Engine::open(config) | Open (or create) a data directory. Replays the WAL on start. |
| engine.flush() | Flush in-memory buffer + WAL to disk. Always called in Drop. |
Write
| method | description |
|---|---|
| engine.append(&event) | Append a single event. WAL-durable. |
| engine.append_batch(&events) | Append a slice of events. Single WAL frame, bulk column writes. |
Read
| method | description |
|---|---|
| engine.query_recent(&filter, limit) | Most recent events matching filter, strictly newest-first across all segments. Each candidate segment contributes up to limit matches; the pool is merged by ts_ns descending and truncated. max_ts-descending scan order with early termination keeps the hot path at ~1 segment visit for time-partitioned layouts. |
| engine.aggregate(&filter) | SIMD-scanned totals with zone map chunk pruning and bloom segment skip. Event count, metric sum, counter totals. Rayon fan-out across segments; cached intern resolve table per segment. |
| engine.query_rollups(from_day, to_day, &dim_filters) | Per-day rollups for the configured rollup_dims across the selected range. |
| engine.get_event("id") | Point lookup by event id. Intern lookup skips non-matching segments; matching uses u16 index comparison, not string comparison. Respects tombstones. |
Admin
| method | description |
|---|---|
| engine.delete_event("id") | Tombstone a single event by id. O(1) append to tombstones file. Excluded from subsequent reads. |
| engine.gc(cutoff_ts_ns) | Drop every segment with max_ts < cutoff. Uses in-memory manifest — zero disk reads. Returns stats. |
Errors
All fallible calls return Result<T, DbError>. Notable variants:
- DbError::Io(io::Error) io
- Underlying filesystem or mmap failure. Engine state is typically preserved; retry after diagnosing.
- DbError::WalCorrupt { offset } recovery
- WAL frame failed CRC check. The engine truncates at
offsetand surfaces this once on open. - DbError::Corrupt(msg) recovery
- Segment header or index data is corrupt — bad magic, truncated bitmap, invalid zone map, or truncated intern data. File is skipped and a warning is logged.
- DbError::InternTableFull write
- A single segment accumulated more than 65,535 unique strings in one dim. Rotate the segment or reduce cardinality.
Segment format
Every .kseg file is self-describing and read-only. Columns are written in fixed-width blocks so mmap'd slices can be reinterpreted as typed arrays via zerocopy::FromBytes.
kseg — on-disk layout (KSG3 v3)┌────────────────────────────────┐ │ header 256 B │ magic · version · N · offsets ├────────────────────────────────┤ │ i64 block zstd+d │ delta-encoded ts_ns + metric │ u32 cols latency+cnt │ latency_ms, latency_detail, counters │ u16 cols status+dim │ status, flags, dims, id, labels ├────────────────────────────────┤ │ bloom filter 128 B │ primary dim skip │ status bitmap zstd │ per-value compressed bitmaps │ zone maps raw │ min/max per 256-row chunk × D │ intern table zstd │ lazy-decompressed └────────────────────────────────┘
Write path
- Sharded WAL, batch-routed — each
append_batchclaims a shard with a singlefetch_addand takes oneMutex. Concurrent writers land on different shards; no per-event lock fan-out. Per-event append still rotates shards round-robin. - Arena-backed
InternTable—hashbrown::HashTablewith usize-width FxHash (8 bytes/iter, no Hasher trait overhead). ZeroStringallocations per event on the intern hot path. - Deferred rollup accumulation — rollups are built from segment column data during rotation, not per-event.
merge_fromlets thread-local rollups merge under a short write lock. - Reusable WAL serialisation buffer — no per-event heap allocation. CRC32-framed records. Crash-safe three-phase rotation:
rename → write segment → unlink, with*.wal.rotatingfiles recovered on next open. - O(1) tombstone append — single line appended +
sync_data. Full rewrite deferred to GC. - Status bitmap + zone map build — during rotation, SIMD min/max reduction (
_mm256_min_epu16/_mm256_max_epu16) builds per-dim zone maps; per-status bitmaps are compressed with zstd. - Column-by-column sort reorder —
SegmentWriter::sort_by_tsreorders with one scratch Vec per element type instead of cloning the full column buffer; peak memory drops from 2× to ~1.1×. - Delta-encoded i64 block — ts_ns and metric stored as deltas, compressed with zstd. Small positive deltas compress very well.
- GC uses in-memory manifest — segment metadata from the ArcSwap index, no disk reads during gc; only the actual
unlink()syscalls.
Read path
- Lock-free segment access — segment manifest + mmap cache behind
ArcSwap. Readers load with one atomic pointer op, zero contention with writers. - Lock-free tombstone check —
is_tombstoned()is a single atomic load viaArcSwap<HashSet>. - Global newest-first merge —
query_recentsorts candidate segments bymax_tsdesc, pulls up tolimitmatches per segment, pools them, sorts by ts_ns desc, and truncates. Early-terminates when the next segment's max_ts is below the kth-best pool entry. - Rayon parallel aggregate — per-segment SIMD scans fan out across cores; results summed at the end. One atomic for rollup reads; lock-free index access.
- Cached intern resolve table —
MmapSegment::intern_resolve_table()decompresses + builds theHashMap<String,u16>once viaOnceLock; subsequent calls return&cached. 170× faster filtered aggregate, 407× fasterquery_recent_100after warmup. - Pre-decompressed i64 cache —
MmapSegmentcaches the delta-decoded ts_ns + metric block at open time. - Zero-copy u32/u16 columns via
zerocopy::FromBytes— pointer arithmetic into the mmap'd region. - Segment-level time skip using
min_ts/max_tsfrom segment metadata. - Bloom filter check operates directly on mmap'd bytes — no struct copy.
- Status bitmap index — per-value compressed bitmaps enable O(1) status lookups, skipping non-matching rows entirely. Falls back to full scan if unavailable.
- Zone map chunk pruning — min/max per 256-row chunk per dimension. Chunks that can't match are skipped before any column access. Used by both
query_recentand filtered aggregates. - Late materialization —
query_recentonly loads full columns (metric, latency, counters, labels, id) for rows that pass all filters.
SIMD & scan
Hot scan kernels compile to AVX2 when the target supports it and fall back to scalar code otherwise. Every SIMD loop includes hardware prefetch hints (_mm_prefetch with _MM_HINT_T0) to hide memory latency.
- sum_i64(col: &[i64]) -> i64 avx2 + prefetch
- Horizontal sum over the metric column. 4 lanes × 256-bit accumulators. Prefetches 8 iterations (128 bytes) ahead.
- sum_u32_as_u64(col: &[u32]) -> u64 avx2 + prefetch
- Widening sum for counter columns — avoids overflow on long segments. Prefetches 8 iterations ahead.
- count_eq_u16(col: &[u16], needle: u16) -> u64 avx2 + prefetch
- Vectorised equality count for status and flag columns. Prefetches 4 iterations (128 bytes) ahead.
- filtered_aggregate(…) avx2 + prefetch
- Combined mask + sum pass: filter by dim index, sum metric in a single linear scan. Prefetches 4 iterations ahead.
- filtered_aggregate_generic<C>(…) avx2 + prefetch
- Generic filtered aggregate with arbitrary counter columns. SIMD dim match + scalar counter accumulation.
- count_range_i64(…) avx2 + prefetch
- Count i64 elements in [lo, hi] range. Prefetches 8 iterations ahead.
- sum_range_i64(…) avx2 + prefetch
- Sum i64 elements in [lo, hi] range. Early-skips chunks with no matches. Prefetches 8 iterations ahead.
- filtered_sum_u32(…) avx2 + prefetch
- Sum u32 elements where corresponding dim_col == needle. Prefetches 4 iterations ahead.
Status bitmap index
Each segment stores a compressed bitmap per unique status value. Bit i is set if row i has that status. This enables O(1) status lookups without scanning the status column.
- Build — during segment rotation, unique status values are collected and bitmaps built per value, then zstd-compressed. Format:
[num_entries: u16][value: u16, len: u32, bytes...]*. - Query —
status_range_indices(min, max)decompresses matching bitmaps and returns sorted row indices.status_bitmap_iter(value)returns an iterator for single-value lookups. - Fallback — if the bitmap index is unavailable or empty, the query falls back to a full status column scan.
- Integration —
query_recentuses the bitmap index when a status filter is present, iterating only matching rows in reverse (newest-first).
[num_entries: u16][value: u16, len: u32, bytes...]*, then zstd-compressed. Decompression is lazy — only the needed bitmap is decompressed. The StatusBitmapIter iterates set bits efficiently.Zone maps
Per-dimension min/max values are stored for each 256-row chunk. When a dimension filter is present, chunks whose min/max range doesn't contain the filter value are skipped entirely.
- Build — during segment rotation, SIMD min/max reduction (
_mm256_min_epu16/_mm256_max_epu16) computes chunk boundaries for all D dimensions in parallel. - Query —
zone_map_pruned_chunks()returns a list of (start, end) ranges for chunks that might match. Used by bothquery_recent(with dim filters) andaggregate_filtered. - Integration — when zone maps are available and dimension filters exist, the query scans only matching chunks in reverse order (newest-first), skipping entire 256-row blocks.
[num_chunks: u16][min: u16, max: u16]*D per chunk. Stored raw (uncompressed) between the status bitmap and intern table. ZONE_MAP_CHUNK_SIZE = 256 rows.Configuration
AiCall::default_config(data_dir) (generated by the derive) returns an EngineConfig with bloom_dim and rollup_dims prefilled from the schema. Everything else follows EngineConfig::default():
| field | type | default | description |
|---|---|---|---|
| data_dir | PathBuf | — | Directory to hold WAL + segments. Created if missing. |
| wal_max_events | u32 | 500_000 | Events per WAL shard before rotation into a segment. |
| wal_sync_interval | u32 | 64 | fsync after this many events per shard. Set to 1 for zero data loss. |
| wal_sync_bytes | u64 | 262_144 | fsync when a shard has buffered this many bytes (256 KB). |
| wal_shard_count | usize | 4 | Number of WAL shards. Each append_batch lands on one shard (round-robin at batch level). |
| schema_id | u8 | 0 | Written into every segment header. Verified on open — a mismatch rejects the data directory. |
| bloom_dim | usize | 0 | Which dim drives the per-segment bloom filter. Must be < NUM_DIMS. |
| rollup_dims | Vec<usize> | [0, 1] | Dim indices composing the daily rollup key. |
| mmap_cache_capacity | usize | 256 | Max simultaneously mmap'd segment files. Older entries evicted LRU. |
| rollup_replay_days | u32 | 7 | Days of historical segments to replay into the rollup store on startup. |
Crash recovery
Engine::open() scans the data directory in this order:
- Load in-memory manifest — if missing, rebuild from segment headers.
- Validate each
.ksegheader magic; move corrupt files aside. - Replay
wal.logframe-by-frame, CRC-checked; truncate at first bad frame. Handles truncated files gracefully. - Write replayed events into a new segment; update manifest.
Recovery is single-threaded and proportional to WAL size. For a default 64-event sync interval, recovery processes tens of thousands of events per second. The WAL uses sharded buffers for concurrent append safety.
Sizing & limits
| limit | value | why |
|---|---|---|
| events / segment | 2³¹ | u32 row indices throughout the column layout. |
| unique strings / dim / segment | 65_535 | u16 intern index. Exceeding triggers early rotation. |
| zone map chunk size | 256 rows | Balance between pruning granularity and index size. SIMD-built during rotation. |
| status bitmap entries | ≤ 65,536 | One bitmap per unique status value. Compressed with zstd. |
| concurrent writers | 1 | Single-writer by design. Wrap Engine in an Arc<Mutex> for multi-producer. |
Use cases
The fixed LogEvent schema maps to any append-only, time-ordered workload. Below is how the generic fields adapt per domain.
All use cases benefit from the same performance optimizations: bitmap indexes for status filtering, zone maps for dimension chunk pruning, and AVX2 SIMD with hardware prefetching for aggregation.
Schema mapping
| field | LLM | HTTP logs | IoT | payments | CI/CD | CDN |
|---|---|---|---|---|---|---|
| metric | cost (nanodollars) | response time | sensor reading | amount (cents) | build duration | bytes xfr |
| counters[0] | input tokens | req bytes | sample count | item count | test count | cache hits |
| counters[1] | output tokens | resp bytes | error count | tax | failures | cache miss |
| latency_ms | total latency | TTFB | poll interval | processing | queue wait | edge latency |
| status | HTTP status | HTTP status | device status | txn status | exit code | HTTP status |
| dims[0] | user_id | client IP | device_id | merchant | repo | edge POP |
| dims[1] | api_key | method | location | customer | branch | country |
| dims[2] | model | route | sensor type | currency | workflow | asset type |
| dims[3] | provider | upstream | firmware | pay method | runner OS | origin |
| labels[0] | error type | user agent | alert level | error code | commit SHA | referer |
| id | trace id | req id | event id | txn id | run id | req id |
| flags | streaming, tools | keep-alive | battery_low | refund | manual, retry | compressed |
Payment ledger
let mut e = LogEvent::new(ts); e.dims[0] = "merchant-99".into(); e.dims[1] = "cust-abc".into(); e.dims[2] = "USD".into(); e.dims[3] = "stripe".into(); e.counters[0] = 3; // items e.counters[1] = 450; // tax e.metric = 4999; // $49.99 in cents e.status = 200; engine.append(&e)?;
CI/CD pipeline events
let mut e = LogEvent::new(ts); e.dims[0] = "myorg/api-server".into(); e.dims[1] = "main".into(); e.dims[2] = "deploy-prod".into(); e.dims[3] = "ubuntu-latest".into(); e.counters[0] = 342; // tests e.labels[0] = "abc1234".into(); // SHA e.metric = 184_000; e.status = 0; engine.append(&e)?;
CDN edge logs
let mut e = LogEvent::new(ts); e.dims[0] = "IAD".into(); e.dims[1] = "US".into(); e.dims[2] = "image/webp".into(); e.dims[4] = "HIT".into(); e.counters[0] = 1; // cache hit e.metric = 284_000; e.latency_ms = 12; e.status = 200; engine.append(&e)?;
metric = primary value, counters = secondary tallies, dims = filterable axes (zone-mapped), labels = free-form context, status = bitmap-indexed. No schema migration needed.