KeplorDB, in one page.
A columnar, append-only log engine written in Rust — purpose-built for high-throughput structured event ingestion. No server, no SQL, no threads. Open the engine, append events, query columns.
Overview
KeplorDB is an embeddable library. It is not a database server. You link it into your Rust binary, call Engine::open(), and write events. Every append() goes to a WAL on disk and to an in-memory columnar buffer. When the buffer fills, it rotates into an immutable .kseg segment file.
Reads mmap those segment files and scan the relevant columns directly — no deserialization, no row reconstruction, no query planner. Aggregations run over contiguous i64 and u32 arrays using AVX2 SIMD, with a scalar fallback.
Install
Add the crate to your Cargo.toml:
Cargo.toml[dependencies] keplordb = { git = "https://github.com/themankindproject/keplordb" }
Or via cargo:
shell$ cargo add keplordb --git https://github.com/themankindproject/keplordb
Requires Rust 1.82 or newer. The crate pulls in zstd, zerocopy, memmap2, thiserror, rustc-hash, hashbrown, mimalloc, and rayon as dependencies.
Quickstart
src/main.rs — open · append · aggregateuse keplordb::{Engine, EngineConfig, LogEvent, QueryFilter}; fn main() -> Result<(), keplordb::DbError> { let engine = Engine::open(EngineConfig { data_dir: "/tmp/my_logs".into(), wal_max_events: 100_000, ..Default::default() })?; // Write a single event. let mut e = LogEvent::new(ts_ns()); e.dims[0] = "alice".into(); e.dims[2] = "gpt-4o".into(); e.metric = 5_000_000; e.counters[0] = 1000; e.status = 200; engine.append(&e)?; // Last 50 events for a user. let results = engine.query_recent(&QueryFilter { user_id: Some("alice".into()), ..Default::default() }, 50)?; // Full-segment aggregate (SIMD scan). let totals = engine.aggregate(&QueryFilter::default())?; println!("events: {}, metric sum: {}", totals.event_count, totals.metric); engine.flush()?; Ok(()) }
append_batch(&events) when you have more than a handful of events. It bypasses per-event WAL framing and gets you ~973K ev/s versus ~843K for singles.Data model
Every record is a LogEvent — a flat, fixed-shape struct. There is no schema migration; every event has the same columns. Unused columns are cheap (they intern to the empty string or zero).
Dimensions vs. labels
- dims[0..5] — indexed and filterable. Use these for the axes you query by. In LLM workloads, dim[0]=user, dim[1]=api_key, dim[2]=model, etc.
- labels[0..3] — free-form strings, stored but not indexed. Returned with
query_recent, invisible toaggregate. - payload — opaque JSON string, zstd-compressed alongside variable data.
LogEvent schema
| field | type | description |
|---|---|---|
| id | String | Unique event identifier. |
| ts_ns | i64 | Nanosecond timestamp. Sorted, binary-searchable per segment. |
| metric | i64 | Primary signed metric — cost, duration, scalar value. |
| counters[0..5] | u32 | Five unsigned counters — tokens, bytes, retries. |
| latency_ms | u32 | Primary latency (ms). Second latency lives in counters. |
| status | u16 | Status code — HTTP, gRPC, or application-defined. |
| flags | u16 | 16 boolean bitflags. |
| dims[0..5] | String | Five indexed, filterable dimensions. Interned per segment. |
| labels[0..3] | String | Three free-form string labels. |
| payload | String | JSON metadata — opaque to the engine. |
Segments & WAL
A KeplorDB data directory contains:
- An active
wal.log— append-only framed log of every event. - Zero or more immutable
*.ksegsegments — columnar, compressed, mmap-friendly. - A small
meta.jsonwithmin_ts/max_ts/event count per segment.
Writes go to both the WAL and an in-memory columnar buffer. When the buffer hits wal_max_events, it serialises to a new segment file and the WAL is truncated. Segments are never modified after creation — only deleted.
Durability
Every append():
- Writes to the in-memory columnar buffer.
- Writes to the on-disk WAL file.
fsyncs every 64 events by default (configurable).
On crash, Engine::open() replays the WAL and writes recovered events into a segment. Maximum data loss is one sync interval — with defaults, up to 63 events. Set wal_sync_interval: 1 for strict fsync per append, at a significant throughput cost.
fsync only protects against process crashes. For power-loss durability on consumer SSDs, also ensure your filesystem is mounted with data=journal or equivalent — KeplorDB does not attempt to flush hardware write caches.Garbage collection
Retention is segment-level. engine.gc(cutoff_ts_ns) deletes every segment whose max_ts < cutoff. There is no compaction, no background merge, and no write amplification — GC is a few unlink() calls.
// drop segments older than 7 days engine.gc(ts_ns() - 7 * 86_400 * 1_000_000_000)?;
API reference
The full surface of the Engine struct.
Lifecycle
| method | description |
|---|---|
| Engine::open(config) | Open (or create) a data directory. Replays the WAL on start. |
| engine.flush() | Flush in-memory buffer + WAL to disk. Always called in Drop. |
Write
| method | description |
|---|---|
| engine.append(&event) | Append a single event. WAL-durable. |
| engine.append_batch(&events) | Append a slice of events. Single WAL frame, bulk column writes. |
Read
| method | description |
|---|---|
| engine.query_recent(&filter, limit) | Return the most recent events matching filter, newest first. |
| engine.aggregate(&filter) | SIMD-scanned totals: event count, metric sum, per-status tallies. |
| engine.query_rollups(from_day, to_day, user, api_key) | Per-day, per-user, per-key rollups across the selected range. |
| engine.get_event("id") | Point lookup by event id. Uses bloom filters to skip segments. |
Admin
| method | description |
|---|---|
| engine.delete_event("id") | Tombstone a single event by id. Excluded from subsequent reads. |
| engine.gc(cutoff_ts_ns) | Drop every segment with max_ts < cutoff. Returns stats. |
Errors
All fallible calls return Result<T, DbError>. Notable variants:
- DbError::Io(io::Error) io
- Underlying filesystem or mmap failure. Engine state is typically preserved; retry after diagnosing.
- DbError::WalCorrupt { offset } recovery
- WAL frame failed CRC check. The engine truncates at
offsetand surfaces this once on open. - DbError::SegmentBadMagic { path } recovery
- Segment header missing the expected magic bytes. File is moved to
corrupt/. - DbError::InternTableFull write
- A single segment accumulated more than 65,535 unique strings in one dim. Rotate the segment or reduce cardinality.
Segment format
Every .kseg file is self-describing and read-only. Columns are written in fixed-width blocks so mmap'd slices can be reinterpreted as typed arrays via zerocopy::FromBytes.
kseg — on-disk layout┌────────────────────────────────┐ │ header 256 B │ magic · version · N · bloom offset ├────────────────────────────────┤ │ ts_ns i64 × N │ sorted, binary-searchable │ metric i64 × N │ contiguous for SIMD SUM │ counters u32 × N × 5 │ │ latencies u32 × N × 2 │ │ status · flags u16 × N │ │ dim indices u16/u8 × 5 │ interned string refs │ ext indices u16 × N × 4 │ ├────────────────────────────────┤ │ bloom filter 128 B │ primary dim skip │ intern table zstd │ lazy-decompressed │ variable data zstd │ labels · payload └────────────────────────────────┘
Write path
- Arena-backed
InternTable—hashbrown::HashTablewhose hashes and equality resolve through a contiguousVec<u8>. ZeroStringallocations per event. - FxHashMap (
rustc-hash) for rollup accumulation. - Unchecked
Vecpush across all 20 column buffers — bounds verified by capacity invariant at buffer-open time. - Bulk column writes via
zerocopy::IntoBytes— onewrite_allper column per segment flush. - Reusable WAL serialisation buffer — no per-event heap allocation.
Read path
- Segment-level time skip using
min_ts/max_tsfrommeta.jsonbefore any mmap is opened. - Bloom filter check operates directly on mmap'd bytes — no struct copy.
- Zero-copy column access via
zerocopy::FromBytes. - Lazy intern decompression — filterless aggregates skip the intern table entirely.
- Unchecked indexing in scan loops; bounds guaranteed by column slice length.
SIMD & scan
Hot scan kernels compile to AVX2 when the target supports it and fall back to scalar code otherwise.
- sum_i64(col: &[i64]) -> i128 avx2
- Horizontal sum over the metric column. 4 lanes × 256-bit accumulators.
- sum_u32_as_u64(col: &[u32]) -> u64 avx2
- Widening sum for counter columns — avoids overflow on long segments.
- count_eq_u16(col: &[u16], needle: u16) -> usize avx2
- Vectorised equality count for status and flag columns.
- filtered_aggregate(…) avx2
- Combined mask + sum pass: filter by dim index, sum metric in a single linear scan.
Configuration
| field | type | default | description |
|---|---|---|---|
| data_dir | PathBuf | — | Directory to hold WAL + segments. Created if missing. |
| wal_max_events | u32 | 500_000 | Events per segment before rotation. |
| wal_sync_interval | u32 | 64 | WAL fsync interval, in events. Set to 1 for zero data loss. |
| bloom_bits | u32 | 1024 | Bits of bloom per segment. Higher = fewer false positives. |
| compress_level | i32 | 3 | zstd level for intern table + variable data. |
Crash recovery
Engine::open() scans the data directory in this order:
- Load
meta.json— if missing, rebuild from segment headers. - Validate each
.ksegheader magic; move corrupt files aside. - Replay
wal.logframe-by-frame, CRC-checked; truncate at first bad frame. - Write replayed events into a new segment; rewrite
meta.json.
Recovery is single-threaded and proportional to WAL size. For a default 64-event sync interval, recovery processes tens of thousands of events per second.
Sizing & limits
| limit | value | why |
|---|---|---|
| events / segment | 2³¹ | u32 row indices throughout the column layout. |
| unique strings / dim / segment | 65_535 | u16 intern index. Exceeding triggers early rotation. |
| payload size | — | Unbounded, but compressed together; aim for < 4 KB typical. |
| concurrent writers | 1 | Single-writer by design. Wrap Engine in an Arc<Mutex> for multi-producer. |