Chapter 06

Performance

Real numbers from a benchmark you can run yourself. ~100M docs per node, effectively unlimited with sharding β€” 1B and 1T are real, intended scale points. Limits, bottlenecks, and the fix for each.

Real numbers (250,000 documents, modern laptop)

From the included benchmark suite. Workload: airline order documents (~500 bytes each) with 5 indexes.

OperationThroughput / Latency
Bulk insert~70,000 docs/sec sustained
Point lookup (unique indexed PNR)~210,000 QPS, avg <1ms
Status = 'CONFIRMED' LIMIT 100~3,300 QPS, avg 0.3ms
Departure airport LIMIT 50~6,400 QPS, avg 0.15ms
Range query (fare > 1500) LIMIT 100~144 QPS, avg 7ms
First name scan (no index) LIMIT 10~830 QPS, avg 1.2ms
Collection scan (no LIMIT, ~8K results)~3 QPS, avg 360ms

2.6 million queries executed in 5 minutes during the query benchmark. On one laptop. In C#.

Scale: single node

Based on the 250K benchmark, here's what to expect per node β€” i.e. one .dfdb file served by one dfdb serve process:

ScaleFile sizeRAM usageInsert rateIndexed lookupStatus
1M docs~500 MB~500 MB60–70K/sec200K+ QPSπŸš€ Trivial
10M docs~5 GB~2–3 GB40–60K/sec150–200K QPSπŸ’ͺ Comfortable
50M docs~25 GB~12 GB20–30K/sec80–120K QPS⚑ Sweet spot ends
100M docs~50 GB~25 GB10–20K/sec50–80K QPS⚠️ Needs 32GB+
500M docs~250 GB>100 GBlimitedcache-bound❌ Shard it

The cliff at ~100M on a single node comes from three in-RAM structures: the location map (~24 bytes/doc), the in-memory B-tree (~32–48 bytes/entry Γ— number of indexes), and the working-set page cache. Once your indexes don't fit in RAM, performance tanks. See Bottlenecks below for the fixes.

Per node, not per database. The 100M figure is what one process handles comfortably. Your database can be much bigger β€” that's what sharding is for. Keep reading.

Scale: with sharding

Consistent-hash sharding splits a collection across N nodes. Each node only carries 1/N of the documents, 1/N of each index, and 1/N of the hot set. There is no architectural cap on total document count β€” scale is bounded by how many nodes you want to run.

ShardsDocs (rough)Aggregate dataAggregate insertsIndexed lookupHardware profile
3300 M~150 GB~150 K/sec~600 K QPS3 Γ— 32 GB / 500 GB NVMe
101 B~500 GB~500 K/sec~1.5–2 M QPS10 Γ— 32 GB / 500 GB NVMe
10010 B~5 TB~5 M/sec~15–20 M QPS100 Γ— 64 GB / 1 TB NVMe
1,0001 T~500 TB~10 M/sec (wire-bound)~50–100 M QPS1,000 Γ— 128 GB / 2 TB NVMe

A few things hold across all rows:

What 1 billion documents looks like

10 shards Γ— ~100M each. This is the comfortable "big but not absurd" tier β€” plausibly a single team's production workload.

AIR

Global airline reservations

Every PNR booked worldwide in ~2 years. Lookup by PNR is a single-shard hit. Passenger-name search fans out to 10 nodes β€” still sub-second. Exactly the Sabre/Amadeus workload this design was inspired by.

IoT

Connected-device telemetry

1B events = ~11,500 events/sec for 24 months, or 30K/sec for 1 year. Shard on deviceId: per-device history is one node; fleet-wide queries fan out.

ORD

E-commerce order history

Amazon-scale order history is around this mark. Shard on customerId β€” a customer's entire order history lives on one node. Order-by-date global queries use the fan-out path.

LOG

Application audit log

Every API call to a mid-sized SaaS for 5–10 years. Shard on tenantId for GDPR-clean per-customer data residency.

GAM

Game match history

Every match ever played in a popular online game. Shard on playerId β€” the player's personal history is always one-node. Leaderboards use a replicated collection.

FIN

Transaction ledger

A mid-sized bank's full transaction log for a decade. Shard on accountId. Point queries by transaction ID are one-shard; daily-settlement aggregates are batch fan-out.

Infrastructure reality check at 1B: 10 boxes with ~32–64 GB RAM and a terabyte of fast NVMe each. Fits in a single rack. Runs on three AZs with a follower per shard for read scale and auto-failover. Replication secret on, TLS on, API keys on. Boring, dependable, ~$2–5K/month in commodity cloud.

What 1 trillion documents looks like

1,000 shards Γ— ~1B each, or 10,000 Γ— ~100M. At this scale you are firmly in "serious operational commitment" territory β€” but the architecture does not break, and nothing in the engine cares whether N is 10 or 10,000.

CAR

Every connected vehicle on Earth

~1.5B connected cars, 700 events/day each = ~1T events/year. Shard on vin. Per-car history is one-node (for the insurance adjuster, the OTA update planner, the fleet manager). Aggregate fleet metrics are batch analytical jobs.

FEED

Social-network timeline events

Every post, like, share, and view for a billion-user network over ~2 years. Shard on userId. The "show me my timeline" path is one-shard-per-follower-fanout; trending-topics is a separate, derived pipeline.

PAY

Global payment processor ledger

Every card swipe, every transfer, for ~5 years. Shard on merchantId or cardHash. Point lookups for dispute/refund workflows stay sub-millisecond even as the ledger grows.

SENS

Smart-city sensor mesh

Every traffic, air-quality, and utility sensor in a major metro, every minute, for a decade. Shard on sensorId + time bucket. Historical per-sensor queries hit one shard; heat-maps are batch aggregations.

GEN

Genomics variant store

Every human SNP Γ— every patient in a research biobank. Shard on patientId. Per-patient studies are one-shard; population-wide cohorts are deliberate fan-out analytical workloads.

DOC

Archival document / email store

Every email sent through a Tier-1 email provider for 7 years. Shard on mailboxId. The user's "find an old email" path is one-shard; legal-hold exports are fan-out.

Infrastructure reality check at 1T: 500–2,000 nodes depending on shape. Multi-DC replication with planned handover for maintenance windows. Tiered storage (hot NVMe, warm SATA, cold object store) becomes meaningful at this scale. Expect to invest seriously in observability, per-shard dashboards, and a rebalance schedule. The point is not that this is easy β€” it's that the database doesn't introduce any ceiling below here. You will hit wire, disk, budget, or ops ceilings before you hit a DocumentForge one.

Bottlenecks

Each bottleneck has a known fix. Some are already implemented, some are roadmap.

LimitHits atFixStatus
BSON allocation overheadBulk insert ceilingArrayPool + stackallocβœ“ Partial
Page cache churnData > cache sizeIncrease CacheSizeInPagesβœ“ Configurable
Location map in RAM~50M docsMove to persistent structureπŸ—ΊοΈ Roadmap
In-memory B-tree~50M docsTrue on-disk B-tree (SQLite style)πŸ—ΊοΈ Roadmap
Single global write lockHundreds of concurrent writersPage-level locking / MVCCπŸ—ΊοΈ Roadmap
Single-file per node~100M docs / ~50 GBConsistent-hash sharding across nodesβœ… Shipped

Tuning

Cache size

The single most impactful knob. Set CacheSizeInPages so that ~20% of your data fits in cache. Each page is 8KB.

new DatabaseOptions { CacheSizeInPages = 100_000 } // 800MB cache

Indexes

Add indexes for every field you filter on. Remove indexes you don't use β€” they slow down inserts. Check query plans during development.

LIMIT early

A LIMIT without an ORDER BY is pushed into the scan β€” we stop as soon as we've found enough. Adding LIMIT 100 can turn a 3-second query into a 0.3ms one.

Disable WAL for bulk load

If you're doing a one-time bulk import and can restart from the beginning on failure, disabling the WAL during that load doubles throughput:

new DatabaseOptions { EnableWal = false } // NO crash safety during load

How it gets this fast

A few specific decisions do most of the heavy lifting.

Direct addressing via location map

Indexes store DocumentId β†’ (PageId, SlotIndex). Once the index finds your document's ID, we jump straight to the exact page and slot. No scan. This is the same trick the 1960s-era mainframe reservation systems (TPF, Sabre) used β€” and it's still unbeatable.

Page-based storage

Data lives in 8KB pages. A slotted layout lets us insert, update, and delete within a page without rewriting the whole file. Fixed sizes mean the OS page cache and our own LRU cache play nicely together.

Persistent indexes

Indexes are stored as append-only page chains on disk. On startup, we stream them back into memory without rebuilding from scratch. What used to take 7 seconds for 250K docs now takes a few hundred milliseconds.

LIMIT pushdown

The executor stops as soon as it's collected LIMIT + OFFSET matching documents. This turns queries that would fetch millions of rows into queries that fetch exactly what you asked for.

Hash joins

JOINs build a hash map of the smaller collection, then stream the larger one through it. O(n + m) instead of O(n Γ— m).

Run it yourself

The repository includes a full benchmark. It creates a fresh database, inserts 250K documents with 5 indexes, runs 5 different query types for a minute each, then tests the reopen-with-persistent-indexes flow.

dotnet run --project samples/DocumentForge.Benchmark --configuration Release

Expect total runtime of about 5–7 minutes. The output shows insert rate, query throughput per pattern, and reopen performance.

Your numbers will vary. CPU, disk, RAM, and OS all matter. A fast NVMe SSD and 32GB of RAM will produce materially better results than a 5-year-old laptop with an HDD. The relative ordering (PNR > airport > status > range > scan) will be the same.