Real numbers (250,000 documents, modern laptop)
From the included benchmark suite. Workload: airline order documents (~500 bytes each) with 5 indexes.
| Operation | Throughput / Latency |
|---|---|
| Bulk insert | ~70,000 docs/sec sustained |
| Point lookup (unique indexed PNR) | ~210,000 QPS, avg <1ms |
| Status = 'CONFIRMED' LIMIT 100 | ~3,300 QPS, avg 0.3ms |
| Departure airport LIMIT 50 | ~6,400 QPS, avg 0.15ms |
| Range query (fare > 1500) LIMIT 100 | ~144 QPS, avg 7ms |
| First name scan (no index) LIMIT 10 | ~830 QPS, avg 1.2ms |
| Collection scan (no LIMIT, ~8K results) | ~3 QPS, avg 360ms |
2.6 million queries executed in 5 minutes during the query benchmark. On one laptop. In C#.
Scale: single node
Based on the 250K benchmark, here's what to expect per node β i.e. one .dfdb file served by one dfdb serve process:
| Scale | File size | RAM usage | Insert rate | Indexed lookup | Status |
|---|---|---|---|---|---|
| 1M docs | ~500 MB | ~500 MB | 60β70K/sec | 200K+ QPS | π Trivial |
| 10M docs | ~5 GB | ~2β3 GB | 40β60K/sec | 150β200K QPS | πͺ Comfortable |
| 50M docs | ~25 GB | ~12 GB | 20β30K/sec | 80β120K QPS | β‘ Sweet spot ends |
| 100M docs | ~50 GB | ~25 GB | 10β20K/sec | 50β80K QPS | β οΈ Needs 32GB+ |
| 500M docs | ~250 GB | >100 GB | limited | cache-bound | β Shard it |
The cliff at ~100M on a single node comes from three in-RAM structures: the location map (~24 bytes/doc), the in-memory B-tree (~32β48 bytes/entry Γ number of indexes), and the working-set page cache. Once your indexes don't fit in RAM, performance tanks. See Bottlenecks below for the fixes.
Scale: with sharding
Consistent-hash sharding splits a collection across N nodes. Each node only carries 1/N of the documents, 1/N of each index, and 1/N of the hot set. There is no architectural cap on total document count β scale is bounded by how many nodes you want to run.
| Shards | Docs (rough) | Aggregate data | Aggregate inserts | Indexed lookup | Hardware profile |
|---|---|---|---|---|---|
| 3 | 300 M | ~150 GB | ~150 K/sec | ~600 K QPS | 3 Γ 32 GB / 500 GB NVMe |
| 10 | 1 B | ~500 GB | ~500 K/sec | ~1.5β2 M QPS | 10 Γ 32 GB / 500 GB NVMe |
| 100 | 10 B | ~5 TB | ~5 M/sec | ~15β20 M QPS | 100 Γ 64 GB / 1 TB NVMe |
| 1,000 | 1 T | ~500 TB | ~10 M/sec (wire-bound) | ~50β100 M QPS | 1,000 Γ 128 GB / 2 TB NVMe |
A few things hold across all rows:
- Point lookups on the shard key route to one node. Latency stays sub-millisecond regardless of cluster size β the consistent-hash ring is an O(log N) lookup.
- Writes route to one shard. Aggregate write throughput scales linearly with shard count, bounded by how much write traffic you can put on the wire.
- Queries without the shard key fan out to every shard. Still useful at 10 shards, painful at 1,000 β keep hot paths on the shard key.
- Cross-shard JOINs re-gather at the coordinator. Think of them as analytical queries, not OLTP.
- Rebalance is online. Adding shards moves data in the background; reads and writes continue throughout. See the Sharding guide.
What 1 billion documents looks like
10 shards Γ ~100M each. This is the comfortable "big but not absurd" tier β plausibly a single team's production workload.
Global airline reservations
Every PNR booked worldwide in ~2 years. Lookup by PNR is a single-shard hit. Passenger-name search fans out to 10 nodes β still sub-second. Exactly the Sabre/Amadeus workload this design was inspired by.
Connected-device telemetry
1B events = ~11,500 events/sec for 24 months, or 30K/sec for 1 year. Shard on deviceId: per-device history is one node; fleet-wide queries fan out.
E-commerce order history
Amazon-scale order history is around this mark. Shard on customerId β a customer's entire order history lives on one node. Order-by-date global queries use the fan-out path.
Application audit log
Every API call to a mid-sized SaaS for 5β10 years. Shard on tenantId for GDPR-clean per-customer data residency.
Game match history
Every match ever played in a popular online game. Shard on playerId β the player's personal history is always one-node. Leaderboards use a replicated collection.
Transaction ledger
A mid-sized bank's full transaction log for a decade. Shard on accountId. Point queries by transaction ID are one-shard; daily-settlement aggregates are batch fan-out.
What 1 trillion documents looks like
1,000 shards Γ ~1B each, or 10,000 Γ ~100M. At this scale you are firmly in "serious operational commitment" territory β but the architecture does not break, and nothing in the engine cares whether N is 10 or 10,000.
Every connected vehicle on Earth
~1.5B connected cars, 700 events/day each = ~1T events/year. Shard on vin. Per-car history is one-node (for the insurance adjuster, the OTA update planner, the fleet manager). Aggregate fleet metrics are batch analytical jobs.
Social-network timeline events
Every post, like, share, and view for a billion-user network over ~2 years. Shard on userId. The "show me my timeline" path is one-shard-per-follower-fanout; trending-topics is a separate, derived pipeline.
Global payment processor ledger
Every card swipe, every transfer, for ~5 years. Shard on merchantId or cardHash. Point lookups for dispute/refund workflows stay sub-millisecond even as the ledger grows.
Smart-city sensor mesh
Every traffic, air-quality, and utility sensor in a major metro, every minute, for a decade. Shard on sensorId + time bucket. Historical per-sensor queries hit one shard; heat-maps are batch aggregations.
Genomics variant store
Every human SNP Γ every patient in a research biobank. Shard on patientId. Per-patient studies are one-shard; population-wide cohorts are deliberate fan-out analytical workloads.
Archival document / email store
Every email sent through a Tier-1 email provider for 7 years. Shard on mailboxId. The user's "find an old email" path is one-shard; legal-hold exports are fan-out.
Bottlenecks
Each bottleneck has a known fix. Some are already implemented, some are roadmap.
| Limit | Hits at | Fix | Status |
|---|---|---|---|
| BSON allocation overhead | Bulk insert ceiling | ArrayPool + stackalloc | β Partial |
| Page cache churn | Data > cache size | Increase CacheSizeInPages | β Configurable |
| Location map in RAM | ~50M docs | Move to persistent structure | πΊοΈ Roadmap |
| In-memory B-tree | ~50M docs | True on-disk B-tree (SQLite style) | πΊοΈ Roadmap |
| Single global write lock | Hundreds of concurrent writers | Page-level locking / MVCC | πΊοΈ Roadmap |
| Single-file per node | ~100M docs / ~50 GB | Consistent-hash sharding across nodes | β Shipped |
Tuning
Cache size
The single most impactful knob. Set CacheSizeInPages so that ~20% of your data fits in cache. Each page is 8KB.
new DatabaseOptions { CacheSizeInPages = 100_000 } // 800MB cache
Indexes
Add indexes for every field you filter on. Remove indexes you don't use β they slow down inserts. Check query plans during development.
LIMIT early
A LIMIT without an ORDER BY is pushed into the scan β we stop as soon as we've found enough. Adding LIMIT 100 can turn a 3-second query into a 0.3ms one.
Disable WAL for bulk load
If you're doing a one-time bulk import and can restart from the beginning on failure, disabling the WAL during that load doubles throughput:
new DatabaseOptions { EnableWal = false } // NO crash safety during load
How it gets this fast
A few specific decisions do most of the heavy lifting.
Direct addressing via location map
Indexes store DocumentId β (PageId, SlotIndex). Once the index finds your document's ID, we jump straight to the exact page and slot. No scan. This is the same trick the 1960s-era mainframe reservation systems (TPF, Sabre) used β and it's still unbeatable.
Page-based storage
Data lives in 8KB pages. A slotted layout lets us insert, update, and delete within a page without rewriting the whole file. Fixed sizes mean the OS page cache and our own LRU cache play nicely together.
Persistent indexes
Indexes are stored as append-only page chains on disk. On startup, we stream them back into memory without rebuilding from scratch. What used to take 7 seconds for 250K docs now takes a few hundred milliseconds.
LIMIT pushdown
The executor stops as soon as it's collected LIMIT + OFFSET matching documents. This turns queries that would fetch millions of rows into queries that fetch exactly what you asked for.
Hash joins
JOINs build a hash map of the smaller collection, then stream the larger one through it. O(n + m) instead of O(n Γ m).
Run it yourself
The repository includes a full benchmark. It creates a fresh database, inserts 250K documents with 5 indexes, runs 5 different query types for a minute each, then tests the reopen-with-persistent-indexes flow.
dotnet run --project samples/DocumentForge.Benchmark --configuration Release
Expect total runtime of about 5β7 minutes. The output shows insert rate, query throughput per pattern, and reopen performance.