Concepts
DocumentForge supports two replication modes. They solve different problems and can be used together.
| Logical | Physical | |
|---|---|---|
| Stream contents | Operations (Insert / Delete / Index) | Raw page bytes |
| Follower can serve queries | ✓ — indexes stay coherent | ✗ — indexes are stale |
| Follower file byte-identical | No (layout may differ) | Yes (byte-for-byte) |
| Best for | Read scale, planned handover | Hot backup, disaster recovery |
| Sequence numbers | Yes — guaranteed ordering | No |
| Catchup on reconnect | Yes — from last-seen seq | No (manual resync) |
Logical replication
On a write, the leader broadcasts an operation record to every connected follower. The follower runs the operation through its own engine, so indexes, location maps, and cached state all stay consistent with the data file.
Leader Follower
┌──────────┐ ┌──────────┐
write│ Insert │ ──(seq 42, Insert)──▶│ Insert │
│ ↓ │ ──(seq 43, Insert)──▶│ ↓ │
│ engine │ ──(seq 44, CreIdx)──▶│ engine │
│ ↓ │ ───(heartbeat)──────▶│ │
│ disk │ │ disk │
└──────────┘ └──────────┘
Each op carries a monotonic seq. Followers persist the last-applied seq to a sidecar file. On reconnect, they tell the leader their position and get replayed everything after.
Physical replication
Page-level byte streaming. The follower ends up with a byte-identical data file. Use this when you want a hot backup or plan to manually promote the follower as a new leader without any engine warmup.
Physical replication doesn't solve the "can the follower serve reads" problem — its in-memory structures (indexes, location map) were built from the original data and aren't updated by page writes. Close and reopen the follower to rebuild them.
Setting it up (C# API)
Replication runs over a dedicated TCP port, separate from the HTTP API. At the moment it's wired in code — you write a tiny bootstrap program that opens the database and calls the replication methods. See From the CLI below for the current CLI story and the roadmap.
Leader
using var leader = DocumentForgeDb.OpenOrCreate("prod.dfdb"); leader.StartLogicalReplicationServer(port: 5500); // Normal writes broadcast automatically leader.Insert("orders", @"{""pnr"": ""ABC123""}");
Follower (read replica)
using var follower = DocumentForgeDb.OpenOrCreate("replica.dfdb"); follower.StartLogicalReplicationFollower("leader-host", 5500); // Reads work correctly, with live indexes var r = follower.Execute("SELECT * FROM orders WHERE pnr = 'ABC123'");
Observability
leader.LeaderCurrentSeq // latest seq assigned leader.GetLogicalFollowerCount() // how many followers are connected follower.FollowerLastSeq // last applied op follower.LogicallyReplicatedOps // running total follower.GapsDetected // non-zero means lost ops — investigate!
From the CLI
dfdb serve can stand up a leader or follower directly — no custom code. Pass the role via flag, env var, or a replication block in node.json, and the node will bring up both the HTTP API and the replication wire when it starts.
Leader
dfdb serve --port 5000 --data-dir ./data/leader \
--replication-role leader \
--replication-port 5500 \
--replication-secret repl_shared_xyz
The HTTP API listens on :5000; the replication wire listens on :5500. Followers connect to the replication port, not the HTTP port.
Follower
dfdb serve --port 5010 --data-dir ./data/replica \
--replication-role follower \
--leader-host localhost --leader-port 5500 \
--replication-secret repl_shared_xyz
The follower opens its own HTTP API on :5010 (so your app can read from it), and connects outbound to leader-host:5500 for the replication stream.
Follower with auto-failover
dfdb serve --port 5010 --data-dir ./data/replica \
--replication-role follower \
--leader-host localhost --leader-port 5500 \
--auto-failover-seconds 10
If the leader goes silent for 10 seconds, this follower promotes itself and begins accepting writes on the leader's replication port. Add --auto-failover-new-port N if you want it to take over on a different port. See Auto-failover for the trade-offs.
All-in-one via node.json
For production, put everything in a file. The replication block is honored by dfdb serve out of the box:
{
"nodeName": "replica-1",
"port": 5010,
"dataDir": "/var/lib/dfdb/replica",
"security": {
"replicationSecret": "repl_shared_xyz"
},
"replication": {
"role": "follower",
"leaderHost": "leader.internal",
"leaderPort": 5500,
"autoFailover": {
"silenceSeconds": 10,
"newLeaderPort": 5500
}
}
}
dfdb serve --config ./node.json
Env-var equivalents
Every flag has an env var, handy in containers:
| Flag | Env var |
|---|---|
--replication-role | DFDB_REPLICATION_ROLE |
--replication-port | DFDB_REPLICATION_PORT |
--leader-host | DFDB_LEADER_HOST |
--leader-port | DFDB_LEADER_PORT |
--auto-failover-seconds | DFDB_AUTO_FAILOVER_SECONDS |
--replication-secret | DFDB_REPLICATION_SECRET |
replicationSecret on the leader must match on every follower or the handshake is rejected. See Security → Replication secret.
Verifying it's working
On startup, the banner shows the active role:
dfdb serve node: replica-1 data dir: /var/lib/dfdb/replica listening: http://localhost:5010 security: API key OFF | TLS OFF | replication-secret ON replication: FOLLOWER following leader.internal:5500 | auto-failover after 10s (new port :5500)
Then query /stats on both nodes — document counts should match once catchup completes.
Runtime control over HTTP
Everything above is also exposed via REST, so a management tool (or the Postman collection) can drive it. See the REST API reference for the full list:
GET /replication/status
POST /replication/start-leader { "port": 5500 }
POST /replication/start-follower { "host": "leader.internal", "port": 5500 }
POST /replication/read-only # planned handover step 1
POST /replication/promote { "port": 5500 } # step 2
POST /replication/auto-failover/enable { "silenceSeconds": 10, "newLeaderPort": 5500 }
POST /replication/auto-failover/disable
Direct C# API
The CLI flags and HTTP endpoints map 1:1 to the C# methods, which remain available if you're embedding DocumentForge in your own process instead of running dfdb serve:
db.StartLogicalReplicationServer(port: 5500, sharedSecret: "..."); db.StartLogicalReplicationFollower("leader-host", 5500, sharedSecret: "..."); db.EnableAutoFailover(newLeaderPort: 5500, silenceTimeout: TimeSpan.FromSeconds(10));
Catchup on reconnect
When a follower disconnects and reconnects:
- Follower loads its last-applied seq from disk (the
.followerseqsidecar file) - Follower sends a handshake to the leader: "I'm at seq N, please bring me up to date"
- Leader replays all ops with seq > N from its in-memory ring buffer
- Leader adds the follower to its live broadcast list
Planned handover
When you need to move the leader role — typically for maintenance or a datacenter move — use the planned-handover flow. It guarantees zero data loss because the new leader catches up fully before the old one releases the role.
// 1. Old leader enters read-only and waits for new leader to catch up ulong finalSeq = oldLeader.BeginPlannedHandover( followerLastSeqProbe: () => newLeader.FollowerLastSeq, timeout: TimeSpan.FromMinutes(5)); // 2. Old leader is now read-only. New leader is caught up. // Promote new leader. newLeader.PromoteToLeader(port: 5500); // 3. Direct clients to new leader's address. // Old leader can be shut down or repurposed as a follower.
Between steps 1 and 2, writes to oldLeader throw DocumentForgeException("Database is in read-only mode..."). After step 2, newLeader accepts writes.
Auto-failover
Planned handover is the safe path — zero data loss, no split-brain risk. For crashes (leader process died, host went dark, network partition), you can opt in to automatic promotion. A follower watches the leader's heartbeat stream and, if nothing arrives within a silence threshold, promotes itself to leader.
using var follower = DocumentForgeDb.OpenOrCreate("replica.dfdb"); follower.StartLogicalReplicationFollower("leader-host", 5500); // Enable auto-promotion if the leader goes silent for 10 seconds. follower.EnableAutoFailover( newLeaderPort: 5500, silenceTimeout: TimeSpan.FromSeconds(10), onPromoted: port => Console.WriteLine($"Promoted to leader on :{port}"));
The watcher has a 3-second grace period after startup (so normal connect-time silence doesn't trigger promotion). Once promoted, the follower stops consuming from the old leader, opens its own replication server on newLeaderPort, and begins accepting writes.
Useful properties while running:
follower.WasAutoFailoverPromoted // true once it has taken over follower.DisableAutoFailover() // stop watching (safe to call any time)
Datacenter move runbook
A concrete recipe for moving a production database from Datacenter A to Datacenter B:
- Week before: provision the new DocumentForge instance in DC-B. Have it run as a logical follower of the DC-A leader. Verify
FollowerLastSeqmatchesLeaderCurrentSeq(or is within seconds). - Maintenance window (15 min): inform clients of brief write unavailability.
- T+0: call
oldLeader.BeginPlannedHandover(...). Writes on DC-A now fail with a clear error. - T+seconds: the call returns the
finalSeqonce DC-B is fully caught up. - T+immediately: call
newLeader.PromoteToLeader(port). DC-B is now the writable leader. - T+now: flip your load-balancer / DNS / connection string to point at DC-B.
- Optional: reconfigure DC-A as a follower of DC-B for ongoing disaster recovery.
1) If DC-B doesn't catch up within the timeout,
BeginPlannedHandover throws and re-enables writes on DC-A. Abort the handover, investigate the network, retry.2) If DC-B promotes while DC-A is still accepting writes, you'd have split-brain. The protocol prevents this: DC-A is read-only before DC-B is promoted. Never bypass that order.
3) Clients writing to DC-A during the window see an error — make sure your client code has retry logic that can fail over to DC-B.
Caveats and current limits
- Auto-failover has a small data-loss window. Ops the leader committed but hadn't yet broadcast when it died are lost. Planned handover avoids this — use it for maintenance.
- No built-in split-brain protection. Auto-failover has no witness/quorum — enable it on one follower only, or front it with your own fencing mechanism. Planned handover prevents split-brain by design (the old leader is read-only before the new one is promoted).
- No multi-master. Writes go to the leader only. You cannot write to two nodes simultaneously.
- Ring-buffered catchup. The leader keeps the last 10,000 ops for reconnecting followers. A follower that's been offline longer than that needs a fresh snapshot (manual file copy) to catch up.
- Client-side failover. DocumentForge doesn't include a client-side load balancer or discovery service. Your application needs to know which host is the current leader — typically via a config flip after promotion, DNS, or a small service-discovery layer in front.