Setting up replication

Run a leader, point one or more followers at it, and they stream the leader’s changes by sequence number — catching up automatically after a disconnect. dfdb serve can stand up a leader or follower directly, no custom code: pass the role via flag, env var, or a replication block in node.json, and the node brings up both the HTTP API and the replication wire when it starts.

Replication runs over a dedicated TCP port, separate from the HTTP API. For the underlying model — sequence numbers, catchup, failover trade-offs — see the replication concept page.

One shared secret on every node. Whatever value you set for replicationSecret on the leader must match on every follower or the handshake is rejected.

Start a leader

The HTTP API listens on :5000; the replication wire listens on :5500. Followers connect to the replication port, not the HTTP port.


dfdb serve --port 5000 --data-dir ./data/leader \
           --replication-role leader \
           --replication-port 5500 \
           --replication-secret repl_shared_xyz

Attach a follower

The follower opens its own HTTP API on :5010 (so your app can read from it), and connects outbound to leader-host:5500 for the replication stream.


dfdb serve --port 5010 --data-dir ./data/replica \
           --replication-role follower \
           --leader-host localhost --leader-port 5500 \
           --replication-secret repl_shared_xyz

On startup, the banner shows the active role:


dfdb serve
  node:        replica-1
  data dir:    /var/lib/dfdb/replica
  listening:   http://localhost:5010
  security:    API key OFF  |  TLS OFF  |  replication-secret ON
  replication: FOLLOWER  following leader.internal:5500  |  auto-failover after 10s (new port :5500)

Verify catchup

Query /stats on both nodes — document counts should match once catchup completes. Write to the leader and confirm the row appears on the follower:


curl http://localhost:5000/stats   # leader
curl http://localhost:5010/stats   # follower — counts converge after catchup

Test failover

Bring up the follower with auto-failover enabled. If the leader goes silent for 10 seconds, this follower promotes itself and begins accepting writes on the leader’s replication port:


dfdb serve --port 5010 --data-dir ./data/replica \
           --replication-role follower \
           --leader-host localhost --leader-port 5500 \
           --auto-failover-seconds 10

Stop the leader process. After the silence window elapses, the follower promotes; confirm by writing to it. Add --auto-failover-new-port N to take over on a different port.

Auto-failover accepts a small data-loss window and has no built-in split-brain protection — enable it on one follower only. For maintenance moves, prefer planned handover (see below), which guarantees zero data loss.

All-in-one via node.json

For production, put everything in a file. The replication block is honored by dfdb serve out of the box:


{
  "nodeName": "replica-1",
  "port": 5010,
  "dataDir": "/var/lib/dfdb/replica",
  "security": {
    "replicationSecret": "repl_shared_xyz"
  },
  "replication": {
    "role": "follower",
    "leaderHost": "leader.internal",
    "leaderPort": 5500,
    "autoFailover": {
      "silenceSeconds": 10,
      "newLeaderPort": 5500
    }
  }
}


dfdb serve --config ./node.json

Env-var equivalents

Every flag has an env var, handy in containers:

Flag	Env var
`--replication-role`	`DFDB_REPLICATION_ROLE`
`--replication-port`	`DFDB_REPLICATION_PORT`
`--leader-host`	`DFDB_LEADER_HOST`
`--leader-port`	`DFDB_LEADER_PORT`
`--auto-failover-seconds`	`DFDB_AUTO_FAILOVER_SECONDS`
`--replication-secret`	`DFDB_REPLICATION_SECRET`

Runtime control over HTTP

Everything above is also exposed via REST, so a management tool can drive it. See the CLI / REST reference for the full list:


GET  /replication/status
POST /replication/start-leader       { "port": 5500 }
POST /replication/start-follower     { "host": "leader.internal", "port": 5500 }
POST /replication/read-only          # planned handover step 1
POST /replication/promote            { "port": 5500 }  # step 2
POST /replication/auto-failover/enable   { "silenceSeconds": 10, "newLeaderPort": 5500 }
POST /replication/auto-failover/disable

Direct C# API

The CLI flags and HTTP endpoints map 1:1 to the C# methods, which remain available if you’re embedding DocumentForge in your own process instead of running dfdb serve:


db.StartLogicalReplicationServer(port: 5500, sharedSecret: "...");
db.StartLogicalReplicationFollower("leader-host", 5500, sharedSecret: "...");
db.EnableAutoFailover(newLeaderPort: 5500, silenceTimeout: TimeSpan.FromSeconds(10));

Observability properties:


leader.LeaderCurrentSeq          // latest seq assigned
leader.GetLogicalFollowerCount() // how many followers are connected
 
follower.FollowerLastSeq         // last applied op
follower.LogicallyReplicatedOps  // running total
follower.GapsDetected            // non-zero means lost ops — investigate!

Planned handover (datacenter move)

For maintenance or a datacenter move, planned handover guarantees zero data loss — the new leader catches up fully before the old one releases the role. A concrete recipe for moving a production database from Datacenter A to Datacenter B:

Week before: provision the new DocumentForge instance in DC-B. Have it run as a logical follower of the DC-A leader. Verify FollowerLastSeq matches LeaderCurrentSeq (or is within seconds).
Maintenance window (15 min): inform clients of brief write unavailability.
T+0: call oldLeader.BeginPlannedHandover(...). Writes on DC-A now fail with a clear error.
T+seconds: the call returns the finalSeq once DC-B is fully caught up.
T+immediately: call newLeader.PromoteToLeader(port). DC-B is now the writable leader.
T+now: flip your load-balancer / DNS / connection string to point at DC-B.
Optional: reconfigure DC-A as a follower of DC-B for ongoing disaster recovery.

What could go wrong, and how you’d detect it:

If DC-B doesn’t catch up within the timeout, BeginPlannedHandover throws and re-enables writes on DC-A. Abort the handover, investigate the network, retry.
If DC-B promotes while DC-A is still accepting writes, you’d have split-brain. The protocol prevents this: DC-A is read-only before DC-B is promoted. Never bypass that order.
Clients writing to DC-A during the window see an error — make sure your client code has retry logic that can fail over to DC-B.