Chapter 06

Transactions

Multi-document atomicity on a single node and across shards via two-phase commit. Crash-safe with automatic recovery.

What transactions guarantee

A DocumentForge transaction stages a batch of writes — inserts, replaces, deletes — and either applies them all atomically or applies none of them. Within a single transaction:

Two flavours, same API shape:

Single-nodeCluster (sharded)
Entry pointdb.BeginTransaction()cluster.BeginTransaction()
Atomicity guaranteeLocal write lockTwo-phase commit across shards
Crash recoveryWAL replay on Opencluster.Recover() sweep
Reads during a txBlock on shard write lockBlock on each touched shard
Isolation levelRead-committed (via blocking)Read-committed (via blocking)

Single-node multi-document transactions

The simplest case: one DocumentForgeDb, several writes that have to land or fail together. Open a transaction with BeginTransaction(), stage your writes via the returned handle, then call Commit():

using var db = DocumentForgeDb.OpenOrCreate("airline.dfdb");
db.CreateIndex("users", "email", "idx_email", unique: true);

using var tx = db.BeginTransaction();
tx.DeleteByField("users", "email", "alice@example.com");
tx.Insert("users", @"{""email"":""alice@example.com"",""tier"":""gold""}");
tx.Commit();

The classic upsert pattern: drop the existing row by business key, insert the new version. Outside a transaction, the unique-index check on the insert would fire before the delete completes — leaving a window where the row briefly doesn't exist. Inside one, the validator considers the pending delete when checking the pending insert, so the round-trip is atomic at the BSON-document level.

What you can stage

tx.Insert(coll, doc)         // generates _id or honours one in the body
tx.Replace(coll, id, newDoc) // returns false if id is not visible
tx.Delete(coll, id)          // drops staged inserts of same id, or stages a delete
tx.DeleteByField(coll, field, value)
tx.Find(coll, id)            // read-your-writes: staged + committed

Failure modes

If validation fails at Commit() (e.g., the staged inserts would create a duplicate unique-index entry), the call throws and nothing is persisted. The transaction enters the RolledBack state and the caller can retry with a fresh handle.

If the process crashes mid-commit, the WAL replay on next Open restores the database to a consistent state. Multi-document atomicity at the document level is preserved across the crash.

Cluster transactions across shards

The same idea, but the documents live on different shards. cluster.BeginTransaction() returns a ClusterTransaction with the same staging API. Under the hood, the cluster:

  1. Routes each staged op to its target shard at staging time using the consistent-hash ring.
  2. If every staged op happens to route to one shard, takes the single-shard fast path — calls that shard's local BeginTransaction().Commit(). Same atomicity, same WAL fsync, no extra round-trip.
  3. Otherwise, runs a two-phase commit across the participating shards.
var shards = new[] {
    new InProcessShardTransport("A", dbA),
    new InProcessShardTransport("B", dbB),
    new InProcessShardTransport("C", dbC),
};
using var cluster = new DocumentForgeCluster()
    .AddShard(shards[0])
    .AddShard(shards[1])
    .AddShard(shards[2])
    .ShardCollection("orders", shardKeyPath: "pnr");

using var tx = cluster.BeginTransaction();
tx.Insert("orders", @"{""pnr"":""ABC123"",""leg"":1}");  // shard B (say)
tx.Insert("orders", @"{""pnr"":""XYZ789"",""leg"":1}");  // shard A
tx.Commit();  // 2PC: both shards commit, or neither does
Same shard-key value? Single-shard fast path. If every op in your transaction targets the same shard-key value (e.g. two flights for the same PNR), they all route to one shard and the cluster skips the 2PC round-trip entirely. Multi-shard transactions have a real coordination cost — same-shard ones don't.

Cluster transaction API

The handle exposes the operations you'd expect:

tx.Insert(coll, doc)
tx.Replace(coll, id, newDoc)
tx.DeleteByField(coll, field, value)
tx.Find(coll, id)            // staged inserts/replaces from this tx
tx.Commit()
tx.Rollback()                // or just dispose without committing

tx.ParticipantCount          // how many shards this tx touches
tx.StagedOperationCount      // total ops staged
tx.State                     // Active / Committed / RolledBack

Routing rules

Aborts and the failing-op message

If any participant votes ABORT during PREPARE — typically because applying the staged ops would violate a unique-index constraint — the coordinator broadcasts ROLLBACK to every PREPARED participant and the call throws:

try {
    tx.Commit();
} catch (TransactionException ex) {
    // "ClusterTransaction {Id} aborted on shard 'B': <reason>"
    Console.Error.WriteLine(ex.Message);
}

The exception names the shard that aborted and quotes the participant's reason — useful when debugging which shard couldn't accept the staged batch.

Under the hood: two-phase commit

For multi-shard transactions, the cluster picks a coordinator shard deterministically (lowest-index participant) and drives the protocol:

           Coordinator (cluster client)            Coordinator shard       Other participants
                                                  ┌──────────────┐       ┌──────────────┐
   Phase 1:  ──── PREPARE(txId, ops) ──────────▶  │  validate    │       │              │
             ◀─── PREPARED ───────────────────    │  hold lock   │       │              │
             ──── PREPARE(txId, ops) ────────────────────────────────▶   │  validate    │
             ◀─── PREPARED ─────────────────────────────────────────     │  hold lock   │
                                                  │              │       │              │
   Decision: ──── COMMIT_DECISION(txId) ────────▶ │  fsync log   │       │              │   ← point of no return
                                                  │              │       │              │
   Phase 2:  ──── COMMIT(txId) ────────────────▶  │  apply       │       │              │
             ──── COMMIT(txId) ──────────────────────────────────────▶   │  apply       │
             ──── DONE(txId) ─────────────────▶   │  fsync log   │       │              │
                                                  └──────────────┘       └──────────────┘

What each phase does

  1. PREPARE. The coordinator sends the staged op slice to each participant. Each participant validates the slice against its current state, persists it to a {db}.prepared.log file (fsync'd before responding), and holds its write lock until COMMIT or ROLLBACK arrives. Returns PREPARED on success or ABORTED with a reason on conflict.
  2. COMMIT_DECISION. If every participant voted PREPARED, the coordinator durably records the decision in the coordinator shard's {db}.coord.log — fsync before broadcast. This is the point of no return: if the coordinator dies after this record exists, recovery resolves to COMMIT.
  3. COMMIT. The coordinator broadcasts to each participant. Each one applies the prepared ops to storage, releases its write lock, and writes a RESOLVED record to its prepared-tx log. After every participant ACKs, the coordinator writes a DONE record so future recovery sweeps know the broadcast is complete.
  4. ROLLBACK (the abort path). If any participant voted ABORTED, the coordinator broadcasts ROLLBACK to every PREPARED participant. Each releases its lock without applying the ops and writes a RESOLVED record. No COMMIT_DECISION ever lands in the coord log — by convention, absence is abort.

Why the lock is held continuously

A participant that goes PREPARED → released-lock → waiting-for-resolve would let other writers slip in between PREPARE and COMMIT. Those writers could invalidate the staged validation — for instance, by inserting a doc that conflicts with one of the prepared inserts. The coordinator can't reject the COMMIT at that point (it already decided based on the PREPARED votes), so the participant would have to apply ops it knows will fail. Holding the write lock continuously avoids the problem entirely.

.NET's ReaderWriterLockSlim can't be released from a thread other than the one that acquired it, so each participant runs a small actor-pattern worker that owns the lock through the PREPARE→resolve window. The non-transactional hot path is unchanged — same lock primitive, no semaphore on top — and the worker thread doesn't even spin up until the first PREPARE arrives.

Crash recovery

2PC's value is that every failure mode resolves to a consistent end state. After any kind of crash — coordinator process died, participant rebooted, network blip during the broadcast — call cluster.Recover() on a freshly-built cluster:

using var cluster = DocumentForgeCluster.FromConfig(config, transportFactory);

// Resolve any in-flight 2PC tx left over from a prior crash.
var summary = cluster.Recover();
Console.WriteLine($"Recovery: {summary.Committed} committed, {summary.Aborted} aborted, {summary.Skipped} skipped");

The sweep walks every shard's prepared.log; for each PREPARED record without a matching RESOLVED, it asks the named coordinator shard for the decision and finalizes accordingly. The matrix:

Crash pointCoord log stateRecovery action
After COMMIT_DECISION, before broadcastDecided, no DONEReplay COMMIT to participants still PREPARED
Mid-broadcast (some committed, some PREPARED)Decided, no DONEThe PREPARED ones pull the decision and commit
Before deciding (all participants PREPARED)No record for this txIdROLLBACK every PREPARED participant
Coordinator shard absent (rebalance)UnknownSurfaced as Skipped for operator review

Recover() is idempotent — calling it on a clean cluster returns zero counts. It's safe to wire into your application's startup hook.

PREPARE timeout

Without a deadline, a coordinator that died before broadcasting COMMIT/ROLLBACK would leave participants PREPARED with their write lock held — wedging that shard's writers until someone runs Recover() manually. To keep the cluster live in the face of a coordinator failure, every PREPARE carries a deadline. When it elapses, the participant self-aborts: it releases the write lock, writes a RESOLVED-aborted record, and any late COMMIT for that tx then throws.

cluster.PrepareTimeout = TimeSpan.FromSeconds(30);  // default

The right value depends on your network and disk — long enough that healthy cross-shard commits never trip it, short enough that a stuck coordinator doesn't freeze writers for ages. 30 seconds is the default; reduce it for low-latency networks where you'd rather fail fast.

Performance

Transactions add new code paths but don't change the existing ones. The four hot paths called out as bytewise unchanged:

Only multi-shard cluster transactions pay the 2PC cost. Verified against the post-feature quick benchmark — every metric matches or beats the pre-2PC baseline.

Production operations

Metrics

Each shard exposes participant-side 2PC counters as JSON. Scrape them into your monitoring stack — they're cheap to read and Interlocked-incremented on the worker thread.

GET /tx/stats
{
  "prepareTotal": 142,            // every PREPARE accepted (PREPARED + ABORTED)
  "prepareAbortedTotal": 3,         // vote-ABORT (validation conflict, busy)
  "committedTotal": 137,
  "rolledBackTotal": 2,
  "timedOutTotal": 0,             // self-aborted by Phase D timeout
  "inFlightPrepared": 0           // gauge — 0 or 1, alert if >0 for >30s
}

From C#:

var stats = db.GetPreparedTxStats();
Console.WriteLine($"In-flight: {stats.InFlightPrepared}, committed: {stats.CommittedTotal}");

Recommended alerts:

Operator manual abort

When the timeout policy isn't aggressive enough or has been disabled, an operator can force-resolve a stuck PREPARED tx:

POST /tx/{txId}/abort
// 200 OK on success
// 409 Conflict if THIS shard recorded COMMIT_DECISION for this txId

The 409 response is the safety guard: a coordinator shard that already decided COMMIT must not be force-aborted, because the cluster has irrevocably committed. The right move there is to let recovery complete the broadcast.

From C# (via the HTTP transport, equivalent to db.RollbackPreparedTransaction):

var transport = new HttpShardTransport("shard-A", "http://shardA:5000");
transport.OperatorAbort("tx-stuck-12345");

Stuck-tx runbook

  1. Identify. A participant shard's inFlightPrepared stays at 1. Reads on collections it touched are blocked.
  2. Check the coordinator. Find the coordinator shard (it's in the participant's prepared.log; GET /tx/in-flight returns the records). Query its decision: GET /tx/{txId}/coord-state.
  3. If decided COMMIT (and not yet DONE): the cluster has committed. Run cluster.Recover() from any client; the sweep replays the broadcast. Don't force-abort.
  4. If decided is unknown (404) or undecided: the tx is in limbo. cluster.Recover() will treat it as ABORT (no decision = abort). If you can't run Recover for some reason, force-abort directly: POST /tx/{txId}/abort on each PREPARED participant.
  5. Verify. The participant's inFlightPrepared drops to 0 and reads on the touched collections unblock.

SLA & failure modes

What the 2PC machinery promises in writing.

Atomicity (post-decision)

If cluster.BeginTransaction().Commit() returns successfully, every staged op is durable on every participating shard. The coordinator records COMMIT_DECISION before broadcasting, and a successful return means every participant ACK'd COMMIT. Crash before the return is handled by recovery; crash during the broadcast resolves to COMMIT on the next Recover().

If Commit() throws TransactionException, no participating shard has any of the staged ops applied. Either an early PREPARE returned ABORT and the coordinator broadcast ROLLBACK before throwing, or no COMMIT_DECISION was ever written and recovery treats every PREPARED participant as aborted.

Isolation

Read-committed via blocking. Reads on a shard with an in-flight PREPARED tx wait until the resolve completes. With the default 30-second prepare timeout, the worst-case read delay is bounded at 30s per shard. There is no MVCC and no snapshot isolation; concurrent readers and writers serialize on the participant's write lock during PREPARED.

Durability

Three durability fences, all fsync'd before the next step:

Liveness

The PREPARE timeout guarantees liveness in the face of a coordinator that crashes between sending PREPARE and broadcasting the decision. Default 30s; configurable via cluster.PrepareTimeout. After timeout the participant self-aborts and releases its write lock — readers and writers can proceed.

Network partition between PREPARE and the broadcast on a specific participant: that participant times out and aborts. If the coordinator decided COMMIT but couldn't reach this participant, the coordinator's COMMIT broadcast eventually fails with an error to the client — but the participant has already aborted. The cluster is briefly inconsistent until either (a) the operator runs cluster.Recover() with the partition healed, or (b) the participant restarts and pulls the decision from the coord log on the next sweep.

Out of scope

Not in the current implementation:

Caveats and current limits

None of these affect single-node multi-doc transactions, which have the same atomicity story they've had since DocumentForge's first release — well-tested and unchanged by this work.