Transactions — DocumentForge

What transactions guarantee

A DocumentForge transaction stages a batch of writes — inserts, replaces, deletes — and either applies them all atomically or applies none of them. Within a single transaction:

Atomicity. Commit either succeeds completely or fails completely. There is no partially-applied state visible to other readers.
Consistency. Unique-index constraints are validated against the simulated post-commit state — including pending deletes — so a delete-then-insert with the same business key (the canonical upsert) round-trips cleanly.
Read-your-writes. Reads through the transaction handle see staged inserts/replaces and skip staged deletes; reads outside the transaction don't see any of it until commit succeeds.
Durability. Once Commit() returns, the writes are fsync'd to the WAL on every participating node.

Two flavours, same API shape:

	Single-node	Cluster (sharded)
Entry point	`db.BeginTransaction()`	`cluster.BeginTransaction()`
Atomicity guarantee	Local write lock	Two-phase commit across shards
Crash recovery	WAL replay on Open	`cluster.Recover()` sweep
Reads during a tx	Block on shard write lock	Block on each touched shard
Isolation level	Read-committed (via blocking)	Read-committed (via blocking)

Single-node multi-document transactions

The simplest case: one DocumentForgeDb, several writes that have to land or fail together. Open a transaction with BeginTransaction(), stage your writes via the returned handle, then call Commit():

using var db = DocumentForgeDb.OpenOrCreate("airline.dfdb");
db.CreateIndex("users", "email", "idx_email", unique: true);

using var tx = db.BeginTransaction();
tx.DeleteByField("users", "email", "alice@example.com");
tx.Insert("users", @"{""email"":""alice@example.com"",""tier"":""gold""}");
tx.Commit();

The classic upsert pattern: drop the existing row by business key, insert the new version. Outside a transaction, the unique-index check on the insert would fire before the delete completes — leaving a window where the row briefly doesn't exist. Inside one, the validator considers the pending delete when checking the pending insert, so the round-trip is atomic at the BSON-document level.

What you can stage

tx.Insert(coll, doc)         // generates _id or honours one in the body
tx.Replace(coll, id, newDoc) // returns false if id is not visible
tx.Delete(coll, id)          // drops staged inserts of same id, or stages a delete
tx.DeleteByField(coll, field, value)
tx.Find(coll, id)            // read-your-writes: staged + committed

Failure modes

If validation fails at Commit() (e.g., the staged inserts would create a duplicate unique-index entry), the call throws and nothing is persisted. The transaction enters the RolledBack state and the caller can retry with a fresh handle.

If the process crashes mid-commit, the WAL replay on next Open restores the database to a consistent state. Multi-document atomicity at the document level is preserved across the crash.

Cluster transactions across shards

The same idea, but the documents live on different shards. cluster.BeginTransaction() returns a ClusterTransaction with the same staging API. Under the hood, the cluster:

Routes each staged op to its target shard at staging time using the consistent-hash ring.
If every staged op happens to route to one shard, takes the single-shard fast path — calls that shard's local BeginTransaction().Commit(). Same atomicity, same WAL fsync, no extra round-trip.
Otherwise, runs a two-phase commit across the participating shards.

var shards = new[] {
    new InProcessShardTransport("A", dbA),
    new InProcessShardTransport("B", dbB),
    new InProcessShardTransport("C", dbC),
};
using var cluster = new DocumentForgeCluster()
    .AddShard(shards[0])
    .AddShard(shards[1])
    .AddShard(shards[2])
    .ShardCollection("orders", shardKeyPath: "pnr");

using var tx = cluster.BeginTransaction();
tx.Insert("orders", @"{""pnr"":""ABC123"",""leg"":1}");  // shard B (say)
tx.Insert("orders", @"{""pnr"":""XYZ789"",""leg"":1}");  // shard A
tx.Commit();  // 2PC: both shards commit, or neither does

Same shard-key value? Single-shard fast path. If every op in your transaction targets the same shard-key value (e.g. two flights for the same PNR), they all route to one shard and the cluster skips the 2PC round-trip entirely. Multi-shard transactions have a real coordination cost — same-shard ones don't.

Cluster transaction API

The handle exposes the operations you'd expect:

tx.Insert(coll, doc)
tx.Replace(coll, id, newDoc)
tx.DeleteByField(coll, field, value)
tx.Find(coll, id)            // staged inserts/replaces from this tx
tx.Commit()
tx.Rollback()                // or just dispose without committing

tx.ParticipantCount          // how many shards this tx touches
tx.StagedOperationCount      // total ops staged
tx.State                     // Active / Committed / RolledBack

Routing rules

Insert: shard key extracted from the doc; routed to that one shard.
Replace: shard key extracted from the new doc; routed to that one shard. The replacement doc must hash to the same shard as the existing doc — changing the shard-key value isn't supported.
DeleteByField on the shard key: single-shard. The cluster knows exactly which shard owns the matching docs.
DeleteByField on a non-shard-key field: matching docs could be anywhere, so the op fans out to every shard. Multi-shard 2PC; participants with no matches no-op cleanly.

Aborts and the failing-op message

If any participant votes ABORT during PREPARE — typically because applying the staged ops would violate a unique-index constraint — the coordinator broadcasts ROLLBACK to every PREPARED participant and the call throws:

try {
    tx.Commit();
} catch (TransactionException ex) {
    // "ClusterTransaction {Id} aborted on shard 'B': <reason>"
    Console.Error.WriteLine(ex.Message);
}

The exception names the shard that aborted and quotes the participant's reason — useful when debugging which shard couldn't accept the staged batch.

Under the hood: two-phase commit

For multi-shard transactions, the cluster picks a coordinator shard deterministically (lowest-index participant) and drives the protocol:

           Coordinator (cluster client)            Coordinator shard       Other participants
                                                  ┌──────────────┐       ┌──────────────┐
   Phase 1:  ──── PREPARE(txId, ops) ──────────▶  │  validate    │       │              │
             ◀─── PREPARED ───────────────────    │  hold lock   │       │              │
             ──── PREPARE(txId, ops) ────────────────────────────────▶   │  validate    │
             ◀─── PREPARED ─────────────────────────────────────────     │  hold lock   │
                                                  │              │       │              │
   Decision: ──── COMMIT_DECISION(txId) ────────▶ │  fsync log   │       │              │   ← point of no return
                                                  │              │       │              │
   Phase 2:  ──── COMMIT(txId) ────────────────▶  │  apply       │       │              │
             ──── COMMIT(txId) ──────────────────────────────────────▶   │  apply       │
             ──── DONE(txId) ─────────────────▶   │  fsync log   │       │              │
                                                  └──────────────┘       └──────────────┘

What each phase does

PREPARE. The coordinator sends the staged op slice to each participant. Each participant validates the slice against its current state, persists it to a {db}.prepared.log file (fsync'd before responding), and holds its write lock until COMMIT or ROLLBACK arrives. Returns PREPARED on success or ABORTED with a reason on conflict.
COMMIT_DECISION. If every participant voted PREPARED, the coordinator durably records the decision in the coordinator shard's {db}.coord.log — fsync before broadcast. This is the point of no return: if the coordinator dies after this record exists, recovery resolves to COMMIT.
COMMIT. The coordinator broadcasts to each participant. Each one applies the prepared ops to storage, releases its write lock, and writes a RESOLVED record to its prepared-tx log. After every participant ACKs, the coordinator writes a DONE record so future recovery sweeps know the broadcast is complete.
ROLLBACK (the abort path). If any participant voted ABORTED, the coordinator broadcasts ROLLBACK to every PREPARED participant. Each releases its lock without applying the ops and writes a RESOLVED record. No COMMIT_DECISION ever lands in the coord log — by convention, absence is abort.

Why the lock is held continuously

A participant that goes PREPARED → released-lock → waiting-for-resolve would let other writers slip in between PREPARE and COMMIT. Those writers could invalidate the staged validation — for instance, by inserting a doc that conflicts with one of the prepared inserts. The coordinator can't reject the COMMIT at that point (it already decided based on the PREPARED votes), so the participant would have to apply ops it knows will fail. Holding the write lock continuously avoids the problem entirely.

.NET's ReaderWriterLockSlim can't be released from a thread other than the one that acquired it, so each participant runs a small actor-pattern worker that owns the lock through the PREPARE→resolve window. The non-transactional hot path is unchanged — same lock primitive, no semaphore on top — and the worker thread doesn't even spin up until the first PREPARE arrives.

Crash recovery

2PC's value is that every failure mode resolves to a consistent end state. After any kind of crash — coordinator process died, participant rebooted, network blip during the broadcast — call cluster.Recover() on a freshly-built cluster:

using var cluster = DocumentForgeCluster.FromConfig(config, transportFactory);

// Resolve any in-flight 2PC tx left over from a prior crash.
var summary = cluster.Recover();
Console.WriteLine($"Recovery: {summary.Committed} committed, {summary.Aborted} aborted, {summary.Skipped} skipped");

The sweep walks every shard's prepared.log; for each PREPARED record without a matching RESOLVED, it asks the named coordinator shard for the decision and finalizes accordingly. The matrix:

Crash point	Coord log state	Recovery action
After COMMIT_DECISION, before broadcast	Decided, no DONE	Replay COMMIT to participants still PREPARED
Mid-broadcast (some committed, some PREPARED)	Decided, no DONE	The PREPARED ones pull the decision and commit
Before deciding (all participants PREPARED)	No record for this txId	ROLLBACK every PREPARED participant
Coordinator shard absent (rebalance)	Unknown	Surfaced as `Skipped` for operator review

Recover() is idempotent — calling it on a clean cluster returns zero counts. It's safe to wire into your application's startup hook.

PREPARE timeout

Without a deadline, a coordinator that died before broadcasting COMMIT/ROLLBACK would leave participants PREPARED with their write lock held — wedging that shard's writers until someone runs Recover() manually. To keep the cluster live in the face of a coordinator failure, every PREPARE carries a deadline. When it elapses, the participant self-aborts: it releases the write lock, writes a RESOLVED-aborted record, and any late COMMIT for that tx then throws.

cluster.PrepareTimeout = TimeSpan.FromSeconds(30);  // default

The right value depends on your network and disk — long enough that healthy cross-shard commits never trip it, short enough that a stuck coordinator doesn't freeze writers for ages. 30 seconds is the default; reduce it for low-latency networks where you'd rather fail fast.

Performance

Transactions add new code paths but don't change the existing ones. The four hot paths called out as bytewise unchanged:

Single-doc db.Insert/Update/Delete on a DocumentForgeDb.
Single-node db.BeginTransaction().Commit().
Non-transactional cluster.Insert/Execute (the cluster routing layer).
cluster.BeginTransaction() when every staged op routes to exactly one shard (the single-shard fast path).

Only multi-shard cluster transactions pay the 2PC cost. Verified against the post-feature quick benchmark — every metric matches or beats the pre-2PC baseline.

Production operations

Metrics

Each shard exposes participant-side 2PC counters as JSON. Scrape them into your monitoring stack — they're cheap to read and Interlocked-incremented on the worker thread.

GET /tx/stats
{
  "prepareTotal": 142,            // every PREPARE accepted (PREPARED + ABORTED)
  "prepareAbortedTotal": 3,         // vote-ABORT (validation conflict, busy)
  "committedTotal": 137,
  "rolledBackTotal": 2,
  "timedOutTotal": 0,             // self-aborted by Phase D timeout
  "inFlightPrepared": 0           // gauge — 0 or 1, alert if >0 for >30s
}

From C#:

var stats = db.GetPreparedTxStats();
Console.WriteLine($"In-flight: {stats.InFlightPrepared}, committed: {stats.CommittedTotal}");

Recommended alerts:

inFlightPrepared at 1 for longer than the prepare timeout — a participant is stuck and the timeout hasn't kicked in for some reason.
timedOutTotal growing — coordinator-side problems leading to participants self-aborting. Often paired with elevated request error rates on the cluster client.
prepareAbortedTotal growing relative to prepareTotal — application-level conflicts (unique-index races) or busy slots; not always a problem, but worth a glance.

Operator manual abort

When the timeout policy isn't aggressive enough or has been disabled, an operator can force-resolve a stuck PREPARED tx:

POST /tx/{txId}/abort
// 200 OK on success
// 409 Conflict if THIS shard recorded COMMIT_DECISION for this txId

The 409 response is the safety guard: a coordinator shard that already decided COMMIT must not be force-aborted, because the cluster has irrevocably committed. The right move there is to let recovery complete the broadcast.

From C# (via the HTTP transport, equivalent to db.RollbackPreparedTransaction):

var transport = new HttpShardTransport("shard-A", "http://shardA:5000");
transport.OperatorAbort("tx-stuck-12345");

Stuck-tx runbook

Identify. A participant shard's inFlightPrepared stays at 1. Reads on collections it touched are blocked.
Check the coordinator. Find the coordinator shard (it's in the participant's prepared.log; GET /tx/in-flight returns the records). Query its decision: GET /tx/{txId}/coord-state.
If decided COMMIT (and not yet DONE): the cluster has committed. Run cluster.Recover() from any client; the sweep replays the broadcast. Don't force-abort.
If decided is unknown (404) or undecided: the tx is in limbo. cluster.Recover() will treat it as ABORT (no decision = abort). If you can't run Recover for some reason, force-abort directly: POST /tx/{txId}/abort on each PREPARED participant.
Verify. The participant's inFlightPrepared drops to 0 and reads on the touched collections unblock.

SLA & failure modes

What the 2PC machinery promises in writing.

Atomicity (post-decision)

If cluster.BeginTransaction().Commit() returns successfully, every staged op is durable on every participating shard. The coordinator records COMMIT_DECISION before broadcasting, and a successful return means every participant ACK'd COMMIT. Crash before the return is handled by recovery; crash during the broadcast resolves to COMMIT on the next Recover().

If Commit() throws TransactionException, no participating shard has any of the staged ops applied. Either an early PREPARE returned ABORT and the coordinator broadcast ROLLBACK before throwing, or no COMMIT_DECISION was ever written and recovery treats every PREPARED participant as aborted.

Isolation

Read-committed via blocking. Reads on a shard with an in-flight PREPARED tx wait until the resolve completes. With the default 30-second prepare timeout, the worst-case read delay is bounded at 30s per shard. There is no MVCC and no snapshot isolation; concurrent readers and writers serialize on the participant's write lock during PREPARED.

Durability

Three durability fences, all fsync'd before the next step:

Participant prepared.log — fsync'd before responding PREPARED. Survives any crash short of disk failure.
Coordinator coord.log — fsync'd before broadcasting COMMIT. Point of no return.
Per-participant WAL — fsync'd as part of the existing single-node commit path during the COMMIT-side apply.

Liveness

The PREPARE timeout guarantees liveness in the face of a coordinator that crashes between sending PREPARE and broadcasting the decision. Default 30s; configurable via cluster.PrepareTimeout. After timeout the participant self-aborts and releases its write lock — readers and writers can proceed.

Network partition between PREPARE and the broadcast on a specific participant: that participant times out and aborts. If the coordinator decided COMMIT but couldn't reach this participant, the coordinator's COMMIT broadcast eventually fails with an error to the client — but the participant has already aborted. The cluster is briefly inconsistent until either (a) the operator runs cluster.Recover() with the partition healed, or (b) the participant restarts and pulls the decision from the coord log on the next sweep.

Out of scope

Not in the current implementation:

Snapshot or serializable isolation. Read-committed-via-blocking is what we ship.
Multi-region / WAN-aware 2PC. Latency-tolerant prepare and decision protocols are a different design.
Heuristic abort during partition. A stuck PREPARED-without-decision case currently waits for the operator (or the next recovery sweep). No automatic resolution path.
Quorum-based replicated coord log. The decision lives on a single coordinator shard. If that shard's disk is destroyed mid-flight, recovery can't determine the decision — operator intervention required.

Caveats and current limits

Reads block during a multi-shard PREPARED. The participant holds the write lock; readers on that shard wait until COMMIT/ROLLBACK arrives. With a 30-second prepare timeout this caps the worst case at 30s per shard. MVCC / snapshot isolation isn't on the roadmap right now.
One PREPARED tx per shard at a time. A second PREPARE while one's already in flight gets ABORTED("busy") — a clean retry signal for the coordinator. Concurrent multi-shard transactions on the same shard are serialized.
Replace can't change the shard key. The new doc must hash to the same shard as the existing doc. Doing this as a cross-shard delete-then-insert would need a different operation that doesn't exist yet.
Delete-by-id without a shard key isn't supported. Use DeleteByField on the shard-key field for single-shard deletes, or on any other field for cluster-wide scatter deletes.
Replicated collections aren't supported in cluster transactions. Inserts / replaces on a replicated collection from inside a ClusterTransaction throw — every shard would be a participant and the semantics need more thought.
Coordinator-driven re-broadcast not yet wired. If the coordinator dies after deciding but before broadcasting, the participants pull the decision themselves on the next Recover() call. There's no proactive "coordinator wakes up and re-broadcasts" path yet — that's a refinement for later production hardening.
No quorum on the coordinator log. The decision lives on a single coordinator shard. If that shard's disk is destroyed during a broadcast (decided, not yet DONE), recovery can't determine the decision and operator intervention is required. Replicate the coordinator shard for production.

None of these affect single-node multi-doc transactions, which have the same atomicity story they've had since DocumentForge's first release — well-tested and unchanged by this work.