What transactions guarantee
A DocumentForge transaction stages a batch of writes — inserts, replaces, deletes — and either applies them all atomically or applies none of them. Within a single transaction:
- Atomicity. Commit either succeeds completely or fails completely. There is no partially-applied state visible to other readers.
- Consistency. Unique-index constraints are validated against the simulated post-commit state — including pending deletes — so a delete-then-insert with the same business key (the canonical upsert) round-trips cleanly.
- Read-your-writes. Reads through the transaction handle see staged inserts/replaces and skip staged deletes; reads outside the transaction don't see any of it until commit succeeds.
- Durability. Once
Commit()returns, the writes are fsync'd to the WAL on every participating node.
Two flavours, same API shape:
| Single-node | Cluster (sharded) | |
|---|---|---|
| Entry point | db.BeginTransaction() | cluster.BeginTransaction() |
| Atomicity guarantee | Local write lock | Two-phase commit across shards |
| Crash recovery | WAL replay on Open | cluster.Recover() sweep |
| Reads during a tx | Block on shard write lock | Block on each touched shard |
| Isolation level | Read-committed (via blocking) | Read-committed (via blocking) |
Single-node multi-document transactions
The simplest case: one DocumentForgeDb, several writes that have to land or fail together. Open a transaction with BeginTransaction(), stage your writes via the returned handle, then call Commit():
using var db = DocumentForgeDb.OpenOrCreate("airline.dfdb"); db.CreateIndex("users", "email", "idx_email", unique: true); using var tx = db.BeginTransaction(); tx.DeleteByField("users", "email", "alice@example.com"); tx.Insert("users", @"{""email"":""alice@example.com"",""tier"":""gold""}"); tx.Commit();
The classic upsert pattern: drop the existing row by business key, insert the new version. Outside a transaction, the unique-index check on the insert would fire before the delete completes — leaving a window where the row briefly doesn't exist. Inside one, the validator considers the pending delete when checking the pending insert, so the round-trip is atomic at the BSON-document level.
What you can stage
tx.Insert(coll, doc) // generates _id or honours one in the body tx.Replace(coll, id, newDoc) // returns false if id is not visible tx.Delete(coll, id) // drops staged inserts of same id, or stages a delete tx.DeleteByField(coll, field, value) tx.Find(coll, id) // read-your-writes: staged + committed
Failure modes
If validation fails at Commit() (e.g., the staged inserts would create a duplicate unique-index entry), the call throws and nothing is persisted. The transaction enters the RolledBack state and the caller can retry with a fresh handle.
If the process crashes mid-commit, the WAL replay on next Open restores the database to a consistent state. Multi-document atomicity at the document level is preserved across the crash.
Cluster transactions across shards
The same idea, but the documents live on different shards. cluster.BeginTransaction() returns a ClusterTransaction with the same staging API. Under the hood, the cluster:
- Routes each staged op to its target shard at staging time using the consistent-hash ring.
- If every staged op happens to route to one shard, takes the single-shard fast path — calls that shard's local
BeginTransaction().Commit(). Same atomicity, same WAL fsync, no extra round-trip. - Otherwise, runs a two-phase commit across the participating shards.
var shards = new[] { new InProcessShardTransport("A", dbA), new InProcessShardTransport("B", dbB), new InProcessShardTransport("C", dbC), }; using var cluster = new DocumentForgeCluster() .AddShard(shards[0]) .AddShard(shards[1]) .AddShard(shards[2]) .ShardCollection("orders", shardKeyPath: "pnr"); using var tx = cluster.BeginTransaction(); tx.Insert("orders", @"{""pnr"":""ABC123"",""leg"":1}"); // shard B (say) tx.Insert("orders", @"{""pnr"":""XYZ789"",""leg"":1}"); // shard A tx.Commit(); // 2PC: both shards commit, or neither does
Cluster transaction API
The handle exposes the operations you'd expect:
tx.Insert(coll, doc) tx.Replace(coll, id, newDoc) tx.DeleteByField(coll, field, value) tx.Find(coll, id) // staged inserts/replaces from this tx tx.Commit() tx.Rollback() // or just dispose without committing tx.ParticipantCount // how many shards this tx touches tx.StagedOperationCount // total ops staged tx.State // Active / Committed / RolledBack
Routing rules
- Insert: shard key extracted from the doc; routed to that one shard.
- Replace: shard key extracted from the new doc; routed to that one shard. The replacement doc must hash to the same shard as the existing doc — changing the shard-key value isn't supported.
- DeleteByField on the shard key: single-shard. The cluster knows exactly which shard owns the matching docs.
- DeleteByField on a non-shard-key field: matching docs could be anywhere, so the op fans out to every shard. Multi-shard 2PC; participants with no matches no-op cleanly.
Aborts and the failing-op message
If any participant votes ABORT during PREPARE — typically because applying the staged ops would violate a unique-index constraint — the coordinator broadcasts ROLLBACK to every PREPARED participant and the call throws:
try { tx.Commit(); } catch (TransactionException ex) { // "ClusterTransaction {Id} aborted on shard 'B': <reason>" Console.Error.WriteLine(ex.Message); }
The exception names the shard that aborted and quotes the participant's reason — useful when debugging which shard couldn't accept the staged batch.
Under the hood: two-phase commit
For multi-shard transactions, the cluster picks a coordinator shard deterministically (lowest-index participant) and drives the protocol:
Coordinator (cluster client) Coordinator shard Other participants
┌──────────────┐ ┌──────────────┐
Phase 1: ──── PREPARE(txId, ops) ──────────▶ │ validate │ │ │
◀─── PREPARED ─────────────────── │ hold lock │ │ │
──── PREPARE(txId, ops) ────────────────────────────────▶ │ validate │
◀─── PREPARED ───────────────────────────────────────── │ hold lock │
│ │ │ │
Decision: ──── COMMIT_DECISION(txId) ────────▶ │ fsync log │ │ │ ← point of no return
│ │ │ │
Phase 2: ──── COMMIT(txId) ────────────────▶ │ apply │ │ │
──── COMMIT(txId) ──────────────────────────────────────▶ │ apply │
──── DONE(txId) ─────────────────▶ │ fsync log │ │ │
└──────────────┘ └──────────────┘
What each phase does
- PREPARE. The coordinator sends the staged op slice to each participant. Each participant validates the slice against its current state, persists it to a
{db}.prepared.logfile (fsync'd before responding), and holds its write lock until COMMIT or ROLLBACK arrives. ReturnsPREPAREDon success orABORTEDwith a reason on conflict. - COMMIT_DECISION. If every participant voted
PREPARED, the coordinator durably records the decision in the coordinator shard's{db}.coord.log— fsync before broadcast. This is the point of no return: if the coordinator dies after this record exists, recovery resolves to COMMIT. - COMMIT. The coordinator broadcasts to each participant. Each one applies the prepared ops to storage, releases its write lock, and writes a
RESOLVEDrecord to its prepared-tx log. After every participant ACKs, the coordinator writes aDONErecord so future recovery sweeps know the broadcast is complete. - ROLLBACK (the abort path). If any participant voted
ABORTED, the coordinator broadcasts ROLLBACK to every PREPARED participant. Each releases its lock without applying the ops and writes aRESOLVEDrecord. NoCOMMIT_DECISIONever lands in the coord log — by convention, absence is abort.
Why the lock is held continuously
A participant that goes PREPARED → released-lock → waiting-for-resolve would let other writers slip in between PREPARE and COMMIT. Those writers could invalidate the staged validation — for instance, by inserting a doc that conflicts with one of the prepared inserts. The coordinator can't reject the COMMIT at that point (it already decided based on the PREPARED votes), so the participant would have to apply ops it knows will fail. Holding the write lock continuously avoids the problem entirely.
.NET's ReaderWriterLockSlim can't be released from a thread other than the one that acquired it, so each participant runs a small actor-pattern worker that owns the lock through the PREPARE→resolve window. The non-transactional hot path is unchanged — same lock primitive, no semaphore on top — and the worker thread doesn't even spin up until the first PREPARE arrives.
Crash recovery
2PC's value is that every failure mode resolves to a consistent end state. After any kind of crash — coordinator process died, participant rebooted, network blip during the broadcast — call cluster.Recover() on a freshly-built cluster:
using var cluster = DocumentForgeCluster.FromConfig(config, transportFactory); // Resolve any in-flight 2PC tx left over from a prior crash. var summary = cluster.Recover(); Console.WriteLine($"Recovery: {summary.Committed} committed, {summary.Aborted} aborted, {summary.Skipped} skipped");
The sweep walks every shard's prepared.log; for each PREPARED record without a matching RESOLVED, it asks the named coordinator shard for the decision and finalizes accordingly. The matrix:
| Crash point | Coord log state | Recovery action |
|---|---|---|
| After COMMIT_DECISION, before broadcast | Decided, no DONE | Replay COMMIT to participants still PREPARED |
| Mid-broadcast (some committed, some PREPARED) | Decided, no DONE | The PREPARED ones pull the decision and commit |
| Before deciding (all participants PREPARED) | No record for this txId | ROLLBACK every PREPARED participant |
| Coordinator shard absent (rebalance) | Unknown | Surfaced as Skipped for operator review |
Recover() is idempotent — calling it on a clean cluster returns zero counts. It's safe to wire into your application's startup hook.
PREPARE timeout
Without a deadline, a coordinator that died before broadcasting COMMIT/ROLLBACK would leave participants PREPARED with their write lock held — wedging that shard's writers until someone runs Recover() manually. To keep the cluster live in the face of a coordinator failure, every PREPARE carries a deadline. When it elapses, the participant self-aborts: it releases the write lock, writes a RESOLVED-aborted record, and any late COMMIT for that tx then throws.
cluster.PrepareTimeout = TimeSpan.FromSeconds(30); // default
The right value depends on your network and disk — long enough that healthy cross-shard commits never trip it, short enough that a stuck coordinator doesn't freeze writers for ages. 30 seconds is the default; reduce it for low-latency networks where you'd rather fail fast.
Performance
Transactions add new code paths but don't change the existing ones. The four hot paths called out as bytewise unchanged:
- Single-doc
db.Insert/Update/Deleteon aDocumentForgeDb. - Single-node
db.BeginTransaction().Commit(). - Non-transactional
cluster.Insert/Execute(the cluster routing layer). cluster.BeginTransaction()when every staged op routes to exactly one shard (the single-shard fast path).
Only multi-shard cluster transactions pay the 2PC cost. Verified against the post-feature quick benchmark — every metric matches or beats the pre-2PC baseline.
Production operations
Metrics
Each shard exposes participant-side 2PC counters as JSON. Scrape them into your monitoring stack — they're cheap to read and Interlocked-incremented on the worker thread.
GET /tx/stats
{
"prepareTotal": 142, // every PREPARE accepted (PREPARED + ABORTED)
"prepareAbortedTotal": 3, // vote-ABORT (validation conflict, busy)
"committedTotal": 137,
"rolledBackTotal": 2,
"timedOutTotal": 0, // self-aborted by Phase D timeout
"inFlightPrepared": 0 // gauge — 0 or 1, alert if >0 for >30s
}
From C#:
var stats = db.GetPreparedTxStats(); Console.WriteLine($"In-flight: {stats.InFlightPrepared}, committed: {stats.CommittedTotal}");
Recommended alerts:
inFlightPreparedat1for longer than the prepare timeout — a participant is stuck and the timeout hasn't kicked in for some reason.timedOutTotalgrowing — coordinator-side problems leading to participants self-aborting. Often paired with elevated request error rates on the cluster client.prepareAbortedTotalgrowing relative toprepareTotal— application-level conflicts (unique-index races) or busy slots; not always a problem, but worth a glance.
Operator manual abort
When the timeout policy isn't aggressive enough or has been disabled, an operator can force-resolve a stuck PREPARED tx:
POST /tx/{txId}/abort
// 200 OK on success
// 409 Conflict if THIS shard recorded COMMIT_DECISION for this txId
The 409 response is the safety guard: a coordinator shard that already decided COMMIT must not be force-aborted, because the cluster has irrevocably committed. The right move there is to let recovery complete the broadcast.
From C# (via the HTTP transport, equivalent to db.RollbackPreparedTransaction):
var transport = new HttpShardTransport("shard-A", "http://shardA:5000"); transport.OperatorAbort("tx-stuck-12345");
Stuck-tx runbook
- Identify. A participant shard's
inFlightPreparedstays at 1. Reads on collections it touched are blocked. - Check the coordinator. Find the coordinator shard (it's in the participant's
prepared.log;GET /tx/in-flightreturns the records). Query its decision:GET /tx/{txId}/coord-state. - If decided COMMIT (and not yet DONE): the cluster has committed. Run
cluster.Recover()from any client; the sweep replays the broadcast. Don't force-abort. - If decided is unknown (404) or undecided: the tx is in limbo.
cluster.Recover()will treat it as ABORT (no decision = abort). If you can't run Recover for some reason, force-abort directly:POST /tx/{txId}/aborton each PREPARED participant. - Verify. The participant's
inFlightPrepareddrops to 0 and reads on the touched collections unblock.
SLA & failure modes
What the 2PC machinery promises in writing.
Atomicity (post-decision)
If cluster.BeginTransaction().Commit() returns successfully, every staged op is durable on every participating shard. The coordinator records COMMIT_DECISION before broadcasting, and a successful return means every participant ACK'd COMMIT. Crash before the return is handled by recovery; crash during the broadcast resolves to COMMIT on the next Recover().
If Commit() throws TransactionException, no participating shard has any of the staged ops applied. Either an early PREPARE returned ABORT and the coordinator broadcast ROLLBACK before throwing, or no COMMIT_DECISION was ever written and recovery treats every PREPARED participant as aborted.
Isolation
Read-committed via blocking. Reads on a shard with an in-flight PREPARED tx wait until the resolve completes. With the default 30-second prepare timeout, the worst-case read delay is bounded at 30s per shard. There is no MVCC and no snapshot isolation; concurrent readers and writers serialize on the participant's write lock during PREPARED.
Durability
Three durability fences, all fsync'd before the next step:
- Participant prepared.log — fsync'd before responding PREPARED. Survives any crash short of disk failure.
- Coordinator coord.log — fsync'd before broadcasting COMMIT. Point of no return.
- Per-participant WAL — fsync'd as part of the existing single-node commit path during the COMMIT-side apply.
Liveness
The PREPARE timeout guarantees liveness in the face of a coordinator that crashes between sending PREPARE and broadcasting the decision. Default 30s; configurable via cluster.PrepareTimeout. After timeout the participant self-aborts and releases its write lock — readers and writers can proceed.
Network partition between PREPARE and the broadcast on a specific participant: that participant times out and aborts. If the coordinator decided COMMIT but couldn't reach this participant, the coordinator's COMMIT broadcast eventually fails with an error to the client — but the participant has already aborted. The cluster is briefly inconsistent until either (a) the operator runs cluster.Recover() with the partition healed, or (b) the participant restarts and pulls the decision from the coord log on the next sweep.
Out of scope
Not in the current implementation:
- Snapshot or serializable isolation. Read-committed-via-blocking is what we ship.
- Multi-region / WAN-aware 2PC. Latency-tolerant prepare and decision protocols are a different design.
- Heuristic abort during partition. A stuck PREPARED-without-decision case currently waits for the operator (or the next recovery sweep). No automatic resolution path.
- Quorum-based replicated coord log. The decision lives on a single coordinator shard. If that shard's disk is destroyed mid-flight, recovery can't determine the decision — operator intervention required.
Caveats and current limits
- Reads block during a multi-shard PREPARED. The participant holds the write lock; readers on that shard wait until COMMIT/ROLLBACK arrives. With a 30-second prepare timeout this caps the worst case at 30s per shard. MVCC / snapshot isolation isn't on the roadmap right now.
- One PREPARED tx per shard at a time. A second PREPARE while one's already in flight gets
ABORTED("busy")— a clean retry signal for the coordinator. Concurrent multi-shard transactions on the same shard are serialized. - Replace can't change the shard key. The new doc must hash to the same shard as the existing doc. Doing this as a cross-shard delete-then-insert would need a different operation that doesn't exist yet.
- Delete-by-id without a shard key isn't supported. Use
DeleteByFieldon the shard-key field for single-shard deletes, or on any other field for cluster-wide scatter deletes. - Replicated collections aren't supported in cluster transactions. Inserts / replaces on a replicated collection from inside a
ClusterTransactionthrow — every shard would be a participant and the semantics need more thought. - Coordinator-driven re-broadcast not yet wired. If the coordinator dies after deciding but before broadcasting, the participants pull the decision themselves on the next
Recover()call. There's no proactive "coordinator wakes up and re-broadcasts" path yet — that's a refinement for later production hardening. - No quorum on the coordinator log. The decision lives on a single coordinator shard. If that shard's disk is destroyed during a broadcast (decided, not yet DONE), recovery can't determine the decision and operator intervention is required. Replicate the coordinator shard for production.
None of these affect single-node multi-doc transactions, which have the same atomicity story they've had since DocumentForge's first release — well-tested and unchanged by this work.