Chapter 03

Data Modeling

When to embed, when to reference. The decision that shapes every query you'll ever write.

Two patterns

Every document database gives you two ways to relate entities. Both are valid. The mistake is picking one dogmatically.

Embed — nest the related data inside the parent

{
  "pnr": "ABC123",
  "passenger": { "firstName": "John", "lastName": "Smith" },
  "flights": [
    { "flightNumber": "AA100", "cabin": "BUSINESS", "fare": 450 }
  ]
}

Reference — store an ID; use JOIN when needed

// orders collection
{ "pnr": "ABC123", "flightIds": ["AA100_2024-06-15"] }

// flights collection
{ "flightId": "AA100_2024-06-15", "status": "DELAYED", "gate": "B12" }

The rule of thumb

FactorEmbedReference
LifetimeDies with parentLives independently
Access patternAlways read togetherSometimes alone
Shared by manyNo — ownedYes
Update frequencyRarely / as a wholeOften, independently
Bounded sizeYes (small N)Can be large
Atomic readSingle queryNeeds JOIN
The rule: embed what the parent owns. Reference what exists independently.

Case study: Airline PNR

This is where the two patterns really shine together. Real production airline systems combine them — and it's elegant.

Embed inside the order (snapshot at booking time)

{
  "pnr": "ABC123",
  "createdAt": "2024-06-01T10:30:00Z",

  "passenger": {              // embed — owned by this order
    "firstName": "John",
    "lastName": "Smith",
    "frequentFlyerNumber": "AA123456"
  },

  "bookedFlights": [          // embed — SNAPSHOT of what was booked
    {
      "flightNumber": "AA100",
      "cabin": "BUSINESS",
      "fareAtBooking": 450,
      "flightRef": "AA100_2024-06-15"  // pointer to current state
    }
  ],

  "payment": {                // embed — belongs to this order
    "method": "CREDIT_CARD",
    "last4": "4242",
    "amount": 450
  }
}

Reference separately (current state)

// flights — current operational state, changes independently
{
  "flightId": "AA100_2024-06-15",
  "status": "DELAYED",
  "actualDepartureTime": "...",
  "gate": "B23",
  "aircraftRegistration": "N12345"
}

// airports — master data
{ "code": "JFK", "name": "John F. Kennedy", "timezone": "America/New_York" }

// aircraft — fleet data
{ "registration": "N12345", "type": "B738", "status": "IN_SERVICE" }

Why this is the right design

Joining them

-- Get order with current flight status
SELECT * FROM orders
JOIN flights ON orders.bookedFlights[0].flightRef = flights.flightId
WHERE orders.pnr = 'ABC123'

-- Find orders affected by a cancelled flight
SELECT orders.pnr, orders.passenger
FROM orders
JOIN flights ON orders.bookedFlights[0].flightRef = flights.flightId
WHERE flights.status = 'CANCELLED'

IDs & references

Every document has an _id field. If you don't supply one on insert, a sequential time-ordered 128-bit ID is generated. Time-ordering keeps recently-written documents near each other on disk, improving cache locality.

For references, you can:

Tip: Domain-meaningful IDs are often better than auto-generated ones for references. They're easier to read in logs, easier to construct in tests, and more stable across environments.

Index patterns

Indexing decisions should follow your query patterns. Here are the common ones:

Primary lookup

CREATE UNIQUE INDEX idx_pnr ON orders (pnr)

Sub-field lookup (search by passenger)

CREATE INDEX idx_lastname ON orders (passenger.lastName)

First array element (common in reservations)

CREATE INDEX idx_first_flight ON orders (flights[0].flightNumber)

Composite for multi-field filters

CREATE INDEX idx_status_created ON orders (status, createdAt)

Range queries

-- Index on a numeric/date field, then use range predicates
CREATE INDEX idx_fare ON orders (fare.total)
SELECT * FROM orders WHERE fare.total > 1000 ORDER BY fare.total LIMIT 100

Schema evolution

DocumentForge is schemaless. You can add new fields to documents at any time without migrating existing data. Old documents simply don't have the new field.

When you query a field that some documents don't have, they evaluate as null. This is usually what you want:

-- Find documents that DO have the new field
SELECT * FROM orders WHERE loyaltyTier IS NOT NULL

-- Find orders from before the loyaltyTier field existed
SELECT * FROM orders WHERE loyaltyTier IS NULL
Best practice: When adding a new field, either default it to a sensible value in your application code, or make your queries tolerate the absence. Avoid "migrations" that rewrite every document — it's rarely necessary.

History as a feature

The embed-the-snapshot pattern gives you something surprisingly valuable: a historical record. An order booked in 2023 remembers the 2023 fare, the 2023 cabin class, the 2023 passenger name. Even if the airline re-fared the flight or the passenger changed their email.

Don't fight this — embrace it. Call the embedded bit bookedFlights or snapshotAtBooking to make the intent explicit.

When you need current information, query the referenced collection. When you need history, read the embedded snapshot. Both are right.