Data Modeling

When to embed, when to reference — the decision that shapes every query you will ever write.

Two patterns

Every document database gives you two ways to relate entities. Both are valid. The mistake is picking one dogmatically.


{
  "pnr": "ABC123",
  "passenger": { "firstName": "John", "lastName": "Smith" },
  "flights": [
    { "flightNumber": "AA100", "cabin": "BUSINESS", "fare": 450 }
  ]
}

Reference — store an ID; use JOIN when needed


// orders collection
{ "pnr": "ABC123", "flightIds": ["AA100_2024-06-15"] }
 
// flights collection
{ "flightId": "AA100_2024-06-15", "status": "DELAYED", "gate": "B12" }

The rule of thumb

Factor	Embed	Reference
Lifetime	Dies with parent	Lives independently
Access pattern	Always read together	Sometimes alone
Shared by many	No — owned	Yes
Update frequency	Rarely / as a whole	Often, independently
Bounded size	Yes (small N)	Can be large
Atomic read	Single query	Needs JOIN

The rule: embed what the parent owns. Reference what exists independently.

Case study: Airline PNR

This is where the two patterns really shine together. Real production airline systems combine them — and it is elegant.

Embed inside the order (snapshot at booking time)


{
  "pnr": "ABC123",
  "createdAt": "2024-06-01T10:30:00Z",
 
  "passenger": {              // embed — owned by this order
    "firstName": "John",
    "lastName": "Smith",
    "frequentFlyerNumber": "AA123456"
  },
 
  "bookedFlights": [          // embed — SNAPSHOT of what was booked
    {
      "flightNumber": "AA100",
      "cabin": "BUSINESS",
      "fareAtBooking": 450,
      "flightRef": "AA100_2024-06-15"  // pointer to current state
    }
  ],
 
  "payment": {                // embed — belongs to this order
    "method": "CREDIT_CARD",
    "last4": "4242",
    "amount": 450
  }
}

Reference separately (current state)


// flights — current operational state, changes independently
{
  "flightId": "AA100_2024-06-15",
  "status": "DELAYED",
  "actualDepartureTime": "...",
  "gate": "B23",
  "aircraftRegistration": "N12345"
}
 
// airports — master data
{ "code": "JFK", "name": "John F. Kennedy", "timezone": "America/New_York" }
 
// aircraft — fleet data
{ "registration": "N12345", "type": "B738", "status": "IN_SERVICE" }

Why this is the right design

One query reads an order — all customer-facing info is embedded.
Flight status updates do not rewrite every order — one row in flights.
History is preserved — “I booked business class for $450” even if the flight is cancelled.
JOIN when you need live status — see below.

Joining them


-- Get order with current flight status
SELECT * FROM orders
JOIN flights ON orders.bookedFlights[0].flightRef = flights.flightId
WHERE orders.pnr = 'ABC123'
 
-- Find orders affected by a cancelled flight
SELECT orders.pnr, orders.passenger
FROM orders
JOIN flights ON orders.bookedFlights[0].flightRef = flights.flightId
WHERE flights.status = 'CANCELLED'

IDs and references

Every document has an _id field. If you do not supply one on insert, a sequential time-ordered 128-bit ID is generated. Time-ordering keeps recently-written documents near each other on disk, improving cache locality.

For references, you can:

Use the auto-generated _id from the target document.
Use a domain-meaningful ID (like flightId: "AA100_2024-06-15").
Index both sides of the reference for fast JOIN performance.

Tip: domain-meaningful IDs are often better than auto-generated ones for references. They are easier to read in logs, easier to construct in tests, and more stable across environments.

Index patterns

Indexing decisions should follow your query patterns. Here are the common ones.

Primary lookup


CREATE UNIQUE INDEX idx_pnr ON orders (pnr)

Sub-field lookup (search by passenger)


CREATE INDEX idx_lastname ON orders (passenger.lastName)

First array element (common in reservations)


CREATE INDEX idx_first_flight ON orders (flights[0].flightNumber)

Composite for multi-field filters


CREATE INDEX idx_status_created ON orders (status, createdAt)

Range queries


-- Index on a numeric/date field, then use range predicates
CREATE INDEX idx_fare ON orders (fare.total)
SELECT * FROM orders WHERE fare.total > 1000 ORDER BY fare.total LIMIT 100

Schema evolution

DocumentForge is schemaless. You can add new fields to documents at any time without migrating existing data. Old documents simply do not have the new field.

When you query a field that some documents do not have, they evaluate as null. This is usually what you want:


-- Find documents that DO have the new field
SELECT * FROM orders WHERE loyaltyTier IS NOT NULL
 
-- Find orders from before the loyaltyTier field existed
SELECT * FROM orders WHERE loyaltyTier IS NULL

Best practice: when adding a new field, either default it to a sensible value in your application code, or make your queries tolerate its absence. Avoid “migrations” that rewrite every document — it is rarely necessary.

History as a feature

The embed-the-snapshot pattern gives you something surprisingly valuable: a historical record. An order booked in 2023 remembers the 2023 fare, the 2023 cabin class, the 2023 passenger name — even if the airline re-fared the flight or the passenger changed their email.

Do not fight this — embrace it. Call the embedded bit bookedFlights or snapshotAtBooking to make the intent explicit.

When you need current information, query the referenced collection. When you need history, read the embedded snapshot. Both are right.