Data Modeling
When to embed, when to reference — the decision that shapes every query you will ever write.
Two patterns
Every document database gives you two ways to relate entities. Both are valid. The mistake is picking one dogmatically.
Embed — nest the related data inside the parent
{
"pnr": "ABC123",
"passenger": { "firstName": "John", "lastName": "Smith" },
"flights": [
{ "flightNumber": "AA100", "cabin": "BUSINESS", "fare": 450 }
]
}Reference — store an ID; use JOIN when needed
// orders collection
{ "pnr": "ABC123", "flightIds": ["AA100_2024-06-15"] }
// flights collection
{ "flightId": "AA100_2024-06-15", "status": "DELAYED", "gate": "B12" }The rule of thumb
| Factor | Embed | Reference |
|---|---|---|
| Lifetime | Dies with parent | Lives independently |
| Access pattern | Always read together | Sometimes alone |
| Shared by many | No — owned | Yes |
| Update frequency | Rarely / as a whole | Often, independently |
| Bounded size | Yes (small N) | Can be large |
| Atomic read | Single query | Needs JOIN |
The rule: embed what the parent owns. Reference what exists independently.
Case study: Airline PNR
This is where the two patterns really shine together. Real production airline systems combine them — and it is elegant.
Embed inside the order (snapshot at booking time)
{
"pnr": "ABC123",
"createdAt": "2024-06-01T10:30:00Z",
"passenger": { // embed — owned by this order
"firstName": "John",
"lastName": "Smith",
"frequentFlyerNumber": "AA123456"
},
"bookedFlights": [ // embed — SNAPSHOT of what was booked
{
"flightNumber": "AA100",
"cabin": "BUSINESS",
"fareAtBooking": 450,
"flightRef": "AA100_2024-06-15" // pointer to current state
}
],
"payment": { // embed — belongs to this order
"method": "CREDIT_CARD",
"last4": "4242",
"amount": 450
}
}Reference separately (current state)
// flights — current operational state, changes independently
{
"flightId": "AA100_2024-06-15",
"status": "DELAYED",
"actualDepartureTime": "...",
"gate": "B23",
"aircraftRegistration": "N12345"
}
// airports — master data
{ "code": "JFK", "name": "John F. Kennedy", "timezone": "America/New_York" }
// aircraft — fleet data
{ "registration": "N12345", "type": "B738", "status": "IN_SERVICE" }Why this is the right design
- One query reads an order — all customer-facing info is embedded.
- Flight status updates do not rewrite every order — one row in
flights. - History is preserved — “I booked business class for $450” even if the flight is cancelled.
- JOIN when you need live status — see below.
Joining them
-- Get order with current flight status
SELECT * FROM orders
JOIN flights ON orders.bookedFlights[0].flightRef = flights.flightId
WHERE orders.pnr = 'ABC123'
-- Find orders affected by a cancelled flight
SELECT orders.pnr, orders.passenger
FROM orders
JOIN flights ON orders.bookedFlights[0].flightRef = flights.flightId
WHERE flights.status = 'CANCELLED'IDs and references
Every document has an _id field. If you do not supply one on insert, a
sequential time-ordered 128-bit ID is generated. Time-ordering keeps
recently-written documents near each other on disk, improving cache locality.
For references, you can:
- Use the auto-generated
_idfrom the target document. - Use a domain-meaningful ID (like
flightId: "AA100_2024-06-15"). - Index both sides of the reference for fast JOIN performance.
Tip: domain-meaningful IDs are often better than auto-generated ones for references. They are easier to read in logs, easier to construct in tests, and more stable across environments.
Index patterns
Indexing decisions should follow your query patterns. Here are the common ones.
Primary lookup
CREATE UNIQUE INDEX idx_pnr ON orders (pnr)Sub-field lookup (search by passenger)
CREATE INDEX idx_lastname ON orders (passenger.lastName)First array element (common in reservations)
CREATE INDEX idx_first_flight ON orders (flights[0].flightNumber)Composite for multi-field filters
CREATE INDEX idx_status_created ON orders (status, createdAt)Range queries
-- Index on a numeric/date field, then use range predicates
CREATE INDEX idx_fare ON orders (fare.total)
SELECT * FROM orders WHERE fare.total > 1000 ORDER BY fare.total LIMIT 100Schema evolution
DocumentForge is schemaless. You can add new fields to documents at any time without migrating existing data. Old documents simply do not have the new field.
When you query a field that some documents do not have, they evaluate as
null. This is usually what you want:
-- Find documents that DO have the new field
SELECT * FROM orders WHERE loyaltyTier IS NOT NULL
-- Find orders from before the loyaltyTier field existed
SELECT * FROM orders WHERE loyaltyTier IS NULLBest practice: when adding a new field, either default it to a sensible value in your application code, or make your queries tolerate its absence. Avoid “migrations” that rewrite every document — it is rarely necessary.
History as a feature
The embed-the-snapshot pattern gives you something surprisingly valuable: a historical record. An order booked in 2023 remembers the 2023 fare, the 2023 cabin class, the 2023 passenger name — even if the airline re-fared the flight or the passenger changed their email.
Do not fight this — embrace it. Call the embedded bit bookedFlights or
snapshotAtBooking to make the intent explicit.
When you need current information, query the referenced collection. When you need history, read the embedded snapshot. Both are right.