Skip to Content

SQL Reference

DocumentForge speaks a familiar SQL dialect extended with JSON paths: dots navigate into objects, brackets index arrays. Everything below runs through the dfdb CLI (query, repl) and over the REST API POST /query endpoint.

SELECT

SELECT * FROM orders SELECT pnr, passenger.lastName FROM orders SELECT * FROM orders WHERE pnr = 'ABC123' SELECT * FROM orders ORDER BY createdAt DESC LIMIT 100 OFFSET 50

Returns matching documents. The * selector returns the full document, including all nested fields. A comma-separated list projects specific fields, which may be nested paths or array elements.

Collection names

Names are case-insensitive and may contain letters, digits, dots, underscores, and hyphens. Dotted names like aerotoys.tax.users or v2.audit.events work in every position a collection name appears (FROM, JOIN, INSERT INTO, UPDATE, DELETE, CREATE INDEX) and through the REST API path:

-- All four refer to the SAME collection SELECT * FROM aerotoys.tax.users SELECT * FROM AeroToys.Tax.Users POST /collections/aerotoys.tax.users GET /collections/aerotoys.tax.users

Internally the catalog folds names to lowercase on store; lookups go through the same normalisation, so case-mismatched calls resolve to the same physical collection. The IndexDefinition.Name on indexes preserves the original casing supplied at creation time, but lookups are also case-insensitive — see CREATE / DROP INDEX.

Unknown collection

SQL against a collection that doesn’t exist returns a clear error rather than an empty result. The error message lists known collections and, when the only difference is case, suggests the canonical name:

{ "error": "Collection 'aerotoys.tax.environments' not found. Known collections: orders, flights, users, aerotoys.tax.rules." }
{ "error": "Collection 'TaxVersions' not found. Did you mean 'taxversions'? Names are stored case-insensitively." }

The HTTP status is 400 for these cases — no more silently-empty result sets to debug.

DISTINCT

Strip duplicate rows from the result. Place DISTINCT right after SELECT and before the field list. The dedupe key is the projected row, not the underlying document, so SELECT DISTINCT airline FROM flights returns one row per unique airline (the internal _id is omitted from the projection so duplicates collapse).

SELECT DISTINCT airline FROM flights SELECT DISTINCT airline, origin, destination FROM flights SELECT DISTINCT passenger.lastName FROM orders WHERE status = 'CONFIRMED' SELECT DISTINCT flights[0].departureAirport FROM orders LIMIT 10

Plays nicely with WHERE, ORDER BY, LIMIT, and OFFSET. With SELECT DISTINCT * every row is unique by definition (the _id field is per-document), so the modifier is effectively a no-op.

Query plans gain a + DISTINCT suffix. LIMIT is not pushed down to the scan when DISTINCT is present (otherwise the scan would short-circuit before all duplicates are seen).

JSON paths

Any place you can reference a field, you can use a JSON path. Dots navigate into objects; brackets index arrays.

PathMeaning
pnrTop-level field
passenger.lastNameNested object field
flights[0].flightNumberFirst array element’s field
flights[2].fare.totalNested inside array
flights[*].flightNumberAny array element (index-only)

The [*] wildcard is supported when creating indexes (index every element’s value), but in WHERE clauses you need a specific index. Use arrays with known positions for WHERE predicates, or combine with aggregations to filter-by-contents.

WHERE clauses

Supported operators: =, !=, >, <, >=, <=, LIKE, IS NULL, IS NOT NULL. Combine with AND / OR / NOT and parentheses.

SELECT * FROM orders WHERE status = 'CONFIRMED' AND passenger.age >= 18 SELECT * FROM users WHERE email LIKE '%@example.com' SELECT * FROM orders WHERE cancelledAt IS NULL SELECT * FROM orders WHERE (status = 'CONFIRMED' OR status = 'TICKETED') AND total > 1000

Equality uses indexes

A = predicate on an indexed path uses INDEX_SCAN. Range predicates (>, <, >=, <=) use INDEX_RANGE_SCAN when an index exists.

Composite indexes

An index on (status, createdAt) fires when both columns appear with equality in an AND clause. See CREATE INDEX.

OR of equalities on the same indexed column

WHERE col = a OR col = b OR col = c on an indexed col uses INDEX_SCAN_MULTI(idx_name, N keys) — one indexed lookup per branch, results unioned with dedupe. No more accidental collection scans for set lookups.

IN clause

WHERE col IN (v1, v2, …, vN) is the natural shape for “any of these”. Internally it lowers to col = v1 OR col = v2 OR …, so it fires the same multi-key index scan when an index covers the column.

SELECT id FROM rules WHERE id IN ('rule-bag-policy', 'rule-tier-bonus') SELECT * FROM orders WHERE status IN ('CONFIRMED', 'TICKETED') SELECT * FROM users WHERE email IN ('a@example.com', 'b@example.com', 'c@example.com') SELECT * FROM orders WHERE qty IN (1, 5, 10, 100)
  • Empty IN () is a parse error — at least one value is required.
  • Duplicate values (IN ('x', 'y', 'x')) are deduped at the result level, not the input level — fine to pass them in.
  • Numeric, string, boolean, and null literals all work.
  • Plan: INDEX_SCAN_MULTI(idx_name, N keys) with an index, COLLECTION_SCAN without.

ORDER BY · LIMIT · OFFSET

SELECT * FROM orders ORDER BY createdAt DESC LIMIT 50 OFFSET 100

LIMIT without ORDER BY is pushed down into the scan, meaning only N documents are materialized. With ORDER BY, the full matching set is materialized before sorting.

JOIN

A hash join between two collections on an equality predicate. Useful when you’ve referenced a shared entity (like a flight) rather than embedding it.

SELECT * FROM orders JOIN flights ON orders.bookedFlights[0].flightRef = flights.flightId WHERE orders.pnr = 'ABC123'

The result document has both collections nested under their name:

{ "orders": { "pnr": "ABC123", "bookedFlights": [] }, "flights": { "flightId": "AA100_2024-06-15", "status": "DELAYED" } }

In WHERE and SELECT, prefix paths with the collection name: orders.pnr, flights.status.

Aggregations

Five aggregate functions: COUNT, SUM, AVG, MIN, MAX. Combine freely in the SELECT list.

SELECT COUNT(*) FROM orders SELECT SUM(tickets[0].fare.total) FROM orders WHERE status = 'CONFIRMED' SELECT MIN(fare.total), MAX(fare.total), AVG(fare.total) FROM tickets

The result key in the output document matches the aggregate expression: COUNT(*), SUM(tickets[0].fare.total), etc.

GROUP BY

Bucket documents by one or more fields; aggregates run per bucket.

SELECT status, COUNT(*), SUM(fare.total) FROM orders GROUP BY status
SELECT airline, cabin, COUNT(*), AVG(fare.total) FROM orders GROUP BY airline, cabin

Each result row includes the grouping columns plus all aggregates.

INSERT

INSERT INTO orders VALUES { "pnr": "XYZ789", "status": "PENDING" }

The document is parsed as JSON. Missing _id fields are filled with a sequential time-ordered ID.

UPDATE

UPDATE orders SET status = 'CANCELLED' WHERE pnr = 'ABC123' UPDATE orders SET passenger.vip = true WHERE passenger.lastName = 'Smith'

Paths in SET can be nested. All matching documents are updated; the affected-count is returned.

DELETE

DELETE FROM orders WHERE status = 'CANCELLED' AND createdAt < '2020-01-01'

Deletes all matching documents. A DELETE without WHERE is supported but should be used with care.

CREATE / DROP INDEX

CREATE INDEX idx_pnr ON orders (pnr) CREATE UNIQUE INDEX idx_pnr ON orders (pnr) CREATE INDEX idx_status_created ON orders (status, createdAt) DROP INDEX idx_pnr ON orders

Indexes are built from existing data synchronously, then incrementally maintained. Unique indexes reject duplicates on insert. See Concepts → Indexes for index strategy.

Query plans

Every QueryResult returns a QueryPlan string describing how the query was executed.

PlanMeaning
INDEX_SCAN(name)Equality lookup via single-field index
INDEX_RANGE_SCAN(name)Range lookup via B-tree walk
INDEX_SCAN_COMPOSITE(name)Multi-field equality via composite index
COLLECTION_SCANFull scan, no matching index
... + HASH_JOIN(coll)Hash join was applied to the result
... + AGGREGATEAggregation / GROUP BY was applied

Check the plan during development — if you see COLLECTION_SCAN on a query you run often, add an index.

Last updated on