SQL Reference

DocumentForge speaks a familiar SQL dialect extended with JSON paths: dots navigate into objects, brackets index arrays. Everything below runs through the dfdb CLI (query, repl) and over the REST API POST /query endpoint.

SELECT


SELECT * FROM orders
SELECT pnr, passenger.lastName FROM orders
SELECT * FROM orders WHERE pnr = 'ABC123'
SELECT * FROM orders ORDER BY createdAt DESC LIMIT 100 OFFSET 50

Returns matching documents. The * selector returns the full document, including all nested fields. A comma-separated list projects specific fields, which may be nested paths or array elements.

Collection names

Names are case-insensitive and may contain letters, digits, dots, underscores, and hyphens. Dotted names like aerotoys.tax.users or v2.audit.events work in every position a collection name appears (FROM, JOIN, INSERT INTO, UPDATE, DELETE, CREATE INDEX) and through the REST API path:


-- All four refer to the SAME collection
SELECT * FROM aerotoys.tax.users
SELECT * FROM AeroToys.Tax.Users
POST /collections/aerotoys.tax.users
GET  /collections/aerotoys.tax.users

Internally the catalog folds names to lowercase on store; lookups go through the same normalisation, so case-mismatched calls resolve to the same physical collection. The IndexDefinition.Name on indexes preserves the original casing supplied at creation time, but lookups are also case-insensitive — see CREATE / DROP INDEX.

Unknown collection

SQL against a collection that doesn’t exist returns a clear error rather than an empty result. The error message lists known collections and, when the only difference is case, suggests the canonical name:


{
  "error": "Collection 'aerotoys.tax.environments' not found.
            Known collections: orders, flights, users, aerotoys.tax.rules."
}


{
  "error": "Collection 'TaxVersions' not found.
            Did you mean 'taxversions'? Names are stored case-insensitively."
}

The HTTP status is 400 for these cases — no more silently-empty result sets to debug.

DISTINCT

Strip duplicate rows from the result. Place DISTINCT right after SELECT and before the field list. The dedupe key is the projected row, not the underlying document, so SELECT DISTINCT airline FROM flights returns one row per unique airline (the internal _id is omitted from the projection so duplicates collapse).


SELECT DISTINCT airline FROM flights
SELECT DISTINCT airline, origin, destination FROM flights
SELECT DISTINCT passenger.lastName FROM orders WHERE status = 'CONFIRMED'
SELECT DISTINCT flights[0].departureAirport FROM orders LIMIT 10

Plays nicely with WHERE, ORDER BY, LIMIT, and OFFSET. With SELECT DISTINCT * every row is unique by definition (the _id field is per-document), so the modifier is effectively a no-op.

Query plans gain a + DISTINCT suffix. LIMIT is not pushed down to the scan when DISTINCT is present (otherwise the scan would short-circuit before all duplicates are seen).

JSON paths

Any place you can reference a field, you can use a JSON path. Dots navigate into objects; brackets index arrays.

Path	Meaning
`pnr`	Top-level field
`passenger.lastName`	Nested object field
`flights[0].flightNumber`	First array element’s field
`flights[2].fare.total`	Nested inside array
`flights[*].flightNumber`	Any array element (index-only)

The [*] wildcard is supported when creating indexes (index every element’s value), but in WHERE clauses you need a specific index. Use arrays with known positions for WHERE predicates, or combine with aggregations to filter-by-contents.

WHERE clauses

Supported operators: =, !=, >, <, >=, <=, LIKE, IS NULL, IS NOT NULL. Combine with AND / OR / NOT and parentheses.


SELECT * FROM orders WHERE status = 'CONFIRMED' AND passenger.age >= 18
SELECT * FROM users  WHERE email LIKE '%@example.com'
SELECT * FROM orders WHERE cancelledAt IS NULL
SELECT * FROM orders WHERE (status = 'CONFIRMED' OR status = 'TICKETED') AND total > 1000

Equality uses indexes

A = predicate on an indexed path uses INDEX_SCAN. Range predicates (>, <, >=, <=) use INDEX_RANGE_SCAN when an index exists.

Composite indexes

An index on (status, createdAt) fires when both columns appear with equality in an AND clause. See CREATE INDEX.

OR of equalities on the same indexed column

WHERE col = a OR col = b OR col = c on an indexed col uses INDEX_SCAN_MULTI(idx_name, N keys) — one indexed lookup per branch, results unioned with dedupe. No more accidental collection scans for set lookups.

IN clause

WHERE col IN (v1, v2, …, vN) is the natural shape for “any of these”. Internally it lowers to col = v1 OR col = v2 OR …, so it fires the same multi-key index scan when an index covers the column.


SELECT id FROM rules WHERE id IN ('rule-bag-policy', 'rule-tier-bonus')
SELECT * FROM orders WHERE status IN ('CONFIRMED', 'TICKETED')
SELECT * FROM users WHERE email IN ('a@example.com', 'b@example.com', 'c@example.com')
SELECT * FROM orders WHERE qty IN (1, 5, 10, 100)

Empty IN () is a parse error — at least one value is required.
Duplicate values (IN ('x', 'y', 'x')) are deduped at the result level, not the input level — fine to pass them in.
Numeric, string, boolean, and null literals all work.
Plan: INDEX_SCAN_MULTI(idx_name, N keys) with an index, COLLECTION_SCAN without.

ORDER BY · LIMIT · OFFSET


SELECT * FROM orders ORDER BY createdAt DESC LIMIT 50 OFFSET 100

LIMIT without ORDER BY is pushed down into the scan, meaning only N documents are materialized. With ORDER BY, the full matching set is materialized before sorting.

JOIN

A hash join between two collections on an equality predicate. Useful when you’ve referenced a shared entity (like a flight) rather than embedding it.


SELECT * FROM orders
JOIN flights ON orders.bookedFlights[0].flightRef = flights.flightId
WHERE orders.pnr = 'ABC123'

The result document has both collections nested under their name:


{
  "orders":  { "pnr": "ABC123", "bookedFlights": [] },
  "flights": { "flightId": "AA100_2024-06-15", "status": "DELAYED" }
}

In WHERE and SELECT, prefix paths with the collection name: orders.pnr, flights.status.

Aggregations

Five aggregate functions: COUNT, SUM, AVG, MIN, MAX. Combine freely in the SELECT list.


SELECT COUNT(*) FROM orders
SELECT SUM(tickets[0].fare.total) FROM orders WHERE status = 'CONFIRMED'
SELECT MIN(fare.total), MAX(fare.total), AVG(fare.total) FROM tickets

The result key in the output document matches the aggregate expression: COUNT(*), SUM(tickets[0].fare.total), etc.

GROUP BY

Bucket documents by one or more fields; aggregates run per bucket.


SELECT status, COUNT(*), SUM(fare.total)
FROM orders
GROUP BY status


SELECT airline, cabin, COUNT(*), AVG(fare.total)
FROM orders
GROUP BY airline, cabin

Each result row includes the grouping columns plus all aggregates.

INSERT


INSERT INTO orders VALUES { "pnr": "XYZ789", "status": "PENDING" }

The document is parsed as JSON. Missing _id fields are filled with a sequential time-ordered ID.

UPDATE


UPDATE orders SET status = 'CANCELLED' WHERE pnr = 'ABC123'
UPDATE orders SET passenger.vip = true WHERE passenger.lastName = 'Smith'

Paths in SET can be nested. All matching documents are updated; the affected-count is returned.

DELETE


DELETE FROM orders WHERE status = 'CANCELLED' AND createdAt < '2020-01-01'

Deletes all matching documents. A DELETE without WHERE is supported but should be used with care.

CREATE / DROP INDEX


CREATE INDEX idx_pnr ON orders (pnr)
CREATE UNIQUE INDEX idx_pnr ON orders (pnr)
CREATE INDEX idx_status_created ON orders (status, createdAt)
DROP INDEX idx_pnr ON orders

Indexes are built from existing data synchronously, then incrementally maintained. Unique indexes reject duplicates on insert. See Concepts → Indexes for index strategy.

Query plans

Every QueryResult returns a QueryPlan string describing how the query was executed.

Plan	Meaning
`INDEX_SCAN(name)`	Equality lookup via single-field index
`INDEX_RANGE_SCAN(name)`	Range lookup via B-tree walk
`INDEX_SCAN_COMPOSITE(name)`	Multi-field equality via composite index
`COLLECTION_SCAN`	Full scan, no matching index
`... + HASH_JOIN(coll)`	Hash join was applied to the result
`... + AGGREGATE`	Aggregation / GROUP BY was applied

Check the plan during development — if you see COLLECTION_SCAN on a query you run often, add an index.