Storage and query engine
All pack data is stored as Parquet files. The query engine is DuckDB with
the httpfs extension, which allows DuckDB to read Parquet
directly from object storage (R2 / S3) without staging a full file locally.
Predicate pushdown applies at the Parquet reader level — row groups that
don't match the query filter are skipped before data is transferred.
For a typical filtered query (location, time range, metric), only the relevant row groups are read. This is why local mode is 2x to 6x faster than hosted for the same query: the Parquet files are on local disk and there is no network round trip for each row group fetch.
Catalog loading
On startup, the runtime reads a catalog file that maps pack identifiers to their Parquet paths, schema version, coverage metadata, and access tier. The active catalog determines what the runtime can see. Hosted, local, and self-hosted deployments all use the same catalog format — the difference is which catalog is loaded and what paths it points to.
Discovery endpoints (GET /api/v1/catalog and
GET /api/v1/packs/{pack_id}) read from the loaded catalog and
return structured metadata. No query execution happens at discovery time.
Query execution pipeline
A query arrives at POST /api/v1/query/dataset with a pack
identifier, filters (location, time range, metrics), and a row limit.
The pipeline:
- Validate the request against the catalog entry for the named pack.
- Check access tier. Free packs proceed. Paid packs return HTTP
402with a payment challenge before execution. - Translate filters to a DuckDB SQL query with predicate pushdown on the relevant Parquet partition.
- Execute against the Parquet path in the catalog (local disk or object storage via
httpfs). - Apply the row limit. Return structured rows.
Requests that exceed the source maximum reject before the payment challenge — the engine does not charge for a query it will not execute.
MCP routing
The MCP server at /mcp is a thin wrapper over the same
discovery and query pipeline. tools/list returns one tool per
pack from the active catalog. Tool calls translate to the same internal
query path as direct HTTP calls. There is no separate data layer behind MCP
— the same Parquet files, the same DuckDB engine, the same access tier
checks.
Pack-specific facades (/mcp/currency,
/mcp/earthquakes, etc.) expose a single-pack tool list. These
exist for registry discoverability; they do not change the runtime
architecture.
Geography layer
Every pack normalizes location to loc_id, a shared geography
key. This key is hierarchical: country, region, and sub-region identifiers
share a prefix scheme so queries can match at any level without joining
separate geometry tables. See the loc_id guide
for the full reference.
Pack-specific region identifiers (ocean basin codes for tsunamis, XOO for international waters) extend the base scheme rather than replacing it.
Pack release pipeline
A source data sheet specifies the raw source, schema mapping, field normalization, and QA requirements for a pack. A pack builder converts the source to Parquet against that spec, validates the output, and produces a catalog entry. The catalog entry is what the runtime loads — the pack does not exist to the engine until it appears in the catalog at or above the minimum release tier.
Release tiers (core, standard, experimental, private) signal quality and availability. The runtime applies the same query path regardless of tier; tier controls access and discovery visibility, not execution behavior.