Skip to main content

How Stream Data Archives Work

This page is the architectural foundation for stream data: the mental model and vocabulary you need before reading either the full Stream Data Archives reference or the Rollups & recompute deep-dive. It explains how the pieces fit together — the two-store split, the three archive types, how rows are stored, how an archive moves through its lifecycle, and what happens on the write and read paths — and then hands off to the reference and deep-dive pages for the exhaustive detail.

Where the detail lives

This page deliberately stays at architecture altitude. For the field-by-field reference (the full CK model YAML, column tables, exception catalog, metrics, API surface) read Stream Data Archives. For the rollup recompute mechanics read Rollups & recompute; the stream-data terminology glossary defines tick, bucket, watermark, generation and the rest.

An archive lives in two stores

An archive is not a single object in a single database. It is split across the two stores OctoMesh already uses, by what each store is good at:

StoreWhat it holds for an archive
MongoDB (runtime graph)The archive's definition (target CK type, column list, bucket/aggregation config) and its runtime-state attributes (Status, the rollup watermark, dirty-window ledger, recompute health). The archive is a runtime entity — an instance of the System.StreamData CK model.
CrateDB (time-series plane)The bulk time-series rows — one per-tenant, per-archive table holding potentially billions of datapoints.

This is the single most important thing to internalise: the runtime/Mongo side is the source of truth for the archive's identity and state, while the time-series rows live in CrateDB. The lifecycle service keeps the two reconcilable without a distributed transaction (see Reconciliation in the reference).

Contrast this with runtime queries, which read the current state of entities straight from MongoDB. Stream data queries read history from CrateDB, keyed back to the entity that produced it by its rtId.

The three archive types

The abstract base System.StreamData/Archive has three concrete subtypes. They differ in their time axis and, crucially, in where their rows come from:

Type (CK type)Time axisRows produced by
System.StreamData/RawArchiveone timestamp per rowexternal producers (pipelines, SDK) — instant measurements
System.StreamData/TimeRangeArchiveexplicit [from, to) per rowexternal producers — already pre-aggregated data (smart-meter intervals, weather APIs)
System.StreamData/RollupArchiveone bucket [bucketStart, bucketEnd) per rowthe system rollup orchestrator only — never written externally

A rollup aggregates from a source (base) archive. The source can be a raw archive, a time-range archive, or another rollup — so rollups form chains (rollup-of-rollup). A daily rollup might read 15-minute raw datapoints; a yearly rollup might read that daily rollup in turn. Raw and time-range archives accept external data; rollup columns are generated server-side and direct writes are rejected.

See Archive Subtypes for the per-type ingestion paths and use cases.

How rows are stored

Each archive maps to exactly one CrateDB table in the tenant's schema. The table's columns are derived differently depending on the type:

  • Raw / time-range archives — columns are derived from the configured CK attribute paths, e.g. Amount.Value, ObisCode. Each captured path becomes a real, typed CrateDB column; nested paths and arrays map to OBJECT/ARRAY columns.
  • Rollup archives — columns are derived from the configured aggregations by RollupColumnGenerator, e.g. amountvalue_sum, dataquality_max. There is exactly one row per bucket per entity.

Every table also carries the standard bookkeeping columns: TargetCkTypeId (the concrete CK type, relevant under inheritance), and rows are keyed back to the producing entity by its source rtId. Columns can be marked indexed or non-indexed per column. Rollup tables additionally carry a generation column — part of the primary key — which is what makes reader-safe recompute possible (see read path below).

Schema is frozen after activation

Once an archive is Activated, its TargetCkTypeId and column set are immutable — a schema change means a new archive. This is what lets the data plane assume a stable shape for every insert and query. See Storage Layout for the full column-mapping rules.

The lifecycle

An archive moves through the CkArchiveStatus states. The status is what gates the whole data plane — inserts and queries are accepted only while an archive is Activated:

StatusCrateDB tableReads / writes
Creatednot provisioned yetrejected — definition exists, no table
Activatedprovisioned, schema frozenaccepted
Disabledpreserved (kept)rejected — re-enable to resume
Failedmay be partialrejected — activation DDL failed; retry required

Activation is where the CrateDB DDL runs (CREATE TABLE) and the schema freezes. It deliberately does not backfill history: a rollup's watermark starts at activation time, so only data from activation forward is rolled up automatically. Populating history is an explicit operator action (a recompute over the historical range). An implicit full-history scan on activation would be an unbounded, surprising operation.

See Lifecycle for the full state diagram (including delete and retry transitions) and the three-tier Activation Layers gate (instance → tenant → archive).

The write path

External data reaches a raw or time-range archive through the mesh-adapter pipeline nodes:

  • SaveStreamDataInArchive@1 → a RawArchive
  • SaveTimeRangeStreamDataInArchive@1 → a TimeRangeArchive

(The same data plane is reachable from the SDK via IStreamDataRepository.) Two things happen on every write that matter for the architecture:

  1. A source-rtId integrity guard ensures each row is attributed to a real producing entity before it is persisted.
  2. The retroactive-write detection hook runs. Every write is checked against the consumed watermark of any dependent rollups: a datapoint that lands before a window a rollup has already aggregated is flagged as a dirty window so it can be recomputed later. This is purely timestamp-vs-watermark and value-agnostic.

That detection step is the entry point to the entire recompute machinery — see How a change in a base archive is detected for exactly how the criterion works.

The read path

History is read through stream-data queries, available in four shapes — simple, aggregation, grouping, and downsampling — each as a persisted (saved) or transient (ad-hoc) variant. The Stream Data Access overview covers the query surface in full.

For rollup archives there is one extra step the reader never sees: the read path injects the active generation filter into every query, so a reader always resolves a single, consistent snapshot of each window — either the old values or the recomputed ones, never a half-written mix. This is what makes recompute reader-safe: a correction is computed into a new generation alongside the live one and committed with a single pointer flip. See Optimistic atomic swap for the mechanism.

Putting it together

The definition and state stay in Mongo; the rows stay in CrateDB; writes flow into the source table and cascade into rollups; reads resolve a consistent generation.

Where this leads

See Also