Glossary
Plain-English definitions of the data and database terms that come up around columnar analytics, time-series, observability, and infrastructure.
A
ACID stands for Atomicity, Consistency, Isolation, and Durability, the four guarantees that make database transactions reliable.
Anomaly detection finds data points that deviate from normal patterns, used for fraud detection, equipment monitoring, and security.
Apache Parquet is an open columnar file format for analytics. It compresses well, reads fast, and is supported by almost every data tool.
C
Cardinality is the number of unique values in a dataset or column. High cardinality can break time-series and observability systems.
A columnar database stores data by column instead of by row, making large-scale analytics and aggregations dramatically faster.
Row storage keeps each record together and suits transactions. Columnar storage keeps each column together and suits analytics. Here is the difference.
Compaction merges many small data files into fewer large ones, cutting storage cost and making queries faster. Here is how it works.
D
Data compression shrinks stored data to save space and speed up queries by reducing how much must be read from disk. Columnar data compresses especially well.
A data historian is a system that records time-series data from industrial equipment and processes. Traditional historians are proprietary and costly.
Data ingestion is the process of importing data from sources into a storage or analytics system. It can be batch or streaming.
A data lake is a central repository that stores raw data of any type at scale, usually as files on cheap object storage.
A data retention policy defines how long data is kept before it is deleted or archived. It balances cost, compliance, and usefulness.
Downsampling reduces the resolution of time-series data by aggregating it into larger time buckets, saving storage at the cost of detail.
E
Edge computing processes data close to where it is generated, instead of sending everything to a central cloud. It cuts latency and bandwidth.
Event sourcing stores every change to application state as an immutable sequence of events, rather than just the current state.
Eventual consistency means all copies of data become consistent over time, but not instantly. It trades immediate consistency for availability.
H
High cardinality means a very large number of unique values. It is a common cause of slow queries and runaway costs in monitoring systems.
Hot storage is fast and expensive for frequently used data. Cold storage is slow and cheap for rarely used data. Tiering balances the two.
L
A lakehouse combines the low cost and openness of a data lake with the performance and structure of a data warehouse.
Line protocol is a simple text format for writing time-series data points, popularized by InfluxDB. Each line is one measurement with tags and fields.
M
A materialized view stores the precomputed result of a query, so repeated reads are fast. It trades storage and freshness for query speed.
MELT stands for Metrics, Events, Logs, and Traces, the four core data types of observability. Unifying them is a major challenge.
O
Object storage keeps data as objects in flat buckets, like Amazon S3. It is cheap, scalable, and the foundation of modern data lakes.
Observability is the ability to understand a system's internal state from its outputs, like metrics, logs, and traces. It is key to running reliable software.
OLAP is a class of database workload focused on fast analytical queries over large datasets, like aggregations, rollups, and reporting.
OLAP handles analytical queries over large datasets. OLTP handles many small transactions. Here is how they differ and when to use each.
An open format database stores your data in a non-proprietary format like Parquet, so you can read it with other tools and avoid lock-in.
OpenTelemetry is an open standard for collecting metrics, logs, and traces from software. It frees telemetry from any single vendor.
An order book is the real-time list of buy and sell orders for an asset, organized by price. Reconstructing it from history is a hard data problem.
P
Partition pruning skips entire partitions of data that a query does not need, dramatically reducing how much data must be scanned.
Predicate pushdown moves query filters as close to the data as possible, so less data is read and processed. It is key to fast analytics.
Predictive maintenance uses sensor data to predict equipment failure before it happens, reducing downtime and avoiding unnecessary repairs.
R
Real-time analytics is analyzing data the moment it arrives, so insights are available in seconds rather than hours. It needs fast ingest and query.
A relational database organizes data into tables of rows and columns with defined relationships, queried using SQL. It is the classic OLTP design.
S
Schema on read applies structure to data when you query it, not when you store it. It offers flexibility for evolving and varied data.
Sensor data is the stream of measurements produced by physical sensors, like temperature, vibration, and pressure. At fleet scale it is high volume.
A SQL query engine parses, plans, and executes SQL queries against data. Modern engines can run directly on open files like Parquet.
Storage tiering automatically moves data between fast expensive storage and cheap slow storage based on how often it is accessed.
Stream processing handles data continuously as it arrives, rather than in scheduled batches. It powers real-time pipelines and analytics.
T
Telemetry data is automatically collected measurements sent from remote sources like sensors, servers, and devices for monitoring and analysis.
Tick data is the record of every individual trade and quote in a market, timestamped to the microsecond. It is the rawest form of market data.
Time bucketing groups time-series data into fixed intervals, like per minute or per hour, to summarize and chart trends over time.
Time-based partitioning splits a table into segments by time, so time-range queries read only the relevant segments and run faster.
Time-series data is a sequence of data points indexed by time. Examples include metrics, sensor readings, stock prices, and application events.
A time-series database is built to store and query data points indexed by time, like metrics, sensor readings, events, and financial ticks.
V
Vectorized execution processes data in batches of values at once instead of row by row, making analytical queries much faster.
Vendor lock-in is when switching away from a product is so costly or hard that you stay against your will. Open formats are the main defense.
W
A window function performs a calculation across a set of rows related to the current row, like running totals and moving averages, without collapsing them.
A write-ahead log records changes before they are applied, so a database can recover without data loss after a crash.