117 Terms

Data Engineering Terms Explained

A guide to key terms used in data engineering. Entries with the icon include useful code examples in Python.
For installation instructions for the packages used in the examples, visit the packages page.

For a complete list of Data Engineering terms all data engineers should know, please check out the terms index.

Cosine Similarity

Curate

Select, organize, and annotate data to make it more useful for analysis and modeling.

De-identify

Remove personally identifiable information (PII) from data to protect privacy and comply with regulations.

Deduplicate

Identify and remove duplicate records or entries to improve data quality.

Denoise

Remove noise or artifacts from data to improve its accuracy and quality.

Denormalize

Optimize data for faster read access by reducing the number of joins needed to retrieve related data.

Derive

Extracting, transforming, and generating new data from existing datasets.

Deserialize

Deserialization is essentially the reverse process of serialization. See: 'Serialize'.

Dimensionality

Analyzing the number of features or attributes in the data to improve performance.

Discretize

Transform continuous data into discrete categories or bins to simplify analysis.

Downsample

Reduce the amount of data for analysis, storage, or processing.

ETL

Extract, transform, and load data between different systems.

An image representing the data engineering concept of 'ETL'

Encapsulate

The bundling of data with the methods that operate on that data.

Encode

Convert categorical variables into numerical representations for ML algorithms.

Enrich

Enhance data with additional information from external sources.

Explore

Understand the data, identify patterns, and gain insights.

Export

Extract data from a system for use in another system or application.

Extrapolate

Predict values outside a known range, based on the trends or patterns identified within the available data.

Fan-Out

A pipeline design in which one operation is broken into - or results in - many parallel downstream tasks.

Feature Extraction

Identify and extract relevant features from raw data for use in analysis or modeling.

Feature Selection

Identify and select the most relevant and informative features for analysis or modeling.

Filter

Extract a subset of data based on specific criteria or conditions.

Fragment

Break data down into smaller chunks for storage and management purposes.

Geospatial Analysis

Analyze data that has geographic or spatial components to identify patterns and relationships.

Graph Theory

A powerful tool to model and understand intricate relationships within our data systems.

Hash

Convert data into a fixed-length code to improve data security and integrity.

Homogenize

Make data uniform, consistent, and comparable.

Idempotent

An operation that produces the same result each time it is performed.

Impute

Fill in missing data values with estimated or imputed values to facilitate analysis.

Index

Create an optimized data structure for fast search and retrieval.

Ingest

The initial collection and import of data from various sources into your processing environment.

Integrate

Combine data from different sources to create a unified view for analysis or reporting.

Interpolate

Use known data values to estimate unknown data values.

Lineage

Understand how data moves through a pipeline, including its origin, transformations, dependencies, and ultimate consumption.

Linearizability

Ensure that each individual operation on a distributed system appear to occur instantaneously.

Linearize

Transforming the relationship between variables to make datasets approximately linear.

Load

Insert data into a database or data warehouse, or your pipeline for processing.

Mask

Obfuscate sensitive data to protect its privacy and security.

Materialize

Executing a computation and persisting the results into storage.

Memoize

Store the results of expensive function calls and reusing them when the same inputs occur again.

Merge

Combine data from multiple datasets into a single dataset.

Mine

Extract useful information, patterns or insights from large volumes of data using statistics and machine learning.

Model

Create a conceptual representation of data objects.

Monitor

Track data processing metrics and system health to ensure high availability and performance.

Multiprocessing

Optimize execution time with multiple parallel processes.

Munge

See 'wrangle'.

Named Entity Recognition

Locate and classify named entities in text into pre-defined categories.

NoSQL

Non-relational databases designed for scalability, schema flexibility, and optimized performance in specific use-cases.

Normality Testing

Assess the normality of data distributions to ensure validity and reliability of statistical analysis.

Normalize

Standardize data values to facilitate comparison and analysis. Organize data into a consistent format.

Obfuscate

Make data unintelligible or difficult to understand.

Parallelize

Boost execution speed of large data processing by breaking the task into many smaller concurrent tasks.

Parse

Interpret and convert data from one format to another.

Partition

Data partitioning is a technique that data engineers and ML engineers use to divide data into smaller subsets for improved performance.

Pickle

Convert a Python object into a byte stream for efficient storage.

Pre-aggregate

See 'aggregate'.

Prep

Transform your data so it is fit-for-purpose.

Preprocess

Transform raw data before data analysis or machine learning modeling.

Primary Key

A unique identifier for a record in a database table that helps maintain data integrity.

Profile

Generate statistical summaries and distributions of data to understand its characteristics.

Purge

Delete data that is no longer needed or relevant to free up storage space.

Race Condition

Handling conflicts when accessing a shared resource.

Rebalance

Redistributing data across nodes or partitions for optimal performance.

Reduce

Convert a large set of data into a smaller, more manageable form without significant loss of information.

Repartition

Redistribute data across multiple partitions for improved parallelism and performance.

Replicate

Create a copy of data for redundancy or distributed processing.

Reshape

Change the structure of data to better fit specific analysis or modeling requirements.

Sample

Extract a subset of data for exploratory analysis or to reduce computational complexity.

Scaling

Increasing the capacity or performance of a system to handle more data or traffic.

Schema Inference

Automatically identify the structure of a dataset.

Schema Mapping

Translate data from one schema or structure to another to facilitate data integration.

Scrape

Extract data from a website or another source.

Secondary Index

Improve the efficiency of data retrieval in a database or storage system.

Secure

Protect data from unauthorized access, modification, or destruction.

Sentiment Analysis

Analyze text data to identify and categorize the emotional tone or sentiment expressed.

Serialize

Convert data into a linear format for efficient storage and processing.

Shard

Partitioning a database into smaller, more manageable pieces.

Shred

Break down large datasets into smaller, more manageable pieces for easier processing and analysis.

Shuffle

Randomize the order of data records to improve analysis and prevent bias.

Skew

An imbalance in the distribution or representation of data.

Software-defined Asset

A declarative design pattern that represents a data asset through code.

Spill

Temporarily transfer data that exceeds available memory to disk.

Split

Divide a dataset into training, validation, and testing sets for machine learning model training.

Standardize

Transform data to a common unit or format to facilitate comparison and analysis.

Stored Procedure

Precompiled and stored SQL statements and procedural logic for easy database operations and complex data manipulations.

Synchronize

Ensure that data in different systems or databases are in sync and up-to-date.

Thread

Enable concurrent execution in Python by decoupling tasks which are not sequentially dependent.

Time Series Analysis

Analyze data over time to identify trends, patterns, and relationships.

Tokenize

Convert data into tokens or smaller units to simplify analysis or processing.

Transform

Convert data from one format or structure to another.

Unstructured Data Analysis

Analyze unstructured data, such as text or images, to extract insights and meaning.

Upsert

Update a record or insert a new record if it does not yet exist.

Validate

Check data for completeness, accuracy, and consistency.

Vectorize

Executing a single operation on multiple data points simultaneously.

Version

Maintain a history of changes to data for auditing and tracking purposes.

Wrangle

Convert unstructured data into a structured format.

View the Index

Can't find what you are looking for? Check the complete Index here →

An image representing Daggy the Dagster mascot as painted by René Magritte.

About the artwork.

The art you see throughout the glossary was generated thanks to Midjourney and curated by the Dagster Labs team. It was inspired by some of the great artists of the 20th century (and some from earlier periods). See if you can recognize the ‘work’ of Marcel Duchamp, Frederic Remington, Keith Haring, Claes Oldenburg, Roy Lichtenstein, Wassily Kandinsky, and others.

Left: Daggy, as seen by René Magritte.

Interested in trying Dagster Cloud for Free?

Enterprise orchestration that puts developer experience first. Serverless or hybrid deployments, native branching, and out-of-the-box CI/CD.

Try Dagster Cloud Free for 30 days