117 Terms
Data Engineering Terms Explained
A guide to key terms used in data engineering. Entries with the icon include useful code examples in Python.
For installation instructions for the packages used in the examples, visit the packages page.
For a complete list of Data Engineering terms all data engineers should know, please check out the terms index.
Aggregate
Combine data from multiple sources into a single dataset.
Align
Aligning data can mean one of three things: aligning datasets, meeting business rules, or arranging data elements in memory.
Anomaly Detection
Identify data points or events that deviate significantly from expected patterns or behaviors.
Anonymize
Remove personal or identifying information from data.
Append
Adding or attaching new records or data items to the end of an existing dataset, database table, file, or list.
Archive
Move rarely accessed data to a low-cost, long-term storage solution to reduce costs. Store data for long-term retention and compliance.
AsyncIO
Speed up execution with asynchronous I/O.
Augment
Add new data or information to an existing dataset to enhance its value.
Auto-materialize
The automatic execution of computations and the persistence of their results.
Backpressure
A mechanism to handle situations where data is produced faster than it can be consumed.
Backup
Create a copy of data to protect against loss or corruption.
Batch Processing
Process large volumes of data all at once in a single operation or batch.
Big Data Processing
Process large volumes of data in parallel and distributed computing environments to improve performance.
Cache
Store expensive computation results so they can be reused, not recomputed.
Categorize
Organizing and classifying data into different categories, groups, or segments.
Checkpointing
Saving the state of a process at certain points so that it can be restarted from that point in case of failure.
Clean or Cleanse
Remove invalid or inconsistent data values, such as empty fields or outliers.
Cluster
Group data points based on similarities or patterns to facilitate analysis and modeling.
Compact
Reducing the size of data while preserving its essential information.
Compress
Reduce the size of data to save storage space and improve processing performance.
Consolidate
Combine multiple datasets into one to create a more comprehensive view of the data.
Cosine Similarity
A measure of similarity between two entities used in text analysis, natural language processing, etc.
Curate
Select, organize, and annotate data to make it more useful for analysis and modeling.
De-identify
Remove personally identifiable information (PII) from data to protect privacy and comply with regulations.
Deduplicate
Identify and remove duplicate records or entries to improve data quality.
Denoise
Remove noise or artifacts from data to improve its accuracy and quality.
Denormalize
Optimize data for faster read access by reducing the number of joins needed to retrieve related data.
Derive
Extracting, transforming, and generating new data from existing datasets.
Deserialize
Deserialization is essentially the reverse process of serialization. See: 'Serialize'.
Dimensionality
Analyzing the number of features or attributes in the data to improve performance.
Discretize
Transform continuous data into discrete categories or bins to simplify analysis.
Downsample
Reduce the amount of data for analysis, storage, or processing.
ETL
Extract, transform, and load data between different systems.
Encapsulate
The bundling of data with the methods that operate on that data.
Encode
Convert categorical variables into numerical representations for ML algorithms.
Enrich
Enhance data with additional information from external sources.
Explore
Understand the data, identify patterns, and gain insights.
Export
Extract data from a system for use in another system or application.
Extrapolate
Predict values outside a known range, based on the trends or patterns identified within the available data.
Fan-Out
A pipeline design in which one operation is broken into - or results in - many parallel downstream tasks.
Feature Extraction
Identify and extract relevant features from raw data for use in analysis or modeling.
Feature Selection
Identify and select the most relevant and informative features for analysis or modeling.
Filter
Extract a subset of data based on specific criteria or conditions.
Fragment
Break data down into smaller chunks for storage and management purposes.
Geospatial Analysis
Analyze data that has geographic or spatial components to identify patterns and relationships.
Graph Theory
A powerful tool to model and understand intricate relationships within our data systems.
Hash
Convert data into a fixed-length code to improve data security and integrity.
Homogenize
Make data uniform, consistent, and comparable.
Idempotent
An operation that produces the same result each time it is performed.
Impute
Fill in missing data values with estimated or imputed values to facilitate analysis.
Index
Create an optimized data structure for fast search and retrieval.
Ingest
The initial collection and import of data from various sources into your processing environment.
Integrate
Combine data from different sources to create a unified view for analysis or reporting.
Interpolate
Use known data values to estimate unknown data values.
Lineage
Understand how data moves through a pipeline, including its origin, transformations, dependencies, and ultimate consumption.
Linearizability
Ensure that each individual operation on a distributed system appear to occur instantaneously.
Linearize
Transforming the relationship between variables to make datasets approximately linear.
Load
Insert data into a database or data warehouse, or your pipeline for processing.
Mask
Obfuscate sensitive data to protect its privacy and security.
Materialize
Executing a computation and persisting the results into storage.
Memoize
Store the results of expensive function calls and reusing them when the same inputs occur again.
Merge
Combine data from multiple datasets into a single dataset.
Mine
Extract useful information, patterns or insights from large volumes of data using statistics and machine learning.
Model
Create a conceptual representation of data objects.
Monitor
Track data processing metrics and system health to ensure high availability and performance.
Multiprocessing
Optimize execution time with multiple parallel processes.
Munge
See 'wrangle'.
Named Entity Recognition
Locate and classify named entities in text into pre-defined categories.
NoSQL
Non-relational databases designed for scalability, schema flexibility, and optimized performance in specific use-cases.
Normality Testing
Assess the normality of data distributions to ensure validity and reliability of statistical analysis.
Normalize
Standardize data values to facilitate comparison and analysis. Organize data into a consistent format.
Obfuscate
Make data unintelligible or difficult to understand.
Parallelize
Boost execution speed of large data processing by breaking the task into many smaller concurrent tasks.
Parse
Interpret and convert data from one format to another.
Partition
Data partitioning is a technique that data engineers and ML engineers use to divide data into smaller subsets for improved performance.
Pickle
Convert a Python object into a byte stream for efficient storage.
Pre-aggregate
See 'aggregate'.
Prep
Transform your data so it is fit-for-purpose.
Preprocess
Transform raw data before data analysis or machine learning modeling.
Primary Key
A unique identifier for a record in a database table that helps maintain data integrity.
Profile
Generate statistical summaries and distributions of data to understand its characteristics.
Purge
Delete data that is no longer needed or relevant to free up storage space.
Race Condition
Handling conflicts when accessing a shared resource.
Rebalance
Redistributing data across nodes or partitions for optimal performance.
Reduce
Convert a large set of data into a smaller, more manageable form without significant loss of information.
Repartition
Redistribute data across multiple partitions for improved parallelism and performance.
Replicate
Create a copy of data for redundancy or distributed processing.
Reshape
Change the structure of data to better fit specific analysis or modeling requirements.
Sample
Extract a subset of data for exploratory analysis or to reduce computational complexity.
Scaling
Increasing the capacity or performance of a system to handle more data or traffic.
Schema Inference
Automatically identify the structure of a dataset.
Schema Mapping
Translate data from one schema or structure to another to facilitate data integration.
Scrape
Extract data from a website or another source.
Secondary Index
Improve the efficiency of data retrieval in a database or storage system.
Secure
Protect data from unauthorized access, modification, or destruction.
Sentiment Analysis
Analyze text data to identify and categorize the emotional tone or sentiment expressed.
Serialize
Convert data into a linear format for efficient storage and processing.
Shard
Partitioning a database into smaller, more manageable pieces.
Shred
Break down large datasets into smaller, more manageable pieces for easier processing and analysis.
Shuffle
Randomize the order of data records to improve analysis and prevent bias.
Skew
An imbalance in the distribution or representation of data.
Software-defined Asset
A declarative design pattern that represents a data asset through code.
Spill
Temporarily transfer data that exceeds available memory to disk.
Split
Divide a dataset into training, validation, and testing sets for machine learning model training.
Standardize
Transform data to a common unit or format to facilitate comparison and analysis.
Stored Procedure
Precompiled and stored SQL statements and procedural logic for easy database operations and complex data manipulations.
Synchronize
Ensure that data in different systems or databases are in sync and up-to-date.
Thread
Enable concurrent execution in Python by decoupling tasks which are not sequentially dependent.
Time Series Analysis
Analyze data over time to identify trends, patterns, and relationships.
Tokenize
Convert data into tokens or smaller units to simplify analysis or processing.
Transform
Convert data from one format or structure to another.
Unstructured Data Analysis
Analyze unstructured data, such as text or images, to extract insights and meaning.
Upsert
Update a record or insert a new record if it does not yet exist.
Validate
Check data for completeness, accuracy, and consistency.
Vectorize
Executing a single operation on multiple data points simultaneously.
Version
Maintain a history of changes to data for auditing and tracking purposes.
Wrangle
Convert unstructured data into a structured format.
View the Index
Can't find what you are looking for? Check the complete Index here →
About the artwork.
The art you see throughout the glossary was generated thanks to Midjourney and curated by the Dagster Labs team. It was inspired by some of the great artists of the 20th century (and some from earlier periods). See if you can recognize the ‘work’ of Marcel Duchamp, Frederic Remington, Keith Haring, Claes Oldenburg, Roy Lichtenstein, Wassily Kandinsky, and others.
Left: Daggy, as seen by René Magritte.
Interested in trying Dagster Cloud for Free?
Enterprise orchestration that puts developer experience first. Serverless or hybrid deployments, native branching, and out-of-the-box CI/CD.