Releases · OpenLineage/OpenLineage

1.26.0 - 2024-12-20

Added

dbt: Consume dbt structured logs and report progress in real time. #3314 @MassyB
If --consume-structured-logs flag is set, dbt integration will consume dbt structured logs and report execution progress in real time.
Java: Add transform transport to allow event modification. #3301 @pawel-big-lebowski
New transport type allows to modify the event based on the specified transformer class.
Java: Parallel event emitting for composite transport. #3305[#3305] @pawel-big-lebowski
Emit events in parallel for composite transport. Running in parallel is a default behaviour continueOnFailure set to true. Default value of continueOnFailure got changed from false to true.
Spark: Collect ScanReport and CommitReport in OpenLineage events when dealing with Iceberg tables. #3256 @pawel-big-lebowski
Collects additional Iceberg metrics for datasets read or written through the library. Visit Dataset Metrics docs for more details.
dbt: add support for duckdb adapter #3280 @mobuchowski
Adds support for duckdb adapter for dbt integration.

Changed

Spark: Add DatasetFactory to support Dataset creation. #3207 @pawel-big-lebowski
Adds DatasetFactory to support Dataset creation. This class is used to create Dataset instances for DatasetFactory.

Fixed

Spark: fix inconsistent dataset naming #3285 @pawel-big-lebowski
GCS path now has correctly stripped leading slash

Added

Dbt: Add support for Column-Level Lineage in dbt integration. #3264 @mayurmadnani
Dbt integration now uses SQL parser to add information about collected column-level lineage.
Spark: Add input and output statistics about datasets read and written. #3240#3263 @pawel-big-lebowski
Fix issues related to existing output statistics collection mechanism and fetch input statistics. Output statistics contain now amount of files written, bytes size as well as records written. Input statistics contain bytes size and number of files read, while record count is collected only for DataSourceV2 sources.
Introduced InputStatisticsInputDatasetFacet #3238 @pawel-big-lebowski
Extend spec with a new facet InputStatisticsInputDatasetFacet modelled after a similar OutputStatisticsOutputDatasetFacet to contain statistics about input dataset read by a job.

Changed

Spark: Exclude META-INF/*TransportBuilder from Spark Extension Interfaces #3244 @tnazarew
Excludes META-INF/*TransportBuilder to avoid version conflicts
Spark: enables building input/output facets through DatasetFactory #3207 @pawel-big-lebowski
Adds extra capabilities into DatasetFactory class, marks some public developers' API methods as deprecated.

Fixed

dbt: fix compatibility with dbt v1.8 #3228 @NJA010
dbt integration now takes into account modified test_metadata field
Spark: enabled Delta 3.x version compatibility #3253 @Jorricks
Take into account modified initialSnapshot name

Added

Spark: Add Dataproc run facet to include jobType property #3167 @codelixir
Updates the GCP Dataproc run facet to include jobType property
Add EnvironmentVariablesRunFacet to core spec #3186 @JDarDagran
Use EnvironmentVariablesRunFacet in Python client
Add assertions for format in test events #3221 @JDarDagran
Spark: Add integration tests for EMR #3142 @arturowczarek
Spark integration has integration tests for EMR

Changed

Move Kinesis to separate module, migrate HTTP transport to httpclient5 #3205 @mobuchowski
Moves Kinesis integration to a separate module and updates HTTP transport to use HttpClient 5.x
Docs: Upgrade docusaurus to 3.6 #3219 @arturowczarek
Spark: Limit the Seq size in RddPathUtils::extract() #3148 @codelixir
Adds flag to limit the logs in RddPathUtils::extract() to avoid OutOfMemoryError for large jobs

Fixed

Docs: Fix outdated Spark-related docs #3215 @mobuchowski
Fix docusaurus-mdx-checker errors #3217 @arturowczarek
[Integration/dbt] Parse dbt source tests #3208 @MassyB
Fix: Consider dbt sources when looking for test results
Avoid tests in configurable test #3141 @pawel-leszczynski

Added

Java: added CompositeTransport #3039 @JDarDagran
This allows user to specify multiple targets to which OpenLineage events will be emitted.
Spark extension interfaces: support table extended sources #3062 @Imbruced
Interfaces are now able to extract lineage from Table interface, not only RelationProvider.
Java: added GCP Dataplex transport #3043 @ddebowczyk92
Dataplex transport is now available as a separate Maven package for users that want to send OL events to GCP Dataplex
Java: added Google Cloud Storage transport #3077 @ddebowczyk92
GCS transport is now available as a separate Maven package for users that want to send OL events to Google Cloud Storage
Java: added S3 transport #3129 @arturowczarek
S3 transport is now available as a separate Maven package for users that want to send OL events to S3
Java: add option to configure client via environment variables #3094 @JDarDagran
Specified variables are now autotranslated to configuration values.
Python: add option to configure client via environment variables #3114 @JDarDagran
Specified variables are now autotranslated to configuration values.
Python: add option to add custom headers in HTTP transport #3116 @JDarDagran
Allows user to add custom headers, for example for auth purposes.
Column level lineage: add full dataset dependencies #3097 #3098 @arturowczarek
Now, if datasetLineageEnabled is enabled, and when column level lineage depends on the whole dataset, it does add dataset dependency instead of listing all the column fields in that dataset.
Java: OpenLineageClient and Transports are now AutoCloseable #3122 @ddebowczyk92
This prevents a number of issues that might be caused by not closing underlying transports

Fixed

Python Facet generator does not validate optional arguments #3054 @JDarDagran
This fixes issue where NominalTimeRunFacet Facet breaks when nominalEndTime is None
SQL: report only actually used tables from CTEs, rather than all #2962 @Imbruced
With this change, if SQL specified CTE, but does not use it in final query, the lineage won't be falsely reported
Fluentd: Enhancing plugin's capabilities #3068 @jonathanlbt1
This change enhances performance and docs of fluentd proxy plugin.
SQL: fix parser to point to origin table instead of CTEs #3107 @Imbruced
For some complex CTEs, parser emitted CTE as a target table instead of original table. This is now fixed.
Spark: column lineage correctly produces for merge into command #3095 @Imbruced
Now OL produces CLL correctly for the potential view in the middle.

@kacpermuda

Added

SQL: add support for USE statement with different syntaxes #2944 @kacpermuda
Adjusts our Context so that it can use the new support for this statement in the parser and pass it to a number of queries.
Spark: add script to build Spark dependencies #3044 @arturowczarek
Adds a script to rebuild dependencies automatically following releases.
Website: versionable docs #3007 #3023 @pawel-big-lebowski
Adds a GitHub action that creates a new Docusaurus version on a tag push, verifiable using the openlineage-site repo. Implements a monorepo approach in a new website directory.

Fixed

SQL: add support for SingleQuotedString in Identifier() #3035 @kacpermuda
Single quoted strings were being treated differently than strings with no quotes, double quotes, or backticks.
SQL: support IDENTIFIER function instead of treating it like table name #2999 @kacpermuda
Adds support for this identifier in SELECT, MERGE, UPDATE, and DELETE statements. For now, only static identifiers are supported. When a variable is used, this table is removed from lineage to avoid emitting incorrect lineage.
Spark: fix issue with only one table in inputs from SQL query while reading from JDBC #2918 @Imbruced
Events created did not contain the correct input table when the query contained multiple tables.
Spark: fix AWS Glue jobs naming for RDD events #3020 @arturowczarek
The naming for RDD jobs now uses the same code as SQL and Application events.

@tnazarew

Added

Spec: add GCP Dataproc facet #2987 @tnazarew
Registers the Google Cloud Platform Dataproc run facet.

Fixed

Airflow: update SQL integration code to work with latest sqlparser-rs main #2983 @kacpermuda
Adjusts the SQL integration after our sqlparser-rs fork has been updated to the latest main.
Spark: fix AWS Glue jobs naming for SQL events #3001 @arturowczarek
SQL events now properly use the names of the jobs retrieved from AWS Glue.
Spark: fix issue with column lineage when using delta merge into command #2986 @Imbruced
A view instance of a node is now included when gathering data sources for input columns.
Spark: minor Spark filters refactor #2990 @arturowczarek
Fixes a number of minor issues.
Spark: Iceberg tables in AWS Glue have slashes instead of dots in symlinks #2984 @arturowczarek
They should use slashes and the prefix table/.
Spark: lineage for Iceberg datasets that are present outside of Spark's catalog is now present #2937 @d-m-h
Previously, reading Iceberg datasets outside the configured Spark catalog prevented the datasets from being present in the inputs property of the RunEvent.

@JDarDagran

Added

Python: add CompositeTransport #2925 @JDarDagran
Adds a CompositeTransport that can accept other transport configs to instantiate transports and use them to emit events.
Spark: compile & test Spark integration on Java 17 #2828 @pawel-big-lebowski
The Spark integration is always compiled with Java 17, while tests are running on both Java 8 and Java 17 according to the configuration.
Spark: support preview release of Spark 4.0 #2854 @pawel-big-lebowski
Includes the Spark 4.0 preview release in the integration tests.
Spark: add handling for Window #2901 @tnazarew
Adds handling for Window-type nodes of a logical plan.
Spark: extract and send events with raw SQL from Spark #2913 @Imbruced
Adds a parser that traverses QueryExecution to get the SQL query used from the SQL field with a BFS algorithm.
Spark: support Mongostream source #2887 @Imbruced
Adds a Mongo streaming visitor and tests.
Spark: new mechanism for disabling facets #2912 @arturowczarek
The mechanism makes FacetConfig accept the disabled flag for any facet instead of passing them as a list.
Spark: support Kinesis source #2906 @Imbruced
Adds a Kinesis class handler in the streaming source builder.
Spark: extract DatasetIdentifier from extension LineageNode #2900 @ddebowczyk92
Adds support for cases in which LogicalRelation has a grandChild node that implements the LineageRelation interface.
Spark: extract Dataset from underlying BaseRelation #2893 @ddebowczyk92
DatasetIdentifier is now extracted from the underlying node of LogicalRelation.
Spark: add descriptions and Marquez UI to Docker Compose file #2889 @jonathanlbt1
Adds the marquez-web service to docker-compose.yml.

Fixed

Proxy: bug fixed on error messages descriptions #2880 @jonathanlbt1
Improves error logging.
Proxy: update Docker image for Fluentd 1.17 #2877 @jonathanlbt1
Upgrades the Fluentd version.
Spark: fix issue with Kafka source when saving with for each batch method #2868 @Imbruced
Fixes an issue when Spark is in streaming mode and input for Kafka was not present in the event.
Spark: properly set ARN in namespace for Iceberg Glue symlinks #2943 @arturowczarek
Makes IcebergHandler support Glue catalog tables and create the symlink using the code from PathUtils.
Spark: accept any provider for AWS Glue storage format #2917 @arturowczarek
Makes the AWS Glue ARN generating method accept every format (including Parquet), not only Hive SerDe.
Spark: return valid JSON for failed logical plan serialization #2892 @arturowczarek
The LogicalPlanSerializer now returns <failed-to-serialize-logical-plan> for failed serialization instead of an empty string.
Spark: extract legacy column lineage visitors loader #2883 @arturowczarek
Refactors CustomCollectorsUtils for improved readability.
Spark: add Kafka input source when writing in foreach batch mode #2868 @Imbruced
Fixes a bug keeping Kafka input sources from being produced.
Spark: extract DatasetIdentifier from SaveIntoDataSourceCommandVisitor options #2934 @ddebowczyk92
Extracts DatasetIdentifier from command's options instead of relying on p.createRelation(sqlContext, command.options()), which is a heavy operation for JdbcRelationProvider.

@dolfinus

Added

Airflow: add log_url to AirflowRunFacet #2852 @dolfinus
Adds taskinstance's log_url field to AirflowRunFacet.
Spark: add handling for Generate #2856 @tnazarew
Adds handling for Generate-type nodes of a logical plan (e.g., explode operations).
Java: add DerbyJdbcExtractor #2869 @dolfinus
Adds JdbcExtractor implementation for Derby database. As this is a file-based DBMS, its Dataset namespace is file and name is an absolute path to a database file.
Spark: verify bytecode version of the built jar. #2859 @pawel-big-lebowski
Extends the JarVerifier plugin to ensure all compiled classes have a bytecode version of Java 8 or lower.
Spark: add Kafka streaming source support #2851 @d-m-h
Adds support for Kafka streaming sources to Kafka streaming sinks. Inputs and outputs are now included in lineage events.

Fixed

Airflow: replace datetime.now with airflow.utils.timezone.utcnow #2865 @kacpermuda
Fixes missing timezone information in task FAIL events.
Spark: remove shaded dependency in ColumnLevelLineageBuilder #2850 @tnazarew
Removes the shaded Streams dependency in ColumnLevelLineageBuilder causing a ClassNotFoundException.
Spark: make Delta dataset symlink consistent with non-Delta tables #2863 @dolfinus
Makes dataset symlinks for Delta and non-Delta tables consistent.
Spark: use Table's properties during column-level lineage construction #2855 @ddebowczyk92
Fixes PlanUtils3 so Dataset identifier information based on a Table's properties is also retrieved during the construction of column-level lineage.
Spark: extract job name creation to providers #2861 @arturowczarek
The integration now detects if the spark.app.name was autogenerated by Glue and uses the Glue job name in such cases. Also, each job name provisioning strategy is now extracted to a separate provider.

@pawel-big-lebowski

Added

Spark: configurable integration test #2755 @pawel-big-lebowski
Provides command line tool capable of running Spark integration tests that can be created without Java.
Spark: OpenLineage Spark extension interfaces without runtime dependency hell #2809 #2837 @ddebowczyk92
New Spark extension interfaces without runtime dependency hell. Includes a test to verify the integration is working properly.
Spark: support latest versions 3.4.3 and 3.5.1. #2743 @pawel-big-lebowski
Upgrades CI workflows to run tests against latest Spark versions: 3.4.2 -> 3.4.3 and 3.5.0 -> 3.5.1.
Spark: add extraction of the masking property in column-level lineage #2789 @tnazarew
Adds extraction of the masking property during collection of dependencies for ColumnLineageDatasetFacet creation.
Spark: collect table name from InsertIntoHadoopFsRelationCommand #2794 @dolfinus
Collects a table name for INSERT INTO command for tables created with USING $fileFormat syntax, like USING orc.
Spark, Flink: add PostgresJdbcExtractor #2806 @dolfinus
Adds the default 5432 port to Postgres namespaces.
Spark, Flink: add TeradataJdbcExtractor #2826 @dolfinus
Converts JDBC URLs like jdbc:teradata/host/DBS_PORT=1024,DATABASE=somedb to datasets with namespace teradata://host:1024 and name somedb.table.
Spark, Flink: add MySqlJdbcExtractor #2825 @dolfinus
Handles different formats of MySQL JDBC URL, and produces datasets with consistent namespaces, like mysql://host:port.
Spark, Flink: add OracleJdbcExtractor #2824 @dolfinus
Handles simple Oracle JDBC URLs, like oracle:thin:@//host:port/serviceName and oracle:thin@host:port:sid, and converts each to a dataset with namespace oracle://host:port and name sid.schema.table or serviceName.schema.table.
Spark: configurable test with Docker image provided #2822 @pawel-big-lebowski
Extends the configurable integration test feature to enable getting the Docker image name as a name.
Spark: Support Iceberg 1.4 on Spark 3.5.1. #2838 @pawel-big-lebowski
Include Iceberg support for Spark 3.5. Fix column level lineage facet for UNION queries.
Spec: add example for change in #2756 #2801 @Sheeri
Updates the customLineage facet test for the new syntax created in #2756.

Changed

Spark: fallback to spark.sql.warehouse.dir as table namespace #2767 @dolfinus
In cases when a metastore is not used, falls back to spark.sql.warehouse.dir or hive.metastore.warehouse.dir as table namespace, instead of duplicating the table's location.

Fixed

Java: handle dashes in hostname for JdbcExtractors #2830 @dolfinus
Proper handling of dashes in JDBC URL hosts.
Spark: fix Glue symlinks formatting bug #2807 @Akash2351
Fixes Glue symlinks with config parsing for Glue catalogid.
Spark, Flink: fix DBFS namespace format #2800 @dolfinus
Fixes the DBFS namespace format.
Spark: fix Glue naming format #2766 @dolfinus
Changes the AWS Glue namespace to match Glue ARN documentation.
Spark: fix Iceberg dataset location #2797 @dolfinus
Fixes Iceberg dataset namespace: instead of file:/some/path/database.table uses file:/some/path/database/table. For dataset TABLE symlink, uses warehouse location instead of database location.
Spark: fix NPE and incorrect comment #2827 @pawel-big-lebowski
Fixes an error caused by a recent upgrade of Spark versions that did not break existing tests.
Spark: convert scheme and authority to lowercase in JdbcLocation #2831 @dolfinus
Converts valid JDBC URL scheme and authority to lowercase, leaving intact instance/database name, as different databases have different default case and case-sensitivity rules.

@pawel-big-lebowski

Added

Java: dataset namespace resolver feature #2720 @pawel-big-lebowski
Adds a dataset namespace resolving mechanism that resolves dataset namespaces based on the resolvers configured. The core mechanism is implemented in openlineage-java and can be used within the Flink and Spark integrations.
Spark: add transformation extraction #2758 @tnazarew
Adds a transformation type extraction mechanism.
Spark: add GCP run and job facets #2643 @codelixir
Adds GCPRunFacetBuilder and GCPJobFacetBuilder to report additional facets when running on Google Cloud Platform.
Spark: improve namespace format for SQLServer #2773 @dolfinus
Improves the namespace format for SQLServer.
Spark: verify jar content after build #2698 @pawel-big-lebowski
Adds a tool to verify shadowJar content and prevent reported issues. These are hard to prevent currently and require manual verification of manually unpacked jar content.
Spec: add transformation type info #2756 @tnazarew
Adds information about the transformation type in ColumnLineageDatasetFacet. transformationType and transformationDescription are marked as deprecated.
Spec: implementing facet registry (following #2161) #2729 @harels
Introduces the foundations of the new facet Registry into the repo.
Spec: register GCP common job facet #2740 @ngorchakova
Registers the GCP job facet that contains common attributes that will improve the way lineage is parsed and displayed by the GCP platform. Based on the proposal, GCP Lineage would like to define facets that are expected from integrations. The list of support facets is not final and will be extended further by next PR.

Removed

Java: remove deprecated localServerId option from Kafka config #2738 @dolfinus
Removes localServerId from Kafka config, deprecated since 1.13.0.
Java: remove deprecated Transport.emit(String) #2737 @dolfinus
Removes Transport.emit(String) support, deprecated since 1.13.0.
Spark: remove spark-interfaces-scala module #2781 @ddebowczyk92
Replaces the existing spark-interfaces-scala interfaces with new ones decoupled from the Scala binary version. Allows for improved integration in environments where one cannot guarantee the same version of openlineage-java.

Changed

Spark: add log info when emitting lineage from Spark (following #2650) #2769 @algorithmy1
Enhances logging.

Fixed

Flink: use namespace.name as Avro complex field type #2763 @dolfinus
namespace.name is now used as Avro "type" of complex fields (record, enum, fixed).
Java: repair empty dataset name #2776 @kacpermuda
The dataset name should not be empty.
Spark: fix events emitted for drop table for Spark 3.4 and above #2745 @pawel-big-lebowski @savannavalgi
Includes dataset being dropped within the event, as it used to be prior to Spark 3.4.
Spark, Flink: fix S3 dataset names #2782 @dolfinus
Drops the leading slash from the object storage dataset name. Converts s3a:// and s3n:// schemes to s3://.
Spark: fix Hive metastore namespace #2761 @dolfinus
Fixes the dataset namespace for cases when the Hive metastore URL is set using $SPARK_CONF_DIR/hive-site.xml.
Spark: fix NPE in column-level lineage #2749 @pawel-big-lebowski
The Spark agent now checks to determine if cur.getDependencies() is not null before adding dependencies.
Spark: refactor OpenLineageRunEventBuilder #2754 @pawel-big-lebowski
Adds a separate class containing all the input arguments to call OpenLineageRunEventBuilder::buildRun.
Spark: fix historyUrl format #2741 @dolfinus
Fixes the historyUrl format in spark_applicationDetails.
SQL: allow self-recursive aliases #2753 @mobuchowski
Expressions like select * from test_orders as test_orders are now parsed properly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.26.0 - 2024-12-20

Added

Changed

Fixed

Added

Changed

Fixed

Added

Changed

Fixed

Added

Fixed

Added

Fixed

Contributors

Added

Fixed

Contributors

Added

Fixed

Contributors

Added

Fixed

Contributors

Added

Changed

Fixed

Contributors

Added

Removed

Changed

Fixed

Contributors

Releases: OpenLineage/OpenLineage

OpenLineage 1.26.0

1.26.0 - 2024-12-20

Added

Changed

Fixed

OpenLineage 1.25.0

Added

Changed

Fixed

OpenLineage 1.24.2

Added

Changed

Fixed

OpenLineage 1.23.0

Added

Fixed

OpenLineage 1.22.0

Added

Fixed

Contributors

OpenLineage 1.21.1

Added

Fixed

Contributors

OpenLineage 1.20.5

Added

Fixed

Contributors

OpenLineage 1.19.0

Added

Fixed

Contributors

OpenLineage 1.18.0

Added

Changed

Fixed

Contributors

OpenLineage 1.17.1

Added

Removed

Changed

Fixed

Contributors