Skip to content

Releases: OpenLineage/OpenLineage

OpenLineage 1.26.0

20 Dec 16:59
Compare
Choose a tag to compare

1.26.0 - 2024-12-20

Added

  • dbt: Consume dbt structured logs and report progress in real time. #3314 @MassyB
    If --consume-structured-logs flag is set, dbt integration will consume dbt structured logs and report execution progress in real time.
  • Java: Add transform transport to allow event modification. #3301 @pawel-big-lebowski
    New transport type allows to modify the event based on the specified transformer class.
  • Java: Parallel event emitting for composite transport. #3305[#3305] @pawel-big-lebowski
    Emit events in parallel for composite transport. Running in parallel is a default behaviour continueOnFailure set to true. Default value of continueOnFailure got changed from false to true.
  • Spark: Collect ScanReport and CommitReport in OpenLineage events when dealing with Iceberg tables. #3256 @pawel-big-lebowski
    Collects additional Iceberg metrics for datasets read or written through the library. Visit Dataset Metrics docs for more details.
  • dbt: add support for duckdb adapter #3280 @mobuchowski
    Adds support for duckdb adapter for dbt integration.

Changed

  • Spark: Add DatasetFactory to support Dataset creation. #3207 @pawel-big-lebowski
    Adds DatasetFactory to support Dataset creation. This class is used to create Dataset instances for DatasetFactory.

Fixed

OpenLineage 1.25.0

03 Dec 12:23
Compare
Choose a tag to compare

Added

  • Dbt: Add support for Column-Level Lineage in dbt integration. #3264 @mayurmadnani
    Dbt integration now uses SQL parser to add information about collected column-level lineage.
  • Spark: Add input and output statistics about datasets read and written. #3240#3263 @pawel-big-lebowski
    Fix issues related to existing output statistics collection mechanism and fetch input statistics. Output statistics contain now amount of files written, bytes size as well as records written. Input statistics contain bytes size and number of files read, while record count is collected only for DataSourceV2 sources.
  • Introduced InputStatisticsInputDatasetFacet #3238 @pawel-big-lebowski
    Extend spec with a new facet InputStatisticsInputDatasetFacet modelled after a similar OutputStatisticsOutputDatasetFacet to contain statistics about input dataset read by a job.

Changed

  • Spark: Exclude META-INF/*TransportBuilder from Spark Extension Interfaces #3244 @tnazarew
    Excludes META-INF/*TransportBuilder to avoid version conflicts
  • Spark: enables building input/output facets through DatasetFactory #3207 @pawel-big-lebowski
    Adds extra capabilities into DatasetFactory class, marks some public developers' API methods as deprecated.

Fixed

  • dbt: fix compatibility with dbt v1.8 #3228 @NJA010
    dbt integration now takes into account modified test_metadata field
  • Spark: enabled Delta 3.x version compatibility #3253 @Jorricks
    Take into account modified initialSnapshot name

OpenLineage 1.24.2

05 Nov 21:28
Compare
Choose a tag to compare

Added

  • Spark: Add Dataproc run facet to include jobType property #3167 @codelixir
    Updates the GCP Dataproc run facet to include jobType property
  • Add EnvironmentVariablesRunFacet to core spec #3186 @JDarDagran
    Use EnvironmentVariablesRunFacet in Python client
  • Add assertions for format in test events #3221 @JDarDagran
  • Spark: Add integration tests for EMR #3142 @arturowczarek
    Spark integration has integration tests for EMR

Changed

  • Move Kinesis to separate module, migrate HTTP transport to httpclient5 #3205 @mobuchowski
    Moves Kinesis integration to a separate module and updates HTTP transport to use HttpClient 5.x
  • Docs: Upgrade docusaurus to 3.6 #3219 @arturowczarek
  • Spark: Limit the Seq size in RddPathUtils::extract() #3148 @codelixir
    Adds flag to limit the logs in RddPathUtils::extract() to avoid OutOfMemoryError for large jobs

Fixed

OpenLineage 1.23.0

04 Oct 12:28
Compare
Choose a tag to compare

Added

  • Java: added CompositeTransport #3039 @JDarDagran
    This allows user to specify multiple targets to which OpenLineage events will be emitted.
  • Spark extension interfaces: support table extended sources #3062 @Imbruced
    Interfaces are now able to extract lineage from Table interface, not only RelationProvider.
  • Java: added GCP Dataplex transport #3043 @ddebowczyk92
    Dataplex transport is now available as a separate Maven package for users that want to send OL events to GCP Dataplex
  • Java: added Google Cloud Storage transport #3077 @ddebowczyk92
    GCS transport is now available as a separate Maven package for users that want to send OL events to Google Cloud Storage
  • Java: added S3 transport #3129 @arturowczarek
    S3 transport is now available as a separate Maven package for users that want to send OL events to S3
  • Java: add option to configure client via environment variables #3094 @JDarDagran
    Specified variables are now autotranslated to configuration values.
  • Python: add option to configure client via environment variables #3114 @JDarDagran
    Specified variables are now autotranslated to configuration values.
  • Python: add option to add custom headers in HTTP transport #3116 @JDarDagran
    Allows user to add custom headers, for example for auth purposes.
  • Column level lineage: add full dataset dependencies #3097 #3098 @arturowczarek
    Now, if datasetLineageEnabled is enabled, and when column level lineage depends on the whole dataset, it does add dataset dependency instead of listing all the column fields in that dataset.
  • Java: OpenLineageClient and Transports are now AutoCloseable #3122 @ddebowczyk92
    This prevents a number of issues that might be caused by not closing underlying transports

Fixed

  • Python Facet generator does not validate optional arguments #3054 @JDarDagran
    This fixes issue where NominalTimeRunFacet Facet breaks when nominalEndTime is None
  • SQL: report only actually used tables from CTEs, rather than all #2962 @Imbruced
    With this change, if SQL specified CTE, but does not use it in final query, the lineage won't be falsely reported
  • Fluentd: Enhancing plugin's capabilities #3068 @jonathanlbt1
    This change enhances performance and docs of fluentd proxy plugin.
  • SQL: fix parser to point to origin table instead of CTEs #3107 @Imbruced
    For some complex CTEs, parser emitted CTE as a target table instead of original table. This is now fixed.
  • Spark: column lineage correctly produces for merge into command #3095 @Imbruced
    Now OL produces CLL correctly for the potential view in the middle.

OpenLineage 1.22.0

05 Sep 10:58
Compare
Choose a tag to compare

Added

  • SQL: add support for USE statement with different syntaxes #2944 @kacpermuda
    Adjusts our Context so that it can use the new support for this statement in the parser and pass it to a number of queries.
  • Spark: add script to build Spark dependencies #3044 @arturowczarek
    Adds a script to rebuild dependencies automatically following releases.
  • Website: versionable docs #3007 #3023 @pawel-big-lebowski
    Adds a GitHub action that creates a new Docusaurus version on a tag push, verifiable using the openlineage-site repo. Implements a monorepo approach in a new website directory.

Fixed

  • SQL: add support for SingleQuotedString in Identifier() #3035 @kacpermuda
    Single quoted strings were being treated differently than strings with no quotes, double quotes, or backticks.
  • SQL: support IDENTIFIER function instead of treating it like table name #2999 @kacpermuda
    Adds support for this identifier in SELECT, MERGE, UPDATE, and DELETE statements. For now, only static identifiers are supported. When a variable is used, this table is removed from lineage to avoid emitting incorrect lineage.
  • Spark: fix issue with only one table in inputs from SQL query while reading from JDBC #2918 @Imbruced
    Events created did not contain the correct input table when the query contained multiple tables.
  • Spark: fix AWS Glue jobs naming for RDD events #3020 @arturowczarek
    The naming for RDD jobs now uses the same code as SQL and Application events.

OpenLineage 1.21.1

29 Aug 16:18
Compare
Choose a tag to compare

Added

  • Spec: add GCP Dataproc facet #2987 @tnazarew
    Registers the Google Cloud Platform Dataproc run facet.

Fixed

  • Airflow: update SQL integration code to work with latest sqlparser-rs main #2983 @kacpermuda
    Adjusts the SQL integration after our sqlparser-rs fork has been updated to the latest main.
  • Spark: fix AWS Glue jobs naming for SQL events #3001 @arturowczarek
    SQL events now properly use the names of the jobs retrieved from AWS Glue.
  • Spark: fix issue with column lineage when using delta merge into command #2986 @Imbruced
    A view instance of a node is now included when gathering data sources for input columns.
  • Spark: minor Spark filters refactor #2990 @arturowczarek
    Fixes a number of minor issues.
  • Spark: Iceberg tables in AWS Glue have slashes instead of dots in symlinks #2984 @arturowczarek
    They should use slashes and the prefix table/.
  • Spark: lineage for Iceberg datasets that are present outside of Spark's catalog is now present #2937 @d-m-h
    Previously, reading Iceberg datasets outside the configured Spark catalog prevented the datasets from being present in the inputs property of the RunEvent.

OpenLineage 1.20.5

23 Aug 17:06
Compare
Choose a tag to compare

Added

  • Python: add CompositeTransport #2925 @JDarDagran
    Adds a CompositeTransport that can accept other transport configs to instantiate transports and use them to emit events.
  • Spark: compile & test Spark integration on Java 17 #2828 @pawel-big-lebowski
    The Spark integration is always compiled with Java 17, while tests are running on both Java 8 and Java 17 according to the configuration.
  • Spark: support preview release of Spark 4.0 #2854 @pawel-big-lebowski
    Includes the Spark 4.0 preview release in the integration tests.
  • Spark: add handling for Window #2901 @tnazarew
    Adds handling for Window-type nodes of a logical plan.
  • Spark: extract and send events with raw SQL from Spark #2913 @Imbruced
    Adds a parser that traverses QueryExecution to get the SQL query used from the SQL field with a BFS algorithm.
  • Spark: support Mongostream source #2887 @Imbruced
    Adds a Mongo streaming visitor and tests.
  • Spark: new mechanism for disabling facets #2912 @arturowczarek
    The mechanism makes FacetConfig accept the disabled flag for any facet instead of passing them as a list.
  • Spark: support Kinesis source #2906 @Imbruced
    Adds a Kinesis class handler in the streaming source builder.
  • Spark: extract DatasetIdentifier from extension LineageNode #2900 @ddebowczyk92
    Adds support for cases in which LogicalRelation has a grandChild node that implements the LineageRelation interface.
  • Spark: extract Dataset from underlying BaseRelation #2893 @ddebowczyk92
    DatasetIdentifier is now extracted from the underlying node of LogicalRelation.
  • Spark: add descriptions and Marquez UI to Docker Compose file #2889 @jonathanlbt1
    Adds the marquez-web service to docker-compose.yml.

Fixed

  • Proxy: bug fixed on error messages descriptions #2880 @jonathanlbt1
    Improves error logging.
  • Proxy: update Docker image for Fluentd 1.17 #2877 @jonathanlbt1
    Upgrades the Fluentd version.
  • Spark: fix issue with Kafka source when saving with for each batch method #2868 @Imbruced
    Fixes an issue when Spark is in streaming mode and input for Kafka was not present in the event.
  • Spark: properly set ARN in namespace for Iceberg Glue symlinks #2943 @arturowczarek
    Makes IcebergHandler support Glue catalog tables and create the symlink using the code from PathUtils.
  • Spark: accept any provider for AWS Glue storage format #2917 @arturowczarek
    Makes the AWS Glue ARN generating method accept every format (including Parquet), not only Hive SerDe.
  • Spark: return valid JSON for failed logical plan serialization #2892 @arturowczarek
    The LogicalPlanSerializer now returns <failed-to-serialize-logical-plan> for failed serialization instead of an empty string.
  • Spark: extract legacy column lineage visitors loader #2883 @arturowczarek
    Refactors CustomCollectorsUtils for improved readability.
  • Spark: add Kafka input source when writing in foreach batch mode #2868 @Imbruced
    Fixes a bug keeping Kafka input sources from being produced.
  • Spark: extract DatasetIdentifier from SaveIntoDataSourceCommandVisitor options #2934 @ddebowczyk92
    Extracts DatasetIdentifier from command's options instead of relying on p.createRelation(sqlContext, command.options()), which is a heavy operation for JdbcRelationProvider.

OpenLineage 1.19.0

22 Jul 15:22
Compare
Choose a tag to compare

Added

  • Airflow: add log_url to AirflowRunFacet #2852 @dolfinus
    Adds taskinstance's log_url field to AirflowRunFacet.
  • Spark: add handling for Generate #2856 @tnazarew
    Adds handling for Generate-type nodes of a logical plan (e.g., explode operations).
  • Java: add DerbyJdbcExtractor #2869 @dolfinus
    Adds JdbcExtractor implementation for Derby database. As this is a file-based DBMS, its Dataset namespace is file and name is an absolute path to a database file.
  • Spark: verify bytecode version of the built jar. #2859 @pawel-big-lebowski
    Extends the JarVerifier plugin to ensure all compiled classes have a bytecode version of Java 8 or lower.
  • Spark: add Kafka streaming source support #2851 @d-m-h
    Adds support for Kafka streaming sources to Kafka streaming sinks. Inputs and outputs are now included in lineage events.

Fixed

  • Airflow: replace datetime.now with airflow.utils.timezone.utcnow #2865 @kacpermuda
    Fixes missing timezone information in task FAIL events.
  • Spark: remove shaded dependency in ColumnLevelLineageBuilder #2850 @tnazarew
    Removes the shaded Streams dependency in ColumnLevelLineageBuilder causing a ClassNotFoundException.
  • Spark: make Delta dataset symlink consistent with non-Delta tables #2863 @dolfinus
    Makes dataset symlinks for Delta and non-Delta tables consistent.
  • Spark: use Table's properties during column-level lineage construction #2855 @ddebowczyk92
    Fixes PlanUtils3 so Dataset identifier information based on a Table's properties is also retrieved during the construction of column-level lineage.
  • Spark: extract job name creation to providers #2861 @arturowczarek
    The integration now detects if the spark.app.name was autogenerated by Glue and uses the Glue job name in such cases. Also, each job name provisioning strategy is now extracted to a separate provider.

OpenLineage 1.18.0

12 Jul 14:02
Compare
Choose a tag to compare

Added

  • Spark: configurable integration test #2755 @pawel-big-lebowski
    Provides command line tool capable of running Spark integration tests that can be created without Java.
  • Spark: OpenLineage Spark extension interfaces without runtime dependency hell #2809 #2837 @ddebowczyk92
    New Spark extension interfaces without runtime dependency hell. Includes a test to verify the integration is working properly.
  • Spark: support latest versions 3.4.3 and 3.5.1. #2743 @pawel-big-lebowski
    Upgrades CI workflows to run tests against latest Spark versions: 3.4.2 -> 3.4.3 and 3.5.0 -> 3.5.1.
  • Spark: add extraction of the masking property in column-level lineage #2789 @tnazarew
    Adds extraction of the masking property during collection of dependencies for ColumnLineageDatasetFacet creation.
  • Spark: collect table name from InsertIntoHadoopFsRelationCommand #2794 @dolfinus
    Collects a table name for INSERT INTO command for tables created with USING $fileFormat syntax, like USING orc.
  • Spark, Flink: add PostgresJdbcExtractor #2806 @dolfinus
    Adds the default 5432 port to Postgres namespaces.
  • Spark, Flink: add TeradataJdbcExtractor #2826 @dolfinus
    Converts JDBC URLs like jdbc:teradata/host/DBS_PORT=1024,DATABASE=somedb to datasets with namespace teradata://host:1024 and name somedb.table.
  • Spark, Flink: add MySqlJdbcExtractor #2825 @dolfinus
    Handles different formats of MySQL JDBC URL, and produces datasets with consistent namespaces, like mysql://host:port.
  • Spark, Flink: add OracleJdbcExtractor #2824 @dolfinus
    Handles simple Oracle JDBC URLs, like oracle:thin:@//host:port/serviceName and oracle:thin@host:port:sid, and converts each to a dataset with namespace oracle://host:port and name sid.schema.table or serviceName.schema.table.
  • Spark: configurable test with Docker image provided #2822 @pawel-big-lebowski
    Extends the configurable integration test feature to enable getting the Docker image name as a name.
  • Spark: Support Iceberg 1.4 on Spark 3.5.1. #2838 @pawel-big-lebowski
    Include Iceberg support for Spark 3.5. Fix column level lineage facet for UNION queries.
  • Spec: add example for change in #2756 #2801 @Sheeri
    Updates the customLineage facet test for the new syntax created in #2756.

Changed

  • Spark: fallback to spark.sql.warehouse.dir as table namespace #2767 @dolfinus
    In cases when a metastore is not used, falls back to spark.sql.warehouse.dir or hive.metastore.warehouse.dir as table namespace, instead of duplicating the table's location.

Fixed

  • Java: handle dashes in hostname for JdbcExtractors #2830 @dolfinus
    Proper handling of dashes in JDBC URL hosts.
  • Spark: fix Glue symlinks formatting bug #2807 @Akash2351
    Fixes Glue symlinks with config parsing for Glue catalogid.
  • Spark, Flink: fix DBFS namespace format #2800 @dolfinus
    Fixes the DBFS namespace format.
  • Spark: fix Glue naming format #2766 @dolfinus
    Changes the AWS Glue namespace to match Glue ARN documentation.
  • Spark: fix Iceberg dataset location #2797 @dolfinus
    Fixes Iceberg dataset namespace: instead of file:/some/path/database.table uses file:/some/path/database/table. For dataset TABLE symlink, uses warehouse location instead of database location.
  • Spark: fix NPE and incorrect comment #2827 @pawel-big-lebowski
    Fixes an error caused by a recent upgrade of Spark versions that did not break existing tests.
  • Spark: convert scheme and authority to lowercase in JdbcLocation #2831 @dolfinus
    Converts valid JDBC URL scheme and authority to lowercase, leaving intact instance/database name, as different databases have different default case and case-sensitivity rules.

OpenLineage 1.17.1

21 Jun 16:26
Compare
Choose a tag to compare

Added

  • Java: dataset namespace resolver feature #2720 @pawel-big-lebowski
    Adds a dataset namespace resolving mechanism that resolves dataset namespaces based on the resolvers configured. The core mechanism is implemented in openlineage-java and can be used within the Flink and Spark integrations.
  • Spark: add transformation extraction #2758 @tnazarew
    Adds a transformation type extraction mechanism.
  • Spark: add GCP run and job facets #2643 @codelixir
    Adds GCPRunFacetBuilder and GCPJobFacetBuilder to report additional facets when running on Google Cloud Platform.
  • Spark: improve namespace format for SQLServer #2773 @dolfinus
    Improves the namespace format for SQLServer.
  • Spark: verify jar content after build #2698 @pawel-big-lebowski
    Adds a tool to verify shadowJar content and prevent reported issues. These are hard to prevent currently and require manual verification of manually unpacked jar content.
  • Spec: add transformation type info #2756 @tnazarew
    Adds information about the transformation type in ColumnLineageDatasetFacet. transformationType and transformationDescription are marked as deprecated.
  • Spec: implementing facet registry (following #2161) #2729 @harels
    Introduces the foundations of the new facet Registry into the repo.
  • Spec: register GCP common job facet #2740 @ngorchakova
    Registers the GCP job facet that contains common attributes that will improve the way lineage is parsed and displayed by the GCP platform. Based on the proposal, GCP Lineage would like to define facets that are expected from integrations. The list of support facets is not final and will be extended further by next PR.

Removed

  • Java: remove deprecated localServerId option from Kafka config #2738 @dolfinus
    Removes localServerId from Kafka config, deprecated since 1.13.0.
  • Java: remove deprecated Transport.emit(String) #2737 @dolfinus
    Removes Transport.emit(String) support, deprecated since 1.13.0.
  • Spark: remove spark-interfaces-scala module #2781 @ddebowczyk92
    Replaces the existing spark-interfaces-scala interfaces with new ones decoupled from the Scala binary version. Allows for improved integration in environments where one cannot guarantee the same version of openlineage-java.

Changed

Fixed

  • Flink: use namespace.name as Avro complex field type #2763 @dolfinus
    namespace.name is now used as Avro "type" of complex fields (record, enum, fixed).
  • Java: repair empty dataset name #2776 @kacpermuda
    The dataset name should not be empty.
  • Spark: fix events emitted for drop table for Spark 3.4 and above #2745 @pawel-big-lebowski @savannavalgi
    Includes dataset being dropped within the event, as it used to be prior to Spark 3.4.
  • Spark, Flink: fix S3 dataset names #2782 @dolfinus
    Drops the leading slash from the object storage dataset name. Converts s3a:// and s3n:// schemes to s3://.
  • Spark: fix Hive metastore namespace #2761 @dolfinus
    Fixes the dataset namespace for cases when the Hive metastore URL is set using $SPARK_CONF_DIR/hive-site.xml.
  • Spark: fix NPE in column-level lineage #2749 @pawel-big-lebowski
    The Spark agent now checks to determine if cur.getDependencies() is not null before adding dependencies.
  • Spark: refactor OpenLineageRunEventBuilder #2754 @pawel-big-lebowski
    Adds a separate class containing all the input arguments to call OpenLineageRunEventBuilder::buildRun.
  • Spark: fix historyUrl format #2741 @dolfinus
    Fixes the historyUrl format in spark_applicationDetails.
  • SQL: allow self-recursive aliases #2753 @mobuchowski
    Expressions like select * from test_orders as test_orders are now parsed properly.