Releases: OpenLineage/OpenLineage
Releases · OpenLineage/OpenLineage
OpenLineage 1.26.0
1.26.0 - 2024-12-20
Added
- dbt: Consume dbt structured logs and report progress in real time.
#3314
@MassyB
If --consume-structured-logs flag is set, dbt integration will consume dbt structured logs and report execution progress in real time. - Java: Add
transform
transport to allow event modification.#3301
@pawel-big-lebowski
New transport type allows to modify the event based on the specified transformer class. - Java: Parallel event emitting for composite transport.
#3305
[#3305
] @pawel-big-lebowski
Emit events in parallel for composite transport. Running in parallel is a default behaviourcontinueOnFailure
set totrue
. Default value ofcontinueOnFailure
got changed fromfalse
totrue
. - Spark: Collect
ScanReport
andCommitReport
in OpenLineage events when dealing with Iceberg tables.#3256
@pawel-big-lebowski
Collects additional Iceberg metrics for datasets read or written through the library. VisitDataset Metrics
docs for more details. - dbt: add support for duckdb adapter
#3280
@mobuchowski
Adds support for duckdb adapter for dbt integration.
Changed
- Spark: Add
DatasetFactory
to supportDataset
creation.#3207
@pawel-big-lebowski
AddsDatasetFactory
to supportDataset
creation. This class is used to createDataset
instances forDatasetFactory
.
Fixed
- Spark: fix inconsistent dataset naming
#3285
@pawel-big-lebowski
GCS path now has correctly stripped leading slash
OpenLineage 1.25.0
Added
- Dbt: Add support for Column-Level Lineage in dbt integration.
#3264
@mayurmadnani
Dbt integration now uses SQL parser to add information about collected column-level lineage. - Spark: Add input and output statistics about datasets read and written.
#3240
#3263
@pawel-big-lebowski
Fix issues related to existing output statistics collection mechanism and fetch input statistics. Output statistics contain now amount of files written, bytes size as well as records written. Input statistics contain bytes size and number of files read, while record count is collected only for DataSourceV2 sources. - Introduced InputStatisticsInputDatasetFacet
#3238
@pawel-big-lebowski
Extend spec with a new facet InputStatisticsInputDatasetFacet modelled after a similar OutputStatisticsOutputDatasetFacet to contain statistics about input dataset read by a job.
Changed
- Spark: Exclude META-INF/*TransportBuilder from Spark Extension Interfaces
#3244
@tnazarew
Excludes META-INF/*TransportBuilder to avoid version conflicts - Spark: enables building input/output facets through
DatasetFactory
#3207
@pawel-big-lebowski
Adds extra capabilities intoDatasetFactory
class, marks some public developers' API methods as deprecated.
Fixed
OpenLineage 1.24.2
Added
- Spark: Add Dataproc run facet to include jobType property
#3167
@codelixir
Updates the GCP Dataproc run facet to include jobType property - Add EnvironmentVariablesRunFacet to core spec
#3186
@JDarDagran
Use EnvironmentVariablesRunFacet in Python client - Add assertions for format in test events
#3221
@JDarDagran - Spark: Add integration tests for EMR
#3142
@arturowczarek
Spark integration has integration tests for EMR
Changed
- Move Kinesis to separate module, migrate HTTP transport to httpclient5
#3205
@mobuchowski
Moves Kinesis integration to a separate module and updates HTTP transport to use HttpClient 5.x - Docs: Upgrade docusaurus to 3.6
#3219
@arturowczarek - Spark: Limit the Seq size in RddPathUtils::extract()
#3148
@codelixir
Adds flag to limit the logs in RddPathUtils::extract() to avoid OutOfMemoryError for large jobs
Fixed
- Docs: Fix outdated Spark-related docs
#3215
@mobuchowski - Fix docusaurus-mdx-checker errors
#3217
@arturowczarek - [Integration/dbt] Parse dbt source tests
#3208
@MassyB
Fix: Consider dbt sources when looking for test results - Avoid tests in configurable test
#3141
@pawel-leszczynski
OpenLineage 1.23.0
Added
- Java: added CompositeTransport
#3039
@JDarDagran
This allows user to specify multiple targets to which OpenLineage events will be emitted. - Spark extension interfaces: support table extended sources
#3062
@Imbruced
Interfaces are now able to extract lineage from Table interface, not only RelationProvider. - Java: added GCP Dataplex transport
#3043
@ddebowczyk92
Dataplex transport is now available as a separate Maven package for users that want to send OL events to GCP Dataplex - Java: added Google Cloud Storage transport
#3077
@ddebowczyk92
GCS transport is now available as a separate Maven package for users that want to send OL events to Google Cloud Storage - Java: added S3 transport
#3129
@arturowczarek
S3 transport is now available as a separate Maven package for users that want to send OL events to S3 - Java: add option to configure client via environment variables
#3094
@JDarDagran
Specified variables are now autotranslated to configuration values. - Python: add option to configure client via environment variables
#3114
@JDarDagran
Specified variables are now autotranslated to configuration values. - Python: add option to add custom headers in HTTP transport
#3116
@JDarDagran
Allows user to add custom headers, for example for auth purposes. - Column level lineage: add full dataset dependencies
#3097
#3098
@arturowczarek
Now, if datasetLineageEnabled is enabled, and when column level lineage depends on the whole dataset, it does add dataset dependency instead of listing all the column fields in that dataset. - Java: OpenLineageClient and Transports are now AutoCloseable
#3122
@ddebowczyk92
This prevents a number of issues that might be caused by not closing underlying transports
Fixed
- Python Facet generator does not validate optional arguments
#3054
@JDarDagran
This fixes issue where NominalTimeRunFacet Facet breaks when nominalEndTime is None - SQL: report only actually used tables from CTEs, rather than all
#2962
@Imbruced
With this change, if SQL specified CTE, but does not use it in final query, the lineage won't be falsely reported - Fluentd: Enhancing plugin's capabilities
#3068
@jonathanlbt1
This change enhances performance and docs of fluentd proxy plugin. - SQL: fix parser to point to origin table instead of CTEs
#3107
@Imbruced
For some complex CTEs, parser emitted CTE as a target table instead of original table. This is now fixed. - Spark: column lineage correctly produces for merge into command
#3095
@Imbruced
Now OL produces CLL correctly for the potential view in the middle.
OpenLineage 1.22.0
Added
- SQL: add support for
USE
statement with different syntaxes#2944
@kacpermuda
Adjusts our Context so that it can use the new support for this statement in the parser and pass it to a number of queries. - Spark: add script to build Spark dependencies
#3044
@arturowczarek
Adds a script to rebuild dependencies automatically following releases. - Website: versionable docs
#3007
#3023
@pawel-big-lebowski
Adds a GitHub action that creates a new Docusaurus version on a tag push, verifiable using the openlineage-site repo. Implements a monorepo approach in a newwebsite
directory.
Fixed
- SQL: add support for
SingleQuotedString
inIdentifier()
#3035
@kacpermuda
Single quoted strings were being treated differently than strings with no quotes, double quotes, or backticks. - SQL: support
IDENTIFIER
function instead of treating it like table name#2999
@kacpermuda
Adds support for this identifier in SELECT, MERGE, UPDATE, and DELETE statements. For now, only static identifiers are supported. When a variable is used, this table is removed from lineage to avoid emitting incorrect lineage. - Spark: fix issue with only one table in inputs from SQL query while reading from JDBC
#2918
@Imbruced
Events created did not contain the correct input table when the query contained multiple tables. - Spark: fix AWS Glue jobs naming for RDD events
#3020
@arturowczarek
The naming for RDD jobs now uses the same code as SQL and Application events.
OpenLineage 1.21.1
Added
- Spec: add GCP Dataproc facet
#2987
@tnazarew
Registers the Google Cloud Platform Dataproc run facet.
Fixed
- Airflow: update SQL integration code to work with latest sqlparser-rs main
#2983
@kacpermuda
Adjusts the SQL integration after our sqlparser-rs fork has been updated to the latest main. - Spark: fix AWS Glue jobs naming for SQL events
#3001
@arturowczarek
SQL events now properly use the names of the jobs retrieved from AWS Glue. - Spark: fix issue with column lineage when using delta merge into command
#2986
@Imbruced
A view instance of a node is now included when gathering data sources for input columns. - Spark: minor Spark filters refactor
#2990
@arturowczarek
Fixes a number of minor issues. - Spark: Iceberg tables in AWS Glue have slashes instead of dots in symlinks
#2984
@arturowczarek
They should use slashes and the prefixtable/
. - Spark: lineage for Iceberg datasets that are present outside of Spark's catalog is now present
#2937
@d-m-h
Previously, reading Iceberg datasets outside the configured Spark catalog prevented the datasets from being present in theinputs
property of theRunEvent
.
OpenLineage 1.20.5
Added
- Python: add
CompositeTransport
#2925
@JDarDagran
Adds aCompositeTransport
that can accept other transport configs to instantiate transports and use them to emit events. - Spark: compile & test Spark integration on Java 17
#2828
@pawel-big-lebowski
The Spark integration is always compiled with Java 17, while tests are running on both Java 8 and Java 17 according to the configuration. - Spark: support preview release of Spark 4.0
#2854
@pawel-big-lebowski
Includes the Spark 4.0 preview release in the integration tests. - Spark: add handling for
Window
#2901
@tnazarew
Adds handling forWindow
-type nodes of a logical plan. - Spark: extract and send events with raw SQL from Spark
#2913
@Imbruced
Adds a parser that traversesQueryExecution
to get the SQL query used from the SQL field with a BFS algorithm. - Spark: support Mongostream source
#2887
@Imbruced
Adds a Mongo streaming visitor and tests. - Spark: new mechanism for disabling facets
#2912
@arturowczarek
The mechanism makesFacetConfig
accept the disabled flag for any facet instead of passing them as a list. - Spark: support Kinesis source
#2906
@Imbruced
Adds a Kinesis class handler in the streaming source builder. - Spark: extract
DatasetIdentifier
from extensionLineageNode
#2900
@ddebowczyk92
Adds support for cases in whichLogicalRelation
has a grandChild node that implements theLineageRelation
interface. - Spark: extract Dataset from underlying
BaseRelation
#2893
@ddebowczyk92
DatasetIdentifier
is now extracted from the underlying node ofLogicalRelation
. - Spark: add descriptions and Marquez UI to Docker Compose file
#2889
@jonathanlbt1
Adds themarquez-web
service to docker-compose.yml.
Fixed
- Proxy: bug fixed on error messages descriptions
#2880
@jonathanlbt1
Improves error logging. - Proxy: update Docker image for Fluentd 1.17
#2877
@jonathanlbt1
Upgrades the Fluentd version. - Spark: fix issue with Kafka source when saving with
for each
batch method#2868
@Imbruced
Fixes an issue when Spark is in streaming mode and input for Kafka was not present in the event. - Spark: properly set ARN in namespace for Iceberg Glue symlinks
#2943
@arturowczarek
MakesIcebergHandler
support Glue catalog tables and create the symlink using the code fromPathUtils
. - Spark: accept any provider for AWS Glue storage format
#2917
@arturowczarek
Makes the AWS Glue ARN generating method accept every format (including Parquet), not only Hive SerDe. - Spark: return valid JSON for failed logical plan serialization
#2892
@arturowczarek
TheLogicalPlanSerializer
now returns<failed-to-serialize-logical-plan>
for failed serialization instead of an empty string. - Spark: extract legacy column lineage visitors loader
#2883
@arturowczarek
RefactorsCustomCollectorsUtils
for improved readability. - Spark: add Kafka input source when writing in
foreach
batch mode#2868
@Imbruced
Fixes a bug keeping Kafka input sources from being produced. - Spark: extract
DatasetIdentifier
fromSaveIntoDataSourceCommandVisitor
options#2934
@ddebowczyk92
ExtractsDatasetIdentifier
from command's options instead of relying onp.createRelation(sqlContext, command.options())
, which is a heavy operation forJdbcRelationProvider
.
OpenLineage 1.19.0
Added
- Airflow: add
log_url
toAirflowRunFacet
#2852
@dolfinus
Adds taskinstance'slog_url
field toAirflowRunFacet
. - Spark: add handling for
Generate
#2856
@tnazarew
Adds handling forGenerate
-type nodes of a logical plan (e.g., explode operations). - Java: add
DerbyJdbcExtractor
#2869
@dolfinus
AddsJdbcExtractor
implementation for Derby database. As this is a file-based DBMS, its Dataset namespace isfile
and name is an absolute path to a database file. - Spark: verify bytecode version of the built jar.
#2859
@pawel-big-lebowski
Extends theJarVerifier
plugin to ensure all compiled classes have a bytecode version of Java 8 or lower. - Spark: add Kafka streaming source support
#2851
@d-m-h
Adds support for Kafka streaming sources to Kafka streaming sinks. Inputs and outputs are now included in lineage events.
Fixed
- Airflow: replace datetime.now with airflow.utils.timezone.utcnow
#2865
@kacpermuda
Fixes missing timezone information in task FAIL events. - Spark: remove shaded dependency in
ColumnLevelLineageBuilder
#2850
@tnazarew
Removes the shadedStreams
dependency inColumnLevelLineageBuilder
causing aClassNotFoundException
. - Spark: make Delta dataset symlink consistent with non-Delta tables
#2863
@dolfinus
Makes dataset symlinks for Delta and non-Delta tables consistent. - Spark: use Table's properties during column-level lineage construction
#2855
@ddebowczyk92
FixesPlanUtils3
so Dataset identifier information based on a Table's properties is also retrieved during the construction of column-level lineage. - Spark: extract job name creation to providers
#2861
@arturowczarek
The integration now detects if thespark.app.name
was autogenerated by Glue and uses the Glue job name in such cases. Also, each job name provisioning strategy is now extracted to a separate provider.
OpenLineage 1.18.0
Added
- Spark: configurable integration test
#2755
@pawel-big-lebowski
Provides command line tool capable of running Spark integration tests that can be created without Java. - Spark: OpenLineage Spark extension interfaces without runtime dependency hell
#2809
#2837
@ddebowczyk92
New Spark extension interfaces without runtime dependency hell. Includes a test to verify the integration is working properly. - Spark: support latest versions 3.4.3 and 3.5.1.
#2743
@pawel-big-lebowski
Upgrades CI workflows to run tests against latest Spark versions: 3.4.2 -> 3.4.3 and 3.5.0 -> 3.5.1. - Spark: add extraction of the masking property in column-level lineage
#2789
@tnazarew
Adds extraction of the masking property during collection of dependencies forColumnLineageDatasetFacet
creation. - Spark: collect table name from
InsertIntoHadoopFsRelationCommand
#2794
@dolfinus
Collects a table name forINSERT INTO
command for tables created withUSING $fileFormat
syntax, likeUSING orc
. - Spark, Flink: add
PostgresJdbcExtractor
#2806
@dolfinus
Adds the default5432
port to Postgres namespaces. - Spark, Flink: add
TeradataJdbcExtractor
#2826
@dolfinus
Converts JDBC URLs likejdbc:teradata/host/DBS_PORT=1024,DATABASE=somedb
to datasets with namespaceteradata://host:1024
and namesomedb.table
. - Spark, Flink: add
MySqlJdbcExtractor
#2825
@dolfinus
Handles different formats of MySQL JDBC URL, and produces datasets with consistent namespaces, likemysql://host:port
. - Spark, Flink: add
OracleJdbcExtractor
#2824
@dolfinus
Handles simple Oracle JDBC URLs, likeoracle:thin:@//host:port/serviceName
andoracle:thin@host:port:sid
, and converts each to a dataset with namespaceoracle://host:port
and namesid.schema.table
orserviceName.schema.table
. - Spark: configurable test with Docker image provided
#2822
@pawel-big-lebowski
Extends the configurable integration test feature to enable getting the Docker image name as a name. - Spark: Support Iceberg 1.4 on Spark 3.5.1.
#2838
@pawel-big-lebowski
Include Iceberg support for Spark 3.5. Fix column level lineage facet forUNION
queries. - Spec: add example for change in
#2756
#2801
@Sheeri
Updates thecustomLineage
facet test for the new syntax created in#2756
.
Changed
- Spark: fallback to
spark.sql.warehouse.dir
as table namespace#2767
@dolfinus
In cases when a metastore is not used, falls back tospark.sql.warehouse.dir
orhive.metastore.warehouse.dir
as table namespace, instead of duplicating the table's location.
Fixed
- Java: handle dashes in hostname for
JdbcExtractors
#2830
@dolfinus
Proper handling of dashes in JDBC URL hosts. - Spark: fix Glue symlinks formatting bug
#2807
@Akash2351
Fixes Glue symlinks with config parsing for Gluecatalogid
. - Spark, Flink: fix DBFS namespace format
#2800
@dolfinus
Fixes the DBFS namespace format. - Spark: fix Glue naming format
#2766
@dolfinus
Changes the AWS Glue namespace to match Glue ARN documentation. - Spark: fix Iceberg dataset location
#2797
@dolfinus
Fixes Iceberg dataset namespace: instead offile:/some/path/database.table
usesfile:/some/path/database/table
. For dataset TABLE symlink, uses warehouse location instead of database location. - Spark: fix NPE and incorrect comment
#2827
@pawel-big-lebowski
Fixes an error caused by a recent upgrade of Spark versions that did not break existing tests. - Spark: convert scheme and authority to lowercase in
JdbcLocation
#2831
@dolfinus
Converts valid JDBC URL scheme and authority to lowercase, leaving intact instance/database name, as different databases have different default case and case-sensitivity rules.
OpenLineage 1.17.1
Added
- Java: dataset namespace resolver feature
#2720
@pawel-big-lebowski
Adds a dataset namespace resolving mechanism that resolves dataset namespaces based on the resolvers configured. The core mechanism is implemented in openlineage-java and can be used within the Flink and Spark integrations. - Spark: add transformation extraction
#2758
@tnazarew
Adds a transformation type extraction mechanism. - Spark: add GCP run and job facets
#2643
@codelixir
AddsGCPRunFacetBuilder
andGCPJobFacetBuilder
to report additional facets when running on Google Cloud Platform. - Spark: improve namespace format for SQLServer
#2773
@dolfinus
Improves the namespace format for SQLServer. - Spark: verify jar content after build
#2698
@pawel-big-lebowski
Adds a tool to verifyshadowJar
content and prevent reported issues. These are hard to prevent currently and require manual verification of manually unpacked jar content. - Spec: add transformation type info
#2756
@tnazarew
Adds information about the transformation type inColumnLineageDatasetFacet
.transformationType
andtransformationDescription
are marked as deprecated. - Spec: implementing facet registry (following #2161)
#2729
@harels
Introduces the foundations of the new facet Registry into the repo. - Spec: register GCP common job facet
#2740
@ngorchakova
Registers the GCP job facet that contains common attributes that will improve the way lineage is parsed and displayed by the GCP platform. Based on the proposal, GCP Lineage would like to define facets that are expected from integrations. The list of support facets is not final and will be extended further by next PR.
Removed
- Java: remove deprecated
localServerId
option from Kafka config#2738
@dolfinus
RemoveslocalServerId
from Kafka config, deprecated since 1.13.0. - Java: remove deprecated
Transport.emit(String)
#2737
@dolfinus
RemovesTransport.emit(String)
support, deprecated since 1.13.0. - Spark: remove
spark-interfaces-scala
module#2781
@ddebowczyk92
Replaces the existingspark-interfaces-scala
interfaces with new ones decoupled from the Scala binary version. Allows for improved integration in environments where one cannot guarantee the same version ofopenlineage-java
.
Changed
- Spark: add log info when emitting lineage from Spark (following #2650)
#2769
@algorithmy1
Enhances logging.
Fixed
- Flink: use
namespace.name
as Avro complex field type#2763
@dolfinus
namespace.name
is now used as Avro"type"
of complex fields (record, enum, fixed). - Java: repair empty dataset name
#2776
@kacpermuda
The dataset name should not be empty. - Spark: fix events emitted for
drop table
for Spark 3.4 and above#2745
@pawel-big-lebowski @savannavalgi
Includes dataset being dropped within the event, as it used to be prior to Spark 3.4. - Spark, Flink: fix S3 dataset names
#2782
@dolfinus
Drops the leading slash from the object storage dataset name. Convertss3a://
ands3n://
schemes tos3://
. - Spark: fix Hive metastore namespace
#2761
@dolfinus
Fixes the dataset namespace for cases when the Hive metastore URL is set using$SPARK_CONF_DIR/hive-site.xml
. - Spark: fix NPE in column-level lineage
#2749
@pawel-big-lebowski
The Spark agent now checks to determine ifcur.getDependencies()
is not null before adding dependencies. - Spark: refactor
OpenLineageRunEventBuilder
#2754
@pawel-big-lebowski
Adds a separate class containing all the input arguments to callOpenLineageRunEventBuilder::buildRun
. - Spark: fix
historyUrl
format#2741
@dolfinus
Fixes thehistoryUrl
format inspark_applicationDetails
. - SQL: allow self-recursive aliases
#2753
@mobuchowski
Expressions likeselect * from test_orders as test_orders
are now parsed properly.