[SPARK] Reading Apache Iceberg data from a filepath, instead of a Spark Catalog, results in no input datasets being present in the OpenLineage event #2937

d-m-h · 2024-08-15T15:59:37Z

Problem

When reading from an Iceberg table by accessing the vx.metadata.json file, openlineage-spark assumes the data is being accessed via the Iceberg catalog.

Closes: #2935

Solution

The IcebergHandler was changed to be able to handle cases where Spark is instructed to read Iceberg data by accessing an Iceberg-formatted dataset's v1.metadata.json (or v2.metadata.json).
The DataSourceV2ScanRelationInputDatasetBuilder was split into 2 classes that apply at different stages of the Spark job lifecycle. DataSourceV2ScanRelationOnStartInputDatasetBuilder is applied to SparkListenerSQLExecutionStart events. Similarly, DataSourceV2ScanRelationOnEndInputDatasetBuilder is applied to SparkListenerSQLExecutionEnd events. This was done in order to retain compatibility with the existing functionality related to the dataset versions facet.

One-line summary: Added support for deriving lineage from Iceberg datasets, when accessing their data without using a Spark catalog.

Checklist

You've signed-off your work
Your pull request title follows our guidelines
Your changes are accompanied by tests (if relevant)
Your change contains a small diff and is self-contained
You've updated any relevant documentation (if relevant)
Your comment includes a one-liner for the changelog about the specific purpose of the change (not required for changes to tests, docs, or CI config)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2024 contributors to the OpenLineage project

.../io/openlineage/spark3/agent/lifecycle/plan/DataSourceV2ScanRelationInputDatasetBuilder.java

.../spark3/src/main/java/io/openlineage/spark3/agent/lifecycle/plan/catalog/IcebergHandler.java

integration/spark/app/src/test/java/io/openlineage/spark/agent/util/StatefulHttpServer.java

pawel-big-lebowski

A1 piece of code. Thank you @d-m-h for bringing this improvement.

…erg datasets that are not located within the configured Iceberg SparkCatalog 1. Modified the IcebergHandler to accomplish the building of these paths 2. Splits the DataSourceV2ScanRelationInputDatasetBuilder into 2 classes that focus on different parts of the Spark lifecycle. Regarding (1) Docker container based tests were created in the SparkIcebergMetadataJsonTest class. These tests launch Spark applications present in the "scala-fixtures" module. Regarding (2) These classes are: 1. DataSourceV2ScanRelationOnEndInputDatasetBuilder 2. DataSourceV2ScanRelationOnStartInputDatasetBuilder The relevant tests were also updated. Signed-off-by: Damien Hawes <[email protected]>

…FileSystemBinds We use this in order to overcome the file system permissions when running in a CI/CD environment Signed-off-by: Damien Hawes <[email protected]>

Signed-off-by: Damien Hawes <[email protected]>

…t available in JDK8 Signed-off-by: Damien Hawes <[email protected]>

Signed-off-by: Damien Hawes <[email protected]>

…Container to get the lineage events instead of using HTTP transport Signed-off-by: Damien Hawes <[email protected]>

…ts in Spark 3.5.1 and Iceberg Signed-off-by: Damien Hawes <[email protected]>

Signed-off-by: Damien Hawes <[email protected]>

1. Spark 3.2.x, 3.3.x, 3.4.x 2. Spark 3.5.x Signed-off-by: Damien Hawes <[email protected]>

Signed-off-by: Damien Hawes <[email protected]>

d-m-h self-assigned this Aug 15, 2024

d-m-h requested a review from a team as a code owner August 15, 2024 15:59

boring-cyborg bot added area:integration/spark language:java Uses Java programming language area:tests Testing code labels Aug 15, 2024

d-m-h force-pushed the d-m-h/2935-reading-iceberg-data-using-file-path branch 2 times, most recently from 06ef714 to d612e78 Compare August 16, 2024 10:33

pawel-big-lebowski reviewed Aug 19, 2024

View reviewed changes

.../io/openlineage/spark3/agent/lifecycle/plan/DataSourceV2ScanRelationInputDatasetBuilder.java Show resolved Hide resolved

boring-cyborg bot added the language:scala Uses Scala programming language label Aug 19, 2024

datadog-integration-openlineage bot reviewed Aug 19, 2024

View reviewed changes

.../spark3/src/main/java/io/openlineage/spark3/agent/lifecycle/plan/catalog/IcebergHandler.java Outdated Show resolved Hide resolved

d-m-h force-pushed the d-m-h/2935-reading-iceberg-data-using-file-path branch 3 times, most recently from a23a511 to b53d197 Compare August 19, 2024 16:19

boring-cyborg bot added the area:ci CI label Aug 19, 2024

d-m-h force-pushed the d-m-h/2935-reading-iceberg-data-using-file-path branch 2 times, most recently from 7d58108 to 00ccdbe Compare August 20, 2024 09:08

datadog-integration-openlineage bot reviewed Aug 20, 2024

View reviewed changes

integration/spark/app/src/test/java/io/openlineage/spark/agent/util/StatefulHttpServer.java Outdated Show resolved Hide resolved

d-m-h force-pushed the d-m-h/2935-reading-iceberg-data-using-file-path branch 2 times, most recently from e4715bc to aa82140 Compare August 20, 2024 09:22

datadog-integration-openlineage bot reviewed Aug 20, 2024

View reviewed changes

integration/spark/app/src/test/java/io/openlineage/spark/agent/util/StatefulHttpServer.java Outdated Show resolved Hide resolved

d-m-h force-pushed the d-m-h/2935-reading-iceberg-data-using-file-path branch 7 times, most recently from 891d5fd to e1e73dd Compare August 20, 2024 12:24

boring-cyborg bot added the area:documentation Improvements or additions to documentation label Aug 20, 2024

pawel-big-lebowski approved these changes Aug 20, 2024

View reviewed changes

d-m-h force-pushed the d-m-h/2935-reading-iceberg-data-using-file-path branch from 4cb01d0 to 55108bc Compare August 20, 2024 13:34

d-m-h force-pushed the d-m-h/2935-reading-iceberg-data-using-file-path branch 7 times, most recently from 58de4a3 to 779a297 Compare August 22, 2024 14:40

d-m-h added 13 commits August 23, 2024 17:53

Update SparkIcebergMetadataJsonTest to use Docker volumes instead of …

6225f74

…FileSystemBinds We use this in order to overcome the file system permissions when running in a CI/CD environment Signed-off-by: Damien Hawes <[email protected]>

Update the host name for the HTTP url

734017c

Signed-off-by: Damien Hawes <[email protected]>

Run spotlessApply against SparkIcebergMetadataJsonTest

45eaf6f

Signed-off-by: Damien Hawes <[email protected]>

Revert completely to using localhost

629b6fe

Signed-off-by: Damien Hawes <[email protected]>

Revert to using Random over SecureRandom as SecureRandom#nextInt isn'…

36ec228

…t available in JDK8 Signed-off-by: Damien Hawes <[email protected]>

Change the random#nextInt(Int, Int) to random#nextInt(Int)

e25545c

Signed-off-by: Damien Hawes <[email protected]>

Use a combination of file transport and Testcontainers's copyFileFrom…

d9d17f9

…Container to get the lineage events instead of using HTTP transport Signed-off-by: Damien Hawes <[email protected]>

Change the test to use "append_data" due to a specific case that exis…

f1d31fd

…ts in Spark 3.5.1 and Iceberg Signed-off-by: Damien Hawes <[email protected]>

Add logging to the test to see what events are being gathered

134b1e5

Signed-off-by: Damien Hawes <[email protected]>

Run spotlessApply against SparkIcebergMetadataJsonTest

81ae508

Signed-off-by: Damien Hawes <[email protected]>

Split readIcebergMetadataJsonOutsideConfiguredCatalog into 2 tests

e44c9c5

1. Spark 3.2.x, 3.3.x, 3.4.x 2. Spark 3.5.x Signed-off-by: Damien Hawes <[email protected]>

Suppress PMD for SparkIcebergMetadataJsonTest

7c9fb50

Signed-off-by: Damien Hawes <[email protected]>

d-m-h force-pushed the d-m-h/2935-reading-iceberg-data-using-file-path branch from df85faf to 7c9fb50 Compare August 23, 2024 15:54

d-m-h merged commit 3cb3e1c into main Aug 23, 2024
53 checks passed

d-m-h deleted the d-m-h/2935-reading-iceberg-data-using-file-path branch August 23, 2024 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK] Reading Apache Iceberg data from a filepath, instead of a Spark Catalog, results in no input datasets being present in the OpenLineage event #2937

[SPARK] Reading Apache Iceberg data from a filepath, instead of a Spark Catalog, results in no input datasets being present in the OpenLineage event #2937

d-m-h commented Aug 15, 2024 •

edited

Loading

pawel-big-lebowski left a comment

[SPARK] Reading Apache Iceberg data from a filepath, instead of a Spark Catalog, results in no input datasets being present in the OpenLineage event #2937

[SPARK] Reading Apache Iceberg data from a filepath, instead of a Spark Catalog, results in no input datasets being present in the OpenLineage event #2937

Conversation

d-m-h commented Aug 15, 2024 • edited Loading

Problem

Solution

One-line summary: Added support for deriving lineage from Iceberg datasets, when accessing their data without using a Spark catalog.

Checklist

pawel-big-lebowski left a comment

Choose a reason for hiding this comment

d-m-h commented Aug 15, 2024 •

edited

Loading