Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK] Reading Apache Iceberg data from a filepath, instead of a Spark Catalog, results in no input datasets being present in the OpenLineage event #2937

Merged
merged 13 commits into from
Aug 23, 2024

Conversation

d-m-h
Copy link
Contributor

@d-m-h d-m-h commented Aug 15, 2024

Problem

When reading from an Iceberg table by accessing the vx.metadata.json file, openlineage-spark assumes the data is being accessed via the Iceberg catalog.

Closes: #2935

Solution

  1. The IcebergHandler was changed to be able to handle cases where Spark is instructed to read Iceberg data by accessing an Iceberg-formatted dataset's v1.metadata.json (or v2.metadata.json).
  2. The DataSourceV2ScanRelationInputDatasetBuilder was split into 2 classes that apply at different stages of the Spark job lifecycle. DataSourceV2ScanRelationOnStartInputDatasetBuilder is applied to SparkListenerSQLExecutionStart events. Similarly, DataSourceV2ScanRelationOnEndInputDatasetBuilder is applied to SparkListenerSQLExecutionEnd events. This was done in order to retain compatibility with the existing functionality related to the dataset versions facet.

One-line summary: Added support for deriving lineage from Iceberg datasets, when accessing their data without using a Spark catalog.

Checklist

  • You've signed-off your work
  • Your pull request title follows our guidelines
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • Your comment includes a one-liner for the changelog about the specific purpose of the change (not required for changes to tests, docs, or CI config)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2024 contributors to the OpenLineage project

@d-m-h d-m-h self-assigned this Aug 15, 2024
@d-m-h d-m-h requested a review from a team as a code owner August 15, 2024 15:59
@boring-cyborg boring-cyborg bot added area:integration/spark language:java Uses Java programming language area:tests Testing code labels Aug 15, 2024
@d-m-h d-m-h force-pushed the d-m-h/2935-reading-iceberg-data-using-file-path branch 2 times, most recently from 06ef714 to d612e78 Compare August 16, 2024 10:33
@boring-cyborg boring-cyborg bot added the language:scala Uses Scala programming language label Aug 19, 2024
@d-m-h d-m-h force-pushed the d-m-h/2935-reading-iceberg-data-using-file-path branch 3 times, most recently from a23a511 to b53d197 Compare August 19, 2024 16:19
@boring-cyborg boring-cyborg bot added the area:ci CI label Aug 19, 2024
@d-m-h d-m-h force-pushed the d-m-h/2935-reading-iceberg-data-using-file-path branch 2 times, most recently from 7d58108 to 00ccdbe Compare August 20, 2024 09:08
@d-m-h d-m-h force-pushed the d-m-h/2935-reading-iceberg-data-using-file-path branch 2 times, most recently from e4715bc to aa82140 Compare August 20, 2024 09:22
@d-m-h d-m-h force-pushed the d-m-h/2935-reading-iceberg-data-using-file-path branch 7 times, most recently from 891d5fd to e1e73dd Compare August 20, 2024 12:24
@boring-cyborg boring-cyborg bot added the area:documentation Improvements or additions to documentation label Aug 20, 2024
Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A1 piece of code. Thank you @d-m-h for bringing this improvement.

@d-m-h d-m-h force-pushed the d-m-h/2935-reading-iceberg-data-using-file-path branch from 4cb01d0 to 55108bc Compare August 20, 2024 13:34
@d-m-h d-m-h force-pushed the d-m-h/2935-reading-iceberg-data-using-file-path branch 7 times, most recently from 58de4a3 to 779a297 Compare August 22, 2024 14:40
d-m-h added 13 commits August 23, 2024 17:53
…erg datasets that are not located within the configured Iceberg SparkCatalog

1. Modified the IcebergHandler to accomplish the building of these paths
2. Splits the DataSourceV2ScanRelationInputDatasetBuilder into 2 classes that focus on different parts of the Spark lifecycle.

Regarding (1)

Docker container based tests were created in the SparkIcebergMetadataJsonTest class. These tests launch Spark applications present in the "scala-fixtures" module.

Regarding (2)

These classes are:

1. DataSourceV2ScanRelationOnEndInputDatasetBuilder
2. DataSourceV2ScanRelationOnStartInputDatasetBuilder

The relevant tests were also updated.

Signed-off-by: Damien Hawes <[email protected]>
…FileSystemBinds

We use this in order to overcome the file system permissions when running in a CI/CD environment

Signed-off-by: Damien Hawes <[email protected]>
…Container to get the lineage events instead of using HTTP transport

Signed-off-by: Damien Hawes <[email protected]>
…ts in Spark 3.5.1 and Iceberg

Signed-off-by: Damien Hawes <[email protected]>
1. Spark 3.2.x, 3.3.x, 3.4.x
2. Spark 3.5.x

Signed-off-by: Damien Hawes <[email protected]>
@d-m-h d-m-h force-pushed the d-m-h/2935-reading-iceberg-data-using-file-path branch from df85faf to 7c9fb50 Compare August 23, 2024 15:54
@d-m-h d-m-h merged commit 3cb3e1c into main Aug 23, 2024
53 checks passed
@d-m-h d-m-h deleted the d-m-h/2935-reading-iceberg-data-using-file-path branch August 23, 2024 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:ci CI area:documentation Improvements or additions to documentation area:integration/spark area:tests Testing code language:java Uses Java programming language language:scala Uses Scala programming language
Projects
None yet
2 participants