-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK] Reading Apache Iceberg data from a filepath, instead of a Spark Catalog, results in no input datasets being present in the OpenLineage event #2937
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
boring-cyborg
bot
added
area:integration/spark
language:java
Uses Java programming language
area:tests
Testing code
labels
Aug 15, 2024
d-m-h
force-pushed
the
d-m-h/2935-reading-iceberg-data-using-file-path
branch
2 times, most recently
from
August 16, 2024 10:33
06ef714
to
d612e78
Compare
.../io/openlineage/spark3/agent/lifecycle/plan/DataSourceV2ScanRelationInputDatasetBuilder.java
Show resolved
Hide resolved
.../spark3/src/main/java/io/openlineage/spark3/agent/lifecycle/plan/catalog/IcebergHandler.java
Outdated
Show resolved
Hide resolved
d-m-h
force-pushed
the
d-m-h/2935-reading-iceberg-data-using-file-path
branch
3 times, most recently
from
August 19, 2024 16:19
a23a511
to
b53d197
Compare
d-m-h
force-pushed
the
d-m-h/2935-reading-iceberg-data-using-file-path
branch
2 times, most recently
from
August 20, 2024 09:08
7d58108
to
00ccdbe
Compare
integration/spark/app/src/test/java/io/openlineage/spark/agent/util/StatefulHttpServer.java
Outdated
Show resolved
Hide resolved
d-m-h
force-pushed
the
d-m-h/2935-reading-iceberg-data-using-file-path
branch
2 times, most recently
from
August 20, 2024 09:22
e4715bc
to
aa82140
Compare
integration/spark/app/src/test/java/io/openlineage/spark/agent/util/StatefulHttpServer.java
Outdated
Show resolved
Hide resolved
d-m-h
force-pushed
the
d-m-h/2935-reading-iceberg-data-using-file-path
branch
7 times, most recently
from
August 20, 2024 12:24
891d5fd
to
e1e73dd
Compare
boring-cyborg
bot
added
the
area:documentation
Improvements or additions to documentation
label
Aug 20, 2024
pawel-big-lebowski
approved these changes
Aug 20, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A1 piece of code. Thank you @d-m-h for bringing this improvement.
d-m-h
force-pushed
the
d-m-h/2935-reading-iceberg-data-using-file-path
branch
from
August 20, 2024 13:34
4cb01d0
to
55108bc
Compare
d-m-h
force-pushed
the
d-m-h/2935-reading-iceberg-data-using-file-path
branch
7 times, most recently
from
August 22, 2024 14:40
58de4a3
to
779a297
Compare
…erg datasets that are not located within the configured Iceberg SparkCatalog 1. Modified the IcebergHandler to accomplish the building of these paths 2. Splits the DataSourceV2ScanRelationInputDatasetBuilder into 2 classes that focus on different parts of the Spark lifecycle. Regarding (1) Docker container based tests were created in the SparkIcebergMetadataJsonTest class. These tests launch Spark applications present in the "scala-fixtures" module. Regarding (2) These classes are: 1. DataSourceV2ScanRelationOnEndInputDatasetBuilder 2. DataSourceV2ScanRelationOnStartInputDatasetBuilder The relevant tests were also updated. Signed-off-by: Damien Hawes <[email protected]>
…FileSystemBinds We use this in order to overcome the file system permissions when running in a CI/CD environment Signed-off-by: Damien Hawes <[email protected]>
Signed-off-by: Damien Hawes <[email protected]>
Signed-off-by: Damien Hawes <[email protected]>
Signed-off-by: Damien Hawes <[email protected]>
…t available in JDK8 Signed-off-by: Damien Hawes <[email protected]>
Signed-off-by: Damien Hawes <[email protected]>
…Container to get the lineage events instead of using HTTP transport Signed-off-by: Damien Hawes <[email protected]>
…ts in Spark 3.5.1 and Iceberg Signed-off-by: Damien Hawes <[email protected]>
Signed-off-by: Damien Hawes <[email protected]>
Signed-off-by: Damien Hawes <[email protected]>
1. Spark 3.2.x, 3.3.x, 3.4.x 2. Spark 3.5.x Signed-off-by: Damien Hawes <[email protected]>
Signed-off-by: Damien Hawes <[email protected]>
d-m-h
force-pushed
the
d-m-h/2935-reading-iceberg-data-using-file-path
branch
from
August 23, 2024 15:54
df85faf
to
7c9fb50
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area:ci
CI
area:documentation
Improvements or additions to documentation
area:integration/spark
area:tests
Testing code
language:java
Uses Java programming language
language:scala
Uses Scala programming language
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
When reading from an Iceberg table by accessing the
vx.metadata.json
file,openlineage-spark
assumes the data is being accessed via the Iceberg catalog.Closes: #2935
Solution
IcebergHandler
was changed to be able to handle cases where Spark is instructed to read Iceberg data by accessing an Iceberg-formatted dataset'sv1.metadata.json
(orv2.metadata.json
).DataSourceV2ScanRelationInputDatasetBuilder
was split into 2 classes that apply at different stages of the Spark job lifecycle.DataSourceV2ScanRelationOnStartInputDatasetBuilder
is applied toSparkListenerSQLExecutionStart
events. Similarly,DataSourceV2ScanRelationOnEndInputDatasetBuilder
is applied toSparkListenerSQLExecutionEnd
events. This was done in order to retain compatibility with the existing functionality related to the dataset versions facet.One-line summary: Added support for deriving lineage from Iceberg datasets, when accessing their data without using a Spark catalog.
Checklist
SPDX-License-Identifier: Apache-2.0
Copyright 2018-2024 contributors to the OpenLineage project