Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with only one table in inputs from sql query while reading from jdbc. #2918

Conversation

Imbruced
Copy link
Contributor

@Imbruced Imbruced commented Aug 7, 2024

Problem

Closes: #2808

Solution

Take more elements from array than the first one.

One-line summary:

Checklist

  • You've signed-off your work
  • Your pull request title follows our guidelines
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • Your comment includes a one-liner for the changelog about the specific purpose of the change (not required for changes to tests, docs, or CI config)
  • You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
  • You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2024 contributors to the OpenLineage project

@boring-cyborg boring-cyborg bot added area:integration/spark area:tests Testing code language:java Uses Java programming language labels Aug 7, 2024
@Imbruced Imbruced force-pushed the bug/spark/one-table-from-jdbc-query branch from 54b1ab0 to 0ffd1b2 Compare August 7, 2024 14:51
@Imbruced Imbruced marked this pull request as ready for review August 8, 2024 14:15
@Imbruced Imbruced requested a review from a team as a code owner August 8, 2024 14:15
@Tag("integration-test")
@Testcontainers
@Slf4j
class InvalidTablesInJdbcInputsTest {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test really verify invalid tables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not good name for test :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

runEvents.stream()
.map(OpenLineage.RunEvent::getInputs)
.forEach(
inputs -> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we extracting schemas in this case? If not, then it's at least worth writing a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its interesting how we can handle schema when we have more than one table, what we can do is to somehow parse the query and look for column references but I am not sure if its the case here and for simplicity is better to keep the schema to be empty StructType ?

@@ -39,13 39,26 @@ public static <D extends OpenLineage.Dataset> List<D> getDatasets(
String jdbcUrl = relation.jdbcOptions().url();
Properties jdbcProperties = relation.jdbcOptions().asConnectionProperties();

if (meta.columnLineage().isEmpty() && meta.inTables().size() > 1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come empty column level lineage makes here a difference? Could you add some comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ve tried to not make any regression. It might be that the condition is not necessary in this form.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ve changed that a little in newest commit, the issue is with how we handle the schema when we have more than one table and we dont have column lineage. In that case we need for each column to set schema to empty StructType or parse plan (but I think thats what column lineage will at the end ll do)

@Imbruced Imbruced force-pushed the bug/spark/one-table-from-jdbc-query branch 5 times, most recently from d7b688c to dff0c5d Compare August 22, 2024 13:40
Signed-off-by: pawelkocinski <[email protected]>
Signed-off-by: pawelkocinski <[email protected]>
@Imbruced Imbruced force-pushed the bug/spark/one-table-from-jdbc-query branch from dff0c5d to e619b82 Compare September 3, 2024 10:12
return datasetFactory.getDataset(
di.getName(),
di.getNamespace(),
numberOfTables == 1 ? schema : new StructType());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this result in schema facet with empty fields? If so, this is something that should be avoided.

If any other OL event emits dataset with schema facet, the schema will be attached to a dataset up until the moment a new schema facet is sent. Sending empty schema facet will clear existing information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That example is only possible to happen when we have no column lineage number of tables in the jdbc query is more than one. Maybe instead this solution we should take one step backward and add column level lineage for jdbc queries with with statement ? I don't see any better solution without stepping back than putting the empty Schema field. Do you have any ideas ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding new method io.openlineage.spark.api.DatasetFactory#getDataset(java.lang.String, java.lang.String) without StructType schema parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Signed-off-by: pawelkocinski <[email protected]>
Signed-off-by: pawelkocinski <[email protected]>
Signed-off-by: pawelkocinski <[email protected]>
@pawel-big-lebowski pawel-big-lebowski merged commit b2d8f6a into OpenLineage:main Sep 3, 2024
48 checks passed
pawel-big-lebowski pushed a commit that referenced this pull request Sep 4, 2024
…from jdbc. (#2918)

* Fix issue with only one table in inputs from sql query while reading from jdbc.

Signed-off-by: pawelkocinski <[email protected]>

---------

Signed-off-by: pawelkocinski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:integration/spark area:tests Testing code language:java Uses Java programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] wrong table taken added to inputs from SQL queries incl WITH statements in openlineage-spark integration
2 participants