Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GCP Dataplex transport as separate module #3043

Merged
merged 6 commits into from
Sep 13, 2024

Conversation

ddebowczyk92
Copy link
Contributor

@ddebowczyk92 ddebowczyk92 commented Sep 4, 2024

This PR adds GCP transport as separate module.

Note: All schema changes require discussion. Please link the issue for context.

  • Your change modifies the core OpenLineage model
  • Your change modifies one or more OpenLineage facets

If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports S3 and GCS filesystem operations, tested with AWS EMR).

Checklist

  • You've signed-off your work
  • Your pull request title follows our guidelines
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • Your comment includes a one-liner for the changelog about the specific purpose of the change (not required for changes to tests, docs, or CI config)
  • You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
  • You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2024 contributors to the OpenLineage project

@ddebowczyk92 ddebowczyk92 requested a review from a team as a code owner September 4, 2024 12:43
@boring-cyborg boring-cyborg bot added area:ci CI area:client/java openlineage-java area:documentation Improvements or additions to documentation area:tests Testing code language:java Uses Java programming language labels Sep 4, 2024
@ddebowczyk92 ddebowczyk92 force-pushed the add-gcp-transport branch 2 times, most recently from d3a80e6 to d436ddb Compare September 4, 2024 13:01
@mobuchowski
Copy link
Member

@ddebowczyk92 GCP -> Dataplex?

@ddebowczyk92 ddebowczyk92 changed the title Add GCP transport as separate module Add GCP Dataplex transport as separate module Sep 4, 2024
@ddebowczyk92 ddebowczyk92 force-pushed the add-gcp-transport branch 6 times, most recently from 3942293 to 1c4d9e8 Compare September 4, 2024 15:14
@codecov-commenter
Copy link

codecov-commenter commented Sep 4, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 85.70%. Comparing base (162e837) to head (724b320).
Report is 3 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3043    /-   ##
=======================================
  Coverage   85.70%   85.70%           
=======================================
  Files          54       54           
  Lines        3112     3112           
=======================================
  Hits         2667     2667           
  Misses        445      445           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is: is this code specific to OpenLineage or certain lineage backend? In case of the latter: are we happy to build and release transports for all the other OL backends in case someone wants it? What are the advantages of openlineage community taking care of this?

The disadvantage of this approach is that: we take responsibility for the code without being able to verify its correctness. One can easily make a change within DataplexTransport which breaks the integration while passing all the tests. It's difficult to maintain such package, as even upgrading dependencies like `com.google.cloud:google-cloud-datalineage' is risky.

If we want to maintain such code, I think we need an integration test with real DataPlex credentials, same as we have for BigQuery.

.circleci/workflows/openlineage-java.yml Show resolved Hide resolved
@mobuchowski
Copy link
Member

My concern is: is this code specific to OpenLineage or certain lineage backend? In case of the latter: are we happy to build and release transports for all the other OL backends in case someone wants it? What are the advantages of openlineage community taking care of this?

I believe there is no problem with us maintaining wrappers (transports are wrappers for specific libraries) for sending events to significant OpenLineage backends. I don't see a difference between DataplexTransport and KafkaTransport here for example, given that they both produce consumable JAR that the transport uses.

Adding this as a separate releasable Maven package is actually a step forward for modularity: even if something breaks inside, it affects only users that use it - it's not the case with Kafka or Kinesis transports now for example.

The disadvantage of this approach is that: we take responsibility for the code without being able to verify its correctness. One can easily make a change within DataplexTransport which breaks the integration while passing all the tests. It's difficult to maintain such package, as even upgrading dependencies like `com.google.cloud:google-cloud-datalineage' is risky.

I don't agree with this, the analogy is comparable to KafkaTransport. We don't test if 'org.apache.kafka:kafka-clients:3.8.0' can emit data to Kafka. If Google breaks their dependency, it will be on Google to fix it. We don't have to do any workarounds for this, we can purely point to the older version of a dependency or disable release for particular transport.

If we want to maintain such code, I think we need an integration test with real DataPlex credentials, same as we have for BigQuery.

@tnazarew is writing integration consumer tests that would be able to cover whether particular transport produces information that's relevant to the particular consumer.

client/java/transports-dataplex/README.md Outdated Show resolved Hide resolved
client/java/transports-dataplex/README.md Outdated Show resolved Hide resolved
### Configuration

- `type` - string, must be `"dataplex"`. Required.
- `endpoint` - string, specifies the endpoint to which events are sent. Optional.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provide the format for the endpoint

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@midamkina Is the endpoint config redundant with the ProjectId Location combination?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there can be a different environment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@midamkina Is the endpoint of the form projects/{project}/locations/{location}/? If so, is the distinction here only about the service's project vs the billing project?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that projects/{project}/locations/{location}/ is the parent. Endpoint has an address that can point to a different environment.

client/java/transports-dataplex/README.md Outdated Show resolved Hide resolved
@ddebowczyk92 ddebowczyk92 force-pushed the add-gcp-transport branch 2 times, most recently from aacf882 to 318c25e Compare September 12, 2024 10:16
Signed-off-by: Dominik Dębowczyk <[email protected]>
Signed-off-by: Dominik Dębowczyk <[email protected]>
Signed-off-by: Dominik Dębowczyk <[email protected]>
Signed-off-by: Dominik Dębowczyk <[email protected]>
Signed-off-by: Dominik Dębowczyk <[email protected]>
Copy link

@pulkit-jain-G pulkit-jain-G left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pawel-big-lebowski pawel-big-lebowski merged commit 96b8f10 into OpenLineage:main Sep 13, 2024
87 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:ci CI area:client/java openlineage-java area:documentation Improvements or additions to documentation area:tests Testing code language:java Uses Java programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants