Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Properly set ARN in namespace for Iceberg Glue symlinks #2943

Merged

Conversation

arturowczarek
Copy link
Collaborator

Problem

For Iceberg tables configured in Glue catalog, the symlinks have paths instead of ARNs in the namespace field. For example:

"symlinks": {
  "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.20.0-SNAPSHOT/integration/spark",
  "_schemaURL": "https://openlineage.io/spec/facets/1-0-1/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet",
  "identifiers": [
    {
      "namespace": "s3://aowczarek-glue-test/iceberg_warehouse",
      "name": "silver.silver_buyinggroup",
      "type": "TABLE"
    }
  ]
}

What we want instead is:

"symlinks": {
  "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.20.0-SNAPSHOT/integration/spark",
  "_schemaURL": "https://openlineage.io/spec/facets/1-0-1/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet",
  "identifiers": [
    {
      "namespace": "arn:aws:glue:eu-central-1:654654611584",
      "name": "silver.silver_buyinggroup",
      "type": "TABLE"
    }
  ]
}

The problem is in io.openlineage.spark3.agent.lifecycle.plan.catalog.IcebergHandler#getDatasetIdentifier method which doesn't support glue catalog type.

Solution

The IcebergHandler should support glue catalog table and create the symlink using the code from PathUtils.

Note: All schema changes require discussion. Please link the issue for context.

  • Your change modifies the core OpenLineage model
  • Your change modifies one or more OpenLineage facets

If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports S3 and GCS filesystem operations, tested with AWS EMR).

One-line summary:

Checklist

  • You've signed-off your work
  • Your pull request title follows our guidelines
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • Your comment includes a one-liner for the changelog about the specific purpose of the change (not required for changes to tests, docs, or CI config)
  • You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
  • You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2024 contributors to the OpenLineage project

@arturowczarek arturowczarek requested a review from a team as a code owner August 19, 2024 15:06
@boring-cyborg boring-cyborg bot added area:integration/spark area:tests Testing code language:java Uses Java programming language labels Aug 19, 2024
@arturowczarek arturowczarek force-pushed the pr/fix-iceberg-glue-symlink branch from 67d2bfb to fa21341 Compare August 20, 2024 09:40
* Fix AWS Glue Iceberg symlinks

Signed-off-by: Artur Owczarek <[email protected]>
@arturowczarek arturowczarek force-pushed the pr/fix-iceberg-glue-symlink branch from fa21341 to 91fcb55 Compare August 20, 2024 10:15
@mobuchowski mobuchowski merged commit cc7877a into OpenLineage:main Aug 20, 2024
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:integration/spark area:tests Testing code language:java Uses Java programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants