Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix AWS Glue jobs naming for RDD events #3020

Merged

Conversation

arturowczarek
Copy link
Collaborator

  • RDD events in AWS Glue contain correct job names

Problem

The names of RDD jobs in AWS glue are autogenerated and different with every execution.

Solution

The naming for RDD jobs should use the same code as SQL and Application events.

This is the second part of the changes in naming. It is ugly and requires further refactoring but provides support for AWS Glue naming.

The most significant changes include:

  • ContextFactory method createRddExecutionContext producing RddExecutionContext accepts OpenLineageContext with correct SparkContext (it is required to determine proper naming)
  • All execution contexts (RDD, SQL, Application) access the OpenLineage object from the OpenLineageContext (so far they were unnecessarily recreated). The OpenLineage object is accessed through the method instead of the files so that the tests are not broken
  • The JobNameBuilder has additional method for building job name for RDDs. It will be further refactored.

Note: All schema changes require discussion. Please link the issue for context.

  • Your change modifies the core OpenLineage model
  • Your change modifies one or more OpenLineage facets

If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports S3 and GCS filesystem operations, tested with AWS EMR).

One-line summary:

Checklist

  • You've signed-off your work
  • Your pull request title follows our guidelines
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • Your comment includes a one-liner for the changelog about the specific purpose of the change (not required for changes to tests, docs, or CI config)
  • You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
  • You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2024 contributors to the OpenLineage project

@arturowczarek arturowczarek requested a review from a team as a code owner August 30, 2024 11:06
@boring-cyborg boring-cyborg bot added area:integration/spark area:tests Testing code language:java Uses Java programming language labels Aug 30, 2024
@arturowczarek arturowczarek force-pushed the pr/fix-aws-glue-rdd branch 2 times, most recently from d3a136d to 2f6b145 Compare August 30, 2024 13:07
* RDD events in AWS Glue contain correct job names

Signed-off-by: Artur Owczarek <[email protected]>
Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvements. It's a pleasure to read and review such code.

@pawel-big-lebowski pawel-big-lebowski merged commit c2218cb into OpenLineage:main Sep 2, 2024
48 checks passed
@arturowczarek arturowczarek mentioned this pull request Sep 5, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:integration/spark area:tests Testing code language:java Uses Java programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants