Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Add integration tests for EMR #3142

Merged

Conversation

arturowczarek
Copy link
Collaborator

@arturowczarek arturowczarek commented Oct 8, 2024

  • Spark integration has integration tests for EMR

Problem

We are missing the integration tests for EMR. Without them some types of bugs remain undetected and we may introduce regressions without knowing it.

Solution

We should have integration tests. We don't have a sponsor for the AWS infrastructure, so every developer should be able to run them manually using their infrastructure.

For reviewers: Start reading from EmrIntegrationTest class.

Note: All schema changes require discussion. Please link the issue for context.

  • Your change modifies the core OpenLineage model
  • Your change modifies one or more OpenLineage facets

If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports S3 and GCS filesystem operations, tested with AWS EMR).

One-line summary:

Checklist

  • You've signed-off your work
  • Your pull request title follows our guidelines
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • Your comment includes a one-liner for the changelog about the specific purpose of the change (not required for changes to tests, docs, or CI config)
  • You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
  • You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2024 contributors to the OpenLineage project

@boring-cyborg boring-cyborg bot added area:integration/spark area:tests Testing code language:java Uses Java programming language language:python Uses Python programming language labels Oct 8, 2024
@arturowczarek arturowczarek force-pushed the test/emr-integration-test branch 9 times, most recently from 95faa83 to 0ccd0a0 Compare October 10, 2024 09:48
@arturowczarek arturowczarek marked this pull request as ready for review October 10, 2024 09:50
@arturowczarek arturowczarek requested a review from a team as a code owner October 10, 2024 09:50
@arturowczarek arturowczarek force-pushed the test/emr-integration-test branch from 0ccd0a0 to b567a55 Compare October 10, 2024 13:00
@mobuchowski
Copy link
Member

@arturowczarek can we run "regular" integration tests on EMR, or only special tests otherwise located in the emr_test_jobs directory?

@arturowczarek
Copy link
Collaborator Author

@mobuchowski The loading class locates the files in the specific directory for simplicity. We can extend it if necessary, but passing just the name of the file is convenient, because the name is used to create temporary directory where everything for this test is kept. Otherwise we could have multiple scripts with the same name and they would keep the events under the same location. Alternatively we could have a better strategy for determining the location for the events.

@mobuchowski
Copy link
Member

Makes sense, I just think we'd also potentially benefit with running a bunch of "regular" integration tests on EMR regularly.

@arturowczarek
Copy link
Collaborator Author

Yes, it is one of the methods we could further develop these tests.
There is another challenge we have to find a good solution to - managing resources and determining which tests to run. In some tests I'd like to add, there is a need to have two AWS accounts (to have two separate Glue catalogs). It is not common that developers have access to multiple accounts. In such scenarios they have to skip these tests. I'd also like to create some (possibly) Terraform file to create the infrastructure. I'll do it when we have someone to sponsor it.

Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is huge step forward in improving OL event collection on AWS.
The code looks good to me.

What system properties are required to be set to run the test?

@arturowczarek
Copy link
Collaborator Author

The parameters are listed in DynamicParameter enum. Most of them have some reasonable defaults (instance size, EMR version, roles which should be present in every account). Only one is required (openlineage.tests.bucketName). Apart from that you have to be authenticated (have ~/.aws/credentials file or use any method described in Default credentials provider chain)

@arturowczarek arturowczarek force-pushed the test/emr-integration-test branch from b567a55 to 0c59d58 Compare October 11, 2024 09:28
* Spark integration has integration tests for EMR

Signed-off-by: Artur Owczarek <[email protected]>
@arturowczarek arturowczarek force-pushed the test/emr-integration-test branch from 0c59d58 to bc7befc Compare October 11, 2024 10:03
Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great test. Thank you @arturowczarek for live demo.

@pawel-big-lebowski pawel-big-lebowski merged commit bf3d4c5 into OpenLineage:main Oct 11, 2024
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:integration/spark area:tests Testing code language:java Uses Java programming language language:python Uses Python programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants