Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Fix glue symlinks formatting bug #2807

Merged

Conversation

Akash2351
Copy link
Contributor

@Akash2351 Akash2351 commented Jun 26, 2024

Use the correct spark and hadoop configuration params for populating symlinks when using Glue catalog in EMR Spark or Athena Spark jobs. This fixes the symlinks with correct Glue ARNs for namespace and name formats.

Fixes: #2766
Fixes: #2765

Problem

When using AWS Glue catalogId in the spark configuration - EMR and Athena, it is not picked up and the symlinks point to hive urls instead of glue ARNs.

Solution

Use the correct spark and hadoop configuration params for populating symlinks when using Glue catalog in EMR Spark or Athena Spark jobs. This fixes the symlinks with correct Glue ARNs for namespace and name formats.

Code changes:

  • Reorder the Symlink fetching order to start from glueArn instead of hive metastore URI since hive metastore URIs are present for glue catalog configurations as well.
  • Read params for glue catalog id specified in various formats - EMR spark, Athena spark, etc and use them for glue ARNs

Before:

     ....
        "symlinks": {
          "_producer": "https://github.com/OpenLineage/OpenLineage/tree/...integration/spark",
          "_schemaURL": "https://openlineage.io/spec/facets/1-0-1/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet",
          "identifiers": [
            {
              "namespace": "hive://ip-10-1-114-149.ec2.internal:9083",
              "name": "datalake_akash.products",
              "type": "TABLE"
            }
          ]
        }
    ....    

After:

     ....
        "symlinks": {
          "_producer": "https://github.com/OpenLineage/OpenLineage/tree/.../integration/spark",
          "_schemaURL": "https://openlineage.io/spec/facets/1-0-1/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet",
          "identifiers": [
            {
              "namespace": "arn:aws:glue:us-east-1:123456789012",
              "name": "table/datalake_akash/products",
              "type": "TABLE"
            }
          ]
        }
    ....    

Tested with actual EMR spark jobs with Glue catalog

One-line summary:

Bug fix: Fixes glue symlinks with config parsing for glue catalog Id

Checklist

  • You've signed-off your work
  • Your pull request title follows our guidelines
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • Your comment includes a one-liner for the changelog about the specific purpose of the change (not required for changes to tests, docs, or CI config)
  • You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
  • You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2024 contributors to the OpenLineage project

@Akash2351 Akash2351 requested a review from a team as a code owner June 26, 2024 19:01
@boring-cyborg boring-cyborg bot added area:integration/spark area:tests Testing code language:java Uses Java programming language labels Jun 26, 2024
Copy link

boring-cyborg bot commented Jun 26, 2024

Thanks for opening your first OpenLineage pull request! We appreciate your contribution. If you haven't already, please make sure you've reviewed our guide for new contributors (https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md).

Akash Anjanappa and others added 2 commits June 26, 2024 12:50
Use the correct spark and hadoop configuration params for populating symlinks when using Glue catalog in EMR Spark or Athena Spark jobs. This fixes the symlinks with correct Glue ARNs for namespace and name formats.

Fixes [OpenLineage#2766](https://github.com/OpenLineage/OpenLineage/pull/2766/commits)
 - Reorder the Symlink fetching order to start from glueArn instead of hive metastore URI since hive metastore URIs are present for glue catalog configurations as well.
 - Read params for glue catalog id specified in various formats - EMR spark, Athena spark, etc and use them for glue ARNs

Signed-off-by: Akash Anjanappa <[email protected]>
Fix shared:spotlessApply formatting issues

Signed-off-by: Akash Anjanappa <[email protected]>
@Akash2351 Akash2351 force-pushed the bug/fix-glue-symlinks-format branch from 084a0ac to 45638d0 Compare June 26, 2024 19:50
// For AWS Glue access in Athena for Spark
// Guide: https://docs.aws.amazon.com/athena/latest/ug/spark-notebooks-cross-account-glue.html
Optional<String> glueCatalogIdForAthena =
SparkConfUtils.findSparkConfigKey(sparkConf, "spark.hadoop.hive.metastore.glue.catalogid");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be findHadoopConfigKey

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these configs can passed to spark configs as well.
This is from the AWS docs guide.

Copy link
Contributor

@dolfinus dolfinus Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All configs with spark.hadoop. prefix are passed to hadoopConf by Spark (with prefix being stripped).
Also the same option could be passed via hive-site.xml, and if this case it will not be available in SparkConf at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ackd. Lemme test this and fix the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with changes to read from Hadoop config. You were right and SparkHadoopUtil was doing this copy.

clientFactory =
clientFactory.isPresent()
? clientFactory
: SparkConfUtils.findSparkConfigKey(sparkConf, "hive.metastore.client.factory.class");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But Spark config properties do not start with hive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pass the hive configs to Spark configs as well. Atleast for these Glue catalog config.
eg:

spark = SparkSession.builder.master("yarn").appName(f'{app_name}').\
            config("spark.shuffle.blockTransferService", "nio").\
            config("spark.sql.parquet.enableVectorizedReader", "false"). \
            config("spark.sql.sources.partitionOverwriteMode", "dynamic"). \
            config("spark.sql.sources.parallelPartitionDiscovery.parallelism", "10"). \
            config(conf=spark_conf).\
            config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory").\
            config("hive.metastore.glue.catalogid",catalog_id).enableHiveSupport().getOrCreate()

The same is mentioned in EMR guide as well: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html

Tested this with an actual EMR glue catalog job.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, never heard about this, okay

Address PR comments - change reading config from hadoopConfig

Signed-off-by: Akash Anjanappa <[email protected]>
@Akash2351
Copy link
Contributor Author

@dolfinus anything else pending to merge this fix?

@dolfinus
Copy link
Contributor

dolfinus commented Jun 27, 2024

I'm not a repo maintainer, I cannot merge this PR

@Akash2351
Copy link
Contributor Author

Akash2351 commented Jun 27, 2024

@pawel-big-lebowski @mobuchowski @tnazarew Can you please review this PR?

@pawel-big-lebowski pawel-big-lebowski merged commit bb022ca into OpenLineage:main Jul 1, 2024
33 checks passed
Copy link

boring-cyborg bot commented Jul 1, 2024

Great job! Congrats on your first merged pull request in OpenLineage!

fafnirZ pushed a commit to fafnirZ/OpenLineage that referenced this pull request Jul 3, 2024
* Spark: Fix glue symlinks formatting bug

Use the correct spark and hadoop configuration params for populating symlinks when using Glue catalog in EMR Spark or Athena Spark jobs. This fixes the symlinks with correct Glue ARNs for namespace and name formats.

Fixes [OpenLineage#2766](https://github.com/OpenLineage/OpenLineage/pull/2766/commits)
 - Reorder the Symlink fetching order to start from glueArn instead of hive metastore URI since hive metastore URIs are present for glue catalog configurations as well.
 - Read params for glue catalog id specified in various formats - EMR spark, Athena spark, etc and use them for glue ARNs

Signed-off-by: Akash Anjanappa <[email protected]>

* Spark: Fix glue symlinks formatting bug - fix formatting

Fix shared:spotlessApply formatting issues

Signed-off-by: Akash Anjanappa <[email protected]>

* Spark: Fix glue symlinks formatting bug - PR comments and feedback

Address PR comments - change reading config from hadoopConfig

Signed-off-by: Akash Anjanappa <[email protected]>

---------

Signed-off-by: Akash Anjanappa <[email protected]>
Co-authored-by: Akash Anjanappa <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:integration/spark area:tests Testing code language:java Uses Java programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[QUESTION] Glue namespace format
3 participants