Spark: Fix glue symlinks formatting bug #2807

Akash2351 · 2024-06-26T19:01:01Z

Use the correct spark and hadoop configuration params for populating symlinks when using Glue catalog in EMR Spark or Athena Spark jobs. This fixes the symlinks with correct Glue ARNs for namespace and name formats.

Fixes: #2766
Fixes: #2765

Problem

When using AWS Glue catalogId in the spark configuration - EMR and Athena, it is not picked up and the symlinks point to hive urls instead of glue ARNs.

Solution

Use the correct spark and hadoop configuration params for populating symlinks when using Glue catalog in EMR Spark or Athena Spark jobs. This fixes the symlinks with correct Glue ARNs for namespace and name formats.

Code changes:

Reorder the Symlink fetching order to start from glueArn instead of hive metastore URI since hive metastore URIs are present for glue catalog configurations as well.
Read params for glue catalog id specified in various formats - EMR spark, Athena spark, etc and use them for glue ARNs

Before:

     ....
        "symlinks": {
          "_producer": "https://github.com/OpenLineage/OpenLineage/tree/...integration/spark",
          "_schemaURL": "https://openlineage.io/spec/facets/1-0-1/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet",
          "identifiers": [
            {
              "namespace": "hive://ip-10-1-114-149.ec2.internal:9083",
              "name": "datalake_akash.products",
              "type": "TABLE"
            }
          ]
        }
    ....

After:

     ....
        "symlinks": {
          "_producer": "https://github.com/OpenLineage/OpenLineage/tree/.../integration/spark",
          "_schemaURL": "https://openlineage.io/spec/facets/1-0-1/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet",
          "identifiers": [
            {
              "namespace": "arn:aws:glue:us-east-1:123456789012",
              "name": "table/datalake_akash/products",
              "type": "TABLE"
            }
          ]
        }
    ....

Tested with actual EMR spark jobs with Glue catalog

One-line summary:

Bug fix: Fixes glue symlinks with config parsing for glue catalog Id

Checklist

You've signed-off your work
Your pull request title follows our guidelines
Your changes are accompanied by tests (if relevant)
Your change contains a small diff and is self-contained
You've updated any relevant documentation (if relevant)
Your comment includes a one-liner for the changelog about the specific purpose of the change (not required for changes to tests, docs, or CI config)
You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2024 contributors to the OpenLineage project

boring-cyborg · 2024-06-26T19:01:04Z

Thanks for opening your first OpenLineage pull request! We appreciate your contribution. If you haven't already, please make sure you've reviewed our guide for new contributors (https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md).

Use the correct spark and hadoop configuration params for populating symlinks when using Glue catalog in EMR Spark or Athena Spark jobs. This fixes the symlinks with correct Glue ARNs for namespace and name formats. Fixes [OpenLineage#2766](https://github.com/OpenLineage/OpenLineage/pull/2766/commits) - Reorder the Symlink fetching order to start from glueArn instead of hive metastore URI since hive metastore URIs are present for glue catalog configurations as well. - Read params for glue catalog id specified in various formats - EMR spark, Athena spark, etc and use them for glue ARNs Signed-off-by: Akash Anjanappa <[email protected]>

Fix shared:spotlessApply formatting issues Signed-off-by: Akash Anjanappa <[email protected]>

dolfinus · 2024-06-26T20:08:21Z

integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/PathUtils.java

+    // For AWS Glue access in Athena for Spark
+    // Guide: https://docs.aws.amazon.com/athena/latest/ug/spark-notebooks-cross-account-glue.html
+    Optional<String> glueCatalogIdForAthena =
+        SparkConfUtils.findSparkConfigKey(sparkConf, "spark.hadoop.hive.metastore.glue.catalogid");


There should be findHadoopConfigKey

So these configs can passed to spark configs as well.
This is from the AWS docs guide.

All configs with spark.hadoop. prefix are passed to hadoopConf by Spark (with prefix being stripped).
Also the same option could be passed via hive-site.xml, and if this case it will not be available in SparkConf at all.

Ackd. Lemme test this and fix the code

Fixed with changes to read from Hadoop config. You were right and SparkHadoopUtil was doing this copy.

dolfinus · 2024-06-26T20:09:04Z

integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/PathUtils.java

+    clientFactory =
+        clientFactory.isPresent()
+            ? clientFactory
+            : SparkConfUtils.findSparkConfigKey(sparkConf, "hive.metastore.client.factory.class");


But Spark config properties do not start with hive.

We can pass the hive configs to Spark configs as well. Atleast for these Glue catalog config.
eg:

spark = SparkSession.builder.master("yarn").appName(f'{app_name}').\ config("spark.shuffle.blockTransferService", "nio").\ config("spark.sql.parquet.enableVectorizedReader", "false"). \ config("spark.sql.sources.partitionOverwriteMode", "dynamic"). \ config("spark.sql.sources.parallelPartitionDiscovery.parallelism", "10"). \ config(conf=spark_conf).\ config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory").\ config("hive.metastore.glue.catalogid",catalog_id).enableHiveSupport().getOrCreate()

The same is mentioned in EMR guide as well: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html

Tested this with an actual EMR glue catalog job.

Hm, never heard about this, okay

Address PR comments - change reading config from hadoopConfig Signed-off-by: Akash Anjanappa <[email protected]>

Akash2351 · 2024-06-27T16:13:46Z

@dolfinus anything else pending to merge this fix?

dolfinus · 2024-06-27T16:19:18Z

I'm not a repo maintainer, I cannot merge this PR

Akash2351 · 2024-06-27T16:58:05Z

@pawel-big-lebowski @mobuchowski @tnazarew Can you please review this PR?

integration/spark/shared/src/test/java/io/openlineage/spark/agent/util/PathUtilsTest.java

boring-cyborg · 2024-07-01T08:08:35Z

Great job! Congrats on your first merged pull request in OpenLineage!

* Spark: Fix glue symlinks formatting bug Use the correct spark and hadoop configuration params for populating symlinks when using Glue catalog in EMR Spark or Athena Spark jobs. This fixes the symlinks with correct Glue ARNs for namespace and name formats. Fixes [OpenLineage#2766](https://github.com/OpenLineage/OpenLineage/pull/2766/commits) - Reorder the Symlink fetching order to start from glueArn instead of hive metastore URI since hive metastore URIs are present for glue catalog configurations as well. - Read params for glue catalog id specified in various formats - EMR spark, Athena spark, etc and use them for glue ARNs Signed-off-by: Akash Anjanappa <[email protected]> * Spark: Fix glue symlinks formatting bug - fix formatting Fix shared:spotlessApply formatting issues Signed-off-by: Akash Anjanappa <[email protected]> * Spark: Fix glue symlinks formatting bug - PR comments and feedback Address PR comments - change reading config from hadoopConfig Signed-off-by: Akash Anjanappa <[email protected]> --------- Signed-off-by: Akash Anjanappa <[email protected]> Co-authored-by: Akash Anjanappa <[email protected]>

Akash2351 requested a review from a team as a code owner June 26, 2024 19:01

boring-cyborg bot added area:integration/spark area:tests Testing code language:java Uses Java programming language labels Jun 26, 2024

Akash Anjanappa and others added 2 commits June 26, 2024 12:50

Spark: Fix glue symlinks formatting bug - fix formatting

45638d0

Fix shared:spotlessApply formatting issues Signed-off-by: Akash Anjanappa <[email protected]>

Akash2351 force-pushed the bug/fix-glue-symlinks-format branch from 084a0ac to 45638d0 Compare June 26, 2024 19:50

dolfinus reviewed Jun 26, 2024

View reviewed changes

Spark: Fix glue symlinks formatting bug - PR comments and feedback

bb3355b

Address PR comments - change reading config from hadoopConfig Signed-off-by: Akash Anjanappa <[email protected]>

pawel-big-lebowski reviewed Jul 1, 2024

View reviewed changes

integration/spark/shared/src/test/java/io/openlineage/spark/agent/util/PathUtilsTest.java Show resolved Hide resolved

pawel-big-lebowski approved these changes Jul 1, 2024

View reviewed changes

pawel-big-lebowski merged commit bb022ca into OpenLineage:main Jul 1, 2024
33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Fix glue symlinks formatting bug #2807

Spark: Fix glue symlinks formatting bug #2807

Akash2351 commented Jun 26, 2024 •

edited

Loading

boring-cyborg bot commented Jun 26, 2024

dolfinus Jun 26, 2024

Akash2351 Jun 26, 2024

dolfinus Jun 26, 2024 •

edited

Loading

Akash2351 Jun 26, 2024

Akash2351 Jun 26, 2024

dolfinus Jun 26, 2024

Akash2351 Jun 26, 2024

dolfinus Jun 27, 2024

Akash2351 commented Jun 27, 2024

dolfinus commented Jun 27, 2024 •

edited

Loading

Akash2351 commented Jun 27, 2024 •

edited

Loading

boring-cyborg bot commented Jul 1, 2024

Spark: Fix glue symlinks formatting bug #2807

Spark: Fix glue symlinks formatting bug #2807

Conversation

Akash2351 commented Jun 26, 2024 • edited Loading

Problem

Solution

One-line summary:

Checklist

boring-cyborg bot commented Jun 26, 2024

dolfinus Jun 26, 2024

Choose a reason for hiding this comment

Akash2351 Jun 26, 2024

Choose a reason for hiding this comment

dolfinus Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Akash2351 Jun 26, 2024

Choose a reason for hiding this comment

Akash2351 Jun 26, 2024

Choose a reason for hiding this comment

dolfinus Jun 26, 2024

Choose a reason for hiding this comment

Akash2351 Jun 26, 2024

Choose a reason for hiding this comment

dolfinus Jun 27, 2024

Choose a reason for hiding this comment

Akash2351 commented Jun 27, 2024

dolfinus commented Jun 27, 2024 • edited Loading

Akash2351 commented Jun 27, 2024 • edited Loading

boring-cyborg bot commented Jul 1, 2024

Akash2351 commented Jun 26, 2024 •

edited

Loading

dolfinus Jun 26, 2024 •

edited

Loading

dolfinus commented Jun 27, 2024 •

edited

Loading

Akash2351 commented Jun 27, 2024 •

edited

Loading