-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark: Fix glue symlinks formatting bug #2807
Spark: Fix glue symlinks formatting bug #2807
Conversation
Thanks for opening your first OpenLineage pull request! We appreciate your contribution. If you haven't already, please make sure you've reviewed our guide for new contributors (https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md). |
Use the correct spark and hadoop configuration params for populating symlinks when using Glue catalog in EMR Spark or Athena Spark jobs. This fixes the symlinks with correct Glue ARNs for namespace and name formats. Fixes [OpenLineage#2766](https://github.com/OpenLineage/OpenLineage/pull/2766/commits) - Reorder the Symlink fetching order to start from glueArn instead of hive metastore URI since hive metastore URIs are present for glue catalog configurations as well. - Read params for glue catalog id specified in various formats - EMR spark, Athena spark, etc and use them for glue ARNs Signed-off-by: Akash Anjanappa <[email protected]>
Fix shared:spotlessApply formatting issues Signed-off-by: Akash Anjanappa <[email protected]>
084a0ac
to
45638d0
Compare
// For AWS Glue access in Athena for Spark | ||
// Guide: https://docs.aws.amazon.com/athena/latest/ug/spark-notebooks-cross-account-glue.html | ||
Optional<String> glueCatalogIdForAthena = | ||
SparkConfUtils.findSparkConfigKey(sparkConf, "spark.hadoop.hive.metastore.glue.catalogid"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be findHadoopConfigKey
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So these configs can passed to spark configs as well.
This is from the AWS docs guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All configs with spark.hadoop.
prefix are passed to hadoopConf by Spark (with prefix being stripped).
Also the same option could be passed via hive-site.xml
, and if this case it will not be available in SparkConf at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ackd. Lemme test this and fix the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed with changes to read from Hadoop config. You were right and SparkHadoopUtil was doing this copy.
clientFactory = | ||
clientFactory.isPresent() | ||
? clientFactory | ||
: SparkConfUtils.findSparkConfigKey(sparkConf, "hive.metastore.client.factory.class"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But Spark config properties do not start with hive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can pass the hive configs to Spark configs as well. Atleast for these Glue catalog config.
eg:
spark = SparkSession.builder.master("yarn").appName(f'{app_name}').\
config("spark.shuffle.blockTransferService", "nio").\
config("spark.sql.parquet.enableVectorizedReader", "false"). \
config("spark.sql.sources.partitionOverwriteMode", "dynamic"). \
config("spark.sql.sources.parallelPartitionDiscovery.parallelism", "10"). \
config(conf=spark_conf).\
config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory").\
config("hive.metastore.glue.catalogid",catalog_id).enableHiveSupport().getOrCreate()
The same is mentioned in EMR guide as well: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html
Tested this with an actual EMR glue catalog job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, never heard about this, okay
Address PR comments - change reading config from hadoopConfig Signed-off-by: Akash Anjanappa <[email protected]>
@dolfinus anything else pending to merge this fix? |
I'm not a repo maintainer, I cannot merge this PR |
@pawel-big-lebowski @mobuchowski @tnazarew Can you please review this PR? |
integration/spark/shared/src/test/java/io/openlineage/spark/agent/util/PathUtilsTest.java
Show resolved
Hide resolved
Great job! Congrats on your first merged pull request in OpenLineage! |
* Spark: Fix glue symlinks formatting bug Use the correct spark and hadoop configuration params for populating symlinks when using Glue catalog in EMR Spark or Athena Spark jobs. This fixes the symlinks with correct Glue ARNs for namespace and name formats. Fixes [OpenLineage#2766](https://github.com/OpenLineage/OpenLineage/pull/2766/commits) - Reorder the Symlink fetching order to start from glueArn instead of hive metastore URI since hive metastore URIs are present for glue catalog configurations as well. - Read params for glue catalog id specified in various formats - EMR spark, Athena spark, etc and use them for glue ARNs Signed-off-by: Akash Anjanappa <[email protected]> * Spark: Fix glue symlinks formatting bug - fix formatting Fix shared:spotlessApply formatting issues Signed-off-by: Akash Anjanappa <[email protected]> * Spark: Fix glue symlinks formatting bug - PR comments and feedback Address PR comments - change reading config from hadoopConfig Signed-off-by: Akash Anjanappa <[email protected]> --------- Signed-off-by: Akash Anjanappa <[email protected]> Co-authored-by: Akash Anjanappa <[email protected]>
Use the correct spark and hadoop configuration params for populating symlinks when using Glue catalog in EMR Spark or Athena Spark jobs. This fixes the symlinks with correct Glue ARNs for namespace and name formats.
Fixes: #2766
Fixes: #2765
Problem
When using AWS Glue catalogId in the spark configuration - EMR and Athena, it is not picked up and the symlinks point to hive urls instead of glue ARNs.
Solution
Use the correct spark and hadoop configuration params for populating symlinks when using Glue catalog in EMR Spark or Athena Spark jobs. This fixes the symlinks with correct Glue ARNs for namespace and name formats.
Code changes:
Before:
After:
Tested with actual EMR spark jobs with Glue catalog
One-line summary:
Bug fix: Fixes glue symlinks with config parsing for glue catalog Id
Checklist
SPDX-License-Identifier: Apache-2.0
Copyright 2018-2024 contributors to the OpenLineage project