Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add dbt-athena as a supported adapter #10020

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

hhhonzik
Copy link

@hhhonzik hhhonzik commented May 8, 2024

Closes: #3554 #789 #4940

Description:

I've recently added a custom adapter for Starrocks which seems to be working well. With some minor time investment i was able to add dbt-athena support as well. If there is a will from contributors to merge this, i can cover the adapter with tests to ensure it's all ready to merge.

Otherwise feel free to close this!

Screenshot 2024-05-08 at 22 02 57

Reviewer actions

  • I have manually tested the changes in the preview environment
  • I have reviewed the code
  • I understand that "request changes" will block this PR from merging

Copy link

netlify bot commented May 8, 2024

👷 Deploy request for peaceful-bassi-cbf284 pending review.

Visit the deploys page to approve it

Name Link
🔨 Latest commit 664e9bb

@hhhonzik hhhonzik changed the title Add dbt-athena as a supported adapter feat: Add dbt-athena as a supported adapter May 8, 2024
@TuringLovesDeathMetal TuringLovesDeathMetal added the ✨ feature-request Request for a new feature or functionality label May 9, 2024
@TuringLovesDeathMetal
Copy link
Contributor

Note to add docs for this once it's reviewed!

Comment on lines 187 to 188
aws_access_key_id: credentials.awsAccessKeyId,
aws_secret_access_key: credentials.awsSecretKey,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be optional? for example if I run Lightdash in a self hosted environments (e.g. ECS) I would like to leverage the underlying role used by ECS to authenticate, without hard-coding credentials anywhere, simply because any call done by container are already authenticated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are, in fact that's how i was testing the build. The only thing needed is AWS_SDK_LOAD_CONFIG=true as environment variable.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. So if I get right AWS_SDK_LOAD_CONFIG should be enough when running from ECS or EKS/pods with authentication via roles right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I've added this to .env file, but it should definitely be mentioned in the docs.

Comment on lines 58 to 67
}: TableInfo) => `SELECT table_catalog
, table_schema
, table_name
, column_name
, data_type
FROM ${database}.information_schema.columns
WHERE table_catalog = lower('${database}')
AND table_schema = lower('${schema}')
AND table_name = lower('${table}')
ORDER BY 1, 2, 3, ordinal_position`;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this query is easy, but it's really really slow on scale. I've seen this query running ~2/3 minutes.
What we do in dbt-athena is to use glue apis. Also for example JDBC athena drivers uses the same concept, and in tools like Redash is possible to flag the usage of Glue APIs instead of using information_schema.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hhhonzik I'm happy to have a chat with you to change the behvior of the above, and make the retrival of metadata faster. The API to use should be simply glue get_tables https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue/client/get_tables.html for example in boto3.

const getTables = (databaseName) => {
  const client = new GlueClient({});

  const command = new GetTablesCommand({
    DatabaseName: databaseName,
  });

  return client.send(command);
};

that supports only DatabaseName as filter, means that to to filter by a specific table we must do in memory.

};
}

private async checkQueryExequtionStateAndGetData(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a thought, isn't there any JS library that takes care of the below? for example in python there is PyAthena, that pretty much take care of all the operations around firing a query and returning the result. I'm not really familiar with the JS world, but did you do a quick search to see if we can reusable some library? it will definitely reduce complexity.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's athena-express - i would prefer that, but didn't want to add too many dependencies. I'll take this as a green light and use it :)

Copy link

@nicor88 nicor88 May 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, just double check with the lightdash team and let's check if athena-express is well maintained. I see that it was published the last time 2 years ago, that is not a good indicator tbh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should use athena-express. Usage is dropping, there is no activity and they haven't been bumping their dependencies to patch security issues.
On the other hand, athena-express-plus ( fork from athena-express ) is gaining popularity. 👀
Screenshot 2024-07-03 at 16 12 44

Note that we are switching our warehouse clients to stream the results instead of loading them all in memory. #10545
So we would need to adjust the code to support this as well.

) {
// In my case, queries run no faster than 800-900ms, which is why I set a 1000ms timeout
await new Promise<void>((res) => {
setTimeout(() => res(), 1000);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make the timeout configurable?
Also let me get it right, if a query takes more than 1 second will you timeout?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is essentially a sleep, no? It's waiting one second before checking again for results.

@hhhonzik maybe the comment could clarify

Copy link

@nicor88 nicor88 Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I got it, it's the waited time that determine how often we check when the query result is ready. In dbt-athena we call this poll_interval.

My suggestion would be to make the value configurable, with a default value of 500 ms.
I've seen very optimized tables and queries running below 1sec, often not the case, but definitely will for my serving layer I will aim to make my tables as optimized as possible.
Also, as suggested by @magnew let's add a comment please, to give more context 🙏🏻

@magnew magnew self-assigned this Jun 12, 2024
@owlas owlas requested a deployment to duplicate_dbt-athena - jaffle_db_pg_13 PR #10404 June 13, 2024 14:28 — with Render Abandoned
@owlas owlas deployed to duplicate_dbt-athena - headless-browser PR #10404 June 13, 2024 14:28 — with Render Active
@owlas owlas deployed to duplicate_dbt-athena - headless-browser PR #10404 June 13, 2024 17:25 — with Render Active
@owlas owlas deployed to duplicate_dbt-athena - headless-browser PR #10404 June 14, 2024 13:03 — with Render Active
@owlas owlas temporarily deployed to duplicate_dbt-athena - headless-browser PR #10404 June 14, 2024 15:43 — with Render Destroyed
@magnew magnew self-requested a review June 14, 2024 15:50
Copy link
Contributor

@magnew magnew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hhhonzik, very cool. Thanks for posting this. I have a few questions to get it shipped. Bear with me, I am not a deep expert in Athena so there might be something I am not understanding 😄

  1. There is some confusion in my mind (and in this code) about the distinction between database, schema and catalog. It seems like Athena typically treats database and schema as the same, is that right? If so, can we choose one and require it and ignore the other? If they are subtly different and both are needed, which should we require and what do we do with the other? Also, the UI seems to send whatever is entered in the schema input box as catalog. Should it be sending that as schema or database? Can we hardcode catalog or does it need an input?

  2. It looks like AWS_SDK_LOAD_CONFIG allows us to pick up role-based permissions from the aws credentials when using the lightdash CLI because you can specify the profile there. But I don't think that will work in the UI since there is no profile specified. So as far as I understand that means you can only set up the UI with a user that doesn't need to assume a role to use Athena. Does that sound right? If we want to support roles in the UI, should we add an advanced option to provide an amazon profile name (like you can in the DBT profile).

FWIW, we can decide to only support a subset of possible AWS credential configurations for now. What kind of profile definitions do you expect this to work for for now?

Thanks again!

export type WarehouseCredentials =
| SnowflakeCredentials
| RedshiftCredentials
| PostgresCredentials
| BigqueryCredentials
| DatabricksCredentials
| TrinoCredentials;
| TrinoCredentials
| CreateAthenaCredentials;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AthenaCredentials?

region_name: credentials.awsRegion,
s3_staging_dir: credentials.outputLocation,
schema: credentials.schema,
database: credentials.database,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hhhonzik

Are database and schema effectively the same in Athena? Is there a reason to require both? I'm going to make database map to schema for now. If there is a reason we need to require both, let me know.

) {
// In my case, queries run no faster than 800-900ms, which is why I set a 1000ms timeout
await new Promise<void>((res) => {
setTimeout(() => res(), 1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is essentially a sleep, no? It's waiting one second before checking again for results.

@hhhonzik maybe the comment could clarify

@owlas owlas marked this pull request as draft June 17, 2024 11:31
@nicor88
Copy link

nicor88 commented Jun 27, 2024

@magnew thanks for the review, as I have a clear interest in having this feature in, and as dbt-athena maintainer I could try to address few of the 2 questions.

  1. Athena it's based on Trino (but with some different interface to communicate to it). Said so Athena has few concepts:
    • catalog: the most important catalog it's awsdatacatalog, that is pretty much an interface to glue catalog. It's possible to use athena as a federated system to query other sources, for example postgres. In this case the catalog name will be different. - most likely the catalog can be hardcoded, and its value should be awsdatacatalog. Given the fact that the user will use lightdash to query assets produced via dbt-athena, and that dbt-athena can only write to awsdatacatalog should be enough IMHO (I cannot foreseen edge cases now, but in the first iteration it's more than enough).
    • database/schema: there is no concept of schema in athena. If we pick the most relevant catalog awsdatacatalog (aka glue catalog) we have only the concept of database. In dbt-athena we use the keyword schema to refer to a database, and I believe that we should do the same in lightdash, as it's really dbt dependent
  2. Regarding permissions I see only 2 most relevant use cases:
    a. The user provided a set of AWS_ credentials (most likely coming from an AWS IAM user). In this scenario the user it's required to provided the credentials in the UI. I believe that this PR address that properly. Specifically this will be the case in an environment where the user cannot control the containers where lightdash will be deployed (e.g. lightdash cloud)
    b. The user do not provide a set of AWS credentials, because for example has lightdash running in its own infra. In this case it will be nice to try to pick the underlying credentials used by the environment where the user deploy lightdash. For example if you deploy lightdash in ECS you can have an ecs role that have already the AWS_ credentials set, and lightdash could leverage that. The same apply to K8S with a role attached to a container, the AWS_ credentials will be available in the underlying container.

Specifically regarding point 2, the minimum requirement that should addressed by this PR, it's the support of AWS provided credentials by the user - that cover a ligthdash cloud deployment, as a self-hosted deployment (in this case, I would like to avoid to provide any credentials, but we can figure this out later).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨ feature-request Request for a new feature or functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support For AWS Athena As Source
7 participants