feat: Add dbt-athena as a supported adapter #10020

hhhonzik · 2024-05-08T20:07:26Z

Description:

I've recently added a custom adapter for Starrocks which seems to be working well. With some minor time investment i was able to add dbt-athena support as well. If there is a will from contributors to merge this, i can cover the adapter with tests to ensure it's all ready to merge.

Otherwise feel free to close this!

Reviewer actions

I have manually tested the changes in the preview environment
I have reviewed the code
I understand that "request changes" will block this PR from merging

netlify · 2024-05-08T20:07:30Z

👷 Deploy request for peaceful-bassi-cbf284 pending review.

Visit the deploys page to approve it

Name	Link
🔨 Latest commit	`664e9bb`

TuringLovesDeathMetal · 2024-05-09T12:46:10Z

Note to add docs for this once it's reviewed!

nicor88 · 2024-05-09T18:41:43Z

packages/backend/src/dbt/profiles.ts

+ aws_access_key_id: credentials.awsAccessKeyId,
+ aws_secret_access_key: credentials.awsSecretKey,


can this be optional? for example if I run Lightdash in a self hosted environments (e.g. ECS) I would like to leverage the underlying role used by ECS to authenticate, without hard-coding credentials anywhere, simply because any call done by container are already authenticated.

They are, in fact that's how i was testing the build. The only thing needed is AWS_SDK_LOAD_CONFIG=true as environment variable.

Thanks for the clarification. So if I get right AWS_SDK_LOAD_CONFIG should be enough when running from ECS or EKS/pods with authentication via roles right?

Yup, I've added this to .env file, but it should definitely be mentioned in the docs.

packages/backend/src/dbt/profiles.ts

packages/cli/src/dbt/targets/athena.ts

nicor88 · 2024-05-09T18:46:41Z

packages/warehouses/src/warehouseClients/AthenaWarehouseClient.ts

+}: TableInfo) => `SELECT table_catalog
+ , table_schema
+ , table_name
+ , column_name
+ , data_type
+ FROM ${database}.information_schema.columns
+ WHERE table_catalog = lower('${database}')
+ AND table_schema = lower('${schema}')
+ AND table_name = lower('${table}')
+ ORDER BY 1, 2, 3, ordinal_position`;


this query is easy, but it's really really slow on scale. I've seen this query running ~2/3 minutes.
What we do in dbt-athena is to use glue apis. Also for example JDBC athena drivers uses the same concept, and in tools like Redash is possible to flag the usage of Glue APIs instead of using information_schema.

@hhhonzik I'm happy to have a chat with you to change the behvior of the above, and make the retrival of metadata faster. The API to use should be simply glue get_tables https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue/client/get_tables.html for example in boto3.

const getTables = (databaseName) => { const client = new GlueClient({}); const command = new GetTablesCommand({ DatabaseName: databaseName, }); return client.send(command); };

that supports only DatabaseName as filter, means that to to filter by a specific table we must do in memory.

nicor88 · 2024-05-09T18:48:43Z

packages/warehouses/src/warehouseClients/AthenaWarehouseClient.ts

+ };
+ }
+
+ private async checkQueryExequtionStateAndGetData(


just a thought, isn't there any JS library that takes care of the below? for example in python there is PyAthena, that pretty much take care of all the operations around firing a query and returning the result. I'm not really familiar with the JS world, but did you do a quick search to see if we can reusable some library? it will definitely reduce complexity.

There's athena-express - i would prefer that, but didn't want to add too many dependencies. I'll take this as a green light and use it :)

Perfect, just double check with the lightdash team and let's check if athena-express is well maintained. I see that it was published the last time 2 years ago, that is not a good indicator tbh.

I don't think we should use athena-express. Usage is dropping, there is no activity and they haven't been bumping their dependencies to patch security issues.
On the other hand, athena-express-plus ( fork from athena-express ) is gaining popularity. 👀

Note that we are switching our warehouse clients to stream the results instead of loading them all in memory. #10545
So we would need to adjust the code to support this as well.

packages/cli/src/dbt/targets/athena.ts

packages/common/src/compiler/translator.ts

packages/common/src/utils/timeFrames.ts

packages/frontend/src/components/ProjectConnection/WarehouseForms/AthenaForm.tsx

…rms/AthenaForm.tsx Co-authored-by: SimonGodefroid <17337190 [email protected]>

Co-authored-by: SimonGodefroid <17337190 [email protected]>

nicor88 · 2024-06-06T06:00:21Z

packages/warehouses/src/warehouseClients/AthenaWarehouseClient.ts

+ ) {
+ // In my case, queries run no faster than 800-900ms, which is why I set a 1000ms timeout
+ await new Promise<void>((res) => {
+ setTimeout(() => res(), 1000);


Could we make the timeout configurable?
Also let me get it right, if a query takes more than 1 second will you timeout?

This is essentially a sleep, no? It's waiting one second before checking again for results.

@hhhonzik maybe the comment could clarify

Ah I got it, it's the waited time that determine how often we check when the query result is ready. In dbt-athena we call this poll_interval.

My suggestion would be to make the value configurable, with a default value of 500 ms.
I've seen very optimized tables and queries running below 1sec, often not the case, but definitely will for my serving layer I will aim to make my tables as optimized as possible.
Also, as suggested by @magnew let's add a comment please, to give more context 🙏🏻

magnew

@hhhonzik, very cool. Thanks for posting this. I have a few questions to get it shipped. Bear with me, I am not a deep expert in Athena so there might be something I am not understanding 😄

There is some confusion in my mind (and in this code) about the distinction between database, schema and catalog. It seems like Athena typically treats database and schema as the same, is that right? If so, can we choose one and require it and ignore the other? If they are subtly different and both are needed, which should we require and what do we do with the other? Also, the UI seems to send whatever is entered in the schema input box as catalog. Should it be sending that as schema or database? Can we hardcode catalog or does it need an input?
It looks like AWS_SDK_LOAD_CONFIG allows us to pick up role-based permissions from the aws credentials when using the lightdash CLI because you can specify the profile there. But I don't think that will work in the UI since there is no profile specified. So as far as I understand that means you can only set up the UI with a user that doesn't need to assume a role to use Athena. Does that sound right? If we want to support roles in the UI, should we add an advanced option to provide an amazon profile name (like you can in the DBT profile).

FWIW, we can decide to only support a subset of possible AWS credential configurations for now. What kind of profile definitions do you expect this to work for for now?

Thanks again!

magnew · 2024-06-14T11:57:03Z

packages/common/src/types/projects.ts

 export type WarehouseCredentials =
 | SnowflakeCredentials
 | RedshiftCredentials
 | PostgresCredentials
 | BigqueryCredentials
 | DatabricksCredentials
- | TrinoCredentials;
+ | TrinoCredentials
+ | CreateAthenaCredentials;


AthenaCredentials?

magnew · 2024-06-14T13:14:19Z

packages/backend/src/dbt/profiles.ts

+ region_name: credentials.awsRegion,
+ s3_staging_dir: credentials.outputLocation,
+ schema: credentials.schema,
+ database: credentials.database,


@hhhonzik

Are database and schema effectively the same in Athena? Is there a reason to require both? I'm going to make database map to schema for now. If there is a reason we need to require both, let me know.

magnew · 2024-06-14T15:47:26Z

packages/warehouses/src/warehouseClients/AthenaWarehouseClient.ts

+ ) {
+ // In my case, queries run no faster than 800-900ms, which is why I set a 1000ms timeout
+ await new Promise<void>((res) => {
+ setTimeout(() => res(), 1000);


This is essentially a sleep, no? It's waiting one second before checking again for results.

@hhhonzik maybe the comment could clarify

nicor88 · 2024-06-27T06:38:44Z

@magnew thanks for the review, as I have a clear interest in having this feature in, and as dbt-athena maintainer I could try to address few of the 2 questions.

Athena it's based on Trino (but with some different interface to communicate to it). Said so Athena has few concepts:
- catalog: the most important catalog it's awsdatacatalog, that is pretty much an interface to glue catalog. It's possible to use athena as a federated system to query other sources, for example postgres. In this case the catalog name will be different. - most likely the catalog can be hardcoded, and its value should be awsdatacatalog. Given the fact that the user will use lightdash to query assets produced via dbt-athena, and that dbt-athena can only write to awsdatacatalog should be enough IMHO (I cannot foreseen edge cases now, but in the first iteration it's more than enough).
- database/schema: there is no concept of schema in athena. If we pick the most relevant catalog awsdatacatalog (aka glue catalog) we have only the concept of database. In dbt-athena we use the keyword schema to refer to a database, and I believe that we should do the same in lightdash, as it's really dbt dependent
Regarding permissions I see only 2 most relevant use cases:
a. The user provided a set of AWS_ credentials (most likely coming from an AWS IAM user). In this scenario the user it's required to provided the credentials in the UI. I believe that this PR address that properly. Specifically this will be the case in an environment where the user cannot control the containers where lightdash will be deployed (e.g. lightdash cloud)
b. The user do not provide a set of AWS credentials, because for example has lightdash running in its own infra. In this case it will be nice to try to pick the underlying credentials used by the environment where the user deploy lightdash. For example if you deploy lightdash in ECS you can have an ecs role that have already the AWS_ credentials set, and lightdash could leverage that. The same apply to K8S with a role attached to a container, the AWS_ credentials will be available in the underlying container.

Specifically regarding point 2, the minimum requirement that should addressed by this PR, it's the support of AWS provided credentials by the user - that cover a ligthdash cloud deployment, as a self-hosted deployment (in this case, I would like to avoid to provide any credentials, but we can figure this out later).

hhhonzik changed the title ~~Add dbt-athena as a supported adapter~~ feat: Add dbt-athena as a supported adapter May 8, 2024

hhhonzik force-pushed the dbt-athena branch from 009e1e8 to 1a6c5b5 Compare May 8, 2024 21:11

TuringLovesDeathMetal added the ✨ feature-request Request for a new feature or functionality label May 9, 2024

nicor88 reviewed May 9, 2024

View reviewed changes

packages/backend/src/dbt/profiles.ts Outdated Show resolved Hide resolved

nicor88 reviewed May 9, 2024

View reviewed changes

packages/cli/src/dbt/targets/athena.ts Show resolved Hide resolved

nicor88 reviewed May 9, 2024

View reviewed changes

SimonGodefroid reviewed May 13, 2024

View reviewed changes

packages/cli/src/dbt/targets/athena.ts Outdated Show resolved Hide resolved

SimonGodefroid reviewed May 13, 2024

View reviewed changes

packages/common/src/compiler/translator.ts Show resolved Hide resolved

SimonGodefroid reviewed May 13, 2024

View reviewed changes

packages/common/src/utils/timeFrames.ts Outdated Show resolved Hide resolved

SimonGodefroid reviewed May 13, 2024

View reviewed changes

packages/frontend/src/components/ProjectConnection/WarehouseForms/AthenaForm.tsx Outdated Show resolved Hide resolved

feat: Add dbt-athena as a supported adapter

d160208

hhhonzik force-pushed the dbt-athena branch from 1a6c5b5 to d160208 Compare June 3, 2024 08:20

Honza Stepanovsky and others added 6 commits June 3, 2024 10:22

Fix AWS_SECRET_ACCESS_KEY

581c7b1

Update packages/frontend/src/components/ProjectConnection/WarehouseFo…

32d95e4

…rms/AthenaForm.tsx Co-authored-by: SimonGodefroid <17337190 [email protected]>

Update packages/common/src/compiler/translator.ts

f573f55

Co-authored-by: SimonGodefroid <17337190 [email protected]>

Update packages/cli/src/dbt/targets/athena.ts

37a1b17

Co-authored-by: SimonGodefroid <17337190 [email protected]>

add docs

8b629f0

Make Athena independent

fd71795

nicor88 reviewed Jun 6, 2024

View reviewed changes

magnew self-assigned this Jun 12, 2024

owlas requested a deployment to duplicate_dbt-athena - jaffle_db_pg_13 PR #10404 June 13, 2024 14:28 — with Render Abandoned

owlas deployed to duplicate_dbt-athena - headless-browser PR #10404 June 13, 2024 14:28 — with Render Active

owlas deployed to duplicate_dbt-athena - lightdash PR #10404 June 13, 2024 14:30 — with Render View deployment

magnew added 3 commits June 13, 2024 17:28

Merge branch 'main' into duplicate_dbt-athena

31e04ae

fix: linter error

2e37a52

typo

25767e5

fix lint error

28555b9

owlas deployed to duplicate_dbt-athena - headless-browser PR #10404 June 13, 2024 17:25 — with Render Active

magnew added 2 commits June 14, 2024 12:17

Merge branch 'main' into duplicate_dbt-athena

347b48e

add athena to render docker file

a720488

owlas deployed to duplicate_dbt-athena - headless-browser PR #10404 June 14, 2024 13:03 — with Render Active

update UI parts

664e9bb

owlas temporarily deployed to duplicate_dbt-athena - headless-browser PR #10404 June 14, 2024 15:43 — with Render Destroyed

magnew self-requested a review June 14, 2024 15:50

magnew requested changes Jun 14, 2024

View reviewed changes

owlas marked this pull request as draft June 17, 2024 11:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add dbt-athena as a supported adapter #10020

feat: Add dbt-athena as a supported adapter #10020

hhhonzik commented May 8, 2024 •

edited

Loading

netlify bot commented May 8, 2024 •

edited

Loading

TuringLovesDeathMetal commented May 9, 2024

nicor88 May 9, 2024

hhhonzik May 11, 2024

nicor88 May 11, 2024

hhhonzik Jun 3, 2024

nicor88 May 9, 2024

nicor88 May 11, 2024

nicor88 May 9, 2024

hhhonzik May 11, 2024

nicor88 May 11, 2024 •

edited

Loading

ZeRego Jul 3, 2024

nicor88 Jun 6, 2024

magnew Jun 14, 2024

nicor88 Jun 14, 2024 •

edited

Loading

magnew left a comment

magnew Jun 14, 2024

magnew Jun 14, 2024

magnew Jun 14, 2024

nicor88 commented Jun 27, 2024

		aws_access_key_id: credentials.awsAccessKeyId,
		aws_secret_access_key: credentials.awsSecretKey,

feat: Add dbt-athena as a supported adapter #10020

Are you sure you want to change the base?

feat: Add dbt-athena as a supported adapter #10020

Conversation

hhhonzik commented May 8, 2024 • edited Loading

Description:

Reviewer actions

netlify bot commented May 8, 2024 • edited Loading

👷 Deploy request for peaceful-bassi-cbf284 pending review.

TuringLovesDeathMetal commented May 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicor88 May 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicor88 Jun 14, 2024 • edited Loading

Choose a reason for hiding this comment

magnew left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicor88 commented Jun 27, 2024

hhhonzik commented May 8, 2024 •

edited

Loading

netlify bot commented May 8, 2024 •

edited

Loading

nicor88 May 11, 2024 •

edited

Loading

nicor88 Jun 14, 2024 •

edited

Loading