-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add dbt-athena as a supported adapter #10020
base: main
Are you sure you want to change the base?
Conversation
👷 Deploy request for peaceful-bassi-cbf284 pending review.Visit the deploys page to approve it
|
Note to add docs for this once it's reviewed! |
aws_access_key_id: credentials.awsAccessKeyId, | ||
aws_secret_access_key: credentials.awsSecretKey, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be optional? for example if I run Lightdash in a self hosted environments (e.g. ECS) I would like to leverage the underlying role used by ECS to authenticate, without hard-coding credentials anywhere, simply because any call done by container are already authenticated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are, in fact that's how i was testing the build. The only thing needed is AWS_SDK_LOAD_CONFIG=true
as environment variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. So if I get right AWS_SDK_LOAD_CONFIG
should be enough when running from ECS or EKS/pods with authentication via roles right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I've added this to .env
file, but it should definitely be mentioned in the docs.
}: TableInfo) => `SELECT table_catalog | ||
, table_schema | ||
, table_name | ||
, column_name | ||
, data_type | ||
FROM ${database}.information_schema.columns | ||
WHERE table_catalog = lower('${database}') | ||
AND table_schema = lower('${schema}') | ||
AND table_name = lower('${table}') | ||
ORDER BY 1, 2, 3, ordinal_position`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this query is easy, but it's really really slow on scale. I've seen this query running ~2/3 minutes.
What we do in dbt-athena is to use glue apis. Also for example JDBC athena drivers uses the same concept, and in tools like Redash is possible to flag the usage of Glue APIs instead of using information_schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hhhonzik I'm happy to have a chat with you to change the behvior of the above, and make the retrival of metadata faster. The API to use should be simply glue get_tables
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue/client/get_tables.html for example in boto3.
const getTables = (databaseName) => {
const client = new GlueClient({});
const command = new GetTablesCommand({
DatabaseName: databaseName,
});
return client.send(command);
};
that supports only DatabaseName as filter, means that to to filter by a specific table we must do in memory.
}; | ||
} | ||
|
||
private async checkQueryExequtionStateAndGetData( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a thought, isn't there any JS library that takes care of the below? for example in python there is PyAthena, that pretty much take care of all the operations around firing a query and returning the result. I'm not really familiar with the JS world, but did you do a quick search to see if we can reusable some library? it will definitely reduce complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's athena-express - i would prefer that, but didn't want to add too many dependencies. I'll take this as a green light and use it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, just double check with the lightdash team and let's check if athena-express is well maintained. I see that it was published the last time 2 years ago, that is not a good indicator tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should use athena-express. Usage is dropping, there is no activity and they haven't been bumping their dependencies to patch security issues.
On the other hand, athena-express-plus ( fork from athena-express ) is gaining popularity. 👀
Note that we are switching our warehouse clients to stream the results instead of loading them all in memory. #10545
So we would need to adjust the code to support this as well.
packages/frontend/src/components/ProjectConnection/WarehouseForms/AthenaForm.tsx
Outdated
Show resolved
Hide resolved
…rms/AthenaForm.tsx Co-authored-by: SimonGodefroid <17337190 [email protected]>
Co-authored-by: SimonGodefroid <17337190 [email protected]>
Co-authored-by: SimonGodefroid <17337190 [email protected]>
) { | ||
// In my case, queries run no faster than 800-900ms, which is why I set a 1000ms timeout | ||
await new Promise<void>((res) => { | ||
setTimeout(() => res(), 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make the timeout configurable?
Also let me get it right, if a query takes more than 1 second will you timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is essentially a sleep, no? It's waiting one second before checking again for results.
@hhhonzik maybe the comment could clarify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I got it, it's the waited time that determine how often we check when the query result is ready. In dbt-athena we call this poll_interval.
My suggestion would be to make the value configurable, with a default value of 500 ms.
I've seen very optimized tables and queries running below 1sec, often not the case, but definitely will for my serving layer I will aim to make my tables as optimized as possible.
Also, as suggested by @magnew let's add a comment please, to give more context 🙏🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hhhonzik, very cool. Thanks for posting this. I have a few questions to get it shipped. Bear with me, I am not a deep expert in Athena so there might be something I am not understanding 😄
-
There is some confusion in my mind (and in this code) about the distinction between
database
,schema
andcatalog
. It seems like Athena typically treatsdatabase
andschema
as the same, is that right? If so, can we choose one and require it and ignore the other? If they are subtly different and both are needed, which should we require and what do we do with the other? Also, the UI seems to send whatever is entered in the schema input box ascatalog
. Should it be sending that as schema or database? Can we hardcode catalog or does it need an input? -
It looks like
AWS_SDK_LOAD_CONFIG
allows us to pick up role-based permissions from the aws credentials when using the lightdash CLI because you can specify the profile there. But I don't think that will work in the UI since there is no profile specified. So as far as I understand that means you can only set up the UI with a user that doesn't need to assume a role to use Athena. Does that sound right? If we want to support roles in the UI, should we add an advanced option to provide an amazon profile name (like you can in the DBT profile).
FWIW, we can decide to only support a subset of possible AWS credential configurations for now. What kind of profile definitions do you expect this to work for for now?
Thanks again!
export type WarehouseCredentials = | ||
| SnowflakeCredentials | ||
| RedshiftCredentials | ||
| PostgresCredentials | ||
| BigqueryCredentials | ||
| DatabricksCredentials | ||
| TrinoCredentials; | ||
| TrinoCredentials | ||
| CreateAthenaCredentials; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AthenaCredentials?
packages/backend/src/dbt/profiles.ts
Outdated
region_name: credentials.awsRegion, | ||
s3_staging_dir: credentials.outputLocation, | ||
schema: credentials.schema, | ||
database: credentials.database, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are database and schema effectively the same in Athena? Is there a reason to require both? I'm going to make database map to schema for now. If there is a reason we need to require both, let me know.
) { | ||
// In my case, queries run no faster than 800-900ms, which is why I set a 1000ms timeout | ||
await new Promise<void>((res) => { | ||
setTimeout(() => res(), 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is essentially a sleep, no? It's waiting one second before checking again for results.
@hhhonzik maybe the comment could clarify
@magnew thanks for the review, as I have a clear interest in having this feature in, and as dbt-athena maintainer I could try to address few of the 2 questions.
Specifically regarding point 2, the minimum requirement that should addressed by this PR, it's the support of AWS provided credentials by the user - that cover a ligthdash cloud deployment, as a self-hosted deployment (in this case, I would like to avoid to provide any credentials, but we can figure this out later). |
Closes: #3554 #789 #4940
Description:
I've recently added a custom adapter for Starrocks which seems to be working well. With some minor time investment i was able to add dbt-athena support as well. If there is a will from contributors to merge this, i can cover the adapter with tests to ensure it's all ready to merge.
Otherwise feel free to close this!
Reviewer actions