-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Async Query Client from Hackathon #9353
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So pumped to see this! A couple of thoughts:
Reload insight experience
It'd be great if we could enable the experience where a user refreshes a page for an insight that's already running, and instead of re-running the insight, we can get the status of the already running query.
Using the hash of the query as the key gets us close to this, but the problem is that most insight queries contain timestamps, so the hash will be different.
For the results cache today, we use a hash of the insight's filters. We probably want to do the same thing here. However, it feels wrong to have this querying function be so tightly coupled with insights. I wonder if we should just have an optional query_id
param on enqueue_execute_with_progress
. If it's set, we use that. If it's not set, then we generate a random uuid like before. That way the "application layer" can manage the querying however it needs to.
If it worked this way, I'd imagine the application side of the "kick of query" function would:
- Generate a query_id based on the insight filters
- Check if the query is already running
- If it isn't running, kick off a new query
- Return the query_id to the client
Restart query experience
I imagine we may want to have an explicit "restart the query" option for the customer. For the case where the query is seeming to hang or whatever.
I don't know if this is possible, but to enable this, it would be great if enqueue_execute_with_progress
gets called with an query_id
that's already running, it would cancel the existing query and start a new one.
If this were the case, then the flow here would be the same as above, but without the check for an already running query.
If this is adding too much complexity, we could also skip this case for now, but I think we'd still want to handle the case where 2 queries have the same query_id
.
Existing results cache
I was having some internal debate about how this should related to the existing insight results cache. I keep going back and forth between these are very different, and it's exact same thing...🤔.
I think we should keep them separate, and just wanted to share the reasoning here to make sure others agree too:
- The results cache is a long-term cache. Not all queries run with this function should be stored in a long-term cache (e.g. they might not all be insights)
- The results cache doesn't store raw query results. Rather it stores the transformed data to be returned to the client.
posthog/client.py
Outdated
status: str = "submitted" | ||
complete: bool = False | ||
error: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having status
, complete
, and error
feels a bit odd because it's the same information in 2 different ways. I'm wondering if we have only status
?
Adding query_id to the application layer is a legit fix here, though:
Can we change the timestamps in the frontend? E.g. instead of |
Also, on the frontend, can we have a heghehog-jumping-over-cactuses game when a query is longer than 15 seconds? |
This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! One question and a couple nits
Very excited to get this one in!
results=rv, | ||
task_id=task_id, | ||
) | ||
redis_client.set(key, query_status.to_json(), ex=REDIS_STATUS_TTL) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Don't feel strongly about it, but I think we can get away without the new library by doing something like: json.dumps(dataclasses.asdict(query_status))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% agree but the other side that's kind of nice is the ability to unmarshal JSON into a dataclass DataClassExample.from_json('{"totally": "json"}')
Dude, excellent feedback here. Thanks! |
* master: (137 commits) feat(cohorts): add cohort filter grammars (#9540) feat(cohorts): Backwards compatibility of groups and properties (#9462) perf(ingestion): unsubscribe from buffer topic while no events are produced to it (#9556) fix: Fix `Loading` positioning and `LemonButton` disabled state (#9554) test: Speed up backend tests (#9289) fix: LemonSpacer -> LemonDivider (#9549) feat(funnels): Highlight significant deviations in new funnel viz (#9536) docs(storybook): Lemon UI (#9426) feat: add support for list of teams to enable the conversion buffer for (#9542) chore(onboarding): cleanup framework grid experiment (#9527) fix(signup): domain provisioning on cloud (#9515) chore: split out async migrations ci (#9539) feat(ingestion): enable json ingestion for self-hosted by default (#9448) feat(cohort): add all cohort filter selectors to Storybook (#9492) feat(ingestion): conversion events buffer consumer (#9432) ci(run-backend-tests): remove CH version default (#9532) feat: Add person info to events (#9404) feat(ingestion): produce to buffer partitioned by team_id:distinct_id (#9518) fix: bring latest_migrations.manifest up to date (#9525) chore: removes unused feature flag (#9529) ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ship it!
* cherry pick client from metahog * cherrypick celery and requirements from metahog * Change key to be based on hash of query and add test * test that caching works * black formatting * Remove last references to uuid since it's not a uuid anymore * don't statically set CLICKHOUSE_DATABASE * mypy fixes (oof) * add more tests! * last test * black format again * only a bit of feedback incorporated - more to come * add query_id override, force, and tests * black format * Flake8 and test docs * black format tests * mypy fixes * from_json typing pains * Feedbacked * mypy feedback * pin redis to 6.x
Async query client backend from hackathon
Instead of using UUID for referencing the status/results of the query we are using a hash of the query itself. This is ensures we don't end up submitting the same query over and over in a bad situation. The downside of this method is future use cases when people may want to run DDLs that look the same within a 10 minute period may no-op as they hit cache.
Note:
mypy fixes are quite annoying. Supposedly there is a better fix in progress. But that fix has been underway for 3 years 😢
lidatong/dataclasses-json#31