Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Async Query Client from Hackathon #9353

Merged
merged 23 commits into from
Apr 28, 2022
Merged

feat: Async Query Client from Hackathon #9353

merged 23 commits into from
Apr 28, 2022

Conversation

fuziontech
Copy link
Member

@fuziontech fuziontech commented Apr 5, 2022

Async query client backend from hackathon

Instead of using UUID for referencing the status/results of the query we are using a hash of the query itself. This is ensures we don't end up submitting the same query over and over in a bad situation. The downside of this method is future use cases when people may want to run DDLs that look the same within a 10 minute period may no-op as they hit cache.

Note:
mypy fixes are quite annoying. Supposedly there is a better fix in progress. But that fix has been underway for 3 years 😢
lidatong/dataclasses-json#31

@fuziontech fuziontech marked this pull request as draft April 5, 2022 19:14
@EDsCODE EDsCODE added highlight ⭐ Release highlight and removed highlight ⭐ Release highlight labels Apr 5, 2022
@fuziontech fuziontech changed the title [WIP] Async Query Client from Hackathon feat: Async Query Client from Hackathon Apr 7, 2022
@fuziontech fuziontech self-assigned this Apr 7, 2022
@fuziontech fuziontech marked this pull request as ready for review April 7, 2022 20:25
Copy link
Contributor

@rcmarron rcmarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So pumped to see this! A couple of thoughts:

Reload insight experience
It'd be great if we could enable the experience where a user refreshes a page for an insight that's already running, and instead of re-running the insight, we can get the status of the already running query.

Using the hash of the query as the key gets us close to this, but the problem is that most insight queries contain timestamps, so the hash will be different.

For the results cache today, we use a hash of the insight's filters. We probably want to do the same thing here. However, it feels wrong to have this querying function be so tightly coupled with insights. I wonder if we should just have an optional query_id param on enqueue_execute_with_progress. If it's set, we use that. If it's not set, then we generate a random uuid like before. That way the "application layer" can manage the querying however it needs to.

If it worked this way, I'd imagine the application side of the "kick of query" function would:

  1. Generate a query_id based on the insight filters
  2. Check if the query is already running
  3. If it isn't running, kick off a new query
  4. Return the query_id to the client

Restart query experience
I imagine we may want to have an explicit "restart the query" option for the customer. For the case where the query is seeming to hang or whatever.

I don't know if this is possible, but to enable this, it would be great if enqueue_execute_with_progress gets called with an query_id that's already running, it would cancel the existing query and start a new one.

If this were the case, then the flow here would be the same as above, but without the check for an already running query.

If this is adding too much complexity, we could also skip this case for now, but I think we'd still want to handle the case where 2 queries have the same query_id.

Existing results cache

I was having some internal debate about how this should related to the existing insight results cache. I keep going back and forth between these are very different, and it's exact same thing...🤔.

I think we should keep them separate, and just wanted to share the reasoning here to make sure others agree too:

  • The results cache is a long-term cache. Not all queries run with this function should be stored in a long-term cache (e.g. they might not all be insights)
  • The results cache doesn't store raw query results. Rather it stores the transformed data to be returned to the client.

posthog/client.py Show resolved Hide resolved
Comment on lines 178 to 180
status: str = "submitted"
complete: bool = False
error: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having status, complete, and error feels a bit odd because it's the same information in 2 different ways. I'm wondering if we have only status?

@mariusandra
Copy link
Collaborator

Adding query_id to the application layer is a legit fix here, though:

Using the hash of the query as the key gets us close to this, but the problem is that most insight queries contain timestamps, so the hash will be different.

Can we change the timestamps in the frontend? E.g. instead of $now_to_the_millisecond, use $now_to_the_next_hour. This could provide a large enough caching window to be meaningful?

@mariusandra
Copy link
Collaborator

Also, on the frontend, can we have a heghehog-jumping-over-cactuses game when a query is longer than 15 seconds?

@posthog-bot
Copy link
Contributor

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week.

Copy link
Contributor

@rcmarron rcmarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! One question and a couple nits

Very excited to get this one in!

posthog/client.py Outdated Show resolved Hide resolved
posthog/client.py Outdated Show resolved Hide resolved
posthog/client.py Outdated Show resolved Hide resolved
posthog/client.py Show resolved Hide resolved
posthog/client.py Outdated Show resolved Hide resolved
results=rv,
task_id=task_id,
)
redis_client.set(key, query_status.to_json(), ex=REDIS_STATUS_TTL) # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Don't feel strongly about it, but I think we can get away without the new library by doing something like: json.dumps(dataclasses.asdict(query_status))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agree but the other side that's kind of nice is the ability to unmarshal JSON into a dataclass DataClassExample.from_json('{"totally": "json"}')

@fuziontech fuziontech removed the stale label Apr 28, 2022
@fuziontech
Copy link
Member Author

Dude, excellent feedback here. Thanks!

* master: (137 commits)
  feat(cohorts): add cohort filter grammars (#9540)
  feat(cohorts): Backwards compatibility of groups and properties (#9462)
  perf(ingestion): unsubscribe from buffer topic while no events are produced to it (#9556)
  fix: Fix `Loading` positioning and `LemonButton` disabled state (#9554)
  test: Speed up backend tests (#9289)
  fix: LemonSpacer -> LemonDivider (#9549)
  feat(funnels): Highlight significant deviations in new funnel viz (#9536)
  docs(storybook): Lemon UI (#9426)
  feat: add support for list of teams to enable the conversion buffer for (#9542)
  chore(onboarding): cleanup framework grid experiment (#9527)
  fix(signup): domain provisioning on cloud (#9515)
  chore: split out async migrations ci (#9539)
  feat(ingestion): enable json ingestion for self-hosted by default (#9448)
  feat(cohort): add all cohort filter selectors to Storybook (#9492)
  feat(ingestion): conversion events buffer consumer (#9432)
  ci(run-backend-tests): remove CH version default (#9532)
  feat: Add person info to events (#9404)
  feat(ingestion): produce to buffer partitioned by team_id:distinct_id (#9518)
  fix: bring latest_migrations.manifest up to date (#9525)
  chore: removes unused feature flag (#9529)
  ...
@rcmarron rcmarron self-requested a review April 28, 2022 16:40
Copy link
Contributor

@rcmarron rcmarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it!

@fuziontech fuziontech merged commit 2a0c69e into master Apr 28, 2022
@fuziontech fuziontech deleted the async_query branch April 28, 2022 18:21
alexkim205 pushed a commit that referenced this pull request May 23, 2022
* cherry pick client from metahog

* cherrypick celery and requirements from metahog

* Change key to be based on hash of query and add test

* test that caching works

* black formatting

* Remove last references to uuid since it's not a uuid anymore

* don't statically set CLICKHOUSE_DATABASE

* mypy fixes (oof)

* add more tests!

* last test

* black format again

* only a bit of feedback incorporated - more to come

* add query_id override, force, and tests

* black format

* Flake8 and test docs

* black format tests

* mypy fixes

* from_json typing pains

* Feedbacked

* mypy feedback

* pin redis to 6.x
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants