Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert nested join in Vector Queries to Pandas Merge. #1298

Merged
merged 6 commits into from
Oct 26, 2023

Conversation

Chitti-Ankith
Copy link
Contributor

@Chitti-Ankith Chitti-Ankith commented Oct 17, 2023

Profiling on Vector Scan showed that we are spending a lot of time in the post-processing logic doing a Nested Join. This is an initial commit to change that into a Join using Pandas. Change showed ~50% improvement in Similarity Queries.

@jiashenC
Copy link
Member

For 20% speedup, how many rows does the table contain?

@Chitti-Ankith
Copy link
Contributor Author

For 20% speedup, how many rows does the table contain?

100k

for col_name in column_list:
res_row[col_name] = row[col_name]
res_row_list[idx] = res_row
result_df = pd.merge(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing O(n) of merging, will we get better performance if get all batches from the child and do merging only once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing O(n) of merging, will we get better performance if get all batches from the child and do merging only once?

Thanks for the suggestion, I have also made changes to not add child frames into the result df before merging to avoid unnecessary processing. The speedup is 2X now.

left_index=True,
right_index=True,
how="left",
# sort=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Chitti-Ankith Chitti-Ankith merged commit f420faa into georgia-tech-db:staging Oct 26, 2023
@xzdandy xzdandy added this to the v0.3.9 milestone Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Optimizations Features/Bugs related to optimizations
Projects
Development

Successfully merging this pull request may close these issues.

3 participants