Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass affected labels to MemPostings.Delete() #14307

Merged

Conversation

colega
Copy link
Contributor

@colega colega commented Jun 17, 2024

As suggested by @bboreham, we can track the labels of the deleted series and avoid iterating through all the label/value combinations.

This looks much faster on the MemPostings.Delete call. We don't have a benchmark on stripeSeries.gc() where we'll pay the price of iterating the labels of each one of the deleted series.

goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/tsdb/index
                                            │      main      │              affected              │
                                            │     sec/op     │   sec/op     vs base               │
MemPostings_Delete/refs=1/readers=0-12         139.53m ±  4%   23.81m ± 3%  -82.94% (p=0.002 n=6)
MemPostings_Delete/refs=1/readers=1-12         176.45m ±  1%   23.66m ± 2%  -86.59% (p=0.002 n=6)
MemPostings_Delete/refs=1/readers=10-12       1171.34m ± 48%   23.95m ± 1%  -97.96% (p=0.002 n=6)
MemPostings_Delete/refs=100/readers=0-12       171.41m ± 11%   34.84m ± 5%  -79.67% (p=0.002 n=6)
MemPostings_Delete/refs=100/readers=1-12       213.05m ± 11%   35.94m ± 4%  -83.13% (p=0.002 n=6)
MemPostings_Delete/refs=100/readers=10-12      863.45m ± 36%   36.39m ± 4%  -95.79% (p=0.002 n=6)
MemPostings_Delete/refs=10000/readers=0-12     168.06m ±  6%   37.12m ± 3%  -77.91% (p=0.002 n=6)
MemPostings_Delete/refs=10000/readers=1-12     225.20m ±  3%   37.26m ± 1%  -83.46% (p=0.002 n=6)
MemPostings_Delete/refs=10000/readers=10-12   1019.95m ± 48%   37.78m ± 8%  -96.30% (p=0.002 n=6)
geomean                                         319.9m         31.68m       -90.10%

As suggested by @bboreham, we can track the labels of the deleted series
and avoid iterating through all the label/value combinations.

This looks much faster on the MemPostings.Delete call. We don't have a
benchmark on stripeSeries.gc() where we'll pay the price of iterating
the labels of each one of the deleted series.

Signed-off-by: Oleg Zaytsev <[email protected]>
@colega colega changed the title Pass affected labels to MemPostings.Delete Pass affected labels to MemPostings.Delete() Jun 17, 2024
@colega colega marked this pull request as ready for review June 17, 2024 11:30
Signed-off-by: Oleg Zaytsev <[email protected]>
@jesusvazquez
Copy link
Member

/prombench main

Signed-off-by: Oleg Zaytsev <[email protected]>
@colega
Copy link
Contributor Author

colega commented Jun 17, 2024

I have added a benchmark for Head.Truncate(), which calls head.gc() so it benchmarks the combination of MemPostings.Delete() and stripeSeries.gc(), the results look good:

goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/tsdb
                            │     main     │               affected               │
                            │    sec/op    │    sec/op     vs base                │
Head_Truncate/churn=10-12     319.4m ± 16%   125.1m ± 10%  -60.81% (p=0.000 n=10)
Head_Truncate/churn=100-12    305.2m ±  3%   138.9m ± 12%  -54.50% (p=0.000 n=10)
Head_Truncate/churn=1000-12   298.2m ±  4%   136.1m ± 21%  -54.36% (p=0.000 n=10)
geomean                       307.5m         133.2m        -56.66%

                            │     main     │              affected               │
                            │     B/op     │     B/op      vs base               │
Head_Truncate/churn=10-12     28.63Mi ± 0%   28.48Mi ± 0%  -0.54% (p=0.000 n=10)
Head_Truncate/churn=100-12    29.04Mi ± 0%   28.92Mi ± 0%  -0.42% (p=0.000 n=10)
Head_Truncate/churn=1000-12   29.36Mi ± 0%   29.48Mi ± 0%   0.42% (p=0.000 n=10)
geomean                       29.01Mi        28.96Mi       -0.18%

                            │    main     │              affected              │
                            │  allocs/op  │  allocs/op   vs base               │
Head_Truncate/churn=10-12     14.98k ± 0%   14.98k ± 0%  -0.01% (p=0.000 n=10)
Head_Truncate/churn=100-12    15.24k ± 0%   15.25k ± 0%   0.07% (p=0.000 n=10)
Head_Truncate/churn=1000-12   17.68k ± 0%   17.71k ± 0%   0.17% (p=0.000 n=10)
geomean                       15.92k        15.93k        0.08%

Copy link
Member

@jesusvazquez jesusvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

This is an optimization on top of the former optimization. For some reason it rings a bell that I had already read or discussed with you to only work over the affected subset 👍 So thanks for adding this.

@jesusvazquez jesusvazquez enabled auto-merge (squash) June 18, 2024 10:05
@jesusvazquez jesusvazquez merged commit fd1a89b into prometheus:main Jun 18, 2024
25 checks passed
@colega colega deleted the pass-affected-labels-to-mempostings-delete branch August 23, 2024 10:33
colega added a commit to colega/prometheus that referenced this pull request Oct 29, 2024
This introduces back some unlocking that was removed in prometheus#13286 but in a
more balanced way, as suggested by @pracucci.

For TSDBs with a lot of churn, Delete() can take a couple of seconds,
and while it's holding the mutex, reads and writes are blocked waiting
for that mutex, increasing the number of connections handled and memory
usage.

This implementation pauses every 4K labels processed (note that also
compared to prometheus#13286 we're not processing all the label-values anymore,
but only the affected ones, because of prometheus#14307), makes sure that it's
possible to get the read lock, and waits for a few milliseconds more.

Signed-off-by: Oleg Zaytsev <[email protected]>
Co-authored-by: Marco Pracucci <[email protected]>
colega added a commit to colega/prometheus that referenced this pull request Oct 29, 2024
This introduces back some unlocking that was removed in prometheus#13286 but in a
more balanced way, as suggested by @pracucci.

For TSDBs with a lot of churn, Delete() can take a couple of seconds,
and while it's holding the mutex, reads and writes are blocked waiting
for that mutex, increasing the number of connections handled and memory
usage.

This implementation pauses every 4K labels processed (note that also
compared to prometheus#13286 we're not processing all the label-values anymore,
but only the affected ones, because of prometheus#14307), makes sure that it's
possible to get the read lock, and waits for a few milliseconds more.

Signed-off-by: Oleg Zaytsev <[email protected]>
Co-authored-by: Marco Pracucci <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
colega added a commit to colega/prometheus that referenced this pull request Nov 2, 2024
The change introduced in prometheus#14307 was great for the performance of
MemPostings.Delete(): we know which labels we have to touch, so we just
grab the lock once and keep processing them until we're done.

While this is the best for MemPostings.Delete(), it's much worse for
other users of that mutex: readers and new series writes. These now have
to wait until we've deleted all the affected postings, which are
potentially millions. Our operation isn't so urgent, but we're not
letting other users grab the mutex.

While prometheus#15242 proposes a solution for this by performing small pauses
from time to time and letting other callers take the mutex, it still
doesn't address the elephant in the room: we're just doing too much
stuff with the write mutex being held, that's is exclusive time for us.
We can spread it longer over time, decreasing the impact, but the
overall exclusive time is the same, and we should try to address that.

What's the long operation we're doing while holding the write mutex?
We're filtering the postings list for each label by building a new one,
and looking up in a hashmap whether each one of those elements id
deleted. These lists are potentially tens or hundreds of thousands of
elements each one.

Why don't we build that list while holding just a RLock()? Well there's
a small issue with that: maybe a new series is added to the list after
we've built that filtered replacement postings list, and before we took
the write mutex. The good news is that postings are always appended
in-order to the list and we are the only ones who delete things from
that list, so if we just check whether the list grew, we can know for
sure if we're missing something.

Moreover we don't even need to rebuild the entire list if something was
added, we just need to add the extra elements that the list has now
compared to our snapshot.

Finally, one last thing to address is that with this approach we'd be
taking the write mutex once per affected label value again, which is a
lot, it causes too long compactions under lots of read pressure, and we
mitigated that in the past. We also can't just build all the replacement
builds and then swap them in one go: we would be referencing too much
memory there. It is true that while we're swapping there's still someone
referencing the old slice from some reader, but not at an arbitrary
scale: for a 2M series tsdb, label names which reference all series
(like the typical env=prod) as 16MB of postings each one (2M * 8B).

So we process labels in batches, with max 128 labels in a batch (so
ideally we take one write mutex per each 128 labels) and a maximum of
10*len(allpostings) max postings in the batch (in our example that would
be an extra 160MB allocated temporarily, which should be negligible in a
2M series instance.

Signed-off-by: Oleg Zaytsev <[email protected]>
GiedriusS pushed a commit to vinted/prometheus that referenced this pull request Nov 4, 2024
This introduces back some unlocking that was removed in prometheus#13286 but in a
more balanced way, as suggested by @pracucci.

For TSDBs with a lot of churn, Delete() can take a couple of seconds,
and while it's holding the mutex, reads and writes are blocked waiting
for that mutex, increasing the number of connections handled and memory
usage.

This implementation pauses every 4K labels processed (note that also
compared to prometheus#13286 we're not processing all the label-values anymore,
but only the affected ones, because of prometheus#14307), makes sure that it's
possible to get the read lock, and waits for a few milliseconds more.

Signed-off-by: Oleg Zaytsev <[email protected]>
Co-authored-by: Marco Pracucci <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
jesusvazquez pushed a commit that referenced this pull request Nov 5, 2024
…15242)

This introduces back some unlocking that was removed in #13286 but in a
more balanced way, as suggested by @pracucci.

For TSDBs with a lot of churn, Delete() can take a couple of seconds,
and while it's holding the mutex, reads and writes are blocked waiting
for that mutex, increasing the number of connections handled and memory
usage.

This implementation pauses every 4K labels processed (note that also
compared to #13286 we're not processing all the label-values anymore,
but only the affected ones, because of #14307), makes sure that it's
possible to get the read lock, and waits for a few milliseconds more.

Signed-off-by: Oleg Zaytsev <[email protected]>
Co-authored-by: Marco Pracucci <[email protected]>
juliusv pushed a commit that referenced this pull request Nov 5, 2024
…15242)

This introduces back some unlocking that was removed in #13286 but in a
more balanced way, as suggested by @pracucci.

For TSDBs with a lot of churn, Delete() can take a couple of seconds,
and while it's holding the mutex, reads and writes are blocked waiting
for that mutex, increasing the number of connections handled and memory
usage.

This implementation pauses every 4K labels processed (note that also
compared to #13286 we're not processing all the label-values anymore,
but only the affected ones, because of #14307), makes sure that it's
possible to get the read lock, and waits for a few milliseconds more.

Signed-off-by: Oleg Zaytsev <[email protected]>
Co-authored-by: Marco Pracucci <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants