-
Notifications
You must be signed in to change notification settings - Fork 9.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix data corruption in remote write if max_sample_age is applied #14078
Conversation
Unfortunately, I had no luck with writing a test that would cover the case yet Yet, it is possible to reproduce the bug with 100% "success" Run locally Prometheus with following config global:
scrape_interval: 5s
external_labels:
__replica__: prometheus-local-test-0
cluster: local-test
mimir_cluster_id: test-cluster
remote_write:
- url: http://foo.bar # URL of some remote write endpoint with known IP si it can be blocked
queue_config:
max_shards: 1
min_shards: 1
batch_send_deadline: 5s
capacity: 100
sample_age_limit: 30s
metadata_config:
send: true
send_interval: 1m
max_samples_per_send: 100
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
- job_name: node-exporter # running with default configuration
static_configs:
- targets:
- localhost:9100 When it is running, block the communication to the remote write endpoint IP using Wait until you see a context deadline exceeded, log in the Prometheus and enable the traffic again At this point the issue happens. If you run with debugger inspecting the |
Mentioning @marctc and @tpaschalis as you were authors of the original code and have most context, hope you don't mind. |
Sorry, but for a bug that sounds as bad as what you're describing we need a test case to ensure we've fixed the problem/this doesn't happen again. I think because of the number of functions at play here and the fact that some of them are closures within other functions that is obscuring the underlying problem and also making it hard to test things here. My suggestion would be to add a test that uses a version of the test client that can fail and enforce a retry since you're saying this only happens when we need to retry sending an existing write request. Beyond that, we likely need to refactor some of the code path here so that it's easier to test and less likely to break again. |
Hi @cstyan, thanks for answer. Yes, I agree. We used this bug fix successfully in our production, since we needed it asap, I'll ping you once I manage to reproduce it, if you don't mind. |
I've pinged Paschalis and Marc internally as well, they're aware of the problem and are looking into it.
That's good to know, if it becomes pressing we can just move forward with your PR as is. Having a test is still ideal, some part of me feels like while the fix here works it might not be the correct fix because it seems like what's happening is we're modifying the underlying slice of buffered data but not returning a reference to the proper range for the modified slice after a few retries.
👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to understand, the difference proposed here is to not mutate the input timeSeries
slice, right? Otherwise the logic looks the same.
Do you think that's why it's impacted on retries? I'd also like to have a test to make sure we don't accidentally regress on this.
@tpaschalis yes, this seemed to fix the issue using the above described way to reproduce. I'm trying to reproduce it in a test case, but with no luck yet. |
b286bc9
to
049a42f
Compare
@cstyan @tpaschalis would you take a look? As @david-vavra noted above (we are from the same company) we managed to reproduce the issue in #14157 and the fix from this PR resolves it (not saying it should be accepted this way even though we run it in production without any significant memory impact observed) We still do not exactly know where the actual corruption happens, just that avoiding reusing the |
86a9d7b
to
b76c6aa
Compare
I believe I got into the bottom of this. I propose fix which would swap data within |
EDIT: my bad, I missed that you had pushed a test here already, I will look at that in the morning my time. @FUSAKLA @david-vavra sorry for the delayed response, I had a noisy on-call shift followed by an in person team offsite. The main thing I still want to see is a test that reproduces the issue we see. From @david-vavra's comment it sounds like we know the specific case in which the labels 'corruption ' can occur, so a test should be our next step. That being said, I'm still not sure how the corruption can happen as you're describing it. In populateTimeSeries we're explicitly writing the new data into the reused I do agree that the issue is likely in the modifying of the elements in the slice and the reference indexes. In the meantime I've opened this PR: #14318 |
@FUSAKLA @david-vavra My mistake, I pressed the It passes! Fails without the change to And then lets merge 👍 |
Signed-off-by: David Vavra <[email protected]>
Signed-off-by: David Vavra <[email protected]>
Signed-off-by: Callum Styan <[email protected]>
58f26a6
to
ed78a97
Compare
I rebased and cherry-picked the benchmark mentioned in by #14078 (comment). Benchmark result: Without fix (benchmark cherry-picked to current main:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 👍
Benchmark shows that we're slightly slower now, which could be concerning given that if people turn this on it's on the hot path for sending data, but now that we have a test for correctness we can iterate on improving performance as well.
/fixes #13979
The PR #13002 added the option to drop old samples in remote write.
Unfortunatelly we bumped into a bug causing data to be corrupted (metrics withlabels getting merged with other metrics labels was the most obvious) reported here #13979
after trying to repproduce the issue it showed up it is strictly connected to situations, when retries does happen and the
filter
inbuildTimeSeries
is applied.We did investigate and it appears that the issue is in the newly added
buildTimeSeries
function which does modify thetimeSeries
argument causing the corruption.The suggested change, which avoids modification of the original timeSeries seems to fix the issue, but is naive and not sure how optimal.