Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data corruption in remote write if max_sample_age is applied #14078

Merged
merged 3 commits into from
Jun 21, 2024

Conversation

FUSAKLA
Copy link
Contributor

@FUSAKLA FUSAKLA commented May 10, 2024

/fixes #13979

The PR #13002 added the option to drop old samples in remote write.

Unfortunatelly we bumped into a bug causing data to be corrupted (metrics withlabels getting merged with other metrics labels was the most obvious) reported here #13979

after trying to repproduce the issue it showed up it is strictly connected to situations, when retries does happen and the filter in buildTimeSeries is applied.

We did investigate and it appears that the issue is in the newly added buildTimeSeries function which does modify the timeSeries argument causing the corruption.

The suggested change, which avoids modification of the original timeSeries seems to fix the issue, but is naive and not sure how optimal.

@FUSAKLA
Copy link
Contributor Author

FUSAKLA commented May 10, 2024

Unfortunately, I had no luck with writing a test that would cover the case yet

Yet, it is possible to reproduce the bug with 100% "success"

Run locally Prometheus with following config

global:
  scrape_interval: 5s
  external_labels:
    __replica__: prometheus-local-test-0
    cluster: local-test
    mimir_cluster_id: test-cluster

remote_write:
  - url: http://foo.bar # URL of some remote write endpoint with known IP si it can be blocked
    queue_config:
      max_shards: 1
      min_shards: 1
      batch_send_deadline: 5s
      capacity: 100
      sample_age_limit: 30s
    metadata_config:
      send: true
      send_interval: 1m
      max_samples_per_send: 100

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets:
          - localhost:9090

  - job_name: node-exporter # running with default configuration
    static_configs:
      - targets:
          - localhost:9100

When it is running, block the communication to the remote write endpoint IP using sudo iptables -I OUTPUT -d <ip> -j DROP

Wait until you see a context deadline exceeded, log in the Prometheus and enable the traffic again sudo iptables -F OUTPUT

At this point the issue happens. If you run with debugger inspecting the pendingData variable in the runShard function, you can find for example up metrics with duplicated labels like job, mimir_cluster_id etc

@FUSAKLA FUSAKLA marked this pull request as ready for review May 10, 2024 22:04
@FUSAKLA
Copy link
Contributor Author

FUSAKLA commented May 10, 2024

Mentioning @marctc and @tpaschalis as you were authors of the original code and have most context, hope you don't mind.

@cstyan
Copy link
Member

cstyan commented May 15, 2024

Sorry, but for a bug that sounds as bad as what you're describing we need a test case to ensure we've fixed the problem/this doesn't happen again. I think because of the number of functions at play here and the fact that some of them are closures within other functions that is obscuring the underlying problem and also making it hard to test things here.

My suggestion would be to add a test that uses a version of the test client that can fail and enforce a retry since you're saying this only happens when we need to retry sending an existing write request.

Beyond that, we likely need to refactor some of the code path here so that it's easier to test and less likely to break again.

@FUSAKLA
Copy link
Contributor Author

FUSAKLA commented May 15, 2024

Hi @cstyan, thanks for answer.

Yes, I agree. We used this bug fix successfully in our production, since we needed it asap,
but I'm planning to look into writing the test, so I can verify it's really not happening anymore (it's hard to find the corrupted data).

I'll ping you once I manage to reproduce it, if you don't mind.

@cstyan
Copy link
Member

cstyan commented May 15, 2024

I've pinged Paschalis and Marc internally as well, they're aware of the problem and are looking into it.

We used this bug fix successfully in our production, since we needed it asap

That's good to know, if it becomes pressing we can just move forward with your PR as is. Having a test is still ideal, some part of me feels like while the fix here works it might not be the correct fix because it seems like what's happening is we're modifying the underlying slice of buffered data but not returning a reference to the proper range for the modified slice after a few retries.

I'll ping you once I manage to reproduce it, if you don't mind.

👍

Copy link
Contributor

@tpaschalis tpaschalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to understand, the difference proposed here is to not mutate the input timeSeries slice, right? Otherwise the logic looks the same.

Do you think that's why it's impacted on retries? I'd also like to have a test to make sure we don't accidentally regress on this.

@FUSAKLA
Copy link
Contributor Author

FUSAKLA commented May 28, 2024

@tpaschalis yes, this seemed to fix the issue using the above described way to reproduce.

I'm trying to reproduce it in a test case, but with no luck yet.
Here is a text case I'm working on right now #14157

@david-vavra
Copy link
Contributor

I rebased this over #14157 which adds a test case which reproduces the bug reported in #13979.

@FUSAKLA
Copy link
Contributor Author

FUSAKLA commented Jun 13, 2024

@cstyan @tpaschalis would you take a look?

As @david-vavra noted above (we are from the same company) we managed to reproduce the issue in #14157 and the fix from this PR resolves it (not saying it should be accepted this way even though we run it in production without any significant memory impact observed)

We still do not exactly know where the actual corruption happens, just that avoiding reusing the []prompb.TimeSeries slice does solve the issue.

@david-vavra david-vavra force-pushed the fus-fix-max-sample-age branch 3 times, most recently from 86a9d7b to b76c6aa Compare June 18, 2024 11:20
@david-vavra
Copy link
Contributor

david-vavra commented Jun 18, 2024

I believe I got into the bottom of this.
In case when just a part of a single batch gets dropped (because samples hit maxSampleAge during retrying), timeSeries []prompb.TimeSeries is rewritten so that timeseries which are to be kept are copied to the beginning of the slice so that it can be cut to a number of kept timeseries - see queue_manager.go:buildTimeSeries.
The problem shows itself when a new batch is getting filled into []prompb.TimeSeries (which is reused across batches, see queue_manager.go:populateTimeSeries) and those positions which refer to the same prompb.TimeSeries share the same data for underlying slices ([]Labels, []Samples,...). This leads to mixing/corruption of labels and samples.

I propose fix which would swap data within []prompb.TimeSeries instead of copying them. Other options may be more readable, but I wanted to keep the number of changes to a minimum and to avoid changes to "happy-path" (samples do not age-out during retrying).

@cstyan
Copy link
Member

cstyan commented Jun 19, 2024

EDIT: my bad, I missed that you had pushed a test here already, I will look at that in the morning my time.

@FUSAKLA @david-vavra sorry for the delayed response, I had a noisy on-call shift followed by an in person team offsite.

The main thing I still want to see is a test that reproduces the issue we see. From @david-vavra's comment it sounds like we know the specific case in which the labels 'corruption ' can occur, so a test should be our next step.

That being said, I'm still not sure how the corruption can happen as you're describing it. In populateTimeSeries we're explicitly writing the new data into the reused []prompb.TimeSeries slice starting from index 0, and then returning the count for how many things we wrote into the slice. Am I missing something here?

I do agree that the issue is likely in the modifying of the elements in the slice and the reference indexes.

In the meantime I've opened this PR: #14318

@cstyan
Copy link
Member

cstyan commented Jun 20, 2024

@FUSAKLA @david-vavra My mistake, I pressed the update branch since it looked locally I saw the test still failing and thought that there was some code changes in main that you guys were missing, which would show in a CI failure here. I must have missed something last night when trying the test locally.

It passes! Fails without the change to queue_manager.go but passes with the change, which is good news. Feel free to pull main locally, rebase, and force push over my merge commit. My final ask would be to cherry pick the benchmark from here: 5609bd7

And then lets merge 👍

@david-vavra
Copy link
Contributor

I rebased and cherry-picked the benchmark mentioned in by #14078 (comment).

Benchmark result:

Without fix (benchmark cherry-picked to current main:

goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/storage/remote
cpu: AMD Ryzen 7 7840HS with Radeon 780M Graphics   
BenchmarkBuildTimeSeries-16    	     738	   1531949 ns/op	 2650813 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     757	   1543058 ns/op	 2650798 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     741	   1557946 ns/op	 2650801 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     751	   1544889 ns/op	 2650809 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     732	   1546391 ns/op	 2650812 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     730	   1530907 ns/op	 2650807 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     733	   1532309 ns/op	 2650811 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     739	   1539659 ns/op	 2650808 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     741	   1546674 ns/op	 2650816 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     739	   1538725 ns/op	 2650810 B/op	   39748 allocs/op
PASS
ok  	github.com/prometheus/prometheus/storage/remote	13.073s

With fix:

goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/storage/remote
cpu: AMD Ryzen 7 7840HS with Radeon 780M Graphics   
BenchmarkBuildTimeSeries-16    	     721	   1572745 ns/op	 2650812 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     733	   1583777 ns/op	 2650815 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     746	   1563748 ns/op	 2650818 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     726	   1563117 ns/op	 2650818 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     728	   1572327 ns/op	 2650805 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     724	   1556734 ns/op	 2650817 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     740	   1574076 ns/op	 2650809 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     722	   1571599 ns/op	 2650806 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     706	   1565431 ns/op	 2650812 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     709	   1590074 ns/op	 2650795 B/op	   39748 allocs/op
PASS
ok  	github.com/prometheus/prometheus/storage/remote	13.101s

Copy link
Member

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍

Benchmark shows that we're slightly slower now, which could be concerning given that if people turn this on it's on the hot path for sending data, but now that we have a test for correctness we can iterate on improving performance as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Corrupting data written to remote storage in case sample_age_limit is hit
4 participants