Fix data corruption in remote write if max_sample_age is applied #14078

FUSAKLA · 2024-05-10T21:41:03Z

The PR #13002 added the option to drop old samples in remote write.

Unfortunatelly we bumped into a bug causing data to be corrupted (metrics withlabels getting merged with other metrics labels was the most obvious) reported here #13979

after trying to repproduce the issue it showed up it is strictly connected to situations, when retries does happen and the filter in buildTimeSeries is applied.

We did investigate and it appears that the issue is in the newly added buildTimeSeries function which does modify the timeSeries argument causing the corruption.

The suggested change, which avoids modification of the original timeSeries seems to fix the issue, but is naive and not sure how optimal.

FUSAKLA · 2024-05-10T22:02:31Z

Unfortunately, I had no luck with writing a test that would cover the case yet

Yet, it is possible to reproduce the bug with 100% "success"

Run locally Prometheus with following config

global:
  scrape_interval: 5s
  external_labels:
    __replica__: prometheus-local-test-0
    cluster: local-test
    mimir_cluster_id: test-cluster

remote_write:
  - url: http://foo.bar # URL of some remote write endpoint with known IP si it can be blocked
    queue_config:
      max_shards: 1
      min_shards: 1
      batch_send_deadline: 5s
      capacity: 100
      sample_age_limit: 30s
    metadata_config:
      send: true
      send_interval: 1m
      max_samples_per_send: 100

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets:
          - localhost:9090

  - job_name: node-exporter # running with default configuration
    static_configs:
      - targets:
          - localhost:9100

When it is running, block the communication to the remote write endpoint IP using sudo iptables -I OUTPUT -d <ip> -j DROP

Wait until you see a context deadline exceeded, log in the Prometheus and enable the traffic again sudo iptables -F OUTPUT

At this point the issue happens. If you run with debugger inspecting the pendingData variable in the runShard function, you can find for example up metrics with duplicated labels like job, mimir_cluster_id etc

FUSAKLA · 2024-05-10T22:08:32Z

Mentioning @marctc and @tpaschalis as you were authors of the original code and have most context, hope you don't mind.

cstyan · 2024-05-15T01:31:17Z

Sorry, but for a bug that sounds as bad as what you're describing we need a test case to ensure we've fixed the problem/this doesn't happen again. I think because of the number of functions at play here and the fact that some of them are closures within other functions that is obscuring the underlying problem and also making it hard to test things here.

My suggestion would be to add a test that uses a version of the test client that can fail and enforce a retry since you're saying this only happens when we need to retry sending an existing write request.

Beyond that, we likely need to refactor some of the code path here so that it's easier to test and less likely to break again.

FUSAKLA · 2024-05-15T12:56:30Z

Hi @cstyan, thanks for answer.

Yes, I agree. We used this bug fix successfully in our production, since we needed it asap,
but I'm planning to look into writing the test, so I can verify it's really not happening anymore (it's hard to find the corrupted data).

I'll ping you once I manage to reproduce it, if you don't mind.

cstyan · 2024-05-15T18:33:40Z

I've pinged Paschalis and Marc internally as well, they're aware of the problem and are looking into it.

We used this bug fix successfully in our production, since we needed it asap

That's good to know, if it becomes pressing we can just move forward with your PR as is. Having a test is still ideal, some part of me feels like while the fix here works it might not be the correct fix because it seems like what's happening is we're modifying the underlying slice of buffered data but not returning a reference to the proper range for the modified slice after a few retries.

I'll ping you once I manage to reproduce it, if you don't mind.

👍

tpaschalis

Just to understand, the difference proposed here is to not mutate the input timeSeries slice, right? Otherwise the logic looks the same.

Do you think that's why it's impacted on retries? I'd also like to have a test to make sure we don't accidentally regress on this.

FUSAKLA · 2024-05-28T09:58:20Z

@tpaschalis yes, this seemed to fix the issue using the above described way to reproduce.

I'm trying to reproduce it in a test case, but with no luck yet.
Here is a text case I'm working on right now #14157

david-vavra · 2024-06-12T06:38:19Z

I rebased this over #14157 which adds a test case which reproduces the bug reported in #13979.

FUSAKLA · 2024-06-13T11:41:24Z

@cstyan @tpaschalis would you take a look?

As @david-vavra noted above (we are from the same company) we managed to reproduce the issue in #14157 and the fix from this PR resolves it (not saying it should be accepted this way even though we run it in production without any significant memory impact observed)

We still do not exactly know where the actual corruption happens, just that avoiding reusing the []prompb.TimeSeries slice does solve the issue.

david-vavra · 2024-06-18T12:21:44Z

I believe I got into the bottom of this.
In case when just a part of a single batch gets dropped (because samples hit maxSampleAge during retrying), timeSeries []prompb.TimeSeries is rewritten so that timeseries which are to be kept are copied to the beginning of the slice so that it can be cut to a number of kept timeseries - see queue_manager.go:buildTimeSeries.
The problem shows itself when a new batch is getting filled into []prompb.TimeSeries (which is reused across batches, see queue_manager.go:populateTimeSeries) and those positions which refer to the same prompb.TimeSeries share the same data for underlying slices ([]Labels, []Samples,...). This leads to mixing/corruption of labels and samples.

I propose fix which would swap data within []prompb.TimeSeries instead of copying them. Other options may be more readable, but I wanted to keep the number of changes to a minimum and to avoid changes to "happy-path" (samples do not age-out during retrying).

cstyan · 2024-06-19T23:16:28Z

EDIT: my bad, I missed that you had pushed a test here already, I will look at that in the morning my time.

@FUSAKLA @david-vavra sorry for the delayed response, I had a noisy on-call shift followed by an in person team offsite.

The main thing I still want to see is a test that reproduces the issue we see. From @david-vavra's comment it sounds like we know the specific case in which the labels 'corruption ' can occur, so a test should be our next step.

That being said, I'm still not sure how the corruption can happen as you're describing it. In populateTimeSeries we're explicitly writing the new data into the reused []prompb.TimeSeries slice starting from index 0, and then returning the count for how many things we wrote into the slice. Am I missing something here?

I do agree that the issue is likely in the modifying of the elements in the slice and the reference indexes.

In the meantime I've opened this PR: #14318

cstyan · 2024-06-20T23:59:17Z

@FUSAKLA @david-vavra My mistake, I pressed the update branch since it looked locally I saw the test still failing and thought that there was some code changes in main that you guys were missing, which would show in a CI failure here. I must have missed something last night when trying the test locally.

It passes! Fails without the change to queue_manager.go but passes with the change, which is good news. Feel free to pull main locally, rebase, and force push over my merge commit. My final ask would be to cherry pick the benchmark from here: 5609bd7

And then lets merge 👍

Signed-off-by: David Vavra <[email protected]>

Signed-off-by: Callum Styan <[email protected]>

david-vavra · 2024-06-21T10:43:04Z

I rebased and cherry-picked the benchmark mentioned in by #14078 (comment).

Benchmark result:

Without fix (benchmark cherry-picked to current main:

goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/storage/remote
cpu: AMD Ryzen 7 7840HS with Radeon 780M Graphics   
BenchmarkBuildTimeSeries-16    	     738	   1531949 ns/op	 2650813 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     757	   1543058 ns/op	 2650798 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     741	   1557946 ns/op	 2650801 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     751	   1544889 ns/op	 2650809 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     732	   1546391 ns/op	 2650812 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     730	   1530907 ns/op	 2650807 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     733	   1532309 ns/op	 2650811 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     739	   1539659 ns/op	 2650808 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     741	   1546674 ns/op	 2650816 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     739	   1538725 ns/op	 2650810 B/op	   39748 allocs/op
PASS
ok  	github.com/prometheus/prometheus/storage/remote	13.073s

With fix:

goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/storage/remote
cpu: AMD Ryzen 7 7840HS with Radeon 780M Graphics   
BenchmarkBuildTimeSeries-16    	     721	   1572745 ns/op	 2650812 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     733	   1583777 ns/op	 2650815 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     746	   1563748 ns/op	 2650818 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     726	   1563117 ns/op	 2650818 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     728	   1572327 ns/op	 2650805 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     724	   1556734 ns/op	 2650817 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     740	   1574076 ns/op	 2650809 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     722	   1571599 ns/op	 2650806 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     706	   1565431 ns/op	 2650812 B/op	   39748 allocs/op
BenchmarkBuildTimeSeries-16    	     709	   1590074 ns/op	 2650795 B/op	   39748 allocs/op
PASS
ok  	github.com/prometheus/prometheus/storage/remote	13.101s

cstyan

Nice 👍

Benchmark shows that we're slightly slower now, which could be concerning given that if people turn this on it's on the hot path for sending data, but now that we have a test for correctness we can iterate on improving performance as well.

FUSAKLA marked this pull request as ready for review May 10, 2024 22:04

FUSAKLA requested review from cstyan, bwplotka and tomwilkie as code owners May 10, 2024 22:04

FUSAKLA mentioned this pull request May 10, 2024

Corrupting data written to remote storage in case sample_age_limit is hit #13979

Closed

tpaschalis reviewed May 27, 2024

View reviewed changes

david-vavra mentioned this pull request Jun 11, 2024

Add test case to reproduce malformation of remote_write data during retries when sample_age_limit is applied #14157

Closed

david-vavra force-pushed the fus-fix-max-sample-age branch 2 times, most recently from b286bc9 to 049a42f Compare June 11, 2024 20:46

david-vavra force-pushed the fus-fix-max-sample-age branch 3 times, most recently from 86a9d7b to b76c6aa Compare June 18, 2024 11:20

cstyan mentioned this pull request Jun 19, 2024

remote write: simplify readability of timeseries filtering by using the slices package #14318

Open

david-vavra and others added 3 commits June 21, 2024 11:18

fix: try to reproduce the bug from prometheus#13979 in a test case

2d99f31

Signed-off-by: David Vavra <[email protected]>

fix: data corruption in remote write if max_sample_age is applied

f2f29e3

Signed-off-by: David Vavra <[email protected]>

add benchmark for buildTimeSeries which does the filtering

ed78a97

Signed-off-by: Callum Styan <[email protected]>

david-vavra force-pushed the fus-fix-max-sample-age branch from 58f26a6 to ed78a97 Compare June 21, 2024 09:18

cstyan approved these changes Jun 21, 2024

View reviewed changes

cstyan merged commit 00b110c into prometheus:main Jun 21, 2024
25 checks passed

charleskorn mentioned this pull request Jun 27, 2024

Merge upstream Prometheus at c5040c5 grafana/mimir-prometheus#655

Merged

npazosmendez mentioned this pull request Jun 27, 2024

[remote-write-2-0] sync with main #14361

Merged

machine424 mentioned this pull request Nov 4, 2024

[BUG] Adding sample_age_limit to remote write queue configuration results in duplicate labels on samples #15257

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix data corruption in remote write if max_sample_age is applied #14078

Fix data corruption in remote write if max_sample_age is applied #14078

FUSAKLA commented May 10, 2024 •

edited

Loading

FUSAKLA commented May 10, 2024 •

edited

Loading

FUSAKLA commented May 10, 2024

cstyan commented May 15, 2024

FUSAKLA commented May 15, 2024

cstyan commented May 15, 2024 •

edited

Loading

tpaschalis left a comment

FUSAKLA commented May 28, 2024

david-vavra commented Jun 12, 2024

FUSAKLA commented Jun 13, 2024

david-vavra commented Jun 18, 2024 •

edited

Loading

cstyan commented Jun 19, 2024 •

edited

Loading

cstyan commented Jun 20, 2024

david-vavra commented Jun 21, 2024

cstyan left a comment

Fix data corruption in remote write if max_sample_age is applied #14078

Fix data corruption in remote write if max_sample_age is applied #14078

Conversation

FUSAKLA commented May 10, 2024 • edited Loading

FUSAKLA commented May 10, 2024 • edited Loading

FUSAKLA commented May 10, 2024

cstyan commented May 15, 2024

FUSAKLA commented May 15, 2024

cstyan commented May 15, 2024 • edited Loading

tpaschalis left a comment

Choose a reason for hiding this comment

FUSAKLA commented May 28, 2024

david-vavra commented Jun 12, 2024

FUSAKLA commented Jun 13, 2024

david-vavra commented Jun 18, 2024 • edited Loading

cstyan commented Jun 19, 2024 • edited Loading

cstyan commented Jun 20, 2024

david-vavra commented Jun 21, 2024

cstyan left a comment

Choose a reason for hiding this comment

FUSAKLA commented May 10, 2024 •

edited

Loading

FUSAKLA commented May 10, 2024 •

edited

Loading

cstyan commented May 15, 2024 •

edited

Loading

david-vavra commented Jun 18, 2024 •

edited

Loading

cstyan commented Jun 19, 2024 •

edited

Loading