[ENHANCEMENT] PromQL: use Kahan summation for sum() #14074

bboreham · 2024-05-09T13:32:52Z

This can give a more precise result, by keeping a separate running compensation value to accumulate small errors.
See https://en.wikipedia.org/wiki/Kahan_summation_algorithm

Possible improvement for #14052 (I won't call it a fix).

I'm re-using floatMean which is otherwise unused for sum(); it would be clearer to add a new field but would cost 8 bytes per output value.

Benchmarks show a bigger slow-down than I expected.

goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/promql
cpu: Intel(R) Core(TM) i7-14700K
                                                                                                          │ before.txt  │             after.txt             │
                                                                                                          │   sec/op    │   sec/op     vs base              │
RangeQuery/expr=sum(a_hundred),steps=1-28                                                                   142.2µ ± 1%   144.2µ ± 0%   1.41% (p=0.009 n=6)
RangeQuery/expr=sum(a_hundred),steps=100-28                                                                 447.5µ ± 2%   460.3µ ± 1%   2.85% (p=0.002 n=6)
RangeQuery/expr=sum(a_hundred),steps=1000-28                                                                2.670m ± 2%   2.760m ± 1%   3.37% (p=0.002 n=6)
RangeQuery/expr=sum_without_(l)(h_hundred),steps=1-28                                                       1.637m ± 2%   1.653m ± 1%       ~ (p=0.240 n=6)
RangeQuery/expr=sum_without_(l)(h_hundred),steps=100-28                                                     5.081m ± 2%   5.352m ± 1%   5.33% (p=0.002 n=6)
RangeQuery/expr=sum_without_(l)(h_hundred),steps=1000-28                                                    33.30m ± 2%   35.05m ± 0%   5.25% (p=0.002 n=6)
RangeQuery/expr=sum_without_(le)(h_hundred),steps=1-28                                                      1.676m ± 2%   1.691m ± 1%   0.89% (p=0.026 n=6)
RangeQuery/expr=sum_without_(le)(h_hundred),steps=100-28                                                    5.315m ± 2%   5.503m ± 1%   3.54% (p=0.002 n=6)
RangeQuery/expr=sum_without_(le)(h_hundred),steps=1000-28                                                   35.13m ± 1%   36.76m ± 1%   4.62% (p=0.002 n=6)
RangeQuery/expr=sum_by_(l)(h_hundred),steps=1-28                                                            1.654m ± 1%   1.685m ± 1%   1.84% (p=0.002 n=6)
RangeQuery/expr=sum_by_(l)(h_hundred),steps=100-28                                                          5.349m ± 2%   5.476m ± 0%   2.39% (p=0.002 n=6)
RangeQuery/expr=sum_by_(l)(h_hundred),steps=1000-28                                                         34.97m ± 1%   36.87m ± 1%   5.44% (p=0.002 n=6)
RangeQuery/expr=sum_by_(le)(h_hundred),steps=1-28                                                           1.618m ± 1%   1.647m ± 0%   1.79% (p=0.002 n=6)
RangeQuery/expr=sum_by_(le)(h_hundred),steps=100-28                                                         5.154m ± 1%   5.379m ± 1%   4.35% (p=0.002 n=6)
RangeQuery/expr=sum_by_(le)(h_hundred),steps=1000-28                                                        33.34m ± 2%   35.18m ± 1%   5.51% (p=0.002 n=6)
RangeQuery/expr=sum_without_(l)(rate(a_hundred[1m])),steps=1-28                                             174.5µ ± 1%   176.0µ ± 1%       ~ (p=0.065 n=6)
RangeQuery/expr=sum_without_(l)(rate(a_hundred[1m])),steps=100-28                                           897.6µ ± 1%   911.0µ ± 1%   1.49% (p=0.015 n=6)
RangeQuery/expr=sum_without_(l)(rate(a_hundred[1m])),steps=1000-28                                          6.615m ± 1%   6.690m ± 0%   1.14% (p=0.002 n=6)
RangeQuery/expr=sum_without_(l)(rate(a_hundred[1m]))_/_sum_without_(l)(rate(b_hundred[1m])),steps=1-28      361.2µ ± 1%   360.3µ ± 1%       ~ (p=0.699 n=6)
RangeQuery/expr=sum_without_(l)(rate(a_hundred[1m]))_/_sum_without_(l)(rate(b_hundred[1m])),steps=100-28    1.848m ± 1%   1.859m ± 1%       ~ (p=0.132 n=6)
RangeQuery/expr=sum_without_(l)(rate(a_hundred[1m]))_/_sum_without_(l)(rate(b_hundred[1m])),steps=1000-28   13.46m ± 1%   13.61m ± 1%   1.14% (p=0.041 n=6)
geomean                                                                                                     3.017m        3.095m        2.58%

This can give a more precise result, by keeping a separate running compensation value to accumulate small errors. See https://en.wikipedia.org/wiki/Kahan_summation_algorithm Signed-off-by: Bryan Boreham <[email protected]>

beorn7

Thank you very much.

One day, we should apply all the Kahan tricks to histograms, too…

bboreham · 2024-06-24T10:13:18Z

Current feeling is the slow-down will not be noticeable compared to overheads such as disk IO.

This expands the idea implemented in #14074, where Kahan summation was applied to sum(). This comes at the price of not dual-using `floatMean` anymore, because we need both `floatMean` and `floatKahanC` for `avg`. This creates a small overhead, but also makes the Kahan parts easier to read. The trade-off in terms of performance is similar to the one made in `sum` with a following division. Things that work out for `sum` thanks to Kahan summation should equally work out for `avg`. This is reflected in the added tests. Signed-off-by: beorn7 <[email protected]>

This expands the idea implemented in #14074, where Kahan summation was applied to sum(). This comes at the price of not dual-using `floatMean` anymore, because we need both `floatMean` and `floatKahanC` for `avg`. This creates a small overhead, but also makes the Kahan parts easier to read. The trade-off in terms of performance is similar to the one made in essentially syntactic sugar for `sum` with a following division. Things that work out for `sum` thanks to Kahan summation should therefory equally work out for `avg`. This is reflected in the added tests. Signed-off-by: beorn7 <[email protected]>

The basic idea here is that the previous code was always doing incremental calculation of the mean value, which is more costly and can be less precise. It protects against overflows, but in most cases, an overflow doesn't happen anyway. The other idea applied here is to expand on #14074, where Kahan summation was applied to sum(). With this commit, the average is calculated in a conventional way (adding everything up and divide in the end) as long as the sum isn't overflowing float64. This is combined with Kahan summation so that the avg aggregation, in most cases, is really equivalent to the sum aggregation with a following division (which is the user's expectation as avg is supposed to be syntactic sugar for sum with a following divison). If the sum hits ±Inf, the calculation reverts to incremental calculation of the mean value. Kahan summation is also applied here, although it cannot fully compensate for the numerical errors introduced by the incremental mean calculation. (The tests added in this commit would fail if incremental mean calculation was always used.) Signed-off-by: beorn7 <[email protected]>

Implements the improvements from prometheus/prometheus#14074 prometheus/prometheus#14362

* MQE: Use kahan in sum aggregation Implements the improvements from prometheus/prometheus#14074 prometheus/prometheus#14362 * Update CHANGELOG

[ENHANCEMENT] PromQL: use Kahan summation for sum()

ea82b49

This can give a more precise result, by keeping a separate running compensation value to accumulate small errors. See https://en.wikipedia.org/wiki/Kahan_summation_algorithm Signed-off-by: Bryan Boreham <[email protected]>

bboreham requested a review from roidelapluie as a code owner May 9, 2024 13:32

bboreham mentioned this pull request May 15, 2024

PromQL: Multiplication by 1 creates additional floating point errors (but should be a no-op) #14052

Open

beorn7 approved these changes May 15, 2024

View reviewed changes

beorn7 mentioned this pull request May 15, 2024

PromQL(histograms): Make more use of Kahan summation for native histograms #14105

Open

bboreham merged commit b6aba4f into prometheus:main Jun 24, 2024
25 checks passed

bboreham deleted the kahan-sum-sum branch June 24, 2024 10:13

charleskorn mentioned this pull request Jun 27, 2024

Merge upstream Prometheus at c5040c5 grafana/mimir-prometheus#655

Merged

beorn7 mentioned this pull request Jul 4, 2024

promql: more Kahan summation (avg) and less incremental mean calculation (avg, avg_over_time) #14413

Merged

charleskorn mentioned this pull request Aug 6, 2024

Mimir query engine: sync upstream test cases grafana/mimir#8911

Merged

2 tasks

jhesketh added a commit to jhesketh/mimir that referenced this pull request Aug 7, 2024

MQE: Use kahan in sum aggregation

1183ae4

Implements the improvements from prometheus/prometheus#14074 prometheus/prometheus#14362

jhesketh mentioned this pull request Aug 7, 2024

MQE: Use kahan in sum aggregation grafana/mimir#8923

Merged

4 tasks

jhesketh added a commit to grafana/mimir that referenced this pull request Aug 7, 2024

MQE: Use kahan in sum aggregation (#8923)

f0e7d90

* MQE: Use kahan in sum aggregation Implements the improvements from prometheus/prometheus#14074 prometheus/prometheus#14362 * Update CHANGELOG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENHANCEMENT] PromQL: use Kahan summation for sum() #14074

[ENHANCEMENT] PromQL: use Kahan summation for sum() #14074

bboreham commented May 9, 2024 •

edited

Loading

beorn7 left a comment

bboreham commented Jun 24, 2024

[ENHANCEMENT] PromQL: use Kahan summation for sum() #14074

[ENHANCEMENT] PromQL: use Kahan summation for sum() #14074

Conversation

bboreham commented May 9, 2024 • edited Loading

beorn7 left a comment

Choose a reason for hiding this comment

bboreham commented Jun 24, 2024

bboreham commented May 9, 2024 •

edited

Loading