Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENHANCEMENT] PromQL: use Kahan summation for sum() #14074

Merged
merged 1 commit into from
Jun 24, 2024

Conversation

bboreham
Copy link
Member

@bboreham bboreham commented May 9, 2024

This can give a more precise result, by keeping a separate running compensation value to accumulate small errors.
See https://en.wikipedia.org/wiki/Kahan_summation_algorithm

Possible improvement for #14052 (I won't call it a fix).

I'm re-using floatMean which is otherwise unused for sum(); it would be clearer to add a new field but would cost 8 bytes per output value.

Benchmarks show a bigger slow-down than I expected.

goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/promql
cpu: Intel(R) Core(TM) i7-14700K
                                                                                                          │ before.txt  │             after.txt             │
                                                                                                          │   sec/op    │   sec/op     vs base              │
RangeQuery/expr=sum(a_hundred),steps=1-28                                                                   142.2µ ± 1%   144.2µ ± 0%   1.41% (p=0.009 n=6)
RangeQuery/expr=sum(a_hundred),steps=100-28                                                                 447.5µ ± 2%   460.3µ ± 1%   2.85% (p=0.002 n=6)
RangeQuery/expr=sum(a_hundred),steps=1000-28                                                                2.670m ± 2%   2.760m ± 1%   3.37% (p=0.002 n=6)
RangeQuery/expr=sum_without_(l)(h_hundred),steps=1-28                                                       1.637m ± 2%   1.653m ± 1%       ~ (p=0.240 n=6)
RangeQuery/expr=sum_without_(l)(h_hundred),steps=100-28                                                     5.081m ± 2%   5.352m ± 1%   5.33% (p=0.002 n=6)
RangeQuery/expr=sum_without_(l)(h_hundred),steps=1000-28                                                    33.30m ± 2%   35.05m ± 0%   5.25% (p=0.002 n=6)
RangeQuery/expr=sum_without_(le)(h_hundred),steps=1-28                                                      1.676m ± 2%   1.691m ± 1%   0.89% (p=0.026 n=6)
RangeQuery/expr=sum_without_(le)(h_hundred),steps=100-28                                                    5.315m ± 2%   5.503m ± 1%   3.54% (p=0.002 n=6)
RangeQuery/expr=sum_without_(le)(h_hundred),steps=1000-28                                                   35.13m ± 1%   36.76m ± 1%   4.62% (p=0.002 n=6)
RangeQuery/expr=sum_by_(l)(h_hundred),steps=1-28                                                            1.654m ± 1%   1.685m ± 1%   1.84% (p=0.002 n=6)
RangeQuery/expr=sum_by_(l)(h_hundred),steps=100-28                                                          5.349m ± 2%   5.476m ± 0%   2.39% (p=0.002 n=6)
RangeQuery/expr=sum_by_(l)(h_hundred),steps=1000-28                                                         34.97m ± 1%   36.87m ± 1%   5.44% (p=0.002 n=6)
RangeQuery/expr=sum_by_(le)(h_hundred),steps=1-28                                                           1.618m ± 1%   1.647m ± 0%   1.79% (p=0.002 n=6)
RangeQuery/expr=sum_by_(le)(h_hundred),steps=100-28                                                         5.154m ± 1%   5.379m ± 1%   4.35% (p=0.002 n=6)
RangeQuery/expr=sum_by_(le)(h_hundred),steps=1000-28                                                        33.34m ± 2%   35.18m ± 1%   5.51% (p=0.002 n=6)
RangeQuery/expr=sum_without_(l)(rate(a_hundred[1m])),steps=1-28                                             174.5µ ± 1%   176.0µ ± 1%       ~ (p=0.065 n=6)
RangeQuery/expr=sum_without_(l)(rate(a_hundred[1m])),steps=100-28                                           897.6µ ± 1%   911.0µ ± 1%   1.49% (p=0.015 n=6)
RangeQuery/expr=sum_without_(l)(rate(a_hundred[1m])),steps=1000-28                                          6.615m ± 1%   6.690m ± 0%   1.14% (p=0.002 n=6)
RangeQuery/expr=sum_without_(l)(rate(a_hundred[1m]))_/_sum_without_(l)(rate(b_hundred[1m])),steps=1-28      361.2µ ± 1%   360.3µ ± 1%       ~ (p=0.699 n=6)
RangeQuery/expr=sum_without_(l)(rate(a_hundred[1m]))_/_sum_without_(l)(rate(b_hundred[1m])),steps=100-28    1.848m ± 1%   1.859m ± 1%       ~ (p=0.132 n=6)
RangeQuery/expr=sum_without_(l)(rate(a_hundred[1m]))_/_sum_without_(l)(rate(b_hundred[1m])),steps=1000-28   13.46m ± 1%   13.61m ± 1%   1.14% (p=0.041 n=6)
geomean                                                                                                     3.017m        3.095m        2.58%

This can give a more precise result, by keeping a separate running
compensation value to accumulate small errors.

See https://en.wikipedia.org/wiki/Kahan_summation_algorithm

Signed-off-by: Bryan Boreham <[email protected]>
Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much.

One day, we should apply all the Kahan tricks to histograms, too…

@bboreham
Copy link
Member Author

Current feeling is the slow-down will not be noticeable compared to overheads such as disk IO.

@bboreham bboreham merged commit b6aba4f into prometheus:main Jun 24, 2024
25 checks passed
@bboreham bboreham deleted the kahan-sum-sum branch June 24, 2024 10:13
beorn7 added a commit that referenced this pull request Jul 4, 2024
This expands the idea implemented in #14074, where Kahan summation was
applied to sum().

This comes at the price of not dual-using `floatMean` anymore, because
we need both `floatMean` and `floatKahanC` for `avg`. This creates a
small overhead, but also makes the Kahan parts easier to read.

The trade-off in terms of performance is similar to the one made in
`sum` with a following division. Things that work out for `sum` thanks
to Kahan summation should equally work out for `avg`. This is
reflected in the added tests.

Signed-off-by: beorn7 <[email protected]>
beorn7 added a commit that referenced this pull request Jul 4, 2024
This expands the idea implemented in #14074, where Kahan summation was
applied to sum().

This comes at the price of not dual-using `floatMean` anymore, because
we need both `floatMean` and `floatKahanC` for `avg`. This creates a
small overhead, but also makes the Kahan parts easier to read.

The trade-off in terms of performance is similar to the one made in
essentially syntactic sugar for `sum` with a following division.
Things that work out for `sum` thanks to Kahan summation should
therefory equally work out for `avg`. This is reflected in the added
tests.

Signed-off-by: beorn7 <[email protected]>
beorn7 added a commit that referenced this pull request Jul 4, 2024
The basic idea here is that the previous code was always doing
incremental calculation of the mean value, which is more costly and
can be less precise. It protects against overflows, but in most cases,
an overflow doesn't happen anyway.

The other idea applied here is to expand on #14074, where Kahan
summation was applied to sum().

With this commit, the average is calculated in a conventional way
(adding everything up and divide in the end) as long as the sum isn't
overflowing float64. This is combined with Kahan summation so that the
avg aggregation, in most cases, is really equivalent to the sum
aggregation with a following division (which is the user's expectation
as avg is supposed to be syntactic sugar for sum with a following
divison).

If the sum hits ±Inf, the calculation reverts to incremental
calculation of the mean value. Kahan summation is also applied here,
although it cannot fully compensate for the numerical errors
introduced by the incremental mean calculation. (The tests added in
this commit would fail if incremental mean calculation was always
used.)

Signed-off-by: beorn7 <[email protected]>
beorn7 added a commit that referenced this pull request Jul 4, 2024
The basic idea here is that the previous code was always doing
incremental calculation of the mean value, which is more costly and
can be less precise. It protects against overflows, but in most cases,
an overflow doesn't happen anyway.

The other idea applied here is to expand on #14074, where Kahan
summation was applied to sum().

With this commit, the average is calculated in a conventional way
(adding everything up and divide in the end) as long as the sum isn't
overflowing float64. This is combined with Kahan summation so that the
avg aggregation, in most cases, is really equivalent to the sum
aggregation with a following division (which is the user's expectation
as avg is supposed to be syntactic sugar for sum with a following
divison).

If the sum hits ±Inf, the calculation reverts to incremental
calculation of the mean value. Kahan summation is also applied here,
although it cannot fully compensate for the numerical errors
introduced by the incremental mean calculation. (The tests added in
this commit would fail if incremental mean calculation was always
used.)

Signed-off-by: beorn7 <[email protected]>
beorn7 added a commit that referenced this pull request Jul 4, 2024
The basic idea here is that the previous code was always doing
incremental calculation of the mean value, which is more costly and
can be less precise. It protects against overflows, but in most cases,
an overflow doesn't happen anyway.

The other idea applied here is to expand on #14074, where Kahan
summation was applied to sum().

With this commit, the average is calculated in a conventional way
(adding everything up and divide in the end) as long as the sum isn't
overflowing float64. This is combined with Kahan summation so that the
avg aggregation, in most cases, is really equivalent to the sum
aggregation with a following division (which is the user's expectation
as avg is supposed to be syntactic sugar for sum with a following
divison).

If the sum hits ±Inf, the calculation reverts to incremental
calculation of the mean value. Kahan summation is also applied here,
although it cannot fully compensate for the numerical errors
introduced by the incremental mean calculation. (The tests added in
this commit would fail if incremental mean calculation was always
used.)

Signed-off-by: beorn7 <[email protected]>
beorn7 added a commit that referenced this pull request Jul 4, 2024
The basic idea here is that the previous code was always doing
incremental calculation of the mean value, which is more costly and
can be less precise. It protects against overflows, but in most cases,
an overflow doesn't happen anyway.

The other idea applied here is to expand on #14074, where Kahan
summation was applied to sum().

With this commit, the average is calculated in a conventional way
(adding everything up and divide in the end) as long as the sum isn't
overflowing float64. This is combined with Kahan summation so that the
avg aggregation, in most cases, is really equivalent to the sum
aggregation with a following division (which is the user's expectation
as avg is supposed to be syntactic sugar for sum with a following
divison).

If the sum hits ±Inf, the calculation reverts to incremental
calculation of the mean value. Kahan summation is also applied here,
although it cannot fully compensate for the numerical errors
introduced by the incremental mean calculation. (The tests added in
this commit would fail if incremental mean calculation was always
used.)

Signed-off-by: beorn7 <[email protected]>
beorn7 added a commit that referenced this pull request Jul 10, 2024
The basic idea here is that the previous code was always doing
incremental calculation of the mean value, which is more costly and
can be less precise. It protects against overflows, but in most cases,
an overflow doesn't happen anyway.

The other idea applied here is to expand on #14074, where Kahan
summation was applied to sum().

With this commit, the average is calculated in a conventional way
(adding everything up and divide in the end) as long as the sum isn't
overflowing float64. This is combined with Kahan summation so that the
avg aggregation, in most cases, is really equivalent to the sum
aggregation with a following division (which is the user's expectation
as avg is supposed to be syntactic sugar for sum with a following
divison).

If the sum hits ±Inf, the calculation reverts to incremental
calculation of the mean value. Kahan summation is also applied here,
although it cannot fully compensate for the numerical errors
introduced by the incremental mean calculation. (The tests added in
this commit would fail if incremental mean calculation was always
used.)

Signed-off-by: beorn7 <[email protected]>
jhesketh added a commit to jhesketh/mimir that referenced this pull request Aug 7, 2024
jhesketh added a commit to grafana/mimir that referenced this pull request Aug 7, 2024
* MQE: Use kahan in sum aggregation

Implements the improvements from
prometheus/prometheus#14074
prometheus/prometheus#14362

* Update CHANGELOG
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants