Fix iteration until high(T) (fix #16353) #19652

Menduist · 2022-03-28T13:16:50Z

Fix #16353, bits stolen from #16361

This logic should have a single test per yield, so it stays optimized:

    if a <= b:
      var res = a
      while true:
        yield res
        if res >= b: break
        inc(res)

This also allows us to do "unchecked increments" ( %) on signed integers, so we should have less check per yield than current devel version

konsumlamm

Nice seeing this finally being fixed (I hope)! Could you also port this fix to countup, countdown and the generic ..?

tests/stdlib/iterators.nim

Menduist · 2022-03-28T14:17:00Z

countup, countdown & .. being generic, you may have weird logic in inc, and trying to optimize things may break some stuff, I think

But we can try

Araq · 2022-03-29T13:54:24Z

In the past fixing this introduced performance regressions. Can we see some benchmarks?

rockcavera · 2022-03-29T14:24:07Z

I did a simple benchmark, because I also thought that it could bring performance regression and I noticed that the new solution was a little faster than the old one in both -d:release and -d:danger, respectively 1.04x and 1.003x.

Edited: on amd64 architecture

rockcavera · 2022-03-29T23:25:05Z

I want to correct my last comment. I performed a new benchmark, code below:

import std/[os, strutils]

import pkg/nimbench

iterator newDotDot(a, b: int): int {.inline.} =
  if a <= b:
    var res = a
    while true:
      yield res
      if res >= b: break
      res = res  % 1

let a = parseInt(paramStr(1))
var r: uint64

when defined(ndot):
  bench(dotDot2):
    var n = 0'u64
    for i in newDotDot(1, a):
      inc(n)
      doNotOptimizeAway(n)
    r = n
else:
  bench(dotDot):
    var n = 0'u64
    for i in 1 .. a:
      inc(n)
      doNotOptimizeAway(n)
    r = n

runBenchmarks()

echo r

I noticed that:
With -d:release, performance is worse with 1 iteration up to 50 iterations on new code. Above 50 iterations the new code is little better.
With -d:danger the performance is pretty much the same regardless of the situation, but in some situations the new code seems to be slightly better (technical tie).

What do I say about this new PR. If the current code doesn't work in all situations it is bad code and should be fixed, regardless of whether the new solution performs better or worse. In addition, this new solution presents worse performance in a few cases, which are very few compared to the possible number. The current solution doesn't work with all cases, which is worse, in my opinion.

Araq · 2022-03-30T08:16:17Z

If the current code doesn't work in all situations it is bad code and should be fixed, regardless of whether the new solution performs better or worse.

I prefer the faster version, the "correct" version is only more correct for a single bit pattern (high(int)) that never comes up in most programs. If my numbers are close to the int's boundaries, I prefer to use BigInt. Good engineers design for error margins, not for perfection.

Menduist · 2022-03-30T08:41:30Z

Unfortunately, it's not all black or white as @rockcavera mentionned.
Because this PR relies on compiler optimization (the C one) to work as efficiently as possible, it's, on int64:

250% of the original speed (so slower) by default
65% of the original speed (so faster) with -d:release
similar speed with -d:release -d:danger or just -d:danger

For uint64:

120% of the original speed (so slower) by default
similar speed with -d:release -d:danger, -d:danger or -d:release

I could deep dive in the generated ASM to try to get better performance by default, but unless it's a blocker, I think fixing a bug and getting better perf with -d:release is already a step in the right direction, that could be improved later

arnetheduck · 2022-03-31T15:04:07Z

If my numbers are close to the int's boundaries, I prefer to use BigInt.

this is fine when you're writing a web page generator - if instead you're writing a bigint library (or cryptography, or AI or any kind of number crunching), you really don't want to be caught by odd and annoying issues like this one where every single bit is absolutely critical to the correct and secure operation of the library. Given Nim's desire to fulfill system programming needs, this bug really needed fixing a long time ago and I'm glad it's being looked at!

Wh1teDuke · 2022-04-02T11:27:29Z

The benchmark shows a tiny difference, for such a small number of iterations I would hardly qualify this as a performance regression.

dotDot2                                                    321.89ns    3.11M
dotDot                                                     320.71ns    3.12M
100

dotDot2                                                     33.38ns   29.96M
dotDot                                                      32.05ns   31.20M
10

It is just one nanosecond in my computer (d:release).

c-blake · 2022-04-02T19:22:49Z

This seemed like a hot-cache including warmed up branch predictor kind of performance situation that might vary a lot from one CPU to another. So, I compiled this more stdlib-only bench:

import std/[os, strutils, times, stats, math]

when defined nd:
  iterator newDotDot(a, b: int): int {.inline.} =
    if a <= b:
      var res = a
      while true:
        yield res
        if res >= b: break
        res = res  % 1

const nAvg = 100
const nMin = 100
let a = parseInt(paramStr(1))
var n = 0'u64
var rs: RunningStat
for k in 1..nAvg:
  var dtMin = float.high
  for j in 1..nMin:
    let t0 = epochtime()
    when defined(nd): (for i in newDotDot(1, a): inc n, i)
    else: (for i in 1 .. a: inc n, i)
    dtMin = min(dtMin, epochtime() - t0)*1e9
  rs.push dtMin
var avg = rs.mean
var err = rs.standardDeviation / nAvg.float.sqrt
echo formatFloat(avg, ffDecimal, 1), "  - ",
     formatFloat(err, ffDecimal, 1), " ", n

with nim-devel 7d32425 and gcc-11.2.1. Then running (with a non-variable CPU frequency):

for n in 5000 50000 500000; do for p in 1 2; do ./$p $n; done; echo; done

I got some interesting d:release/d:danger/cross CPU variability roughly confirming what has been said:

              *** TIME IN nanoseconds ***
n|CPU     AMD2950X           i7-6700K            i5-M540
-d:release
5000     2195.8  - 31.3    3650.2  - 11.5     4980.6  - 18.4
 new     1854.9  -  9.9    1738.1  - 12.3     2498.6  - 13.7

50000   21579.3  - 94.3   26257.0  - 59.1    49052.2  -  10.5
 new    18391.6  - 39.9    7441.0  - 46.5    25057.8  - 133.3

500000 215783.1  - 246.1 252244.5  - 189.8  495541.1  - 725.8
 new   184063.9  - 185.9  63870.0  - 93.4   247118.5  - 498.5
-d:danger
5000     1728.5  - 10.9    1611.7  - 16.9     3325.9  - 11.4
 new     1697.5  -  7.7    1742.8  - 12.5     2522.5  - 14.0

50000   16772.7  - 43.2    5459.8  - 77.1    33283.2  - 148.5
 new    16791.8  - 51.4    7424.4  - 47.1    25110.2  - 137.0

500000 167596.3  - 193.0  43361.2  - 686.7  333151.8  - 784.0
 new   167448.5  - 195.9  63877.1  - 101.7  250370.5  - 709.0

On really old Intel, the new way was always faster. On recent-ish Skylake Intel, the new way is 2-4X faster w/release but 1.4-1.5x slower w/danger. On AMD Zen1 the new way was 1.2x faster w/release and statistically insignificant diff w/danger. So, pretty mixed results, but not what I would call 'bad".

I don't think "in theory" the new way should really be slower, though it seems to perturb the Skylake branch predictor enough to see. Some people get really excited about 639/433 =1.5x kinds of ratios, but I think PGO might recover that.

In practice, well, practice for something like this always depends what CPUs you run on as well as probably which backend compilers (e.g. clang, visualC,..). The stdlib tends to pick "one ok way" rather than forking on CPU types/etc. That said, it would be informative for others to report on their own CPUs. This sort of thing really does tend to vary. E.g., someone testing an ARM version or two would be great. E.g. Apple M1 and AWS cloud ARMs.

In terms of this PR, well, if you are really worried about perf changes and you want to fix the bug then you could probably also do a large "if" statement that did the new way only if the upper bound is ==T.high. That might be pretty ugly and create i-cache pressure from how common the code construct is, but it's not actually that "big" in real terms. Anyway, I take no great stand on priorities, but thought I could add some relevant info and mention that unmentioned idea.

c-blake · 2022-04-08T12:30:14Z

I should also have said (but I know @Araq knows) that branch predictor warm-up times might be impacted a lot, too. Those are harder to measure (vaguely related) - a hot loop benchmark will not do it by definition. You may need specialized per-CPU/OS tools or at least a different benchmark concept. Good coverage over CPUs of the hot cache case is at least a starting point.

Since loop control is the topic here (just structurally, not synthetically), we may only really care about hot cache/warmed up BPs. :-) Just seemed worth a mention. { Also tagging @planetis-m since he has seemed interested in microarch details in the past. }

rockcavera · 2022-04-20T13:21:14Z

Apparently this fix is going to be shelved, unfortunately. But I wanted to leave my last words.

I believe that it is important for the language that this correction is accepted, because it works in all possible cases, taking away from the language a bug that can happen and the developer does not even notice. I don't see it as acceptable for the language to keep a bug in a resource it makes available just because it understands to be "faster". The language should play the role of providing bug-free features and not adopting them in favor of performance. Who has to take the risk of possible bugs are the developers who use the language, that is, if one understands that the feature .. does not offer the "best" performance, he can at any time develop his own iterator.

Given what I said, I think this PR is accepted and, if possible, similar resources that have the same bug should adopt it.

github-actions · 2023-08-29T00:33:37Z

This pull request is stale because it has been open for 1 year with no activity. Contribute more commits on the pull request and rebase it on the latest devel, or it will be closed in 30 days. Thank you for your contributions.

github-actions · 2024-09-28T00:46:35Z

This pull request is stale because it has been open for 1 year with no activity. Contribute more commits on the pull request and rebase it on the latest devel, or it will be closed in 30 days. Thank you for your contributions.

Fix iteration until high(T) (fix nim-lang#16353)

8fa116f

konsumlamm reviewed Mar 28, 2022

View reviewed changes

tests/stdlib/iterators.nim Outdated Show resolved Hide resolved

tests/stdlib/iterators.nim Outdated Show resolved Hide resolved

Prettier test

513a21d

Menduist closed this Mar 29, 2022

Menduist reopened this Mar 29, 2022

Menduist closed this Mar 30, 2022

Menduist reopened this Mar 30, 2022

Menduist mentioned this pull request Aug 28, 2022

Fix iteration to high(T)/low(T) #20273

Closed

Merge branch 'devel' into fix16353

82613ed

Varriount assigned Araq Aug 28, 2022

Varriount added the Requires Araq To Merge PR should only be merged by Araq label Aug 28, 2022

github-actions bot added the stale Staled PR/issues; remove the label after fixing them label Aug 29, 2023

Merge branch 'devel' into fix16353

06524b1

github-actions bot removed the stale Staled PR/issues; remove the label after fixing them label Sep 28, 2023

github-actions bot added the stale Staled PR/issues; remove the label after fixing them label Sep 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix iteration until high(T) (fix #16353) #19652

Fix iteration until high(T) (fix #16353) #19652

Menduist commented Mar 28, 2022 •

edited

Loading

konsumlamm left a comment

Menduist commented Mar 28, 2022

Araq commented Mar 29, 2022

rockcavera commented Mar 29, 2022 •

edited

Loading

rockcavera commented Mar 29, 2022

Araq commented Mar 30, 2022

Menduist commented Mar 30, 2022 •

edited

Loading

arnetheduck commented Mar 31, 2022

Wh1teDuke commented Apr 2, 2022

c-blake commented Apr 2, 2022

c-blake commented Apr 8, 2022

rockcavera commented Apr 20, 2022 •

edited

Loading

github-actions bot commented Aug 29, 2023

github-actions bot commented Sep 28, 2024

Fix iteration until high(T) (fix #16353) #19652

Are you sure you want to change the base?

Fix iteration until high(T) (fix #16353) #19652

Conversation

Menduist commented Mar 28, 2022 • edited Loading

konsumlamm left a comment

Choose a reason for hiding this comment

Menduist commented Mar 28, 2022

Araq commented Mar 29, 2022

rockcavera commented Mar 29, 2022 • edited Loading

rockcavera commented Mar 29, 2022

Araq commented Mar 30, 2022

Menduist commented Mar 30, 2022 • edited Loading

arnetheduck commented Mar 31, 2022

Wh1teDuke commented Apr 2, 2022

c-blake commented Apr 2, 2022

c-blake commented Apr 8, 2022

rockcavera commented Apr 20, 2022 • edited Loading

github-actions bot commented Aug 29, 2023

github-actions bot commented Sep 28, 2024

Menduist commented Mar 28, 2022 •

edited

Loading

rockcavera commented Mar 29, 2022 •

edited

Loading

Menduist commented Mar 30, 2022 •

edited

Loading

rockcavera commented Apr 20, 2022 •

edited

Loading