Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix iteration until high(T) (fix #16353) #19652

Open
wants to merge 4 commits into
base: devel
Choose a base branch
from

Conversation

Menduist
Copy link
Contributor

@Menduist Menduist commented Mar 28, 2022

Fix #16353, bits stolen from #16361

This logic should have a single test per yield, so it stays optimized:

    if a <= b:
      var res = a
      while true:
        yield res
        if res >= b: break
        inc(res)

This also allows us to do "unchecked increments" ( %) on signed integers, so we should have less check per yield than current devel version

Copy link
Contributor

@konsumlamm konsumlamm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice seeing this finally being fixed (I hope)! Could you also port this fix to countup, countdown and the generic ..?

tests/stdlib/iterators.nim Outdated Show resolved Hide resolved
tests/stdlib/iterators.nim Outdated Show resolved Hide resolved
@Menduist
Copy link
Contributor Author

countup, countdown & .. being generic, you may have weird logic in inc, and trying to optimize things may break some stuff, I think

But we can try

@Menduist Menduist closed this Mar 29, 2022
@Menduist Menduist reopened this Mar 29, 2022
@Araq
Copy link
Member

Araq commented Mar 29, 2022

In the past fixing this introduced performance regressions. Can we see some benchmarks?

@rockcavera
Copy link
Contributor

rockcavera commented Mar 29, 2022

I did a simple benchmark, because I also thought that it could bring performance regression and I noticed that the new solution was a little faster than the old one in both -d:release and -d:danger, respectively 1.04x and 1.003x.

Edited: on amd64 architecture

@rockcavera
Copy link
Contributor

I want to correct my last comment. I performed a new benchmark, code below:

import std/[os, strutils]

import pkg/nimbench

iterator newDotDot(a, b: int): int {.inline.} =
  if a <= b:
    var res = a
    while true:
      yield res
      if res >= b: break
      res = res  % 1

let a = parseInt(paramStr(1))
var r: uint64

when defined(ndot):
  bench(dotDot2):
    var n = 0'u64
    for i in newDotDot(1, a):
      inc(n)
      doNotOptimizeAway(n)
    r = n
else:
  bench(dotDot):
    var n = 0'u64
    for i in 1 .. a:
      inc(n)
      doNotOptimizeAway(n)
    r = n

runBenchmarks()

echo r

I noticed that:
With -d:release, performance is worse with 1 iteration up to 50 iterations on new code. Above 50 iterations the new code is little better.
With -d:danger the performance is pretty much the same regardless of the situation, but in some situations the new code seems to be slightly better (technical tie).

What do I say about this new PR. If the current code doesn't work in all situations it is bad code and should be fixed, regardless of whether the new solution performs better or worse. In addition, this new solution presents worse performance in a few cases, which are very few compared to the possible number. The current solution doesn't work with all cases, which is worse, in my opinion.

@Araq
Copy link
Member

Araq commented Mar 30, 2022

If the current code doesn't work in all situations it is bad code and should be fixed, regardless of whether the new solution performs better or worse.

I prefer the faster version, the "correct" version is only more correct for a single bit pattern (high(int)) that never comes up in most programs. If my numbers are close to the int's boundaries, I prefer to use BigInt. Good engineers design for error margins, not for perfection.

@Menduist
Copy link
Contributor Author

Menduist commented Mar 30, 2022

Unfortunately, it's not all black or white as @rockcavera mentionned.
Because this PR relies on compiler optimization (the C one) to work as efficiently as possible, it's, on int64:

  • 250% of the original speed (so slower) by default
  • 65% of the original speed (so faster) with -d:release
  • similar speed with -d:release -d:danger or just -d:danger

For uint64:

  • 120% of the original speed (so slower) by default
  • similar speed with -d:release -d:danger, -d:danger or -d:release

I could deep dive in the generated ASM to try to get better performance by default, but unless it's a blocker, I think fixing a bug and getting better perf with -d:release is already a step in the right direction, that could be improved later

@Menduist Menduist closed this Mar 30, 2022
@Menduist Menduist reopened this Mar 30, 2022
@arnetheduck
Copy link
Contributor

If my numbers are close to the int's boundaries, I prefer to use BigInt.

this is fine when you're writing a web page generator - if instead you're writing a bigint library (or cryptography, or AI or any kind of number crunching), you really don't want to be caught by odd and annoying issues like this one where every single bit is absolutely critical to the correct and secure operation of the library. Given Nim's desire to fulfill system programming needs, this bug really needed fixing a long time ago and I'm glad it's being looked at!

@Wh1teDuke
Copy link
Contributor

The benchmark shows a tiny difference, for such a small number of iterations I would hardly qualify this as a performance regression.

dotDot2                                                    321.89ns    3.11M
dotDot                                                     320.71ns    3.12M
100
dotDot2                                                     33.38ns   29.96M
dotDot                                                      32.05ns   31.20M
10

It is just one nanosecond in my computer (d:release).

@c-blake
Copy link
Contributor

c-blake commented Apr 2, 2022

This seemed like a hot-cache including warmed up branch predictor kind of performance situation that might vary a lot from one CPU to another. So, I compiled this more stdlib-only bench:

import std/[os, strutils, times, stats, math]

when defined nd:
  iterator newDotDot(a, b: int): int {.inline.} =
    if a <= b:
      var res = a
      while true:
        yield res
        if res >= b: break
        res = res  % 1

const nAvg = 100
const nMin = 100
let a = parseInt(paramStr(1))
var n = 0'u64
var rs: RunningStat
for k in 1..nAvg:
  var dtMin = float.high
  for j in 1..nMin:
    let t0 = epochtime()
    when defined(nd): (for i in newDotDot(1, a): inc n, i)
    else: (for i in 1 .. a: inc n, i)
    dtMin = min(dtMin, epochtime() - t0)*1e9
  rs.push dtMin
var avg = rs.mean
var err = rs.standardDeviation / nAvg.float.sqrt
echo formatFloat(avg, ffDecimal, 1), "  - ",
     formatFloat(err, ffDecimal, 1), " ", n

with nim-devel 7d32425 and gcc-11.2.1. Then running (with a non-variable CPU frequency):

for n in 5000 50000 500000; do for p in 1 2; do ./$p $n; done; echo; done

I got some interesting d:release/d:danger/cross CPU variability roughly confirming what has been said:

              *** TIME IN nanoseconds ***
n|CPU     AMD2950X           i7-6700K            i5-M540
-d:release
5000     2195.8  - 31.3    3650.2  - 11.5     4980.6  - 18.4
 new     1854.9  -  9.9    1738.1  - 12.3     2498.6  - 13.7

50000   21579.3  - 94.3   26257.0  - 59.1    49052.2  -  10.5
 new    18391.6  - 39.9    7441.0  - 46.5    25057.8  - 133.3

500000 215783.1  - 246.1 252244.5  - 189.8  495541.1  - 725.8
 new   184063.9  - 185.9  63870.0  - 93.4   247118.5  - 498.5
-d:danger
5000     1728.5  - 10.9    1611.7  - 16.9     3325.9  - 11.4
 new     1697.5  -  7.7    1742.8  - 12.5     2522.5  - 14.0

50000   16772.7  - 43.2    5459.8  - 77.1    33283.2  - 148.5
 new    16791.8  - 51.4    7424.4  - 47.1    25110.2  - 137.0

500000 167596.3  - 193.0  43361.2  - 686.7  333151.8  - 784.0
 new   167448.5  - 195.9  63877.1  - 101.7  250370.5  - 709.0

On really old Intel, the new way was always faster. On recent-ish Skylake Intel, the new way is 2-4X faster w/release but 1.4-1.5x slower w/danger. On AMD Zen1 the new way was 1.2x faster w/release and statistically insignificant diff w/danger. So, pretty mixed results, but not what I would call 'bad".

I don't think "in theory" the new way should really be slower, though it seems to perturb the Skylake branch predictor enough to see. Some people get really excited about 639/433 =1.5x kinds of ratios, but I think PGO might recover that.

In practice, well, practice for something like this always depends what CPUs you run on as well as probably which backend compilers (e.g. clang, visualC,..). The stdlib tends to pick "one ok way" rather than forking on CPU types/etc. That said, it would be informative for others to report on their own CPUs. This sort of thing really does tend to vary. E.g., someone testing an ARM version or two would be great. E.g. Apple M1 and AWS cloud ARMs.

In terms of this PR, well, if you are really worried about perf changes and you want to fix the bug then you could probably also do a large "if" statement that did the new way only if the upper bound is ==T.high. That might be pretty ugly and create i-cache pressure from how common the code construct is, but it's not actually that "big" in real terms. Anyway, I take no great stand on priorities, but thought I could add some relevant info and mention that unmentioned idea.

@c-blake
Copy link
Contributor

c-blake commented Apr 8, 2022

I should also have said (but I know @Araq knows) that branch predictor warm-up times might be impacted a lot, too. Those are harder to measure (vaguely related) - a hot loop benchmark will not do it by definition. You may need specialized per-CPU/OS tools or at least a different benchmark concept. Good coverage over CPUs of the hot cache case is at least a starting point.

Since loop control is the topic here (just structurally, not synthetically), we may only really care about hot cache/warmed up BPs. :-) Just seemed worth a mention. { Also tagging @planetis-m since he has seemed interested in microarch details in the past. }

@rockcavera
Copy link
Contributor

rockcavera commented Apr 20, 2022

Apparently this fix is going to be shelved, unfortunately. But I wanted to leave my last words.

I believe that it is important for the language that this correction is accepted, because it works in all possible cases, taking away from the language a bug that can happen and the developer does not even notice. I don't see it as acceptable for the language to keep a bug in a resource it makes available just because it understands to be "faster". The language should play the role of providing bug-free features and not adopting them in favor of performance. Who has to take the risk of possible bugs are the developers who use the language, that is, if one understands that the feature .. does not offer the "best" performance, he can at any time develop his own iterator.

Given what I said, I think this PR is accepted and, if possible, similar resources that have the same bug should adopt it.

@Varriount Varriount added the Requires Araq To Merge PR should only be merged by Araq label Aug 28, 2022
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for 1 year with no activity. Contribute more commits on the pull request and rebase it on the latest devel, or it will be closed in 30 days. Thank you for your contributions.

@github-actions github-actions bot added the stale Staled PR/issues; remove the label after fixing them label Aug 29, 2023
@github-actions github-actions bot removed the stale Staled PR/issues; remove the label after fixing them label Sep 28, 2023
Copy link
Contributor

This pull request is stale because it has been open for 1 year with no activity. Contribute more commits on the pull request and rebase it on the latest devel, or it will be closed in 30 days. Thank you for your contributions.

@github-actions github-actions bot added the stale Staled PR/issues; remove the label after fixing them label Sep 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Requires Araq To Merge PR should only be merged by Araq stale Staled PR/issues; remove the label after fixing them
Projects
None yet
Development

Successfully merging this pull request may close these issues.

for i in 0 .. high(int32): discard gives OverflowDefect; high(uint32) hangs
8 participants