-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix iteration until high(T) (fix #16353) #19652
base: devel
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice seeing this finally being fixed (I hope)! Could you also port this fix to countup
, countdown
and the generic ..
?
But we can try |
In the past fixing this introduced performance regressions. Can we see some benchmarks? |
I did a simple benchmark, because I also thought that it could bring performance regression and I noticed that the new solution was a little faster than the old one in both -d:release and -d:danger, respectively 1.04x and 1.003x. Edited: on amd64 architecture |
I want to correct my last comment. I performed a new benchmark, code below: import std/[os, strutils]
import pkg/nimbench
iterator newDotDot(a, b: int): int {.inline.} =
if a <= b:
var res = a
while true:
yield res
if res >= b: break
res = res % 1
let a = parseInt(paramStr(1))
var r: uint64
when defined(ndot):
bench(dotDot2):
var n = 0'u64
for i in newDotDot(1, a):
inc(n)
doNotOptimizeAway(n)
r = n
else:
bench(dotDot):
var n = 0'u64
for i in 1 .. a:
inc(n)
doNotOptimizeAway(n)
r = n
runBenchmarks()
echo r I noticed that: What do I say about this new PR. If the current code doesn't work in all situations it is bad code and should be fixed, regardless of whether the new solution performs better or worse. In addition, this new solution presents worse performance in a few cases, which are very few compared to the possible number. The current solution doesn't work with all cases, which is worse, in my opinion. |
I prefer the faster version, the "correct" version is only more correct for a single bit pattern (high(int)) that never comes up in most programs. If my numbers are close to the int's boundaries, I prefer to use BigInt. Good engineers design for error margins, not for perfection. |
Unfortunately, it's not all black or white as @rockcavera mentionned.
For uint64:
I could deep dive in the generated ASM to try to get better performance by default, but unless it's a blocker, I think fixing a bug and getting better perf with |
this is fine when you're writing a web page generator - if instead you're writing a bigint library (or cryptography, or AI or any kind of number crunching), you really don't want to be caught by odd and annoying issues like this one where every single bit is absolutely critical to the correct and secure operation of the library. Given Nim's desire to fulfill system programming needs, this bug really needed fixing a long time ago and I'm glad it's being looked at! |
The benchmark shows a tiny difference, for such a small number of iterations I would hardly qualify this as a performance regression.
It is just one nanosecond in my computer (d:release). |
This seemed like a hot-cache including warmed up branch predictor kind of performance situation that might vary a lot from one CPU to another. So, I compiled this more stdlib-only bench: import std/[os, strutils, times, stats, math]
when defined nd:
iterator newDotDot(a, b: int): int {.inline.} =
if a <= b:
var res = a
while true:
yield res
if res >= b: break
res = res % 1
const nAvg = 100
const nMin = 100
let a = parseInt(paramStr(1))
var n = 0'u64
var rs: RunningStat
for k in 1..nAvg:
var dtMin = float.high
for j in 1..nMin:
let t0 = epochtime()
when defined(nd): (for i in newDotDot(1, a): inc n, i)
else: (for i in 1 .. a: inc n, i)
dtMin = min(dtMin, epochtime() - t0)*1e9
rs.push dtMin
var avg = rs.mean
var err = rs.standardDeviation / nAvg.float.sqrt
echo formatFloat(avg, ffDecimal, 1), " - ",
formatFloat(err, ffDecimal, 1), " ", n with nim-devel 7d32425 and gcc-11.2.1. Then running (with a non-variable CPU frequency): for n in 5000 50000 500000; do for p in 1 2; do ./$p $n; done; echo; done I got some interesting d:release/d:danger/cross CPU variability roughly confirming what has been said:
On really old Intel, the new way was always faster. On recent-ish Skylake Intel, the new way is 2-4X faster w/release but 1.4-1.5x slower w/danger. On AMD Zen1 the new way was 1.2x faster w/release and statistically insignificant diff w/danger. So, pretty mixed results, but not what I would call 'bad". I don't think "in theory" the new way should really be slower, though it seems to perturb the Skylake branch predictor enough to see. Some people get really excited about 639/433 =1.5x kinds of ratios, but I think PGO might recover that. In practice, well, practice for something like this always depends what CPUs you run on as well as probably which backend compilers (e.g. clang, visualC,..). The stdlib tends to pick "one ok way" rather than forking on CPU types/etc. That said, it would be informative for others to report on their own CPUs. This sort of thing really does tend to vary. E.g., someone testing an ARM version or two would be great. E.g. Apple M1 and AWS cloud ARMs. In terms of this PR, well, if you are really worried about perf changes and you want to fix the bug then you could probably also do a large "if" statement that did the new way only if the upper bound is |
I should also have said (but I know @Araq knows) that branch predictor warm-up times might be impacted a lot, too. Those are harder to measure (vaguely related) - a hot loop benchmark will not do it by definition. You may need specialized per-CPU/OS tools or at least a different benchmark concept. Good coverage over CPUs of the hot cache case is at least a starting point. Since loop control is the topic here (just structurally, not synthetically), we may only really care about hot cache/warmed up BPs. :-) Just seemed worth a mention. { Also tagging @planetis-m since he has seemed interested in microarch details in the past. } |
Apparently this fix is going to be shelved, unfortunately. But I wanted to leave my last words. I believe that it is important for the language that this correction is accepted, because it works in all possible cases, taking away from the language a bug that can happen and the developer does not even notice. I don't see it as acceptable for the language to keep a bug in a resource it makes available just because it understands to be "faster". The language should play the role of providing bug-free features and not adopting them in favor of performance. Who has to take the risk of possible bugs are the developers who use the language, that is, if one understands that the feature Given what I said, I think this PR is accepted and, if possible, similar resources that have the same bug should adopt it. |
This pull request is stale because it has been open for 1 year with no activity. Contribute more commits on the pull request and rebase it on the latest devel, or it will be closed in 30 days. Thank you for your contributions. |
This pull request is stale because it has been open for 1 year with no activity. Contribute more commits on the pull request and rebase it on the latest devel, or it will be closed in 30 days. Thank you for your contributions. |
Fix #16353, bits stolen from #16361
This logic should have a single test per yield, so it stays optimized:
This also allows us to do "unchecked increments" ( %) on signed integers, so we should have less check per yield than current devel version