Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for {u8,i8,...}::isqrt #116226

Open
1 of 3 tasks
FedericoStra opened this issue Sep 28, 2023 · 15 comments
Open
1 of 3 tasks

Tracking Issue for {u8,i8,...}::isqrt #116226

FedericoStra opened this issue Sep 28, 2023 · 15 comments
Labels
C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@FedericoStra
Copy link
Contributor

FedericoStra commented Sep 28, 2023

Feature gate: #![feature(isqrt)]

This is a tracking issue for the functions {u8,u16,u32,u64,i128,usize}::isqrt and {i8,i16,i32,i64,i128,isize}::{isqrt,checked_isqrt}, which compute the integer square root, addressing issue #89273.

Public API

For every suffix N among 8, 16, 32, 64, 128 and size, the feature isqrt introduces the methods

const fn uN::isqrt() -> uN;
const fn iN::isqrt() -> iN;
const fn iN::checked_isqrt() -> Option<iN>;
Expand to see the full API
const fn u8::isqrt() -> u8;
const fn i8::isqrt() -> i8;
const fn i8::checked_isqrt() -> Option<i8>;


const fn u16::isqrt() -> u16;
const fn i16::isqrt() -> i16;
const fn i16::checked_isqrt() -> Option<i16>;


const fn u32::isqrt() -> u32;
const fn i32::isqrt() -> i32;
const fn i32::checked_isqrt() -> Option<i32>;


const fn u64::isqrt() -> u64;
const fn i64::isqrt() -> i64;
const fn i64::checked_isqrt() -> Option<i64>;


const fn u128::isqrt() -> u128;
const fn i128::isqrt() -> i128;
const fn i128::checked_isqrt() -> Option<i128>;


const fn usize::isqrt() -> usize;
const fn isize::isqrt() -> isize;
const fn isize::checked_isqrt() -> Option<isize>;

Steps / History

Unresolved Questions

  • None yet.

Footnotes

  1. https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html

@FedericoStra FedericoStra added C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Sep 28, 2023
@ChaiTRex
Copy link
Contributor

ChaiTRex commented Sep 29, 2023

Once #116176 is merged, I would like to submit improved tests and compiler optimization hints for signed integers with tighter upper bounds than with the corresponding unsigned integers.


Separately, I'm going to try to use f32 and f64 to speed up isqrt, as I think that, up to 64-bit integers, the square root function on those might be faster and any errors in the outputs can be adjusted for. This would have the downsides of making the functions non-const and multiplying the written implementations, as doing this effectively would require different code for different bit sizes (for example, u32 could use f64's 53-bit mantissa with no errors, while u64 would need to use f64 with an error adjustment).

@ChaiTRex
Copy link
Contributor

ChaiTRex commented Sep 30, 2023

In a benchmarking repository, I've implemented some methods based on floating-point instructions. I get the following speed improvements with AMD Ryzen 9 5900X:

Speed of original and floating-point methods
original_i8             time:   [10.921 ns 10.963 ns 11.010 ns]
floating_i8             time:   [9.7668 ns 9.8591 ns 9.9453 ns]

original_u8             time:   [9.5130 ns 9.5649 ns 9.6401 ns]
floating_u8             time:   [5.3333 ns 5.3488 ns 5.3632 ns]

original_i16            time:   [12.231 ns 12.257 ns 12.289 ns]
floating_i16            time:   [8.5368 ns 8.5592 ns 8.5828 ns]

original_u16            time:   [12.858 ns 12.875 ns 12.896 ns]
floating_u16            time:   [5.3525 ns 5.3536 ns 5.3549 ns]

original_i32            time:   [11.941 ns 11.945 ns 11.950 ns]
floating_i32            time:   [7.5067 ns 7.5106 ns 7.5152 ns]

original_u32            time:   [15.381 ns 15.420 ns 15.478 ns]
floating_u32            time:   [3.8586 ns 3.9171 ns 3.9841 ns]

original_i64            time:   [19.061 ns 19.064 ns 19.068 ns]
floating_i64            time:   [7.2079 ns 7.2427 ns 7.2740 ns]

original_u64            time:   [28.456 ns 28.502 ns 28.549 ns]
floating_u64            time:   [5.8539 ns 5.8567 ns 5.8601 ns]

original_i128           time:   [130.68 ns 130.93 ns 131.15 ns]
floating_i128           time:   [14.931 ns 14.996 ns 15.134 ns]

original_u128           time:   [248.76 ns 248.81 ns 248.86 ns]
floating_u128           time:   [17.346 ns 17.380 ns 17.410 ns]

Edit: Rebenchmarked because I used the Karatsuba square root algorithm for 128-bit integers to get a major speedup.

@ryanavella
Copy link

@ChaiTRex I noticed in your benchmarking repository that you've commented out (what I presume was previously) a lookup table implementation for u8 and i8, how did that perform?

I benchmarked a few naive lookup tables, and at least on my machine (ThinkPad T490, i5-8365U) the results look comparable, arguably even competitive.

Benchmarks
floating karatsuba_i8 time: [10.002 ns 10.008 ns 10.015 ns]
floating karatsuba_u8 time: [4.8382 ns 4.8987 ns 4.9682 ns]
floating_i8           time: [11.109 ns 11.162 ns 11.207 ns]
floating_u8           time: [6.2175 ns 6.2430 ns 6.2621 ns]
karatsuba_i8         time: [9.8523 ns 9.9603 ns 10.052 ns]
karatsuba_u8         time: [5.0738 ns 5.1621 ns 5.2458 ns]
original_i8          time: [13.251 ns 13.406 ns 13.539 ns]
original_u8          time: [10.680 ns 10.696 ns 10.709 ns]

lut16_i8             time: [11.032 ns 11.033 ns 11.034 ns]
lut16_u8             time: [7.9575 ns 7.9678 ns 7.9778 ns]
lut32_i8             time: [11.323 ns 11.328 ns 11.334 ns]
lut32_u8             time: [6.7746 ns 6.8290 ns 6.8980 ns]
lut64_i8             time: [10.113 ns 10.150 ns 10.214 ns]
lut64_u8             time: [5.8406 ns 5.8711 ns 5.9037 ns]
lut256_i8            time: [10.007 ns 10.026 ns 10.051 ns]
lut256_u8            time: [4.9602 ns 4.9606 ns 4.9611 ns]

Here are my unsigned implementations for what I'm currently calling lut16, lut32, lut64, and lut256, which offer different tradeoffs between table size and branchiness:

(note: for the signed benchmarks I delegated to the unsigned implementation, on the assumption that we don't want to bloat up binaries with redundant lookup tables)

lut16
impl UnsignedIsqrt for u8 {
    fn isqrt(self) -> Self {
        const LUT: [u8; 16] = [
            3, 5, 6, 7,
            8, 9, 10, 11,
            11, 12, 13, 13,
            14, 14, 15, 15,
        ];
        let mut est = LUT[(self >> 4) as usize];
        let mut square = est * est;
        if square > self {
            square -= 2*est - 1;
            est -= 1;
            if square > self {
                square -= 2*est - 1;
                est -= 1;
                if square > self {
                    // square -= 2*est - 1;
                    est -= 1;
                }
            }
        }
        est
    }
}
lut32
impl UnsignedIsqrt for u8 {
    fn isqrt(self) -> Self {
        const LUT: [u8; 32] = [
            2, 3, 4, 5, 6, 6, 7, 7,
            8, 8, 9, 9, 10, 10, 10, 11,
            11, 11, 12, 12, 12, 13, 13, 13,
            14, 14, 14, 14, 15, 15, 15, 15,
        ];
        let mut est = LUT[(self >> 3) as usize];
        let mut square = est * est;
        if square > self {
            square -= 2*est - 1;
            est -= 1;
            if square > self {
                // square -= 2*est - 1;
                est -= 1;
            }
        }
        est
    }
}
lut64
impl UnsignedIsqrt for u8 {
    fn isqrt(self) -> Self {
        const LUT: [u8; 64] = [
            1, 2, 3, 3, 4, 4, 5, 5,
            5, 6, 6, 6, 7, 7, 7, 7,
            8, 8, 8, 8, 9, 9, 9, 9,
            9, 10, 10, 10, 10, 10, 11, 11,
            11, 11, 11, 11, 12, 12, 12, 12,
            12, 12, 13, 13, 13, 13, 13, 13,
            13, 14, 14, 14, 14, 14, 14, 14,
            15, 15, 15, 15, 15, 15, 15, 15,
        ];
        let est = LUT[(self >> 2) as usize];
        if est * est > self {
            est - 1
        } else {
            est
        }
    }
}
lut256
impl UnsignedIsqrt for u8 {
    fn isqrt(self) -> Self {
        const LUT: [u8; 256] = [
            0, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3,
            4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5,
            5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
            6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
            8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
            8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
            9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 
            10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 
            11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 
            12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 
            12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 
            13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
            13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 
            14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 
            14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 
            15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 
        ];
        LUT[self as usize]
    }
}

@ChaiTRex
Copy link
Contributor

ChaiTRex commented Jan 25, 2024

@ryanavella

@ChaiTRex I noticed in your benchmarking repository that you've commented out (what I presume was previously) a lookup table implementation for u8 and i8, how did that perform?

I didn't include either the table or libgmp versions because of copyright concerns.

table uses fred_sqrt from part 3c of Paul Hsieh's Square Root page (note that the 256 u8 table there contains fixed-point square roots where the integer part is the most significant four bits). This table is then used for u32's isqrt with fred_sqrt edited by me to use its repetitiveness to significantly reduce the number of branches.

Benchmarks, including commented out implementations
original_i8             time:   [8.8528 ns 8.8812 ns 8.9134 ns]
floating_i8             time:   [7.1974 ns 7.2120 ns 7.2323 ns]
floating karatsuba_i8   time:   [6.6171 ns 6.6359 ns 6.6577 ns]
karatsuba_i8            time:   [6.8114 ns 6.8362 ns 6.8623 ns]
table_i8                time:   [6.6069 ns 6.6273 ns 6.6469 ns]
libgmp_i8               time:   [6.7207 ns 6.7242 ns 6.7282 ns]

original_u8             time:   [7.3317 ns 7.3425 ns 7.3522 ns]
floating_u8             time:   [4.6880 ns 4.7037 ns 4.7192 ns]
floating karatsuba_u8   time:   [2.5223 ns 2.5303 ns 2.5377 ns]
karatsuba_u8            time:   [3.1860 ns 3.1993 ns 3.2164 ns]
table_u8                time:   [2.5434 ns 2.5499 ns 2.5565 ns]
libgmp_u8               time:   [2.5312 ns 2.5382 ns 2.5450 ns]

original_i16            time:   [9.9607 ns 9.9781 ns 9.9970 ns]
floating_i16            time:   [6.8501 ns 6.8723 ns 6.8935 ns]
floating karatsuba_i16  time:   [6.8666 ns 6.8890 ns 6.9088 ns]
karatsuba_i16           time:   [7.5343 ns 7.5497 ns 7.5659 ns]
table_i16               time:   [7.3142 ns 7.3333 ns 7.3528 ns]
libgmp_i16              time:   [7.4119 ns 7.4263 ns 7.4410 ns]

original_u16            time:   [10.318 ns 10.332 ns 10.349 ns]
floating_u16            time:   [6.5575 ns 6.5711 ns 6.5827 ns]
floating karatsuba_u16  time:   [3.2735 ns 3.2822 ns 3.2902 ns]
karatsuba_u16           time:   [4.6550 ns 4.6614 ns 4.6684 ns]
table_u16               time:   [3.7317 ns 3.7352 ns 3.7390 ns]
libgmp_u16              time:   [3.6700 ns 3.6795 ns 3.6894 ns]

original_i32            time:   [11.113 ns 11.138 ns 11.160 ns]
floating_i32            time:   [6.6519 ns 6.6616 ns 6.6697 ns]
floating karatsuba_i32  time:   [6.5736 ns 6.5880 ns 6.6057 ns]
karatsuba_i32           time:   [8.2693 ns 8.2823 ns 8.3002 ns]
table_i32               time:   [7.2038 ns 7.2214 ns 7.2417 ns]
libgmp_i32              time:   [7.2301 ns 7.2389 ns 7.2476 ns]

original_u32            time:   [13.647 ns 13.680 ns 13.719 ns]
floating_u32            time:   [2.8695 ns 2.8776 ns 2.8846 ns]
floating karatsuba_u32  time:   [2.7922 ns 2.8011 ns 2.8093 ns]
karatsuba_u32           time:   [7.5492 ns 7.6069 ns 7.6858 ns]
table_u32               time:   [4.7263 ns 4.7436 ns 4.7614 ns]
libgmp_u32              time:   [4.4509 ns 4.4625 ns 4.4740 ns]

original_i64            time:   [19.217 ns 19.287 ns 19.371 ns]
floating_i64            time:   [9.1175 ns 9.1324 ns 9.1488 ns]
floating karatsuba_i64  time:   [9.3864 ns 9.4072 ns 9.4309 ns]
karatsuba_i64           time:   [13.713 ns 13.737 ns 13.770 ns]
table_i64               time:   [11.280 ns 11.285 ns 11.290 ns]
libgmp_i64              time:   [11.359 ns 11.420 ns 11.488 ns]

original_u64            time:   [28.746 ns 28.808 ns 28.886 ns]
floating_u64            time:   [5.7362 ns 5.7439 ns 5.7525 ns]
floating karatsuba_u64  time:   [5.8237 ns 5.8340 ns 5.8452 ns]
karatsuba_u64           time:   [18.599 ns 18.628 ns 18.674 ns]
table_u64               time:   [12.631 ns 12.656 ns 12.686 ns]
libgmp_u64              time:   [12.943 ns 12.947 ns 12.951 ns]

original_i128           time:   [129.46 ns 129.99 ns 130.48 ns]
floating_i128           time:   [16.033 ns 16.112 ns 16.224 ns]
floating karatsuba_i128 time:   [15.891 ns 15.918 ns 15.948 ns]
karatsuba_i128          time:   [21.446 ns 21.527 ns 21.597 ns]
table_i128              time:   [17.982 ns 18.018 ns 18.052 ns]
libgmp_i128             time:   [17.814 ns 17.832 ns 17.851 ns]

original_u128           time:   [250.32 ns 251.21 ns 252.13 ns]
floating_u128           time:   [20.254 ns 20.288 ns 20.333 ns]
floating karatsuba_u128 time:   [19.654 ns 19.725 ns 19.804 ns]
karatsuba_u128          time:   [30.799 ns 30.853 ns 30.915 ns]
table_u128              time:   [23.780 ns 23.824 ns 23.873 ns]
libgmp_u128             time:   [23.687 ns 23.777 ns 23.864 ns]

It should be noted that libgmp was somewhat faster than table last time I checked, but now they're essentially equivalent.


Here are my unsigned implementations for what I'm currently calling lut16, lut32, lut64, and lut256, which offer different tradeoffs between table size and branchiness:

Since the concern is reducing the amount of memory used, one other concern is code size:

Implementation Table size Code size Total size
lut16 16 0x4a = 74 90
lut32 32 0x3d = 61 93
lut64 64 0x20 = 32 96
lut256 256 0x10 = 16 272

Since each value in your tables takes up four bits, two of them can fit in a u8, so maybe that can help as well.

It should be noted that lut256 can probably be inlined into one memory access with a small computation of the address.

@tfpf
Copy link

tfpf commented Feb 4, 2024

Is there an estimate on when this might become available in stable?

@ryanavella
Copy link

Is there an estimate on when this might become available in stable?

I'm not sure if these are blocking stabilization necessarily, but these are some outstanding questions that I'd like to see answered:

  • Whether we even want to provide a panicking API for signed integers (it is kind of redundant with x.checked_sqrt().unwrap())
  • Whether to use floating point acceleration (and if so, a proof that it yields the same results regardless of FP optimizations and FPU quirks)
  • How to detect hardware floating point support (Needed: cfg(target_feature) support for hardware floating point detection. #64514)
  • What kind of tradeoff is acceptable between code size, table size (if applicable), and speed

Also, we could always use more testers and more benchmarks.

@tfpf
Copy link

tfpf commented Feb 16, 2024

My two cents, as a relatively new user (started using Rust just a year ago), who is absolutely not an expert in compiler/library design.

  • x.ilog2(), which has been stabilised, panics if x <= 0. (I assume that's not the same as 'having a panicking API'?) Whatever is true of it should also be true of x.isqrt() when x < 0.
  • From a cursory glance at musl (used on lightweight OSes like Alpine), it appears to not use lookup tables. Perhaps the same could be done here?
    • A 64-bit system could be assumed to be not terribly memory-constrained, so full use of lookup tables might be considered.

@ryanavella
Copy link

  • From a cursory glance at musl (used on lightweight OSes like Alpine), it appears to not use lookup tables. Perhaps the same could be done here?

Maybe I'm reading the musl sources wrong, but I don't see an implementation for integer square root? I did find this lookup table used for (floating-point) sqrt. Keep in mind that this implementation is probably only relevant to platforms without hardware floating point.

@tfpf
Copy link

tfpf commented Feb 16, 2024

Nice find; I didn’t check thoroughly. Yes, they don’t have an integer square root function, but I was trying to see whether they used any lookup tables at all, since musl is a relatively new library (compared to how old C is) comparable in age to Rust. (Probably not fair to compare a library and a language, though.)

@ChaiTRex
Copy link
Contributor

ChaiTRex commented Feb 18, 2024

@ryanavella With regard to your questions:

  • Whether we even want to provide a panicking API for signed integers (it is kind of redundant with x.checked_sqrt().unwrap())

I agree with what @tfpf said about maintaining consistency with the behavior of ilog functions.

  • Whether to use floating point acceleration (and if so, a proof that it yields the same results regardless of FP optimizations and FPU quirks)

I think that, for a first release, it's not necessary to have floating point support, as a decently fast, non-FP (because some targets have no FP at all or slow FP square root) method for u8 and the Karatsuba method for everything else will be pretty fast.

As far as a proof, a rough one is available in the 34th comment of this linked discussion.

If it matters, I also looked into the rounding mode used and found that Rust uses and requires the default rounding mode and that the default rounding mode for IEEE-754 is "round-to-nearest-ties-to-even".

  • What kind of tradeoff is acceptable between code size, table size (if applicable), and speed

It would be nice if the s and z optimization settings were available as a cfg flag so that, just by the developer choosing in the standard way whether to optimize for speed or size, the isqrt implementations would then be able to use that to choose between the smallest code table size or the fastest available implementation.

That and FP-detection may make thoroughly testing it difficult, however.

Maybe I'm reading the musl sources wrong, but I don't see an implementation for integer square root? I did find this lookup table used for (floating-point) sqrt.

I'd caution that licensing issues should be considered before using or starting from and modifying musl's code so that the two licenses for the Rust standard library (including who's given credit for the code) are all that's needed to use the code.

@ChaiTRex
Copy link
Contributor

ChaiTRex commented Feb 18, 2024

In my last comment in this thread, I pointed out a discussion I had started on internals.rust-lang.org.

I also started a discussion on the Rust Zulip, where I was informed that core::intrinsics::const_eval_select can be used to take two implementations of isqrt, one const fn and a faster one that can't be const fn (like here with the floating point operations) and meld them into a combined const fn that uses the fast version except in a const context.

The const fn version could use a large table to speed itself up without affecting the executable, as that version is only used at compile time.

@tgross35
Copy link
Contributor

Using floating point operations to speed up integer operations seems very out of scope for std, especially starting here - we don't do this on any other methods, possible loss of precision isn't nice, and dealing with different behavior on softfloat targets also isn't ideal. Methods that accept lossy calculations in exchange for performance are a cool thing to have in the ecosystem though - have you considered publishing a faster isqrt (and others that would benefit) as a crate? (the name fast-int isn't taken yet :) ).


These haven't yet been around a year but seem uncontroversial. Putting up a stabilization PR is easy, anyone interested could probably do that to get the discussion ball rolling.

@ChaiTRex
Copy link
Contributor

possible loss of precision isn't nice

The floating point methods give precise results. This can be seen by checking the square root of all the perfect squares ((n * n).isqrt() == n) and all the perfect squares minus one (((n * n) - 1).isqrt() == n - 1) as well as checking 0 and MAX, which has been done exhaustively for u8 through u64. u128 uses the correct results of u64 along with the Karatsuba square root algorithm, which uses integer operations only.

A proof that it works can also be found in a comment in the internals forum.


I do think that this should be stabilized with the Karatsuba methods with no floating point, as it appears that there is a lot of work involved in getting the floating point stuff working.

@tgross35
Copy link
Contributor

u128 uses the correct results of u64 along with the Karatsuba square root algorithm, which uses integer operations only.

This is specifically the case I was referring to, but I guess it is covered if there is a fallback.

You could make a PR to change to Karatsubat at any time if that shows improvement, changing the internal implementation is independent of stabilization.

@kennytm
Copy link
Member

kennytm commented Jun 3, 2024

Could we have NonZero<uN>::isqrt() -> NonZero<uN> too 🤔

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Jul 19, 2024
Add `isqrt` to `NonZero<uN>`

Implements [rust-lang#70887 (comment)](rust-lang#116226 (comment)), with the following signature:

```rust
impl NonZero<uN> {
    const fn isqrt(self) -> Self;
}
```

Unintended benefits include one fewer panicking branch in `ilog2` for LLVM to optimize away, and one fewer `assume_unchecked` as `NonZero` already does that.

The fast path for `self == 1` is dropped, but the current implementation is very slow anyways compared to hardware. Performance improvements can always come later.

(I didn't add the function to `NonZero<iN>`, since _every_ existing `NonZero` method is non-panicking, and it might be nice to leave it that way.)
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Jul 19, 2024
Add `isqrt` to `NonZero<uN>`

Implements [rust-lang#70887 (comment)](rust-lang#116226 (comment)), with the following signature:

```rust
impl NonZero<uN> {
    const fn isqrt(self) -> Self;
}
```

Unintended benefits include one fewer panicking branch in `ilog2` for LLVM to optimize away, and one fewer `assume_unchecked` as `NonZero` already does that.

The fast path for `self == 1` is dropped, but the current implementation is very slow anyways compared to hardware. Performance improvements can always come later.

(I didn't add the function to `NonZero<iN>`, since _every_ existing `NonZero` method is non-panicking, and it might be nice to leave it that way.)
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Jul 19, 2024
Rollup merge of rust-lang#126199 - ivan-shrimp:nonzero_isqrt, r=tgross35

Add `isqrt` to `NonZero<uN>`

Implements [rust-lang#70887 (comment)](rust-lang#116226 (comment)), with the following signature:

```rust
impl NonZero<uN> {
    const fn isqrt(self) -> Self;
}
```

Unintended benefits include one fewer panicking branch in `ilog2` for LLVM to optimize away, and one fewer `assume_unchecked` as `NonZero` already does that.

The fast path for `self == 1` is dropped, but the current implementation is very slow anyways compared to hardware. Performance improvements can always come later.

(I didn't add the function to `NonZero<iN>`, since _every_ existing `NonZero` method is non-panicking, and it might be nice to leave it that way.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants