Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YJIT: Enhance the String#<< method substitution to handle integer codepoint values. #11032

Merged
merged 4 commits into from
Aug 2, 2024

Conversation

nirvdrum
Copy link
Contributor

@nirvdrum nirvdrum commented Jun 20, 2024

This PR extends YJIT's method substitution for String#<< to handle integer codepoints as well. If the string is ASCII-8BIT and the codepoint is a byte value, YJIT will dispatch to rb_str_buf_cat_byte as a fast path for working with binary strings. Otherwise, it'll dispatch to the general rb_str_concat just as vm_opt_ltlt would.

rb_str_buf_cat_byte currently works with both ASCII-8BIT and US-ASCII, but this YJIT side only optimizes for ASCII-8BIT. It could be extended easily enough with an additional comparison. The encoding indices for a handful of encodings, including both ASCII-8BIT and US-ASCII are fixed and sequential, so we could also do a range check. For the time being, I've omitted the handling of US-ASCII. I'd like to get feedback on the simplified PR and extend it with US-ASCII handling if needed (which I'm also not convinced we do).

Please advise if the mechanism I'm using to handle polymorphic dispatch is incorrect. We already have jit_rb_str_concat as a method substitution for String#<<, but it deliberately only handled string arguments. I could have merged the two, but that struck me as being complicated. However, I also don't know if it's fine to call between methods like this. And, it ends up duplicating some of the type checks to ensure we dispatch to the correct type depending on what we see at compile time. I was unsure on how to handle the runtime guards, so please pay extra attention to that.

This comment has been minimized.

yjit/src/codegen.rs Outdated Show resolved Hide resolved
yjit/src/codegen.rs Outdated Show resolved Hide resolved
yjit/src/codegen.rs Outdated Show resolved Hide resolved
@maximecb
Copy link
Contributor

Hi Kevin.

Commenting to give you as much feedback as possible.

This PR extends YJIT's method substitution for String#<< to handle integer codepoints as well. If the string is ASCII-8BIT and the codepoint is a byte value, YJIT will dispatch to rb_str_buf_cat_byte as a fast path for working with binary strings.

I think that the idea of this PR is a good one. It seems like a sensible specialization to implement.

I wrote some comments on the code itself.

Two things I would like to see in a PR like this:

  • Benchmark results to know how it affects performance on protoboeuf and our headline benchmarks (if any impact)
  • Maybe a dump of machine code generated when concatenating a few bytes in a loop just to see if we're doing anything obviously inefficient

Please advise if the mechanism I'm using to handle polymorphic dispatch is incorrect. We already have jit_rb_str_concat as a method substitution for String#<<, but it deliberately only handled string arguments. I could have merged the two, but that struck me as being complicated. However, I also don't know if it's fine to call between methods like this. And, it ends up duplicating some of the type checks to ensure we dispatch to the correct type depending on what we see at compile time. I was unsure on how to handle the runtime guards, so please pay extra attention to that.

Seems sensible. Duplicating checks that occur at run-time would not be good but the checks you duplicated occur at compilation time. It makes sense to write the logic for this in a separate function to avoid ending up with a megafunction that is hard to follow.

@nirvdrum nirvdrum force-pushed the yjit-optimize-string-append-byte branch 2 times, most recently from 42ff8d8 to f118dd1 Compare June 21, 2024 18:04
@nirvdrum nirvdrum force-pushed the yjit-optimize-string-append-byte branch 2 times, most recently from 477f936 to 00cc8e4 Compare July 9, 2024 19:35
@tenderlove
Copy link
Member

Before this commit Protoboeuf YJIT is about 5.28x slower than Google's protobuf implementation, but after this commit it is 4.33x slower.

Before:

ruby 3.4.0dev (2024-07-09T17:22:29Z master 6f6aff56b1)  YJIT [arm64-darwin23]
Warming up --------------------------------------
     encode upstream    14.000 i/100ms
   encode protoboeuf     2.000 i/100ms
Calculating -------------------------------------
     encode upstream    140.854 (± 0.7%) i/s -    714.000 in   5.069352s
   encode protoboeuf     26.672 (± 3.7%) i/s -    134.000 in   5.028086s

Comparison:
     encode upstream:      140.9 i/s
   encode protoboeuf:       26.7 i/s - 5.28x  slower

After:

ruby 3.4.0dev (2024-07-09T19:35:29Z yjit-optimize-stri.. 00cc8e4429)  YJIT [arm64-darwin23]
Warming up --------------------------------------
     encode upstream    13.000 i/100ms
   encode protoboeuf     3.000 i/100ms
Calculating -------------------------------------
     encode upstream    137.078 (± 0.7%) i/s -    689.000 in   5.026818s
   encode protoboeuf     31.678 (± 3.2%) i/s -    159.000 in   5.023590s

Comparison:
     encode upstream:      137.1 i/s
   encode protoboeuf:       31.7 i/s - 4.33x  slower

Copy link
Member

@XrXr XrXr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative to what you have is to make a new Rust/C function that essentially does the logic that you currently inline into every site (if codepoint.is_fixnum() { rb_str_buf_cat_byte } else { rb_str_concat }). That way you only generate one ccall at each site.

It should be easier to read than checking in assembler, perf should be about the same, and it's a code size win.

If you write it in C, it could go into string.c and you can additionally avoid removing static from rb_str_buf_cat_byte().

@maximecb
Copy link
Contributor

maximecb commented Jul 9, 2024

I like Alan's idea of embedding some of the checks in a C function to save on code size and maybe make the code a bit more readable.

Also thanks Kevin for persisting. You picked a hard problem for your first big PR! 😉

@nirvdrum nirvdrum force-pushed the yjit-optimize-string-append-byte branch from 00cc8e4 to 869d076 Compare July 10, 2024 15:43
@nirvdrum nirvdrum marked this pull request as ready for review July 10, 2024 15:43
@matzbot matzbot requested a review from a team July 10, 2024 15:43
@nirvdrum
Copy link
Contributor Author

An alternative to what you have is to make a new Rust/C function that essentially does the logic that you currently inline into every site (if codepoint.is_fixnum() { rb_str_buf_cat_byte } else { rb_str_concat }). That way you only generate one ccall at each site.

Okay. I think you were simplifying, but for completeness the check is really (in pseudo-code) if codepoint.is_fixnum() && receiver.is_binary_string() && codepoint.is_byte(). rb_str_buf_cat_byte does not have that logic. The caller is expected to call it only when those conditions hold (actually, rb_str_buf_cat_byte also works on US-ASCII strings, but I omitted that in YJIT) and there are assertions at the start of rb_str_buf_cat_byte that checks those conditions are held.

If you write it in C, it could go into string.c and you can additionally avoid removing static from rb_str_buf_cat_byte().

I'll have to introduce a new function. I don't think updating rb_str_buf_cat_byte to do those checks is the right way to go. It would mean duplicating checks in the interpreter to derive information we already have.

@XrXr
Copy link
Member

XrXr commented Jul 10, 2024

Yes, you should make clear that the new function you add is a YJIT helper and by not putting it in headers, express that it has many preconditions that are hard to meet and is not for general usage.

yjit/src/codegen.rs Outdated Show resolved Hide resolved
yjit/src/codegen.rs Outdated Show resolved Hide resolved
@nirvdrum nirvdrum force-pushed the yjit-optimize-string-append-byte branch 4 times, most recently from c89be19 to edc7703 Compare July 24, 2024 20:36
yjit/src/codegen.rs Outdated Show resolved Hide resolved
@nirvdrum nirvdrum force-pushed the yjit-optimize-string-append-byte branch 5 times, most recently from a350c88 to 4026c6d Compare July 25, 2024 17:12
@nirvdrum nirvdrum force-pushed the yjit-optimize-string-append-byte branch 3 times, most recently from dff85fe to 3c3aba4 Compare July 26, 2024 19:34
@maximecb
Copy link
Contributor

maximecb commented Jul 29, 2024

I see that you modified the regular string concat to fold it into the C call as well. The C code looks well structured, but the downside of that is that we don't necessarily benefit from type information that YJIT might have.

I benchmarked this PR locally on my macbook:

master: ruby 3.4.0dev (2024-07-29T15:07:53Z master 86a762ce56) [arm64-darwin23]
cat: ruby 3.4.0dev (2024-07-26T19:34:15Z yjit-optimize-stri.. 3c3aba4119) [arm64-darwin23]

----------  -----------  ----------  --------  ----------  -----------  ----------
bench       master (ms)  stddev (%)  cat (ms)  stddev (%)  cat 1st itr  master/cat
protoboeuf  78.7         0.9         81.8      1.1         0.97         0.96      
str_concat  39.7         6.4         42.7      4.0         0.92         0.93      
----------  -----------  ----------  --------  ----------  -----------  ----------

At the moment it seems like it's slightly slower, so not a win at the moment. Initially Aaron had reported a big speedup on protoboeuf with your original patch.

I'm sorry about all the back and forth. I don't fully understand why the original patch had such a big speedup in the first place compared to this. It might be worth digging more into where the overhead is coming from and doing more experiments. Maybe we could organize a pairing session with @XrXr ?

@nirvdrum nirvdrum force-pushed the yjit-optimize-string-append-byte branch from 3c3aba4 to a49aa17 Compare July 29, 2024 15:39
@XrXr
Copy link
Member

XrXr commented Jul 29, 2024

Initially Aaron had reported a big speedup on protoboeuf with your original patch. ... I don't fully understand why the original patch had such a big speedup in the first place compared to this.

It looks like the speed-up was on the encoding benchmark and you ran the decoding benchmark. #11032 (comment)

@maximecb
Copy link
Contributor

Initially Aaron had reported a big speedup on protoboeuf with your original patch. ... I don't fully understand why the original patch had such a big speedup in the first place compared to this.

It looks like the speed-up was on the encoding benchmark and you ran the decoding benchmark. #11032 (comment)

Mildly embarrassed 😅 Nevertheless, not ideal that it would slow down decode.

It does seem slower on encode, too but we should probably run these benchmarks on an AWS machine as well:

master: ruby 3.4.0dev (2024-07-29T15:07:53Z master 86a762ce56) [arm64-darwin23]
cat: ruby 3.4.0dev (2024-07-26T19:34:15Z yjit-optimize-stri.. 3c3aba4119) [arm64-darwin23]

-----------------  -----------  ----------  --------  ----------  -----------  ----------
bench              master (ms)  stddev (%)  cat (ms)  stddev (%)  cat 1st itr  master/cat
protoboeuf         76.2         2.9         78.6      0.8         1.00         0.97      
protoboeuf-encode  89.8         1.2         92.6      1.4         0.96         0.97      
str_concat         43.8         4.8         44.0      4.0         0.96         0.99      
-----------------  -----------  ----------  --------  ----------  -----------  ----------

Also, @nirvdrum, have you rebased this PR on master recently? Just to make sure it's a fair comparison.

@nirvdrum
Copy link
Contributor Author

I put the code that moves the older jit_rb_str_concat into a separate commit. It should be easy to compare performance with and without that change.

I've been rebasing the branch each time I edit it (although, it's possible I've forgotten to pull first). It's currently based on 86a762c, which matched the version of master you were using. But, I pushed that about 7 minutes after your comment (sorry, I didn't see it before our I would have mentioned it). It looks like the version you used was based off 83b0ced.

@nirvdrum nirvdrum force-pushed the yjit-optimize-string-append-byte branch 3 times, most recently from a49aa17 to 2c2b0ff Compare July 31, 2024 20:15
string.c Outdated Show resolved Hide resolved
@nirvdrum nirvdrum force-pushed the yjit-optimize-string-append-byte branch 2 times, most recently from 0a6e580 to 1fbe118 Compare August 1, 2024 20:05
Performing the check in YJIT means we can make assumptions about the type. It also improves correctness of stack traces in cases where the codepoint argument is not a String or a Fixnum.
@nirvdrum nirvdrum force-pushed the yjit-optimize-string-append-byte branch from 1fbe118 to 44af5ee Compare August 2, 2024 17:58
@maximecb maximecb merged commit 04a6165 into ruby:master Aug 2, 2024
106 checks passed
@maximecb maximecb deleted the yjit-optimize-string-append-byte branch August 2, 2024 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
5 participants