Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize EVMMAX montgomery multiplication #777

Open
chfast opened this issue Jan 5, 2024 · 8 comments
Open

Optimize EVMMAX montgomery multiplication #777

chfast opened this issue Jan 5, 2024 · 8 comments
Labels
EVMMAX good first issue Good for newcomers optimization Iproves performance without functional changes

Comments

@chfast
Copy link
Member

chfast commented Jan 5, 2024

Now that we have benchmarks for EVMMAX precompiles we can experiment with different Montgomery multiplication algorithms. Currently we use CIOS.

The reference paper we used have a summary of the complexity of each algorithm. However, these may not be directly applicable to small numbers like 256-bit. Check also what other projects are doing.

Also study https://www.amazon.science/blog/better-performing-25519-elliptic-curve-cryptography.

@chfast chfast added optimization Iproves performance without functional changes precompiles Related to EVM precompiles good first issue Good for newcomers labels Jan 5, 2024
@triggeredcode
Copy link

@chfast Could I help here? I am good with algorithms, also it will help if I can get a bit more context. Thanks

@chfast
Copy link
Member Author

chfast commented Jan 5, 2024

Hi @triggeredcode,

The two important references are already in the description. You should check out our existing code implementing Montgomery multiplication using CIOS algorithm. https://github.com/ethereum/evmone/blob/v0.11.0/lib/evmmax/evmmax.cpp#L55-L83.

We want to know if swapping this algorithm with any other would bring performance improvements. The good overview of the algorithms is https://www.microsoft.com/en-us/research/wp-content/uploads/1998/06/97Acar.pdf. But you can also do your research.

This code is well tested and we have benchmarks so it should be relatively easy to get feedback.

@chfast chfast added EVMMAX and removed precompiles Related to EVM precompiles labels Jan 6, 2024
@chfast
Copy link
Member Author

chfast commented Jan 31, 2024

Possibly generate the algorithms with https://github.com/mit-plv/fiat-crypto.

@clementjuventin
Copy link

clementjuventin commented Sep 4, 2024

Hey @chfast, I just saw this issue and I would like to give it a try.

However, I am a beginner and this is my first time working with a benchmark, if I understand correctly, the goal is to improve the performance of the benchmark.

Running this command ./build/bin/evmone-bench test/internal_benchmarks twice, I got these results:

// Execution 1
advanced/total/synth/loop_v1                 5738 ns         5738 ns       121862 gas_rate=1.02273G/s gas_used=5.868k
// Execution 2
advanced/total/synth/loop_v1                 5763 ns         5762 ns       121192 gas_rate=1.01833G/s gas_used=5.868k

So if I understand well, the purpose is to try different implementations of the CIOS version of the Montgomery algorithm such as the FIPS version, Barrett Reduction, or Karatsuba Multiplication and see which one is the most efficient according to the benchmark.

The point I’m wondering about is whether the existing unit tests will be sufficient to ensure the quality of the tested algorithm

Thank you in advance for your response. I am not committing to completing this issue, but I find the problem very interesting and I would like to be able to contribute.

@clementjuventin
Copy link

clementjuventin commented Sep 7, 2024

Hello @chfast,

I have made progress since the message and implemented the Separated Operand Scanning (SOS) version of the Montgomery algorithm to compare it with the current version. I’m leaving the benchmark traces below, but could you give me some guidance on how to achieve a quality benchmark?

CIOS Implementation

user@user:~/sandbox/evmone$ ./build/bin/evmone-bench-internal --benchmark_filter=evmmax*
2024-09-07T14:06:36 02:00
Running ./build/bin/evmone-bench-internal
Run on (16 X 4463 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 1.41, 1.11, 1.00
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
evmmax_add<uint256, bn254>           3.48 ns         3.48 ns    202969266
evmmax_add<uint256, secp256k1>       2.79 ns         2.79 ns    250441458
evmmax_sub<uint256, bn254>           1.78 ns         1.78 ns    393497888
evmmax_sub<uint256, secp256k1>       1.78 ns         1.78 ns    393838716
evmmax_mul<uint256, bn254>           27.5 ns         27.5 ns     25418032
evmmax_mul<uint256, secp256k1>       29.1 ns         29.1 ns     23971772

SOS Implementation

user@user:~/sandbox/evmone$ ./build/bin/evmone-bench-internal --benchmark_filter=evmmax*
2024-09-07T14:08:29 02:00
Running ./build/bin/evmone-bench-internal
Run on (16 X 4463 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 1.83, 1.38, 1.11
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
evmmax_add<uint256, bn254>           3.43 ns         3.43 ns    205808992
evmmax_add<uint256, secp256k1>       2.85 ns         2.85 ns    245437966
evmmax_sub<uint256, bn254>           1.79 ns         1.79 ns    392664952
evmmax_sub<uint256, secp256k1>       1.77 ns         1.77 ns    392497508
evmmax_mul<uint256, bn254>           26.7 ns         26.7 ns     26228834
evmmax_mul<uint256, secp256k1>       19.1 ns         19.1 ns     36523676

There are differences that seem to suggest that the SOS version could be more efficient on certain benchmarks, but how can I be sure of the accuracy of the results? Plus, the code compiles, and the unit tests pass, but is that enough to prove the correct implementation of my algorithm?

Could you share the methodology you used or some resources that will help me move forward in line with the project expectations?

@chfast
Copy link
Member Author

chfast commented Sep 10, 2024

There are interesting results. Also because people usually suggest to use CIOS. If you can share your code I can also benchmark it.

You can also take a look at #869.

@clementjuventin
Copy link

My apologies as I realized after posting that message that my algorithm did not pass all the unit tests. I am actively working on it to propose a correct implementation and run new benchmarks.

@chfast
Copy link
Member Author

chfast commented Sep 11, 2024

This is fine. You may still want to take a loot at #869 as a smaller optimization step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EVMMAX good first issue Good for newcomers optimization Iproves performance without functional changes
Projects
None yet
Development

No branches or pull requests

3 participants