Optimize EVMMAX montgomery multiplication #777

chfast · 2024-01-05T10:46:52Z

Now that we have benchmarks for EVMMAX precompiles we can experiment with different Montgomery multiplication algorithms. Currently we use CIOS.

The reference paper we used have a summary of the complexity of each algorithm. However, these may not be directly applicable to small numbers like 256-bit. Check also what other projects are doing.

Also study https://www.amazon.science/blog/better-performing-25519-elliptic-curve-cryptography.

triggeredcode · 2024-01-05T14:26:03Z

@chfast Could I help here? I am good with algorithms, also it will help if I can get a bit more context. Thanks

chfast · 2024-01-05T22:38:30Z

Hi @triggeredcode,

The two important references are already in the description. You should check out our existing code implementing Montgomery multiplication using CIOS algorithm. https://github.com/ethereum/evmone/blob/v0.11.0/lib/evmmax/evmmax.cpp#L55-L83.

We want to know if swapping this algorithm with any other would bring performance improvements. The good overview of the algorithms is https://www.microsoft.com/en-us/research/wp-content/uploads/1998/06/97Acar.pdf. But you can also do your research.

This code is well tested and we have benchmarks so it should be relatively easy to get feedback.

chfast · 2024-01-31T16:17:32Z

Possibly generate the algorithms with https://github.com/mit-plv/fiat-crypto.

clementjuventin · 2024-09-04T20:31:32Z

Hey @chfast, I just saw this issue and I would like to give it a try.

However, I am a beginner and this is my first time working with a benchmark, if I understand correctly, the goal is to improve the performance of the benchmark.

Running this command ./build/bin/evmone-bench test/internal_benchmarks twice, I got these results:

// Execution 1
advanced/total/synth/loop_v1                 5738 ns         5738 ns       121862 gas_rate=1.02273G/s gas_used=5.868k
// Execution 2
advanced/total/synth/loop_v1                 5763 ns         5762 ns       121192 gas_rate=1.01833G/s gas_used=5.868k

So if I understand well, the purpose is to try different implementations of the CIOS version of the Montgomery algorithm such as the FIPS version, Barrett Reduction, or Karatsuba Multiplication and see which one is the most efficient according to the benchmark.

The point I’m wondering about is whether the existing unit tests will be sufficient to ensure the quality of the tested algorithm

Thank you in advance for your response. I am not committing to completing this issue, but I find the problem very interesting and I would like to be able to contribute.

clementjuventin · 2024-09-07T12:22:22Z

Hello @chfast,

I have made progress since the message and implemented the Separated Operand Scanning (SOS) version of the Montgomery algorithm to compare it with the current version. I’m leaving the benchmark traces below, but could you give me some guidance on how to achieve a quality benchmark?

CIOS Implementation

user@user:~/sandbox/evmone$ ./build/bin/evmone-bench-internal --benchmark_filter=evmmax*
2024-09-07T14:06:36 02:00
Running ./build/bin/evmone-bench-internal
Run on (16 X 4463 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 1.41, 1.11, 1.00
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
evmmax_add<uint256, bn254>           3.48 ns         3.48 ns    202969266
evmmax_add<uint256, secp256k1>       2.79 ns         2.79 ns    250441458
evmmax_sub<uint256, bn254>           1.78 ns         1.78 ns    393497888
evmmax_sub<uint256, secp256k1>       1.78 ns         1.78 ns    393838716
evmmax_mul<uint256, bn254>           27.5 ns         27.5 ns     25418032
evmmax_mul<uint256, secp256k1>       29.1 ns         29.1 ns     23971772

SOS Implementation

user@user:~/sandbox/evmone$ ./build/bin/evmone-bench-internal --benchmark_filter=evmmax*
2024-09-07T14:08:29 02:00
Running ./build/bin/evmone-bench-internal
Run on (16 X 4463 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 1.83, 1.38, 1.11
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
evmmax_add<uint256, bn254>           3.43 ns         3.43 ns    205808992
evmmax_add<uint256, secp256k1>       2.85 ns         2.85 ns    245437966
evmmax_sub<uint256, bn254>           1.79 ns         1.79 ns    392664952
evmmax_sub<uint256, secp256k1>       1.77 ns         1.77 ns    392497508
evmmax_mul<uint256, bn254>           26.7 ns         26.7 ns     26228834
evmmax_mul<uint256, secp256k1>       19.1 ns         19.1 ns     36523676

There are differences that seem to suggest that the SOS version could be more efficient on certain benchmarks, but how can I be sure of the accuracy of the results? Plus, the code compiles, and the unit tests pass, but is that enough to prove the correct implementation of my algorithm?

Could you share the methodology you used or some resources that will help me move forward in line with the project expectations?

chfast · 2024-09-10T10:44:57Z

There are interesting results. Also because people usually suggest to use CIOS. If you can share your code I can also benchmark it.

You can also take a look at #869.

clementjuventin · 2024-09-11T10:56:09Z

My apologies as I realized after posting that message that my algorithm did not pass all the unit tests. I am actively working on it to propose a correct implementation and run new benchmarks.

chfast · 2024-09-11T12:30:47Z

This is fine. You may still want to take a loot at #869 as a smaller optimization step.

chfast added optimization Iproves performance without functional changes precompiles Related to EVM precompiles good first issue Good for newcomers labels Jan 5, 2024

chfast mentioned this issue Jan 5, 2024

Add support for sparce Montgomery multiplication in EVMMAX #778

Open

chfast added EVMMAX and removed precompiles Related to EVM precompiles labels Jan 6, 2024

clementjuventin mentioned this issue Sep 16, 2024

Implement Fast CIOS for Montgomery modular multiplication #869

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize EVMMAX montgomery multiplication #777

Optimize EVMMAX montgomery multiplication #777

chfast commented Jan 5, 2024 •

edited

Loading

triggeredcode commented Jan 5, 2024

chfast commented Jan 5, 2024

chfast commented Jan 31, 2024

clementjuventin commented Sep 4, 2024 •

edited

Loading

clementjuventin commented Sep 7, 2024 •

edited

Loading

chfast commented Sep 10, 2024

clementjuventin commented Sep 11, 2024

chfast commented Sep 11, 2024

Optimize EVMMAX montgomery multiplication #777

Optimize EVMMAX montgomery multiplication #777

Comments

chfast commented Jan 5, 2024 • edited Loading

triggeredcode commented Jan 5, 2024

chfast commented Jan 5, 2024

chfast commented Jan 31, 2024

clementjuventin commented Sep 4, 2024 • edited Loading

clementjuventin commented Sep 7, 2024 • edited Loading

CIOS Implementation

SOS Implementation

chfast commented Sep 10, 2024

clementjuventin commented Sep 11, 2024

chfast commented Sep 11, 2024

chfast commented Jan 5, 2024 •

edited

Loading

clementjuventin commented Sep 4, 2024 •

edited

Loading

clementjuventin commented Sep 7, 2024 •

edited

Loading