Autovectorization #1692

usrinivasan · 2019-09-18T23:29:50Z

Graal Autovectorization

Here's our first prototype for Autovectorization in Graal, as mentioned in our email to the graal-dev mailing list in July.

The feature can be enabled using the -Dgraal.Autovectorize=true flag. It's disabled by default.

Current status

Implementation of the SLP algorithm.
Operand packing and extraction.
Code generation for AMD64.
Not limited to loops.
The current implementation vectorizes everything to stress-test the implementation.
Vector stamps.
We're passing all mx unittest tests.
Scimark 2.0 (but it's currently significantly slower, due to loop unrolling limitations & lack of a heuristic).
Support for arbitrary vector lengths (within the vector register limit).

Caveats

Unrolling/ bounds elimination
The CE LoopPartialUnrollPhase does not unroll loops containing array accesses due to guard lowering. We've explored several alternatives to this, like a separate unroll phase that runs before lowering, but this is also not ideal. Right now we have chosen to use one LoopPartialUnrollPhase so the implementation currently does not leverage SLP in loops.
Heuristic
We need to develop a heuristic to restrict vectorization to cases where it is beneficial. Right now we vectorize everything and performance takes a hit as a result. Heuristics that we need are: A loop unrolling heuristic, pair packing savings heuristic, pack filtering heuristic.

There are a few areas where we're unsure of the approach to take, as we don't want to break anything on the EE side. We'll explicitly call out areas in the patch in places where we think there might be a better approach.

@nvgrw @ddegazio @usrinivasan @christhalinger

but I might've broken something. We will see.

currently appears to unroll x4

removes dependency on VectorIntegerStamp from places that shouldn't depend on this

…orization info.

…phase

it causes 2 failures with LoopPartialUnrollTest unit tests

graalvmbot · 2019-09-18T23:40:15Z

Hello Niklas Vangerow, thanks for contributing a PR to our project!

We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address nvangerow -(at)- twitter -(dot)- com. You can sign it at that link.

If you think you've already signed it, please comment below and we'll check.

Hello David Degazio, thanks for contributing a PR to our project!

We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address ddegazio -(at)- twitter -(dot)- com. You can sign it at that link.

If you think you've already signed it, please comment below and we'll check.

Hello Uma Srinivasan, thanks for contributing a PR to our project!

We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address usrinivasan -(at)- twitter -(dot)- com. You can sign it at that link.

If you think you've already signed it, please comment below and we'll check.

nvgrw · 2019-09-18T23:36:20Z

...lvm.compiler.core.amd64/src/org/graalvm/compiler/core/amd64/AMD64ArithmeticLIRGenerator.java

+ Variable result = getLIRGen().newVariable(vectorKind);
+ // getLIRGen().append(new AMD64Unary.VectorReadMemory(result, loadAddress));
+ // Use the line below instead of the line above for the temporary-stack-space load op.
+ getLIRGen().append(new AMD64Packing.LoadStackOp(getLIRGen(), asAllocatable(result), loadAddress, count));


Currently we go via the stack to support arbitrary-size vectors. These could be slow and there might be a better approach.

nvgrw · 2019-09-18T23:36:28Z

...lvm.compiler.core.amd64/src/org/graalvm/compiler/core/amd64/AMD64ArithmeticLIRGenerator.java

+ public void emitVectorStore(LIRKind kind, int count, Value address, Value value, LIRFrameState state) {
+ AMD64AddressValue storeAddress = getAMD64LIRGen().asAddressValue(address);
+ // getLIRGen().append(new AMD64Unary.VectorWriteMemory(storeAddress, asAllocatable(value)));
+ getLIRGen().append(new AMD64Packing.StoreStackOp(getLIRGen(), storeAddress, asAllocatable(value), count));


Currently we go via the stack to support arbitrary-size vectors. These could be slow and there might be a better approach.

nvgrw · 2019-09-18T23:40:22Z

...org.graalvm.compiler.core.common/src/org/graalvm/compiler/core/common/VectorDescription.java

+ return 1;
+ }
+
+ protected abstract int maxVectorWidth(PrimitiveStamp stamp);


We've created this class method to determine the platform vector length. There could be a better way.

nvgrw · 2019-09-18T23:42:50Z

...lvm.compiler.core.common/src/org/graalvm/compiler/core/common/type/VectorPrimitiveStamp.java

+import jdk.vm.ci.meta.ResolvedJavaType;
+import jdk.vm.ci.meta.SerializableConstant;
+
+public abstract class VectorPrimitiveStamp extends ArithmeticStamp {


We've added new stamps that represent vector types to re-use many parts of the code generator. Since these changes are fairly invasive this may provide some challenges when integrating with EE

nvgrw · 2019-09-18T23:44:16Z

compiler/src/org.graalvm.compiler.core/src/org/graalvm/compiler/core/phases/LowTier.java

+ list = Arrays.asList(listValue.split(","));
+ }
+
+ appendPhase(new IsomorphicPackingPhase(


IPP currently runs in the low tier in part to avoid having to make changes to other phases to support the vector stamp types & nodes. Interested to get your opinion on this approach.

nvgrw · 2019-09-18T23:47:21Z

...graalvm.compiler.lir.amd64/src/org/graalvm/compiler/lir/amd64/vector/AMD64VectorShuffle.java

@@ -343,7 359,12 @@ public ExtractShortOp(AllocatableValue result, AllocatableValue vector, int sele

 @Override
 public void emitCode(CompilationResultBuilder crb, AMD64MacroAssembler masm) {
- VPEXTRW.emit(masm, XMM, asRegister(result), asRegister(vector), selector);
+ if (isRegister(result)) {


These operations are not used in CE but may have an impact on EE

nvgrw · 2019-09-18T23:48:30Z

...raalvm.compiler.loop.phases/src/org/graalvm/compiler/loop/phases/LoopPartialUnrollPhase.java

@@ -64,7 65,7 @@ protected void run(StructuredGraph graph, CoreProviders context) {
 if (!LoopTransformations.isUnrollableLoop(loop)) {
 continue;
 }
- if (getPolicies().shouldPartiallyUnroll(loop)) {
+ if (context instanceof MidTierContext && getPolicies().shouldPartiallyUnroll(loop, ((MidTierContext) context).getVectorDescription())) {


There ought to be a better way to pass this contextual data

nvgrw · 2019-09-18T23:48:54Z

...g.graalvm.compiler.loop.phases/src/org/graalvm/compiler/loop/phases/LoopTransformations.java

+ // `processPreLoopPhis` utilizes loop.inside() and asserts that have been duplicated.
+ // However, since the duplication above removes some nodes from the LoopFragmentWhole and
+ // therefore does not duplicate those nodes, the assertion fails as we encounter nodes that were not duplicated.
+ loop.invalidateInsideFragment();


This was a band-aid solution for an unrolling-related problem. Should probably be removed once unrolling is fixed.

nvgrw · 2019-09-18T23:56:27Z

...r/src/org.graalvm.compiler.loop/src/org/graalvm/compiler/loop/VectorizationLoopPolicies.java

+
+import jdk.vm.ci.meta.MetaAccessProvider;
+
+public class VectorizationLoopPolicies implements LoopPolicies {


Policies to use for first unroll pass (now removed). If unrolling before guard lowering then this can be used. Otherwise this could be integrated with the default policy.

nvgrw · 2019-09-18T23:57:23Z

...n/src/org/graalvm/compiler/phases/common/vectorization/DefaultAutovectorizationPolicies.java

+import java.util.Set;
+import java.util.stream.Collectors;
+
+public class DefaultAutovectorizationPolicies implements AutovectorizationPolicies {


These are the heuristics that need some work

usrinivasan · 2019-09-19T00:24:20Z

Hello Niklas Vangerow, thanks for contributing a PR to our project!

We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address nvangerow -(at)- twitter -(dot)- com. You can sign it at that link.

If you think you've already signed it, please comment below and we'll check.

Hello David Degazio, thanks for contributing a PR to our project!

We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address ddegazio -(at)- twitter -(dot)- com. You can sign it at that link.

If you think you've already signed it, please comment below and we'll check.

Hello Uma Srinivasan, thanks for contributing a PR to our project!

We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address usrinivasan -(at)- twitter -(dot)- com. You can sign it at that link.

If you think you've already signed it, please comment below and we'll check.

All the authors worked for Twitter Inc. at the time of contribution to this code. Twitter has signed the OCA.

mauhiz · 2019-09-25T13:48:58Z

...lvm.compiler.core.common/src/org/graalvm/compiler/core/common/type/VectorPrimitiveStamp.java

+ str.append(" of ");
+ str.append(getScalar().toString());
+ } else {
+ str.append("<empty>");


let's return this constant without string building if !hasValues() ?

thomaswue · 2019-09-25T15:12:53Z

Can you share the impact on compile time, code size and peak performance on the workloads as suggested by https://mail.openjdk.java.net/pipermail/graal-dev/2019-July/005862.html when the patch was initially proposed? Specifically interesting for us would be the measured impact for Twitter's services that run the GraalVM compiler in production.

From the summary above it seems there is a negative impact even on the Scimark benchmark. This is one of the workloads that should certainly profit from vectorization.

plokhotnyuk · 2019-09-25T16:49:13Z

@thomaswue can I use standard instructions to build binaries (AMD64, Linux) for this PR?

I'm going to run this benchmark for them with both values of the -Dgraal.Autovectorize option.

thomaswue · 2019-09-25T18:32:34Z

@plokhotnyuk I would assume that the PR is at least in a state where you should be able to do that.

usrinivasan · 2019-09-25T21:19:54Z

@plokhotnyuk I would assume that the PR is at least in a state where you should be able to do that.

That's right! We use mx build to build the graal jar and use it with a jdk11 vm to run tests. Please let us know if you run into any issues.

helloguo · 2019-12-01T19:30:51Z

Thanks for sharing the patch! Any update in terms of the next steps? (e.g. the fix of performance regression for Scimark benchmark, the plan of merging this PR into master).

I ran a few workloads which are supposed to benefit from autovec. However, I did not see too much difference when I applied this PR. It's possible that the loops in my workload are not in a good shape. I guess my question is what would be good scenarios to use this patch? Or when should I not use this patch because of its current limitation?

thomaswue · 2019-12-01T21:15:19Z

We discussed this briefly at the community workshop. There are a lot of possible interactions of vectorization with other loop transformations in the compiler due to different loop shapes. We do have support for auto-vectorization in our EE version. We will work towards having basic support for this feature also in CE.

helloguo · 2019-12-01T21:37:02Z

@thomaswue Thanks for the information.

We do have support for auto-vectorization in our EE version. We will work towards having basic support for this feature also in CE.

Any idea about the timeline for CE to have basic autovec support? Just curious when is a good time to come back and test it again.

thomaswue · 2019-12-01T22:01:59Z

@helloguo You should certainly test the EE version on your workloads ;). It is free for evaluation and I assume it could make a difference at Facebook. Note that auto-vectorization is only a small part of the CE/EE performance-related differences.

Productization of this feature in CE will require quite some engineering effort. The current plan is to have it in one of the 20.x releases next year. But take this as a rough estimate only.

apete · 2019-12-02T19:10:28Z

I've done some linear algebra benchmarks comparing the EE and CE versions, using JMH. Here are the results doing matrix multiplication using Apache Commons Math (ACM), EJML and ojAlgo.

Throughput ops/min on the y-axis, and matrix size (square) on the x-axis.

The results vary depending on the library, but the CE is typically slower than HotSpot and the EE typically faster. The difference between CE and EE can be significant.

christhalinger · 2019-12-02T19:40:53Z

@thomaswue thanks for the update on a possible timeline!

graalvmbot · 2021-11-15T14:10:41Z

Hello Niklas Vangerow, thanks for contributing a PR to our project!

We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address nvangerow -(at)- twitter -(dot)- com. You can sign it at that link.

If you think you've already signed it, please comment below and we'll check.

Hello David Degazio, thanks for contributing a PR to our project!

We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address ddegazio -(at)- twitter -(dot)- com. You can sign it at that link.

If you think you've already signed it, please comment below and we'll check.

Hello Uma Srinivasan, thanks for contributing a PR to our project!

We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address usrinivasan -(at)- twitter -(dot)- com. You can sign it at that link.

If you think you've already signed it, please comment below and we'll check.

nvgrw and others added 30 commits June 27, 2019 15:53

add short names in effort to fix templating

a87899f

preliminary vector extract node

c7ddcae

remove VectorUnpackNode, add can for Extract

804410a

untested vector extract codegen

8b53d1a

untested non-constant packing

9af88e0

preliminary vector integer stamp type

7de5e94

emitVectorExtract => emitExtract

9527945

remove toVectorKind

2c6d83b

remove isVector

c02adb7

use vector stamps and generify some emit

ddac5f1

remove vectorAdd

d2aca14

return Illegal rather than throw for stack kind

d44c9be

wip replacement of vector add node

1e55df5

support arbitrary binary operations in vectorizer

16933fd

add codegen for and -, 2/4 element doubleword

5745684

move pack to separate file

0aa01b5

change alignments to integers

5413a2e

but I might've broken something. We will see.

separate pairs and packs

6dac641

remove unused methods from ipp

09a4388

fix combine_packs broken after 555ce87

6cc9159

add loop unrolling phase to high tier

44f9727

currently appears to unroll x4

move partial unroll to prevent double/excessive ur

2fb5148

move unroll and always partially unroll loops to 4

517c98d

remove redundant getStackKind()

3a72424

clean up NodeView use

4539bd7

add whitelist of vectorization candidate nodes

3ecbf2a

stamp vector conversion modelled after constants

0bfbd82

removes dependency on VectorIntegerStamp from places that shouldn't depend on this

Added VectorDescription, a new class to provide backend-specific vect…

70fb8fd

…orization info.

Updated a few things to match remote.

4c03d29

pass checkstyle

4f92ccf

ezzarghili and others added 11 commits September 12, 2019 22:04

[GR-18183] Release GraalVM 19.2.0.1.

b30f73a

fix single precision vector add, sub, mul

704a5c4

add AutovectorizationContext to pass context info

81f38cd

add naive filtering heuristic

e0bf48f

Move the AutoVec unroller (uses AutoVec policies) below the Lowering …

f9f1cf6

…phase

Merge tag 'vm-19.2.0.1' into feature/autovec

e0c0aa1

Merge branch 'master' into feature/autovec

e50596a

undo accidental file changes

bf3095c

remove unrolling with VectorizationLoopPolicies as

f6faba4

it causes 2 failures with LoopPartialUnrollTest unit tests

Set default for Autovectorize option to false

67c24ed

Optimize Imports - checkstyle fixes

b57595d

nvgrw reviewed Sep 18, 2019

View reviewed changes

usrinivasan marked this pull request as ready for review September 19, 2019 00:27

mauhiz reviewed Sep 25, 2019

View reviewed changes

thomaswue self-assigned this Sep 25, 2019

XiaohongGong mentioned this pull request Jun 12, 2020

Check eliminations inside loop #2562

Closed

rwestrel mentioned this pull request Jul 23, 2020

Auto vectorization support (dependency graph algorithm) #2703

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autovectorization #1692

Autovectorization #1692

usrinivasan commented Sep 18, 2019 •

edited

Loading

graalvmbot commented Sep 18, 2019

nvgrw Sep 18, 2019

nvgrw Sep 18, 2019

nvgrw Sep 18, 2019

nvgrw Sep 18, 2019

nvgrw Sep 18, 2019

nvgrw Sep 18, 2019

nvgrw Sep 18, 2019

nvgrw Sep 18, 2019

nvgrw Sep 18, 2019

nvgrw Sep 18, 2019

usrinivasan commented Sep 19, 2019

mauhiz Sep 25, 2019

thomaswue commented Sep 25, 2019

plokhotnyuk commented Sep 25, 2019 •

edited

Loading

thomaswue commented Sep 25, 2019

usrinivasan commented Sep 25, 2019

helloguo commented Dec 1, 2019

thomaswue commented Dec 1, 2019

helloguo commented Dec 1, 2019 •

edited

Loading

thomaswue commented Dec 1, 2019

apete commented Dec 2, 2019

christhalinger commented Dec 2, 2019

graalvmbot commented Nov 15, 2021


		import jdk.vm.ci.meta.MetaAccessProvider;

		public class VectorizationLoopPolicies implements LoopPolicies {

Autovectorization #1692

Are you sure you want to change the base?

Autovectorization #1692

Conversation

usrinivasan commented Sep 18, 2019 • edited Loading

Graal Autovectorization

Current status

Caveats

graalvmbot commented Sep 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

usrinivasan commented Sep 19, 2019

Choose a reason for hiding this comment

thomaswue commented Sep 25, 2019

plokhotnyuk commented Sep 25, 2019 • edited Loading

thomaswue commented Sep 25, 2019

usrinivasan commented Sep 25, 2019

helloguo commented Dec 1, 2019

thomaswue commented Dec 1, 2019

helloguo commented Dec 1, 2019 • edited Loading

thomaswue commented Dec 1, 2019

apete commented Dec 2, 2019

christhalinger commented Dec 2, 2019

graalvmbot commented Nov 15, 2021

usrinivasan commented Sep 18, 2019 •

edited

Loading

plokhotnyuk commented Sep 25, 2019 •

edited

Loading

helloguo commented Dec 1, 2019 •

edited

Loading