-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autovectorization #1692
base: master
Are you sure you want to change the base?
Autovectorization #1692
Conversation
but I might've broken something. We will see.
currently appears to unroll x4
removes dependency on VectorIntegerStamp from places that shouldn't depend on this
it causes 2 failures with LoopPartialUnrollTest unit tests
We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address nvangerow -(at)- twitter -(dot)- com. You can sign it at that link. If you think you've already signed it, please comment below and we'll check.
We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address ddegazio -(at)- twitter -(dot)- com. You can sign it at that link. If you think you've already signed it, please comment below and we'll check.
We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address usrinivasan -(at)- twitter -(dot)- com. You can sign it at that link. If you think you've already signed it, please comment below and we'll check. |
Variable result = getLIRGen().newVariable(vectorKind); | ||
// getLIRGen().append(new AMD64Unary.VectorReadMemory(result, loadAddress)); | ||
// Use the line below instead of the line above for the temporary-stack-space load op. | ||
getLIRGen().append(new AMD64Packing.LoadStackOp(getLIRGen(), asAllocatable(result), loadAddress, count)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we go via the stack to support arbitrary-size vectors. These could be slow and there might be a better approach.
public void emitVectorStore(LIRKind kind, int count, Value address, Value value, LIRFrameState state) { | ||
AMD64AddressValue storeAddress = getAMD64LIRGen().asAddressValue(address); | ||
// getLIRGen().append(new AMD64Unary.VectorWriteMemory(storeAddress, asAllocatable(value))); | ||
getLIRGen().append(new AMD64Packing.StoreStackOp(getLIRGen(), storeAddress, asAllocatable(value), count)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we go via the stack to support arbitrary-size vectors. These could be slow and there might be a better approach.
return 1; | ||
} | ||
|
||
protected abstract int maxVectorWidth(PrimitiveStamp stamp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've created this class method to determine the platform vector length. There could be a better way.
import jdk.vm.ci.meta.ResolvedJavaType; | ||
import jdk.vm.ci.meta.SerializableConstant; | ||
|
||
public abstract class VectorPrimitiveStamp extends ArithmeticStamp { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've added new stamps that represent vector types to re-use many parts of the code generator. Since these changes are fairly invasive this may provide some challenges when integrating with EE
list = Arrays.asList(listValue.split(",")); | ||
} | ||
|
||
appendPhase(new IsomorphicPackingPhase( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IPP currently runs in the low tier in part to avoid having to make changes to other phases to support the vector stamp types & nodes. Interested to get your opinion on this approach.
@@ -343,7 359,12 @@ public ExtractShortOp(AllocatableValue result, AllocatableValue vector, int sele | |||
|
|||
@Override | |||
public void emitCode(CompilationResultBuilder crb, AMD64MacroAssembler masm) { | |||
VPEXTRW.emit(masm, XMM, asRegister(result), asRegister(vector), selector); | |||
if (isRegister(result)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These operations are not used in CE but may have an impact on EE
@@ -64,7 65,7 @@ protected void run(StructuredGraph graph, CoreProviders context) { | |||
if (!LoopTransformations.isUnrollableLoop(loop)) { | |||
continue; | |||
} | |||
if (getPolicies().shouldPartiallyUnroll(loop)) { | |||
if (context instanceof MidTierContext && getPolicies().shouldPartiallyUnroll(loop, ((MidTierContext) context).getVectorDescription())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There ought to be a better way to pass this contextual data
// `processPreLoopPhis` utilizes loop.inside() and asserts that have been duplicated. | ||
// However, since the duplication above removes some nodes from the LoopFragmentWhole and | ||
// therefore does not duplicate those nodes, the assertion fails as we encounter nodes that were not duplicated. | ||
loop.invalidateInsideFragment(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a band-aid solution for an unrolling-related problem. Should probably be removed once unrolling is fixed.
|
||
import jdk.vm.ci.meta.MetaAccessProvider; | ||
|
||
public class VectorizationLoopPolicies implements LoopPolicies { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Policies to use for first unroll pass (now removed). If unrolling before guard lowering then this can be used. Otherwise this could be integrated with the default policy.
import java.util.Set; | ||
import java.util.stream.Collectors; | ||
|
||
public class DefaultAutovectorizationPolicies implements AutovectorizationPolicies { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the heuristics that need some work
All the authors worked for Twitter Inc. at the time of contribution to this code. Twitter has signed the OCA. |
str.append(" of "); | ||
str.append(getScalar().toString()); | ||
} else { | ||
str.append("<empty>"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's return this constant without string building if !hasValues()
?
Can you share the impact on compile time, code size and peak performance on the workloads as suggested by https://mail.openjdk.java.net/pipermail/graal-dev/2019-July/005862.html when the patch was initially proposed? Specifically interesting for us would be the measured impact for Twitter's services that run the GraalVM compiler in production. From the summary above it seems there is a negative impact even on the Scimark benchmark. This is one of the workloads that should certainly profit from vectorization. |
@thomaswue can I use standard instructions to build binaries (AMD64, Linux) for this PR? I'm going to run this benchmark for them with both values of the |
@plokhotnyuk I would assume that the PR is at least in a state where you should be able to do that. |
That's right! We use |
Thanks for sharing the patch! Any update in terms of the next steps? (e.g. the fix of performance regression for Scimark benchmark, the plan of merging this PR into master). I ran a few workloads which are supposed to benefit from autovec. However, I did not see too much difference when I applied this PR. It's possible that the loops in my workload are not in a good shape. I guess my question is what would be good scenarios to use this patch? Or when should I not use this patch because of its current limitation? |
We discussed this briefly at the community workshop. There are a lot of possible interactions of vectorization with other loop transformations in the compiler due to different loop shapes. We do have support for auto-vectorization in our EE version. We will work towards having basic support for this feature also in CE. |
@thomaswue Thanks for the information.
Any idea about the timeline for CE to have basic autovec support? Just curious when is a good time to come back and test it again. |
@helloguo You should certainly test the EE version on your workloads ;). It is free for evaluation and I assume it could make a difference at Facebook. Note that auto-vectorization is only a small part of the CE/EE performance-related differences. Productization of this feature in CE will require quite some engineering effort. The current plan is to have it in one of the 20.x releases next year. But take this as a rough estimate only. |
I've done some linear algebra benchmarks comparing the EE and CE versions, using JMH. Here are the results doing matrix multiplication using Apache Commons Math (ACM), EJML and ojAlgo. Throughput ops/min on the y-axis, and matrix size (square) on the x-axis. The results vary depending on the library, but the CE is typically slower than HotSpot and the EE typically faster. The difference between CE and EE can be significant. |
@thomaswue thanks for the update on a possible timeline! |
We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address nvangerow -(at)- twitter -(dot)- com. You can sign it at that link. If you think you've already signed it, please comment below and we'll check.
We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address ddegazio -(at)- twitter -(dot)- com. You can sign it at that link. If you think you've already signed it, please comment below and we'll check.
We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address usrinivasan -(at)- twitter -(dot)- com. You can sign it at that link. If you think you've already signed it, please comment below and we'll check. |
Graal Autovectorization
Here's our first prototype for Autovectorization in Graal, as mentioned in our email to the graal-dev mailing list in July.
The feature can be enabled using the
-Dgraal.Autovectorize=true
flag. It's disabled by default.Current status
The current implementation vectorizes everything to stress-test the implementation.
mx unittest
tests.Caveats
The CE
LoopPartialUnrollPhase
does not unroll loops containing array accesses due to guard lowering. We've explored several alternatives to this, like a separate unroll phase that runs before lowering, but this is also not ideal. Right now we have chosen to use oneLoopPartialUnrollPhase
so the implementation currently does not leverage SLP in loops.We need to develop a heuristic to restrict vectorization to cases where it is beneficial. Right now we vectorize everything and performance takes a hit as a result. Heuristics that we need are: A loop unrolling heuristic, pair packing savings heuristic, pack filtering heuristic.
There are a few areas where we're unsure of the approach to take, as we don't want to break anything on the EE side. We'll explicitly call out areas in the patch in places where we think there might be a better approach.
@nvgrw @ddegazio @usrinivasan @christhalinger