Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autovectorization #1692

Open
wants to merge 516 commits into
base: master
Choose a base branch
from
Open

Conversation

usrinivasan
Copy link

@usrinivasan usrinivasan commented Sep 18, 2019

Graal Autovectorization

Here's our first prototype for Autovectorization in Graal, as mentioned in our email to the graal-dev mailing list in July.

The feature can be enabled using the -Dgraal.Autovectorize=true flag. It's disabled by default.

Current status

  • Implementation of the SLP algorithm.
  • Operand packing and extraction.
  • Code generation for AMD64.
  • Not limited to loops.
    The current implementation vectorizes everything to stress-test the implementation.
  • Vector stamps.
  • We're passing all mx unittest tests.
  • Scimark 2.0 (but it's currently significantly slower, due to loop unrolling limitations & lack of a heuristic).
  • Support for arbitrary vector lengths (within the vector register limit).

Caveats

  • Unrolling/ bounds elimination
    The CE LoopPartialUnrollPhase does not unroll loops containing array accesses due to guard lowering. We've explored several alternatives to this, like a separate unroll phase that runs before lowering, but this is also not ideal. Right now we have chosen to use one LoopPartialUnrollPhase so the implementation currently does not leverage SLP in loops.
  • Heuristic
    We need to develop a heuristic to restrict vectorization to cases where it is beneficial. Right now we vectorize everything and performance takes a hit as a result. Heuristics that we need are: A loop unrolling heuristic, pair packing savings heuristic, pack filtering heuristic.

There are a few areas where we're unsure of the approach to take, as we don't want to break anything on the EE side. We'll explicitly call out areas in the patch in places where we think there might be a better approach.

@nvgrw @ddegazio @usrinivasan @christhalinger

nvgrw and others added 30 commits June 27, 2019 15:53
but I might've broken something. We will see.
currently appears to unroll x4
removes dependency on VectorIntegerStamp from places that shouldn't
depend on this
@graalvmbot
Copy link
Collaborator

  • Hello Niklas Vangerow, thanks for contributing a PR to our project!

We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address nvangerow -(at)- twitter -(dot)- com. You can sign it at that link.

If you think you've already signed it, please comment below and we'll check.

  • Hello David Degazio, thanks for contributing a PR to our project!

We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address ddegazio -(at)- twitter -(dot)- com. You can sign it at that link.

If you think you've already signed it, please comment below and we'll check.

  • Hello Uma Srinivasan, thanks for contributing a PR to our project!

We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address usrinivasan -(at)- twitter -(dot)- com. You can sign it at that link.

If you think you've already signed it, please comment below and we'll check.

Variable result = getLIRGen().newVariable(vectorKind);
// getLIRGen().append(new AMD64Unary.VectorReadMemory(result, loadAddress));
// Use the line below instead of the line above for the temporary-stack-space load op.
getLIRGen().append(new AMD64Packing.LoadStackOp(getLIRGen(), asAllocatable(result), loadAddress, count));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we go via the stack to support arbitrary-size vectors. These could be slow and there might be a better approach.

public void emitVectorStore(LIRKind kind, int count, Value address, Value value, LIRFrameState state) {
AMD64AddressValue storeAddress = getAMD64LIRGen().asAddressValue(address);
// getLIRGen().append(new AMD64Unary.VectorWriteMemory(storeAddress, asAllocatable(value)));
getLIRGen().append(new AMD64Packing.StoreStackOp(getLIRGen(), storeAddress, asAllocatable(value), count));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we go via the stack to support arbitrary-size vectors. These could be slow and there might be a better approach.

return 1;
}

protected abstract int maxVectorWidth(PrimitiveStamp stamp);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've created this class method to determine the platform vector length. There could be a better way.

import jdk.vm.ci.meta.ResolvedJavaType;
import jdk.vm.ci.meta.SerializableConstant;

public abstract class VectorPrimitiveStamp extends ArithmeticStamp {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've added new stamps that represent vector types to re-use many parts of the code generator. Since these changes are fairly invasive this may provide some challenges when integrating with EE

list = Arrays.asList(listValue.split(","));
}

appendPhase(new IsomorphicPackingPhase(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPP currently runs in the low tier in part to avoid having to make changes to other phases to support the vector stamp types & nodes. Interested to get your opinion on this approach.

@@ -343,7 359,12 @@ public ExtractShortOp(AllocatableValue result, AllocatableValue vector, int sele

@Override
public void emitCode(CompilationResultBuilder crb, AMD64MacroAssembler masm) {
VPEXTRW.emit(masm, XMM, asRegister(result), asRegister(vector), selector);
if (isRegister(result)) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These operations are not used in CE but may have an impact on EE

@@ -64,7 65,7 @@ protected void run(StructuredGraph graph, CoreProviders context) {
if (!LoopTransformations.isUnrollableLoop(loop)) {
continue;
}
if (getPolicies().shouldPartiallyUnroll(loop)) {
if (context instanceof MidTierContext && getPolicies().shouldPartiallyUnroll(loop, ((MidTierContext) context).getVectorDescription())) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There ought to be a better way to pass this contextual data

// `processPreLoopPhis` utilizes loop.inside() and asserts that have been duplicated.
// However, since the duplication above removes some nodes from the LoopFragmentWhole and
// therefore does not duplicate those nodes, the assertion fails as we encounter nodes that were not duplicated.
loop.invalidateInsideFragment();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a band-aid solution for an unrolling-related problem. Should probably be removed once unrolling is fixed.


import jdk.vm.ci.meta.MetaAccessProvider;

public class VectorizationLoopPolicies implements LoopPolicies {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Policies to use for first unroll pass (now removed). If unrolling before guard lowering then this can be used. Otherwise this could be integrated with the default policy.

import java.util.Set;
import java.util.stream.Collectors;

public class DefaultAutovectorizationPolicies implements AutovectorizationPolicies {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the heuristics that need some work

@usrinivasan
Copy link
Author

  • Hello Niklas Vangerow, thanks for contributing a PR to our project!

We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address nvangerow -(at)- twitter -(dot)- com. You can sign it at that link.

If you think you've already signed it, please comment below and we'll check.

  • Hello David Degazio, thanks for contributing a PR to our project!

We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address ddegazio -(at)- twitter -(dot)- com. You can sign it at that link.

If you think you've already signed it, please comment below and we'll check.

  • Hello Uma Srinivasan, thanks for contributing a PR to our project!

We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address usrinivasan -(at)- twitter -(dot)- com. You can sign it at that link.

If you think you've already signed it, please comment below and we'll check.

All the authors worked for Twitter Inc. at the time of contribution to this code. Twitter has signed the OCA.

@usrinivasan usrinivasan marked this pull request as ready for review September 19, 2019 00:27
str.append(" of ");
str.append(getScalar().toString());
} else {
str.append("<empty>");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's return this constant without string building if !hasValues() ?

@thomaswue thomaswue self-assigned this Sep 25, 2019
@thomaswue
Copy link
Member

Can you share the impact on compile time, code size and peak performance on the workloads as suggested by https://mail.openjdk.java.net/pipermail/graal-dev/2019-July/005862.html when the patch was initially proposed? Specifically interesting for us would be the measured impact for Twitter's services that run the GraalVM compiler in production.

From the summary above it seems there is a negative impact even on the Scimark benchmark. This is one of the workloads that should certainly profit from vectorization.

@plokhotnyuk
Copy link

plokhotnyuk commented Sep 25, 2019

@thomaswue can I use standard instructions to build binaries (AMD64, Linux) for this PR?

I'm going to run this benchmark for them with both values of the -Dgraal.Autovectorize option.

@thomaswue
Copy link
Member

@plokhotnyuk I would assume that the PR is at least in a state where you should be able to do that.

@usrinivasan
Copy link
Author

@plokhotnyuk I would assume that the PR is at least in a state where you should be able to do that.

That's right! We use mx build to build the graal jar and use it with a jdk11 vm to run tests. Please let us know if you run into any issues.

@helloguo
Copy link

helloguo commented Dec 1, 2019

Thanks for sharing the patch! Any update in terms of the next steps? (e.g. the fix of performance regression for Scimark benchmark, the plan of merging this PR into master).

I ran a few workloads which are supposed to benefit from autovec. However, I did not see too much difference when I applied this PR. It's possible that the loops in my workload are not in a good shape. I guess my question is what would be good scenarios to use this patch? Or when should I not use this patch because of its current limitation?

@thomaswue
Copy link
Member

We discussed this briefly at the community workshop. There are a lot of possible interactions of vectorization with other loop transformations in the compiler due to different loop shapes. We do have support for auto-vectorization in our EE version. We will work towards having basic support for this feature also in CE.

@helloguo
Copy link

helloguo commented Dec 1, 2019

@thomaswue Thanks for the information.

We do have support for auto-vectorization in our EE version. We will work towards having basic support for this feature also in CE.

Any idea about the timeline for CE to have basic autovec support? Just curious when is a good time to come back and test it again.

@thomaswue
Copy link
Member

@helloguo You should certainly test the EE version on your workloads ;). It is free for evaluation and I assume it could make a difference at Facebook. Note that auto-vectorization is only a small part of the CE/EE performance-related differences.

Productization of this feature in CE will require quite some engineering effort. The current plan is to have it in one of the 20.x releases next year. But take this as a rough estimate only.

@apete
Copy link

apete commented Dec 2, 2019

I've done some linear algebra benchmarks comparing the EE and CE versions, using JMH. Here are the results doing matrix multiplication using Apache Commons Math (ACM), EJML and ojAlgo.

FillByMultiplying

Throughput ops/min on the y-axis, and matrix size (square) on the x-axis.

The results vary depending on the library, but the CE is typically slower than HotSpot and the EE typically faster. The difference between CE and EE can be significant.

@christhalinger
Copy link
Contributor

@thomaswue thanks for the update on a possible timeline!

@graalvmbot
Copy link
Collaborator

  • Hello Niklas Vangerow, thanks for contributing a PR to our project!

We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address nvangerow -(at)- twitter -(dot)- com. You can sign it at that link.

If you think you've already signed it, please comment below and we'll check.

  • Hello David Degazio, thanks for contributing a PR to our project!

We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address ddegazio -(at)- twitter -(dot)- com. You can sign it at that link.

If you think you've already signed it, please comment below and we'll check.

  • Hello Uma Srinivasan, thanks for contributing a PR to our project!

We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address usrinivasan -(at)- twitter -(dot)- com. You can sign it at that link.

If you think you've already signed it, please comment below and we'll check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.