Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[apex] Replace Jorje with fully open source front-end #3766

Closed
16 tasks done
Tracked by #3898
aaronhurst-google opened this issue Feb 8, 2022 · 30 comments · Fixed by #4806
Closed
16 tasks done
Tracked by #3898

[apex] Replace Jorje with fully open source front-end #3766

aaronhurst-google opened this issue Feb 8, 2022 · 30 comments · Fixed by #4806
Labels
an:enhancement An improvement on existing features / rules in:grammar About the grammar of a lexer or parser, eg, a parse/lex exception
Milestone

Comments

@aaronhurst-google
Copy link
Contributor

aaronhurst-google commented Feb 8, 2022

Description

PMD utilizes the Apex Jorje library to parse Apex source and generate an AST. The fact that this library is closed-source introduces two major headaches for us:

  • Without source transparency, as a policy, there are severe restrictions on how and where code can be run in our environment. The closed-source Jorje dependency is tainting the whole PMD Apex analyzer.
  • This limits enhancement or modification, either to provide early patches for new language features or for other extensions.

I can't imagine it's particularly pleasant to develop on top of an undocumented API. Or to slice-and-dice and integrate new versions.

Describe the solution you'd like

I'd like to explore (and contribute to) the transition to a fully open-source Apex front-end.

In my organization, we're using @nawforce's Apex grammar to parse almost 1MLoC, now with zero issues. Built on top of this, we've got an implementation to convert the resulting parse tree to an abstract syntax tree data structure-- excluding the SOQL query syntax, which isn't relevant here. This is not yet open-sourced, but that is obviously a prerequisite to move forward.

The integration of this code ought to be fairly mechanical, to translate this AST data structure to the existing PMD AST. If there are any gaps or incompatibilities, I guess I'm volunteering to accommodate the requested modifications. Validating the semantic equivalence of the representation is a little trickier, but my sense is that this wouldn't be a big hurdle.

The biggest risk is likely in swapping out one parser for another, and that the tool no longer parses an Apex file that it previously accepted. OTOH, any of these issues could actually be fixed upstream, which is not currently a possibility.

So... that's the idea and the offer... wanted to gauge your interest and gather your concerns.

The chosen open source front-end: Summit AST

... is Summit AST
Maven Repo for Snapshots: https://google.oss.sonatype.org/content/repositories/snapshots/com/google/summit/summit-ast/

Releases: https://central.sonatype.dev/artifact/com.google.summit/summit-ast/1.0.0/overview
https://repo1.maven.org/maven2/com/google/summit/summit-ast/

PR List

Note: All PRs are merged into branch https://github.com/pmd/pmd/tree/experimental-apex-parser

Related work

@aaronhurst-google aaronhurst-google added the an:enhancement An improvement on existing features / rules label Feb 8, 2022
@aaronhurst-google
Copy link
Contributor Author

FYI @adangel since you've been updating jorje.

@rsoesemann
Copy link
Member

rsoesemann commented Feb 8, 2022

@aaronhurst-google thanks for your initiative and the courage to offer to actively work on it.

When I started the apex language module many years ago, I just wasn't able to build my own Apex parser and was happy to get the close-source Jorje by the Eclipse IDE team. For sure I was always aware of all the drawbacks.

The optimal solution would be a public and well documented Parser implementation by Salesforce. But I fear they decided to never bring that.

I ported most of the Java rules to Apex myself and added the facade to map Jorje AST elements. I must say writing new rules has never been hard and the JORJE AST is very stable. So I did never miss a custom Parser implementation like the one @nawforce has built.

I am doubting that "a own parser" would be of so much benefit. Especially when it come to easy upgrading on new language features. What if your team looses interest in keeping the Apex Parser up to date. Just adding the latest Jorje is not an option then anymore?

FYI @adangel @oowekyala

@aaronhurst-google
Copy link
Contributor Author

Thanks @rsoesemann for the thoughtful reply.

It certainly makes sense that Jorje remains a conservative solution. PMD could continue to sail along and reverse engineer and integrate any minor updates to its output... and this is all you really need to keep writing structural matching rules.

Since I'm the one suggesting to rock the boat, allow me to go into more detail...

The up-to-datedness of the language support isn't our pain point, but I understand that this undertaking would introduce a (different) set of risks in this area. Jorje has a recent track record of regular updates and stability, but it dangles the risk of disappearing (due to availability or technology changes) and leaving PMD with no incremental path forward. A new open-source dependency is missing that track record, but it always leaves open the option to fork. Different risk profile. And as you say, maybe not a clear benefit.

The big pain point for us is our internal policy against closed-source code in production environments, which currently applies to PMD because of its inclusion of Jorje. This may not be a concern for many users.. but we're certainly not the only organization concerned with code provenance. JARs checked into github are a giant red flag. I'd like to promulgate the use of PMD, but this is a blocker.

The other area where I feel Jorje is selling PMD short is its extensibility. For example, we built an experimental over-approximate Apex callgraph inside PMD this summer, but the precise version needs type and call resolution. And those are pieces that really ought to be built in the front-end. That's not going to happen on top of the Jorje code, and it's out of scope for their IDE use case. I'm not sure what future analyses are on the roadmap, but semantic analysis is a common prerequisite to many.

@rsoesemann
Copy link
Member

rsoesemann commented Feb 9, 2022

A lot of benefits. That indeed more than outweighing the drawbacks. You are absolutely right and I honestly just love the fact that a company like Google with manpower and people way smarter than me pushing this parser issue. Please go for it.

@rsoesemann rsoesemann added the in:grammar About the grammar of a lexer or parser, eg, a parse/lex exception label Feb 9, 2022
@nawforce
Copy link
Contributor

nawforce commented Feb 9, 2022

Hi Aaron/Robert,

This might be a bit of a tangent to your original posts but thought I should mention that the libraries I created for supporting VSCode extension & CLI plugins are in the middle of a major refactor at the moment. We are adopting Jorje and supplementing with a new Apex parser. There is a specific use case driving this restructuring which I think might cross your interests. This work is being done by a small team in FinancialForce based off a fork of my open source libraries so I can't point you at anything just yet.

If you think this might have interest I would suggest it might be best to have a call to enable to discuss the details a bit better and what the open source plans might look like.

@aaronhurst-google
Copy link
Contributor Author

Hi Kevin! I'm definitely interested to hear more. I'll reach out via email to set something up.

@oowekyala
Copy link
Member

Hi Aaron, I am thrilled you're proposing this and would very much like PMD to migrate away from using Jorje. As you said, pmd-apex is not as widely usable without a fully open-source technology stack. Jorje also poses some hurdles to PMD, for instance, it does not preserve the positions information for all nodes (and when it does, it is approximate). I'm also happy to read that your front-end implementation might also implement semantic analyses like type and call resolution.

The integration of this code ought to be fairly mechanical, to translate this AST data structure to the existing PMD AST.

I'm not sure what you mean by this. If we swap Jorje for an open-source library, I hope we can spare the effort of writing an extensive wrapper API like the one we have for Jorje. This would improve the quality of the API available to rules, and thereby the quality of the rules themselves. Maybe we would need to somewhat synchronize our release cycles though.

I'm looking forward to hearing more about this!

@adangel
Copy link
Member

adangel commented Feb 10, 2022

I fully consent with your assessment: I'd be very happy to throw out the binary blob which is called Jorje and replace it with an open-source solution. I think, open-source solutions have more benefits than draw-backs (you can actually see the code without a decompiler and can fix things upstream).

I can't imagine it's particularly pleasant to develop on top of an undocumented API. Or to slice-and-dice and integrate new versions.

Absolutely agree. I almost never know what the new binary blob brings in (unless we need to update because of a bugfix).

The biggest risk is likely in swapping out one parser for another, and that the tool no longer parses an Apex file that it previously accepted. OTOH, any of these issues could actually be fixed upstream, which is not currently a possibility.

Well, then we have a bug that we can fix - that's how we can move forward and improve. We already have many tests, so we probably only fail on seldom edge-cases (like we still do in our Java parser sometimes).

The integration of this code ought to be fairly mechanical, to translate this AST data structure to the existing PMD AST.

I think, that's a bit difficult to predict. Currently the PMD Apex AST wraps the Jorje nodes 1:1. The PMD Apex AST nodes is the API that the rules see and use. It would be ideal of course, if this API would stay the same, but that might not be possible. It's no problem to update the rules, that we have in our repository, but we need to think about custom rules then (rules developed outside of github.com/pmd/pmd) - how we want to deal with such incompatibilities.

We are working also in parallel on PMD7 (the next major version), where such an change which is likely incompatible would fit in nicely. However, I don't dare to predict when this will be ready...

@adangel
Copy link
Member

adangel commented Feb 10, 2022

Especially when it come to easy upgrading on new language features. What if your team looses interest in keeping the Apex Parser up to date.

We have the same issue with Java where we maintain the grammar/parser ourselves. Someone needs to update it regularly to support the new language features. Luckily Java has meanwhile a clear communication of new features through JEPs (e.g. java 17), so it made it easier to understand the new features (but they still need to be implemented).
I don't know, if Apex has something similar: an overview and detailed description what changes from version to version. Or something like an "Apex Language Specification".

@rsoesemann
Copy link
Member

I don't know, if Apex has something similar: an overview and detailed description what changes from version to version. Or something like an "Apex Language Specification".

@adangel unfortunately there is nothing like this for Apex. There is no structured formal way to easily find out about Jorje changes. The Release Notes documents for sure indicate Language changes but it's a cumbersome process.
The indication for the Switch statement added to Apex was in this document: https://help.salesforce.com/s/articleView?id=release-notes.rn_apex_switch_statement.htm&type=5&release=214

@rsoesemann rsoesemann added this to the 7.0.0 milestone Feb 10, 2022
@rsoesemann
Copy link
Member

@aaronhurst-google I put you and me as assignee. I can help you understand and migrate existing rules to the new parser. Most of them were ported by me in a "hacky fashion, meaning I took the Java code and converted it to Apex without minimal changes. Don't expect it to be nice or minimal code. ;-)

@rsoesemann rsoesemann self-assigned this Feb 10, 2022
@oowekyala
Copy link
Member

@rsoesemann So there is consensus about the fact that this a good idea, but I don't want to rush this and immediately start porting rules or so. I think the best is to organize a call where we could exchange about the long term vision and the next first steps. @aaronhurst-google @rsoesemann @adangel @nawforce Could we do that together? I can send out a doodle poll to figure out a time slot

@rsoesemann
Copy link
Member

rsoesemann commented Feb 10, 2022

Sure @oowekyala I didn't want to say that. This is nothing to rush. Making me an assignee was just to indicate @aaronhurst-google that he will get support by me coping with the existing Apex codebase. And only after we all together agreed on a plan.

@aaronhurst-google
Copy link
Contributor Author

So there is consensus about the fact that this a good idea, but I don't want to rush this and immediately start porting rules or so. I think the best is to organize a call where we could exchange about the long term vision and the next first steps.

Definitely, let's sync up. Feel free to send a poll. (My usual availability is PST working hours, but I can stretch earlier.)

It seems that the role of the existing PMD Apex AST wrapper is one point to cover. I had generally assumed that this layer of abstraction would remain in place (@oowekyala which is why I mentioned a "mechanical" translation aspect). From my point of view, it's easier to target this than to have to dive into the rules and understand the union of their requirements-- even if if in the end this is what matters. If eliminating the wrapper is an important goal, do you have any thoughts on doing this in two steps?

@adangel The 1:1 mapping of course wouldn't necessarily persist with a different AST, but my gut feeling is the structure isn't so different that it would be difficult to split/join a few cases to match the current implementation. I am more cautious about expectations on the general parent/child/sibling relationships (and thus visitation order) of nodes, but it may or may not be a major issue.

@rsoesemann
Copy link
Member

Regarding the consensus: I'm not totally in favour with replacing it (for the reasons mentioned above). Couldn't we find a middle ground like

Making the Parser implementation (plus Wrappers for the AST) pluggable / configurable. People who cannot use Jorje just plug in one of potentially many OSS Parsers. So people or companies that don't care or simply trust Salesforce remain with #Jorje and others plug their own parser.

@adangel @oowekyala @aaronhurst-google @nawforce

What do you think about that?

@oowekyala
Copy link
Member

I had generally assumed that this layer of abstraction would remain in place

If we rely on an open source parser I think it would be interesting to see what its AST API would need to support in order to get rid of the wrapper. The problem with the wrapper is it's supposed to insulate the parser lib from rules. If the parser lib starts to support more complex features like type analysis or dataflow analysis or so, we also need to write a wrapper for these, which is expensive and hard to maintain. Jorje implements type analysis but we never wrote a wrapper for that, so no rules use it, which is a shame. Similarly, the API of wrapper nodes is strictly less useful than the API of the underlying Jorje nodes.

There is no benefit to keeping a wrapper if the parser library has good binary/source compatibility policies. There are many benefits to not having a wrapper, for instance, less overhead, access to all semantic analysis features of the parser lib, less maintenance effort, etc. Since the parser would be open source, we could upstream improvements to the parser lib and AST more directly. AFAICT, the only requirements on the library's AST API would be:

This would be a first for PMD though, so there might be other things I'm not seeing.

Dropping the wrapper obviously has a one-time cost for updating rules, but Apex has fewer rules than Java and this might be reasonable.

The 1:1 mapping of course wouldn't necessarily persist with a different AST, but my gut feeling is the structure isn't so different that it would be difficult to split/join a few cases to match the current implementation. I am more cautious about expectations on the general parent/child/sibling relationships (and thus visitation order) of nodes, but it may or may not be a major issue.

Even if we keep the wrapper API, I think it's important to preserve the 1:1 mapping. However, if your parser produces a different AST from Jorje, then it's the wrapper that needs to be modified to match that AST, not the other way around. This might require updating rules, but that's what we do in all wrapper-based language modules.

@oowekyala
Copy link
Member

Btw I just sent an email with the doodle poll to all people tagged in #3766 (comment). Let me know if you haven't received it

@adangel
Copy link
Member

adangel commented Mar 27, 2022

FYI - since 6.44.0 we have now 3 apex projects integrated in our regression tester. That means, when a PR is opened that changes a Apex rule or anything in apex (like replacing the parser), we should easily see the differences.

@eklimo
Copy link
Contributor

eklimo commented Jul 21, 2022

Hi all,

I am working with @aaronhurst-google and I wanted to provide an update regarding this issue. Development is now underway to replace Jorje in the Apex module with Summit AST, our recently open-sourced Apex AST library.

We are first developing these changes for PMD version 6, but we will also integrate them into version 7 afterwards. Would you be willing to host an experimental branch of version 6 containing this work?

Also, what is your policy on the use of Kotlin for development within PMD?

Please let me know your thoughts.

@aaronhurst-google
Copy link
Contributor Author

Hi @adangel, if you're ok with an experimental branch for 6.x, could you help create that for us? Maybe experimental-apex-parser (or whatever you'd prefer), branched from master.

We're doing the code reviews in a fork. (Unless you prefer a different process.) Would you or @oowekyala like to observe and/or give early feedback at that point?

@adangel
Copy link
Member

adangel commented Jul 28, 2022

Hi @eklimo && @aaronhurst-google ,
thanks for the heads up!

Would you be willing to host an experimental branch of version 6 containing this work?

We can do this. But I'm not sure about the benefit over a fork - unless the plan is to provide two separate PMD versions... I assume, that your changes will eventually end up in the main development branch (still named "master").
You can also create a draft PR with work in progress. However, I probably won't have much time to give early feedback. But the draft PR also runs the build in github-actions, so we have at least the minimal feedback, whether all tests are passing.

Hang-on: Is your plan maybe to create a couple of (more or less smallish) PRs that we merge into experimental-apex-parser? And once the feature is complete, we can merge the branch back into master? That's certainly doable (and preferable over a single big PR).

Also, what is your policy on the use of Kotlin for development within PMD?

We currently use Kotlin only for unit tests at some places (e.g. pmd-lang-test module provides a couple of base test classes). I'm personally a bit reserved on this topic as a (former) eclipse-IDE-only user - where it's very difficult to impossible to develop Kotlin. Using kotlin means, that we accept, that PMD can only be developed with IntelliJ IDEA... which feels like a vendor lock-in.
Also bringing in a mix of languages might make maintenance a bit harder and make it harder for new contributors. However - PMD is a tool that deals with many, many different languages anyway, so this is maybe not a real argument.

However, we discussed this and we are generally open to the idea to increase usage of Kotlin. We came up with the following rules (which we hopefully document at some point in the project documentation):

  • pmd-core should stay in plain Java. This helps in keeping binary compatibility when changing stuff. pmd-core contains the main APIs for all language modules. Currently we release all modules at the same time - so not a real problem -, but that might change in the future (because only few language modules have actual changes per release).
  • For (unit) testing, Kotlin can be used. Also the test frameworks can use Kotlin (pmd-test doesn't yet, pmd-lang-test does already)
  • From now on, we would allow to have the individual language modules be implemented in different languages when it makes sense. So a language module might decide to use plain Java (like now) or also Kotlin (or other languages if it fits).
    • When mixing languages (e.g. Java Kotlin), we need to care that the modules can still be used with plain Java - e.g. when writing custom rules: pmd-java provides a couple of APIs for rules (like symbol table, type resolution) and we should not force the users to use Kotlin (at least not for language modules which already exist and for which users might have written custom rules in Java already).
    • It's probably less a problem, if the entire language module is written entirely in Kotlin. But that probably applies only for new language modules.
  • We might revise these rules/policies in the future.

I hope this all make sense 😄

FYI: @oowekyala

@eklimo
Copy link
Contributor

eklimo commented Aug 3, 2022

Thank you for the thorough response!

A branch on the main repo is beneficial over a fork for a few reasons: As you mentioned, this branch would serve as an alternative PMD 6 version that can be used by those interested in a fully open-source Apex module. We plan to use this branch until PMD 7 is released, and hosting it in this repo provides a stable location for the code. Additionally, it'll enable you to have more oversight of our changes and potentially provide reviews during development.

Our plan is to keep this branch separate from master during version 6 because we can't guarantee that these changes will maintain compatibility. After the code is ported to version 7, it can be merged into the main 7.0 branch if needed.

@aaronhurst-google
Copy link
Contributor Author

Thanks for sharing @rsoesemann. I hadn't seen this. Another Apex grammar is a great community addition!

I've integrated tree-sitter languages into other tools, and it has better support for non-JVM target languages than ANTLR. But that's a moot point for PMD.

I think @nawforce's apex grammar is more mature.... but he has some competition now. 😄

@adangel
Copy link
Member

adangel commented Nov 18, 2022

@adangel @oowekyala QQ: where should ASTFormalComment nodes be attached in the AST? Jorje appeared to assign a nearestNode but its behavior is opaque to me. Is there a standard approach for the other languages?

For Apex, we manually created the nodes for formal comments and added them as the first node to the declaration, to which they belong. We try to determine the "nearestNode" with the location (offsets) - the AST node that immediately follows the comment is the nearest node.

In other languages (like Java), we don't have the comments directly in the AST. Here the comments are added to the root node (

public void setComments(List<Comment> comments) {
) but not added into the tree. It would be nice to have the comments as part of the tree, but adding it afterwards might break many rules (e.g. rules might expect the first child of a node to be of a specific type - with comments, a comment node could appear everywhere). That's why it wasn't added directly in Java. So the "CommentRequired" rule in Java needs to assign the comments to nodes manually.

Having the comments in the AST has the advantage, that the comments are easily accessible by XPath rules as well, without the need of additional functions.

For PMD 7 (java), we have a note in #3784: "Stabilize how Comment nodes are represented".

@aaronhurst-google
Copy link
Contributor Author

Thanks for the explanation @adangel . I found the code you mention in ApexTreeBuilder and was able to reuse it with minimal modification, so at least the ASTFormalComment nodes are attached in the same places.

Agreed: this seems like a tricky problem in general. Some comments are naturally left- and some naturally right- associative, and some apply better to a parent than the nearest leaf node.

@adangel
Copy link
Member

adangel commented Mar 23, 2023

Hi @aaronhurst-google ,

we merged pmd7 into the master branch this month. We plan to ship PMD7 in about 3 months, with the first release candidate due this week.

We really would like to ship your changes with summit-ast with PMD 7.

Could you give it a try to merge current master into the experimental-apex-parser branch?

Let us know, if we can help you.

Regards,
Andreas

@aaronhurst-google
Copy link
Contributor Author

aaronhurst-google commented Mar 30, 2023

@adangel Thanks for the ping. This timing works out well, since I no longer need to build on the internal PMD 6 version. I have a few weeks cleared to work on this. Will get started...

@adangel
Copy link
Member

adangel commented Mar 31, 2023

Awesome, thanks! 👍

@xixiaofinland
Copy link

It's an old post, but just wanna let folks know that Nvim-treesitter already has the open source Apex parser sfapex added. I made a LinkedIn post for this news, and I'm a daily Nvim SF developer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
an:enhancement An improvement on existing features / rules in:grammar About the grammar of a lexer or parser, eg, a parse/lex exception
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants