-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[apex] Replace Jorje with fully open source front-end #3766
Comments
FYI @adangel since you've been updating jorje. |
@aaronhurst-google thanks for your initiative and the courage to offer to actively work on it. When I started the apex language module many years ago, I just wasn't able to build my own Apex parser and was happy to get the close-source Jorje by the Eclipse IDE team. For sure I was always aware of all the drawbacks. The optimal solution would be a public and well documented Parser implementation by Salesforce. But I fear they decided to never bring that. I ported most of the Java rules to Apex myself and added the facade to map Jorje AST elements. I must say writing new rules has never been hard and the JORJE AST is very stable. So I did never miss a custom Parser implementation like the one @nawforce has built. I am doubting that "a own parser" would be of so much benefit. Especially when it come to easy upgrading on new language features. What if your team looses interest in keeping the Apex Parser up to date. Just adding the latest Jorje is not an option then anymore? FYI @adangel @oowekyala |
Thanks @rsoesemann for the thoughtful reply. It certainly makes sense that Jorje remains a conservative solution. PMD could continue to sail along and reverse engineer and integrate any minor updates to its output... and this is all you really need to keep writing structural matching rules. Since I'm the one suggesting to rock the boat, allow me to go into more detail... The up-to-datedness of the language support isn't our pain point, but I understand that this undertaking would introduce a (different) set of risks in this area. Jorje has a recent track record of regular updates and stability, but it dangles the risk of disappearing (due to availability or technology changes) and leaving PMD with no incremental path forward. A new open-source dependency is missing that track record, but it always leaves open the option to fork. Different risk profile. And as you say, maybe not a clear benefit. The big pain point for us is our internal policy against closed-source code in production environments, which currently applies to PMD because of its inclusion of Jorje. This may not be a concern for many users.. but we're certainly not the only organization concerned with code provenance. JARs checked into github are a giant red flag. I'd like to promulgate the use of PMD, but this is a blocker. The other area where I feel Jorje is selling PMD short is its extensibility. For example, we built an experimental over-approximate Apex callgraph inside PMD this summer, but the precise version needs type and call resolution. And those are pieces that really ought to be built in the front-end. That's not going to happen on top of the Jorje code, and it's out of scope for their IDE use case. I'm not sure what future analyses are on the roadmap, but semantic analysis is a common prerequisite to many. |
A lot of benefits. That indeed more than outweighing the drawbacks. You are absolutely right and I honestly just love the fact that a company like Google with manpower and people way smarter than me pushing this parser issue. Please go for it. |
Hi Aaron/Robert, This might be a bit of a tangent to your original posts but thought I should mention that the libraries I created for supporting VSCode extension & CLI plugins are in the middle of a major refactor at the moment. We are adopting Jorje and supplementing with a new Apex parser. There is a specific use case driving this restructuring which I think might cross your interests. This work is being done by a small team in FinancialForce based off a fork of my open source libraries so I can't point you at anything just yet. If you think this might have interest I would suggest it might be best to have a call to enable to discuss the details a bit better and what the open source plans might look like. |
Hi Kevin! I'm definitely interested to hear more. I'll reach out via email to set something up. |
Hi Aaron, I am thrilled you're proposing this and would very much like PMD to migrate away from using Jorje. As you said, pmd-apex is not as widely usable without a fully open-source technology stack. Jorje also poses some hurdles to PMD, for instance, it does not preserve the positions information for all nodes (and when it does, it is approximate). I'm also happy to read that your front-end implementation might also implement semantic analyses like type and call resolution.
I'm not sure what you mean by this. If we swap Jorje for an open-source library, I hope we can spare the effort of writing an extensive wrapper API like the one we have for Jorje. This would improve the quality of the API available to rules, and thereby the quality of the rules themselves. Maybe we would need to somewhat synchronize our release cycles though. I'm looking forward to hearing more about this! |
I fully consent with your assessment: I'd be very happy to throw out the binary blob which is called Jorje and replace it with an open-source solution. I think, open-source solutions have more benefits than draw-backs (you can actually see the code without a decompiler and can fix things upstream).
Absolutely agree. I almost never know what the new binary blob brings in (unless we need to update because of a bugfix).
Well, then we have a bug that we can fix - that's how we can move forward and improve. We already have many tests, so we probably only fail on seldom edge-cases (like we still do in our Java parser sometimes).
I think, that's a bit difficult to predict. Currently the PMD Apex AST wraps the Jorje nodes 1:1. The PMD Apex AST nodes is the API that the rules see and use. It would be ideal of course, if this API would stay the same, but that might not be possible. It's no problem to update the rules, that we have in our repository, but we need to think about custom rules then (rules developed outside of github.com/pmd/pmd) - how we want to deal with such incompatibilities. We are working also in parallel on PMD7 (the next major version), where such an change which is likely incompatible would fit in nicely. However, I don't dare to predict when this will be ready... |
We have the same issue with Java where we maintain the grammar/parser ourselves. Someone needs to update it regularly to support the new language features. Luckily Java has meanwhile a clear communication of new features through JEPs (e.g. java 17), so it made it easier to understand the new features (but they still need to be implemented). |
@adangel unfortunately there is nothing like this for Apex. There is no structured formal way to easily find out about Jorje changes. The Release Notes documents for sure indicate Language changes but it's a cumbersome process. |
@aaronhurst-google I put you and me as assignee. I can help you understand and migrate existing rules to the new parser. Most of them were ported by me in a "hacky fashion, meaning I took the Java code and converted it to Apex without minimal changes. Don't expect it to be nice or minimal code. ;-) |
@rsoesemann So there is consensus about the fact that this a good idea, but I don't want to rush this and immediately start porting rules or so. I think the best is to organize a call where we could exchange about the long term vision and the next first steps. @aaronhurst-google @rsoesemann @adangel @nawforce Could we do that together? I can send out a doodle poll to figure out a time slot |
Sure @oowekyala I didn't want to say that. This is nothing to rush. Making me an assignee was just to indicate @aaronhurst-google that he will get support by me coping with the existing Apex codebase. And only after we all together agreed on a plan. |
Definitely, let's sync up. Feel free to send a poll. (My usual availability is PST working hours, but I can stretch earlier.) It seems that the role of the existing PMD Apex AST wrapper is one point to cover. I had generally assumed that this layer of abstraction would remain in place (@oowekyala which is why I mentioned a "mechanical" translation aspect). From my point of view, it's easier to target this than to have to dive into the rules and understand the union of their requirements-- even if if in the end this is what matters. If eliminating the wrapper is an important goal, do you have any thoughts on doing this in two steps? @adangel The 1:1 mapping of course wouldn't necessarily persist with a different AST, but my gut feeling is the structure isn't so different that it would be difficult to split/join a few cases to match the current implementation. I am more cautious about expectations on the general parent/child/sibling relationships (and thus visitation order) of nodes, but it may or may not be a major issue. |
Regarding the consensus: I'm not totally in favour with replacing it (for the reasons mentioned above). Couldn't we find a middle ground like Making the Parser implementation (plus Wrappers for the AST) pluggable / configurable. People who cannot use Jorje just plug in one of potentially many OSS Parsers. So people or companies that don't care or simply trust Salesforce remain with #Jorje and others plug their own parser. @adangel @oowekyala @aaronhurst-google @nawforce What do you think about that? |
If we rely on an open source parser I think it would be interesting to see what its AST API would need to support in order to get rid of the wrapper. The problem with the wrapper is it's supposed to insulate the parser lib from rules. If the parser lib starts to support more complex features like type analysis or dataflow analysis or so, we also need to write a wrapper for these, which is expensive and hard to maintain. Jorje implements type analysis but we never wrote a wrapper for that, so no rules use it, which is a shame. Similarly, the API of wrapper nodes is strictly less useful than the API of the underlying Jorje nodes. There is no benefit to keeping a wrapper if the parser library has good binary/source compatibility policies. There are many benefits to not having a wrapper, for instance, less overhead, access to all semantic analysis features of the parser lib, less maintenance effort, etc. Since the parser would be open source, we could upstream improvements to the parser lib and AST more directly. AFAICT, the only requirements on the library's AST API would be:
This would be a first for PMD though, so there might be other things I'm not seeing. Dropping the wrapper obviously has a one-time cost for updating rules, but Apex has fewer rules than Java and this might be reasonable.
Even if we keep the wrapper API, I think it's important to preserve the 1:1 mapping. However, if your parser produces a different AST from Jorje, then it's the wrapper that needs to be modified to match that AST, not the other way around. This might require updating rules, but that's what we do in all wrapper-based language modules. |
Btw I just sent an email with the doodle poll to all people tagged in #3766 (comment). Let me know if you haven't received it |
FYI - since 6.44.0 we have now 3 apex projects integrated in our regression tester. That means, when a PR is opened that changes a Apex rule or anything in apex (like replacing the parser), we should easily see the differences. |
Hi all, I am working with @aaronhurst-google and I wanted to provide an update regarding this issue. Development is now underway to replace Jorje in the Apex module with Summit AST, our recently open-sourced Apex AST library. We are first developing these changes for PMD version 6, but we will also integrate them into version 7 afterwards. Would you be willing to host an experimental branch of version 6 containing this work? Also, what is your policy on the use of Kotlin for development within PMD? Please let me know your thoughts. |
Hi @adangel, if you're ok with an experimental branch for 6.x, could you help create that for us? Maybe We're doing the code reviews in a fork. (Unless you prefer a different process.) Would you or @oowekyala like to observe and/or give early feedback at that point? |
Hi @eklimo && @aaronhurst-google ,
We can do this. But I'm not sure about the benefit over a fork - unless the plan is to provide two separate PMD versions... I assume, that your changes will eventually end up in the main development branch (still named "master"). Hang-on: Is your plan maybe to create a couple of (more or less smallish) PRs that we merge into
We currently use Kotlin only for unit tests at some places (e.g. pmd-lang-test module provides a couple of base test classes). I'm personally a bit reserved on this topic as a (former) eclipse-IDE-only user - where it's very difficult to impossible to develop Kotlin. Using kotlin means, that we accept, that PMD can only be developed with IntelliJ IDEA... which feels like a vendor lock-in. However, we discussed this and we are generally open to the idea to increase usage of Kotlin. We came up with the following rules (which we hopefully document at some point in the project documentation):
I hope this all make sense 😄 FYI: @oowekyala |
Thank you for the thorough response! A branch on the main repo is beneficial over a fork for a few reasons: As you mentioned, this branch would serve as an alternative PMD 6 version that can be used by those interested in a fully open-source Apex module. We plan to use this branch until PMD 7 is released, and hosting it in this repo provides a stable location for the code. Additionally, it'll enable you to have more oversight of our changes and potentially provide reviews during development. Our plan is to keep this branch separate from |
Thanks for sharing @rsoesemann. I hadn't seen this. Another Apex grammar is a great community addition! I've integrated tree-sitter languages into other tools, and it has better support for non-JVM target languages than ANTLR. But that's a moot point for PMD. I think @nawforce's apex grammar is more mature.... but he has some competition now. 😄 |
For Apex, we manually created the nodes for formal comments and added them as the first node to the declaration, to which they belong. We try to determine the "nearestNode" with the location (offsets) - the AST node that immediately follows the comment is the nearest node. In other languages (like Java), we don't have the comments directly in the AST. Here the comments are added to the root node ( pmd/pmd-java/src/main/java/net/sourceforge/pmd/lang/java/ast/ASTCompilationUnit.java Line 38 in c70f894
Having the comments in the AST has the advantage, that the comments are easily accessible by XPath rules as well, without the need of additional functions. For PMD 7 (java), we have a note in #3784: "Stabilize how Comment nodes are represented". |
Thanks for the explanation @adangel . I found the code you mention in ApexTreeBuilder and was able to reuse it with minimal modification, so at least the Agreed: this seems like a tricky problem in general. Some comments are naturally left- and some naturally right- associative, and some apply better to a parent than the nearest leaf node. |
Hi @aaronhurst-google , we merged pmd7 into the master branch this month. We plan to ship PMD7 in about 3 months, with the first release candidate due this week. We really would like to ship your changes with summit-ast with PMD 7. Could you give it a try to merge current master into the Let us know, if we can help you. Regards, |
@adangel Thanks for the ping. This timing works out well, since I no longer need to build on the internal PMD 6 version. I have a few weeks cleared to work on this. Will get started... |
Awesome, thanks! 👍 |
It's an old post, but just wanna let folks know that Nvim-treesitter already has the open source Apex parser |
Description
PMD utilizes the Apex Jorje library to parse Apex source and generate an AST. The fact that this library is closed-source introduces two major headaches for us:
I can't imagine it's particularly pleasant to develop on top of an undocumented API. Or to slice-and-dice and integrate new versions.
Describe the solution you'd like
I'd like to explore (and contribute to) the transition to a fully open-source Apex front-end.
In my organization, we're using @nawforce's Apex grammar to parse almost 1MLoC, now with zero issues. Built on top of this, we've got an implementation to convert the resulting parse tree to an abstract syntax tree data structure-- excluding the SOQL query syntax, which isn't relevant here. This is not yet open-sourced, but that is obviously a prerequisite to move forward.
The integration of this code ought to be fairly mechanical, to translate this AST data structure to the existing PMD AST. If there are any gaps or incompatibilities, I guess I'm volunteering to accommodate the requested modifications. Validating the semantic equivalence of the representation is a little trickier, but my sense is that this wouldn't be a big hurdle.
The biggest risk is likely in swapping out one parser for another, and that the tool no longer parses an Apex file that it previously accepted. OTOH, any of these issues could actually be fixed upstream, which is not currently a possibility.
So... that's the idea and the offer... wanted to gauge your interest and gather your concerns.
The chosen open source front-end: Summit AST
... is Summit AST
Maven Repo for Snapshots: https://google.oss.sonatype.org/content/repositories/snapshots/com/google/summit/summit-ast/
Releases: https://central.sonatype.dev/artifact/com.google.summit/summit-ast/1.0.0/overview
https://repo1.maven.org/maven2/com/google/summit/summit-ast/
PR List
Note: All PRs are merged into branch https://github.com/pmd/pmd/tree/experimental-apex-parser
Related work
ast
package #4081The text was updated successfully, but these errors were encountered: