Parsing Performance & Maintainability. #3369

andrevidela · 2024-08-09T22:56:05Z

Long term users of the idris programming language will know that one of the weaknesses of the compiler is
it's most common interaction model: the parser. Because the parser is quite a complex piece of software, there
is not a single element that we can point out and fix to resolve all the symptoms we experience because of it.

The role of this issue is to keep track of problems related to the parser, and different efforts deployed to solve
them.

This issue will be closed when the current state of the parser will be improve in at least two aspects: Performance
and maintainability. First, a non-exhaustive list of symptoms we experience and would like to address.

Existing and known problems

Performance is subpar
Inconsistent error recovery messages
Unexpected whitespace rules
Code architecture regarding location is brittle
Code architecture around desugaring and resugaring is brittle
No tree sitter support
Code synthesis generates unparsable code

Next steps

Here are a list of smaller steps that are not necessarily related to code that we can push as a PR on the
project but we still need to do in order to make progress toward better identifying and scoping solutions.

Measure performance in multiple scenarios: Large files, interactive modes, visually ambiguous code, etc.
investigate options around tree sitter. Are there ways to generate idris-compatible code from tree-sitter? Is there a way to generate tree-sitter spec from the parser? What kind of custom lexer do we need to write?
Investigate a better API for file location that is compatible with desugaring and resugaring, and does not require mindful use of FC in constructors but can be handled automatically

Conclusion

There are definitely more things we could do but I thought those would be small, achievable targets in the currently underspecified
goal to address parser runtime performance and maintainability. Please feel free to add to this discussion and share ideas of
small goals we can achieve easily to better design issues. Please, also share your experience with the parser if it is not represented
here in a way that we might be able to address with this kind of project

The text was updated successfully, but these errors were encountered:

andrevidela · 2024-08-10T09:45:38Z

Some personal notes:

Last year @stefan-hoeck showed us a very efficient constant-time parser using his library idris2-parser. My personal assessment was that it's essentially the target to hit in term of performance, but re-writing the entire parser with it might be too much work.
At the beginning of the year I had a chat with @wenkokke about Tree-Sitter, it is undeniable that tree-sitter offers large benefits in terms of ecosystem and accessibility. Because of this, I personally think it is worth investing in a tree-sitter compatible solution. I started working on a prototype to get familiarised with the API but made no progress on the biggest piece of software that we need for an idris2 version: the whitespace lexer.
Across the years, many contributors, including myself, thought about tackling the problem of partial-evaluation for idris. But nothing materialised yet. At this point the problem is so big and complex it's a project in and of itself, because of that, I don't think we should think of it as a "quick fix for parser performance" but as a genuine research project.
There might be a way out of this local minimum by developping a solution akin to happy/bison/BNFC where we have a static definition of the grammar and generate idris2-parser code from it, and a tree-sitter grammar for editor support. This solution takes us away from the realm of research and into pure engineering, which might be more achievable given the resources we are given.
The parser code itself leaves at lot to be desired. Its monolithic architecture makes it hard to test in small fragments. The code is littered with file-tracking code that's unrelated to the parsing itself. The parser performs some desugaring steps before the desugarer, for example in the case of pi-types or sigmas. All this complexity makes any changes to the parser fraught with unexpected challenges. Cleaning this up is going to take some time and we need to be mindful of doing it in a way that enables our other goals.
While it is not directly related to the parser, the desugaring steps and resugaring steps have their own problems that I think should be adressed as part of any efforts to improve the parser: generating provably parsable code. The issue is that there is no way to prove that a fragment of code can be parsed, desugared and resugared in a way that it can be parsed again. In code, we want to have the following lemma: (code : String) -> parse code >>= desugar >>= resugar >>= print ~~~ code where ~~~ is a notion of parsing equivalence where both sides must both parse or fail with a similar error. Additionally we want the internal trees to be the same: (tree : PTerm) -> tree >>= desugar >>= resugar === Just tree
One very visible area where parser, desugaring and resugaring fail is in location tracking. Often, location is lost, inaccurate or wrong, essentialy because location tracking needs to be carefully threaded by the programming at each of those steps and as a programmer there is no way to ensure that the assumption we make will hold after multiple passes of tree transformation, least of which are the desugarer and resugarer, but go much deeper with features like proof search, with clauses, elaborator reflection, and more. Because elaborated code is much harder to deal with, and further away from the parser, I think they are out of scope for this specific project. But desugared code should definitely be aware of the parser's capabilities.

Matthew-Mosior · 2024-08-10T13:51:29Z

Depending on the timeline to re-write the parser (either all at once or in chunks), I feel that it might make the most sense/provide the best return on investment to spend the time/effort to utilize @stefan-hoeck's idris2-parser library. The potential performance improvement alone may be well worth the effort.

jmanuel1 · 2024-08-11T04:25:03Z

There's also an archived attempt at a tree-sitter parser at https://github.com/gwerbin/tree-sitter-idris2 and an active one at https://github.com/kayhide/tree-sitter-idris that seems to be useful for my purposes (program reduction).

I'm most interested in a parser that can produce partial results, in case of an error, and that can produce a concrete syntax tree. I know tree-sitter does the former, but I haven't investigated its ability to do the latter well. These abilities would be useful for a code formatter, too.

wenkokke · 2024-08-11T15:45:08Z

I'd love to support a tree-sitter implementation, but this complicated the manner in which Idris is currently bootstrapped, since it would introduce a dependency on a C library.

It might be better to implement a parser library in Idris which produces incremental and streaming results and has error correction.

That said, the implementation effort would be significantly more complicated than simply reusing tree sitter.

andrevidela pinned this issue Aug 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing Performance & Maintainability. #3369

Parsing Performance & Maintainability. #3369

andrevidela commented Aug 9, 2024

andrevidela commented Aug 10, 2024 •

edited

Loading

Matthew-Mosior commented Aug 10, 2024 •

edited

Loading

jmanuel1 commented Aug 11, 2024

wenkokke commented Aug 11, 2024 •

edited

Loading

Parsing Performance & Maintainability. #3369

Parsing Performance & Maintainability. #3369

Comments

andrevidela commented Aug 9, 2024

Existing and known problems

Next steps

Conclusion

andrevidela commented Aug 10, 2024 • edited Loading

Matthew-Mosior commented Aug 10, 2024 • edited Loading

jmanuel1 commented Aug 11, 2024

wenkokke commented Aug 11, 2024 • edited Loading

andrevidela commented Aug 10, 2024 •

edited

Loading

Matthew-Mosior commented Aug 10, 2024 •

edited

Loading

wenkokke commented Aug 11, 2024 •

edited

Loading