Why is the lambda
being parsed as an identifier instead of a keyword in this grammar?
#3386
-
I'm trying to parse this syntax:
But after parsing Here is the grammar: module.exports = grammar({
name: 'bend',
extras: $ => [
$.comment,
/[\s\f\uFEFF\u2060\u200B]|\r?\n/,
],
externals: $ => [
$._newline,
$._indent,
$._dedent,
$.comment,
],
inline: $ => [
$._simple_statement,
$.expression,
$.simple_expression,
],
word: $ => $._id,
rules: {
source_file: $ => $._top_level_defs,
_top_level_defs: $ => choice(
$.function_definition,
),
_statement: $ => choice(
$._simple_statements,
),
// Simple statements
_simple_statements: $ => seq(
$._simple_statement,
$._newline,
),
_simple_statement: $ => choice(
$.return_statement,
),
return_statement: $ => seq(
'return',
$.expression,
),
// Compound statements
function_definition: $ => seq(
'def',
field('name', $.identifier),
field('parameters', $.parameters),
':',
$.body,
),
parameters: $ => seq(
'(',
optional($._parameters),
')',
),
_parameters: $ => seq(
commaSep1($.identifier),
optional(',')
),
body: $ => seq(
$._indent,
repeat($._statement),
$._dedent,
),
// Expressions
expression: $ => choice(
$.lambda_expression,
$.simple_expression,
),
simple_expression: $ => choice(
$.identifier,
),
lambda_expression: $ => seq(
choice('λ', 'lambda'),
alias(optionalCommaSep1($.expression), $.parameters),
optional(','),
':',
field('body', $.expression)
),
// Identifier without two consecutive underscores __
_top_level_identifier: _ => /[a-zA-Z][A-Za-z0-9.-/]*(?:_[A-Za-z0-9.-/] )*/,
_id: _ => /[a-zA-Z][A-Za-z0-9.-/]*/,
identifier: $ => choice($._id, $._top_level_identifier),
comment: _ => token(seq('#', /.*/)),
},
});
function optionalCommaSep1(rule) {
return repeat1(seq(optional(','), rule))
}
function commaSep1(rule) {
return sep1(rule, ',');
}
function sep1(rule, separator) {
return seq(rule, repeat(seq(separator, rule)));
} |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
I think the problem in this case lies with how you've implicitly defined lexical precedence between It's mentioned in the docs but it's not something that's immediately obvious, but when you place two terminals (e.g. patterns formed from a regex or literals formed from a string) one after the other, tree-sitter will assign the first one a higher lexical precedence, and in this case we're talking about You can easily fix this by just swapping the order of A rule of thumb that I follow (and maybe this should be documented tbh) is to place your word token rule (in this case |
Beta Was this translation helpful? Give feedback.
I think the problem in this case lies with how you've implicitly defined lexical precedence between
_top_level_identifier
and_id
.It's mentioned in the docs but it's not something that's immediately obvious, but when you place two terminals (e.g. patterns formed from a regex or literals formed from a string) one after the other, tree-sitter will assign the first one a higher lexical precedence, and in this case we're talking about
_top_level_identifier
being higher than_id
. Then, you happen to make_id
the word token, meaning that tree-sitter will attempt to lex these whenever tokens like 'lambda' are valid too - and here comes the problem. We are trying to see if either the word_id
or…