Why is the `lambda` being parsed as an identifier instead of a keyword in this grammar? #3386

LaBatata101 · 2024-05-25T20:50:55Z

LaBatata101
May 25, 2024

I'm trying to parse this syntax:

def main():
  return lambda a b: a

But after parsing return as a keyword tree-sitter parses lambda as an identifier instead of a keyword, why does that happen?

Here is the grammar:

module.exports = grammar({
  name: 'bend',

  extras: $ => [
    $.comment,
    /[\s\f\uFEFF\u2060\u200B]|\r?\n/,
  ],

  externals: $ => [
    $._newline,
    $._indent,
    $._dedent,
    $.comment,
  ],

  inline: $ => [
    $._simple_statement,
    $.expression,
    $.simple_expression,
  ],

  word: $ => $._id,

  rules: {
    source_file: $ => $._top_level_defs,

    _top_level_defs: $ => choice(
      $.function_definition,
    ),

    _statement: $ => choice(
      $._simple_statements,
    ),

    // Simple statements

    _simple_statements: $ => seq(
      $._simple_statement,
      $._newline,
    ),

    _simple_statement: $ => choice(
      $.return_statement,
    ),

    return_statement: $ => seq(
      'return',
      $.expression,
    ),

    // Compound statements
    function_definition: $ => seq(
      'def',
      field('name', $.identifier),
      field('parameters', $.parameters),
      ':',
      $.body,
    ),

    parameters: $ => seq(
      '(',
      optional($._parameters),
      ')',
    ),

    _parameters: $ => seq(
      commaSep1($.identifier),
      optional(',')
    ),

    body: $ => seq(
      $._indent,
      repeat($._statement),
      $._dedent,
    ),

    // Expressions

    expression: $ => choice(
      $.lambda_expression,
      $.simple_expression,
    ),

    simple_expression: $ => choice(
      $.identifier,
    ),

    lambda_expression: $ => seq(
      choice('λ', 'lambda'),
      alias(optionalCommaSep1($.expression), $.parameters),
      optional(','),
      ':',
      field('body', $.expression)
    ),

    // Identifier without two consecutive underscores __
    _top_level_identifier: _ => /[a-zA-Z][A-Za-z0-9.-/]*(?:_[A-Za-z0-9.-/] )*/,

    _id: _ => /[a-zA-Z][A-Za-z0-9.-/]*/,
    identifier: $ => choice($._id, $._top_level_identifier),

    comment: _ => token(seq('#', /.*/)),
  },
});
function optionalCommaSep1(rule) {
  return repeat1(seq(optional(','), rule))
}
function commaSep1(rule) {
  return sep1(rule, ',');
}
function sep1(rule, separator) {
  return seq(rule, repeat(seq(separator, rule)));
}

Answered by amaanq

May 25, 2024

I think the problem in this case lies with how you've implicitly defined lexical precedence between _top_level_identifier and _id.

It's mentioned in the docs but it's not something that's immediately obvious, but when you place two terminals (e.g. patterns formed from a regex or literals formed from a string) one after the other, tree-sitter will assign the first one a higher lexical precedence, and in this case we're talking about _top_level_identifier being higher than _id. Then, you happen to make _id the word token, meaning that tree-sitter will attempt to lex these whenever tokens like 'lambda' are valid too - and here comes the problem. We are trying to see if either the word _id or…

View full answer

amaanq · 2024-05-25T21:19:50Z

amaanq
May 25, 2024
Maintainer

I think the problem in this case lies with how you've implicitly defined lexical precedence between _top_level_identifier and _id.

It's mentioned in the docs but it's not something that's immediately obvious, but when you place two terminals (e.g. patterns formed from a regex or literals formed from a string) one after the other, tree-sitter will assign the first one a higher lexical precedence, and in this case we're talking about _top_level_identifier being higher than _id. Then, you happen to make _id the word token, meaning that tree-sitter will attempt to lex these whenever tokens like 'lambda' are valid too - and here comes the problem. We are trying to see if either the word _id or the literal 'lambda' are valid at a given point, but, because _top_level_identifier has a higher lexical precedence and is not the word token, this is attempted first, and it works, thus causing the problem.

You can easily fix this by just swapping the order of _id and _top_level_identifier and have _id come first, but I think we could improve the behavior or documentation here tbh, since it's not immediately obvious that _id being the word token is kinda causing the issue.

A rule of thumb that I follow (and maybe this should be documented tbh) is to place your word token rule (in this case _id) above any other rules that are a regex pattern to avoid this

5 replies

LaBatata101 May 25, 2024
Author

Thanks!

amaanq May 25, 2024
Maintainer

Np, another tip I left out is to run tree-sitter parse with the -d flag - I immediately noticed that it lexed 'lambda' as a _top_level_identifier, helping me figure out the problem. (Also, you didn't paste your scanner code given you have externals but I just assumed it was the python one copied 😅 might be good to include/mention that too in future questions because it didn't generate without that, and someone else might get lost trying to help out)

LaBatata101 May 26, 2024
Author

Sorry about that 😅

Now I think I'm facing a similar issue in another rule but I don't know how to solve it. Here's the minimal grammar:

module.exports = grammar({
  name: 'bend',

  extras: $ => [
    $.comment,
    /[\s\f\uFEFF\u2060\u200B]|\r?\n/,
  ],

  externals: $ => [
    $._newline,
    $._indent,
    $._dedent,
    $.comment,
  ],

  word: $ => $._id,

  rules: {
    source_file: $ => $._top_level_defs,

    _top_level_defs: $ => choice(
      $.type_definition,
    ),

    // Top-level definitions

    type_definition: $ => seq(
      'type',
      $.identifier,
      ':',
      $._type_def_body,
    ),

    _type_def_body: $ => seq(
      $._indent,
      repeat1($.type_constructor),
      $._dedent,
    ),

    type_constructor: $ => seq(
      $.identifier,
      optional($.type_constructor_field),
      // $._newline,
    ),

    type_constructor_field: $ => seq(
      '{',
      // TODO: maybe create a node or field for the recursive field?
      commaSep1(choice(
        seq('~', $.identifier),
        $.identifier,
      )),
      optional(','),
      '}'
    ),

    _id: _ => /[a-zA-Z][A-Za-z0-9.-/]*/,
    // Identifier without two consecutive underscores __
    _top_level_identifier: _ => /[a-zA-Z][A-Za-z0-9.-/]*(?:_[A-Za-z0-9.-/] )*/,

    identifier: $ => choice($._id, $._top_level_identifier),

    comment: _ => token(seq('#', /.*/)),

  },
});
function optionalCommaSep1(rule) {
  return repeat1(seq(optional(','), rule))
}
function commaSep1(rule) {
  return sep1(rule, ',');
}
function sep1(rule, separator) {
  return seq(rule, repeat(seq(separator, rule)));
}

and the scanner code.

I'm trying to parse this sytnax:

type Option:
  None

type Tree:
  A

After parsing the first type tree-sitter is treating the second type as an _id. What's going on here?

amaanq May 26, 2024
Maintainer

source_file can only be 1 of top_level_defs, make it repeat1($._top_level_defs)

It is a little hard to understand, I think something like "eof expected" would be really cool in this scenario since it took me a few minutes to catch it 😅 that's something we want to improve in the future tbh

LaBatata101 May 27, 2024
Author

Nice! Thanks again!

So I have another one for you 😅 this time I'm trying to parse an if-else statement but tree-sitter is not recognizing the else keyword.

def main():
    if a > b:
        return c
    else:
        return d

Minimal grammar:

const PREC = {
  comparison: 13,
}

module.exports = grammar({
  name: 'bend',

  extras: $ => [
    $.comment,
    /[\s\f\uFEFF\u2060\u200B]|\r?\n/,
  ],

  externals: $ => [
    $._newline,
    $._indent,
    $._dedent,
    $.comment,
  ],

  inline: $ => [
    $._simple_statement,
    $._compound_statement,
    $.expression,
    $.simple_expression,
  ],

  word: $ => $._id,

  rules: {
    source_file: $ => repeat($._top_level_defs),

    _top_level_defs: $ => choice(
      $.function_definition,
    ),

    // Top-level definitions

    function_definition: $ => seq(
      'def',
      field('name', $.identifier),
      field('parameters', $.parameters),
      ':',
      $.body,
    ),

    parameters: $ => seq(
      '(',
      optional($._parameters),
      ')',
    ),

    _parameters: $ => seq(
      commaSep1($.identifier),
      optional(',')
    ),

    body: $ => seq(
      $._indent,
      repeat($._statement),
      $._dedent,
    ),

    _statement: $ => choice(
      $._simple_statements,
      $._compound_statement,
    ),

    // Simple statements

    _simple_statements: $ => seq(
      $._simple_statement,
      $._newline,
    ),

    _simple_statement: $ => choice(
      $.return_statement,
    ),

    return_statement: $ => seq(
      'return',
      $.expression,
    ),

    // Compound statements

    _compound_statement: $ => choice(
      $.if_statement,
    ),

    if_statement: $ => seq(
      'if',
      field('condition', $.expression),
      ':',
      $.body,
      repeat($.elif_clause),
      $.else_clause,
    ),

    elif_clause: $ => seq(
      'elif',
      field('condition', $.expression),
      ':',
      $.body,
    ),

    else_clause: $ => seq(
      'else',
      ':',
      $.body,
    ),

    // Expressions

    expression: $ => choice(
      $.simple_expression,
    ),

    simple_expression: $ => choice(
      $.identifier,
      $.comparison_op,
    ),

    comparison_op: $ => prec.left(PREC.comparison, seq(
      $.simple_expression,
      seq(
        choice(
          '==',
          '<',
          '>',
          '!=',
        ),
        $.simple_expression
      )
    )),

    _id: _ => /[a-zA-Z][A-Za-z0-9.-/]*/,
    // Identifier without two consecutive underscores __
    _top_level_identifier: _ => /[a-zA-Z][A-Za-z0-9.-/]*(?:_[A-Za-z0-9.-/] )*/,

    identifier: $ => choice($._id, $._top_level_identifier),

    comment: _ => token(seq('#', /.*/)),

  },
});
function optionalCommaSep1(rule) {
  return repeat1(seq(optional(','), rule))
}
function commaSep1(rule) {
  return sep1(rule, ',');
}
function sep1(rule, separator) {
  return seq(rule, repeat(seq(separator, rule)));
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the `lambda` being parsed as an identifier instead of a keyword in this grammar? #3386

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Why is the lambda being parsed as an identifier instead of a keyword in this grammar? #3386

LaBatata101 May 25, 2024

Replies: 1 comment · 5 replies

amaanq May 25, 2024 Maintainer

LaBatata101 May 25, 2024 Author

amaanq May 25, 2024 Maintainer

LaBatata101 May 26, 2024 Author

amaanq May 26, 2024 Maintainer

LaBatata101 May 27, 2024 Author

Why is the `lambda` being parsed as an identifier instead of a keyword in this grammar? #3386

LaBatata101
May 25, 2024

Replies: 1 comment 5 replies

amaanq
May 25, 2024
Maintainer

LaBatata101 May 25, 2024
Author

amaanq May 25, 2024
Maintainer

LaBatata101 May 26, 2024
Author

amaanq May 26, 2024
Maintainer

LaBatata101 May 27, 2024
Author