[ML] Update Deberta tokenizer #116358

maxhniebergall · 2024-11-06T20:33:02Z

Was using byte position, but the other code uses char position, so we should probably do that.

…ar position is correct

elasticsearchmachine · 2024-11-06T20:33:27Z

Hi @maxhniebergall, I've created a changelog YAML for you.

maxhniebergall · 2024-11-07T15:12:26Z

maxhniebergall · 2024-11-19T20:20:32Z

Upon further investigation, I have determined that this initial bug fix was not causing the reported problem (although it is also a bug independently), and the problem doesn't seem to be with bytefallback. In fact, the huggingface tokenizer doesn't even produce any output for the problematic characters.

elasticsearchmachine · 2024-11-19T20:22:28Z

Pinging @elastic/ml-core (Team:ML)

maxhniebergall · 2024-11-19T20:22:36Z

@elasticmachine merge upstream

dan-rubinstein

LGTM

elasticsearchmachine · 2024-11-20T20:10:17Z

💚 Backport successful

Status	Branch	Result
✅	8.x
✅	8.16

* Was using byte position for end of offset, but it seems like using char position is correct * Update docs/changelog/116358.yaml * Update UnigramTokenizer.java --------- Co-authored-by: Elastic Machine <[email protected]>

Was using byte position for end of offset, but it seems like using ch…

21894d9

…ar position is correct

maxhniebergall added >bug :ml Machine learning v9.0.0 v8.16.1 v8.17.0 labels Nov 6, 2024

Update docs/changelog/116358.yaml

bec7de7

Merge branch 'main' into useCharPosInsteadOfBytePos

8e380e4

elasticsearchmachine added v8.16.2 and removed v8.16.1 labels Nov 19, 2024

maxhniebergall marked this pull request as ready for review November 19, 2024 20:22

elasticsearchmachine added the Team:ML Meta label for the ML team label Nov 19, 2024

Merge branch 'main' into useCharPosInsteadOfBytePos

c036769

maxhniebergall added the auto-backport Automatically create backport pull requests when merged label Nov 19, 2024

Update UnigramTokenizer.java

aa3556b

dan-rubinstein approved these changes Nov 20, 2024

View reviewed changes

maxhniebergall merged commit 7705514 into main Nov 20, 2024
17 checks passed

maxhniebergall deleted the useCharPosInsteadOfBytePos branch November 20, 2024 20:08

This was referenced Nov 20, 2024

[8.x] [ML] Update Deberta tokenizer (#116358) #117194

Merged

[8.16] [ML] Update Deberta tokenizer (#116358) #117195

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Update Deberta tokenizer #116358

[ML] Update Deberta tokenizer #116358

maxhniebergall commented Nov 6, 2024

elasticsearchmachine commented Nov 6, 2024

maxhniebergall commented Nov 7, 2024

maxhniebergall commented Nov 19, 2024 •

edited

Loading

elasticsearchmachine commented Nov 19, 2024

maxhniebergall commented Nov 19, 2024

dan-rubinstein left a comment

elasticsearchmachine commented Nov 20, 2024

[ML] Update Deberta tokenizer #116358

[ML] Update Deberta tokenizer #116358

Conversation

maxhniebergall commented Nov 6, 2024

elasticsearchmachine commented Nov 6, 2024

maxhniebergall commented Nov 7, 2024

maxhniebergall commented Nov 19, 2024 • edited Loading

elasticsearchmachine commented Nov 19, 2024

maxhniebergall commented Nov 19, 2024

dan-rubinstein left a comment

Choose a reason for hiding this comment

elasticsearchmachine commented Nov 20, 2024

💚 Backport successful

maxhniebergall commented Nov 19, 2024 •

edited

Loading