[ML] Update Deberta tokenizer (#116358)

* Was using byte position for end of offset, but it seems like using char position is correct * Update docs/changelog/116358.yaml * Update UnigramTokenizer.java --------- Co-authored-by: Elastic Machine <[email protected]>
elastic · Nov 20, 2024 · 7705514 · 7705514
1 parent 311412d
commit 7705514
Show file tree

Hide file tree

Showing 2 changed files with 8 additions and 1 deletion.
diff --git a/docs/changelog/116358.yaml b/docs/changelog/116358.yaml
@@ -0,0  1,5 @@
 pr: 116358
 summary: Update Deberta tokenizer
 area: Machine Learning
 type: bug
 issues: []
diff --git a/...l/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/UnigramTokenizer.java b/...l/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/UnigramTokenizer.java
@@ -367,8  367,10 @@ List<DelimitedToken.Encoded> tokenize(CharSequence inputSequence, IntToIntFuncti
                         new DelimitedToken.Encoded(
                             Strings.format("<0xX>", bytes[i]),
                             pieces[i],
                             // even though we are changing the number of characters in the output, we don't
                             // need to change the offsets. The offsets refer to the input characters
                             offsetCorrection.apply(node.startsAtCharPos),
-                            offsetCorrection.apply(startsAtBytes   i)
                             offsetCorrection.apply(endsAtChars)
                         )
                     );
                 }