counting pairs is inaccurate for repeating tokens? #51

JohannesVod · 2024-02-28T12:43:56Z

I just noticed that counting pairs might be slightly inaccurate for a lot of repeating tokens. For example in the sequence 1, 1, 1, 1 the pair (1, 1) gets counted 3 times, but we can only replace it two times. This is slightly inaccurate because for example in the sequence 1, 1, 1, 1, 2, 1, 2, 1, 2 we might choose the pair (1, 1) over (1, 2)

lukasugar · 2024-03-01T17:05:00Z

That's an interesting observation! I think it's not a problem in a real scenario:

even in the case that (1,1) gets picked first, (1,2) would be picked after it. Unless the max number of merges is reached (highly unlikely)
I assume it's not too common to have very long repeating sequences - what'd that text look like? Is it meaningful? I don't have many ideas except the whitespaces in python :)
when looking at a large corpus of data (which is what tokenizers are trained on) this tends to be a rare edge case

JohannesVod · 2024-03-01T20:56:10Z

Yes, i think it is not that bad in a real life scenario i guess. However look at what happens in our example: 1, 1, 1, 1, 2, 1, 2, 1, 2 -> 3,3,2,1,2,1,2. If we pick (1,2) now, then we can only substitute it 2 times. In the original sequence we could have substituted it 3 times. I'm not sure how bad it is, but i noticed (after writing this issue), that this exact problem is mentioned in the paper from the first issue. They also use the same counting function as in this repo and basically say that this is not that important. But i think you should have it in the Back of your mind

zouharvi · 2024-03-02T19:46:15Z

We mention it in the paper page 3 top left and have an algorithm with the adjustment on the last page (screenshot here). We were particular about it in the paper because it does have formal implications regarding the proofs. For them we want the quantity to be how much is compressed in the next step because pair frequency is not that.

Libraries usually fix it but imo this does not have a big impact practically on natural language (could be if there are super long repeated strings in the data such as ============= due to inflated counts). The original paper by Rico Sennrich does not have this adjustment.

JohannesVod · 2024-03-03T18:00:44Z

yes, thank's for your reply. I already looked at your paper which you already linked in the first issue. Personally i think the algorithm is quite nice and i am currently implementing it inside c and call it in python using ctypes. I hope to get a much faster implementation that way. If you want to, i can link it to you when it is finished. Another thing i noticed is that the encoding function could also be much faster by using a linked list. When iterating through each pair from the tokenizer the linked list can then be adjusted similar to your algorithm. Im not sure what the time complexity is for this, but obviously it is much faster. Im guessing something like O(n m), where n is the sequence length and m the number of tokens, but i haven't thought much about it.

Majdoddin · 2024-06-07T18:43:41Z

I submitted PR #51 for it.

Majdoddin mentioned this issue Jun 7, 2024

Count only nonoverlapping occurences of a pair #72

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

counting pairs is inaccurate for repeating tokens? #51

counting pairs is inaccurate for repeating tokens? #51

JohannesVod commented Feb 28, 2024

lukasugar commented Mar 1, 2024

JohannesVod commented Mar 1, 2024

zouharvi commented Mar 2, 2024

JohannesVod commented Mar 3, 2024 •

edited

Loading

Majdoddin commented Jun 7, 2024

counting pairs is inaccurate for repeating tokens? #51

counting pairs is inaccurate for repeating tokens? #51

Comments

JohannesVod commented Feb 28, 2024

lukasugar commented Mar 1, 2024

JohannesVod commented Mar 1, 2024

zouharvi commented Mar 2, 2024

JohannesVod commented Mar 3, 2024 • edited Loading

Majdoddin commented Jun 7, 2024

JohannesVod commented Mar 3, 2024 •

edited

Loading