Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

counting pairs is inaccurate for repeating tokens? #51

Open
JohannesVod opened this issue Feb 28, 2024 · 5 comments
Open

counting pairs is inaccurate for repeating tokens? #51

JohannesVod opened this issue Feb 28, 2024 · 5 comments

Comments

@JohannesVod
Copy link

I just noticed that counting pairs might be slightly inaccurate for a lot of repeating tokens. For example in the sequence 1, 1, 1, 1 the pair (1, 1) gets counted 3 times, but we can only replace it two times. This is slightly inaccurate because for example in the sequence 1, 1, 1, 1, 2, 1, 2, 1, 2 we might choose the pair (1, 1) over (1, 2)

@lukasugar
Copy link

That's an interesting observation! I think it's not a problem in a real scenario:

  • even in the case that (1,1) gets picked first, (1,2) would be picked after it. Unless the max number of merges is reached (highly unlikely)
  • I assume it's not too common to have very long repeating sequences - what'd that text look like? Is it meaningful? I don't have many ideas except the whitespaces in python :)
  • when looking at a large corpus of data (which is what tokenizers are trained on) this tends to be a rare edge case

@JohannesVod
Copy link
Author

Yes, i think it is not that bad in a real life scenario i guess. However look at what happens in our example: 1, 1, 1, 1, 2, 1, 2, 1, 2 -> 3,3,2,1,2,1,2. If we pick (1,2) now, then we can only substitute it 2 times. In the original sequence we could have substituted it 3 times. I'm not sure how bad it is, but i noticed (after writing this issue), that this exact problem is mentioned in the paper from the first issue. They also use the same counting function as in this repo and basically say that this is not that important. But i think you should have it in the Back of your mind

@zouharvi
Copy link

zouharvi commented Mar 2, 2024

We mention it in the paper page 3 top left and have an algorithm with the adjustment on the last page (screenshot here). We were particular about it in the paper because it does have formal implications regarding the proofs. For them we want the quantity to be how much is compressed in the next step because pair frequency is not that.

Libraries usually fix it but imo this does not have a big impact practically on natural language (could be if there are super long repeated strings in the data such as ============= due to inflated counts). The original paper by Rico Sennrich does not have this adjustment.

Screenshot_20240302-204336~2

@JohannesVod
Copy link
Author

JohannesVod commented Mar 3, 2024

yes, thank's for your reply. I already looked at your paper which you already linked in the first issue. Personally i think the algorithm is quite nice and i am currently implementing it inside c and call it in python using ctypes. I hope to get a much faster implementation that way. If you want to, i can link it to you when it is finished. Another thing i noticed is that the encoding function could also be much faster by using a linked list. When iterating through each pair from the tokenizer the linked list can then be adjusted similar to your algorithm. Im not sure what the time complexity is for this, but obviously it is much faster. Im guessing something like O(n m), where n is the sequence length and m the number of tokens, but i haven't thought much about it.

@Majdoddin
Copy link

I submitted PR #51 for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants