Enhance DataCollatorForLanguageModeling with Configurable Token Replacement Probabilities #35251

mahdibaghbanzadeh · 2024-12-12T21:37:43Z

This pull request introduces enhancements to the DataCollatorForLanguageModeling class, providing greater flexibility for token replacement during masked language modeling (MLM). The key changes include:

Configurable Replacement Probabilities:
- mask_replace_prob: Specifies the probability of replacing masked tokens with the [MASK] token (default: 80%).
- random_replace_prob: Specifies the probability of replacing masked tokens with random tokens from the vocabulary (default: 10%).
- The remaining masked tokens are left unchanged (default: 10%).
Edge Case Handling:
- Properly scales random_replace_prob to the remaining probability after applying mask_replace_prob.
- Includes validation to ensure the sum of mask_replace_prob and random_replace_prob does not exceed 1.
Backward Compatibility:
- Default behavior mimics the traditional 80-10-10 rule for MLM token replacement.

Examples of New Functionality

Default Behavior:
Replace 80% of masked tokens with [MASK], 10% with random tokens, and leave 10% unchanged.

Custom Configurations:

Replace all masked tokens with [MASK]:

mask_replace_prob=1.0, random_replace_prob=0.0

Replace all masked tokens with random tokens:

mask_replace_prob=0.0, random_replace_prob=1.0

Balanced replacement:

mask_replace_prob=0.5, random_replace_prob=0.4

Additional Notes

Updated docstrings to reflect the new configuration options.
Added validations for probability values and enhanced edge case handling for robust training workflows.

This enhancement gives users greater control over MLM training configurations, catering to various pretraining and fine-tuning use cases.

… that provides more control over the token masking and relacing

…lator_mlm

Rocketknight1

I like this addition to the class! Some suggestions before we can merge it, though:

You"ll need to run pip install transformers[quality] followed by make style to fix the code style issues
We"ll need some tests to cover these new options! They should go in tests/trainer/test_data_collator.py.

Because the collator uses random sampling, though, please don"t write tests that check the number of masked tokens is close to the expected value - these are very flaky and tend to randomly fail 1% of the time, which is very annoying in our CI. Instead, I suggest setting values to 0 or 1 and confirming that you get the expected behaviour - e.g. set mask_replace_prob=1 and confirm that every token is either the original token or [MASK]. You can also set illegal values and confirm that an error is raised.

src/transformers/data/data_collator.py

… the DataCollatorForLanguageModeling

mahdibaghbanzadeh · 2024-12-13T16:54:12Z

Thanks for the feedback!
I updated the docstring and added the following tests:

test_probability_sum_error: Ensures an error is raised if mask_replace_prob + random_replace_prob is not within [0, 1].
test_all_mask_replacement: Verifies functionality when mask_replace_prob=1, ensuring all tokens are either the original token or [MASK].

Rocketknight1 · 2024-12-17T16:05:03Z

@mahdibaghbanzadeh this looks good now! Let me know whenever you"re ready for final review and I"ll ping a core maintainer

mahdibaghbanzadeh · 2024-12-17T16:10:09Z

@Rocketknight1 Thanks, Please let them know to do the final review.

Rocketknight1 · 2024-12-17T16:14:10Z

cc @ArthurZucker for core maintainer review!

Rocketknight1 · 2025-01-14T14:52:42Z

hi @mahdibaghbanzadeh, and sorry for the Christmas delay! The core maintainers are pretty overworked at the moment, but I just did a final review and I"m happy with this. Let me know if there"s anything you want to change before we merge!

HuggingFaceDocBuilderDev · 2025-01-14T15:13:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

mahdibaghbanzadeh · 2025-01-14T15:55:34Z

hi @Rocketknight1 and happy New Year!
Thank you for letting me know. At this time, this looks fine and final from my end.

Rocketknight1

Merging, in that case, and thank you for the PR!

mahdibaghbanzadeh and others added 4 commits December 12, 2024 16:26

DataCollatorForLanguageModeling class was updated with new parameters…

f26418c

… that provides more control over the token masking and relacing

Merge branch "main" into data_collator_mlm

b0deb66

DataCollatorForLanguageModeling class was updated with new parameters…

cab24a3

… that provides more control over the token masking and relacing

Merge remote-tracking branch "origin/data_collator_mlm" into data_col…

cb0541e

…lator_mlm

Rocketknight1 reviewed Dec 13, 2024

View reviewed changes

src/transformers/data/data_collator.py Outdated Show resolved Hide resolved

Addressed review comments, modified the docstring and made a test for…

72d6cac

… the DataCollatorForLanguageModeling

mahdibaghbanzadeh added 2 commits December 17, 2024 09:03

Merge branch "main" into data_collator_mlm

3942102

Merge branch "main" into data_collator_mlm

450ac60

Merge branch "main" into data_collator_mlm

3c40e02

mahdibaghbanzadeh requested a review from ArthurZucker as a code owner January 9, 2025 16:14

Merge branch "main" into data_collator_mlm

7403c03

Rocketknight1 approved these changes Jan 14, 2025

View reviewed changes

Rocketknight1 merged commit c61fcde into huggingface:main Jan 14, 2025
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance DataCollatorForLanguageModeling with Configurable Token Replacement Probabilities #35251

Enhance DataCollatorForLanguageModeling with Configurable Token Replacement Probabilities #35251

mahdibaghbanzadeh commented Dec 12, 2024

Rocketknight1 left a comment •

edited

Loading

mahdibaghbanzadeh commented Dec 13, 2024

Rocketknight1 commented Dec 17, 2024

mahdibaghbanzadeh commented Dec 17, 2024

Rocketknight1 commented Dec 17, 2024

Rocketknight1 commented Jan 14, 2025

HuggingFaceDocBuilderDev commented Jan 14, 2025

mahdibaghbanzadeh commented Jan 14, 2025

Rocketknight1 left a comment

Enhance DataCollatorForLanguageModeling with Configurable Token Replacement Probabilities #35251

Enhance DataCollatorForLanguageModeling with Configurable Token Replacement Probabilities #35251

Conversation

mahdibaghbanzadeh commented Dec 12, 2024

Examples of New Functionality

Additional Notes

Rocketknight1 left a comment • edited Loading

Choose a reason for hiding this comment

mahdibaghbanzadeh commented Dec 13, 2024

Rocketknight1 commented Dec 17, 2024

mahdibaghbanzadeh commented Dec 17, 2024

Rocketknight1 commented Dec 17, 2024

Rocketknight1 commented Jan 14, 2025

HuggingFaceDocBuilderDev commented Jan 14, 2025

mahdibaghbanzadeh commented Jan 14, 2025

Rocketknight1 left a comment

Choose a reason for hiding this comment

Rocketknight1 left a comment •

edited

Loading