Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for SWAG contradicts itself when constructing the first sentence. #35095

Closed
2 of 4 tasks
bauwenst opened this issue Dec 5, 2024 · 2 comments
Closed
2 of 4 tasks
Labels

Comments

@bauwenst
Copy link

bauwenst commented Dec 5, 2024

System Info

Not relevant.

Who can help?

@stevhliu @ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The docs for multiple choice use SWAG as an example, which is the task of selecting the next sentence given a context. Somewhat strangely, rather than being given in the format (sentence1, [sentence2a, sentence2b, sentence2c, sentence2d]), the dataset is given in the format (sentence1, sentence2_start, [sentence2_endA, sentence2_endB, sentence2_endC, sentence2_endD]).

The code given in the docs basically turns the dataset into the first format, where sentence 1 is kept intact and the start of sentence 2 is concatenated to each ending:

... first_sentences = [[context] * 4 for context in examples["sent1"]]
... question_headers = examples["sent2"]
... second_sentences = [
... [f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)
... ]

Yet, the docs say:

The preprocessing function you want to create needs to:
1. Make four copies of the `sent1` field and combine each of them with `sent2` to recreate how a sentence starts.
2. Combine `sent2` with each of the four possible sentence endings.

What is being described is formatting the dataset as (sentence1 sentence2_start, [sentence2_start sentence2_endA, sentence2_start sentence2_endB, sentence2_start sentence2_endC, sentence2_start sentence2_endD]), where there is overlap between the first and the second sentence (namely sentence2_start).

Expected behavior

Either the code is wrong or the description is wrong.

If the description is wrong, it should be:

The preprocessing function you want to create needs to:

  1. Make four copies of the sent1 field.
  2. Combine sent2 with each of the four possible sentence endings.

If the code is wrong, it should be:

    first_sentences = [[f"{s1} {s2_start}"] * 4 for s1,s2_start in zip(examples["sent1"], examples["sent2"])]
    second_sentences = [
        [f"{s2_start} {examples[end][i]}" for end in ending_names] for i, s2_start in enumerate(examples["sent2"])
    ]
@bauwenst bauwenst added the bug label Dec 5, 2024
@Rocketknight1
Copy link
Member

cc @stevhliu

Copy link

github-actions bot commented Jan 4, 2025

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants