Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token out of vocabulary at train_gpt2.cu:675 #786

Open
aidando73 opened this issue Nov 20, 2024 · 1 comment
Open

Token out of vocabulary at train_gpt2.cu:675 #786

aidando73 opened this issue Nov 20, 2024 · 1 comment

Comments

@aidando73
Copy link

aidando73 commented Nov 20, 2024

I'm trying to follow #481 but I'm getting this error:

evaluating HellaSwag: 30/79
evaluating HellaSwag: 40/79
evaluating HellaSwag: 50/79
evaluating HellaSwag: 60/79
evaluating HellaSwag: 70/79
Writing state to log124M/state_00019560_00002.bin
Error: Token out of vocabulary at train_gpt2.cu:675
Error details:
  File: train_gpt2.cu
  Line: 675
  Token: -1149026846
  Position: 0
  Vocab: 50257
generating:
---
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[20376,1],0]
  Exit code:    1
--------------------------------------------------------------------------

Happens at the end of training. I don't end up getting the final model weights.

Running:

nice nohup bash -c 'echo "start $(date)" && mpirun -np 8 ./train_gpt2cu \
    -i "dev/data/fineweb10B/fineweb_train_*.bin" \
    -j "dev/data/fineweb10B/fineweb_val_*.bin" \
    -o log124M \
    -e "d12" \
    -b 64 -t 1024 \
    -d 524288 \
    -r 1 \
    -z 1 \
    -c 0.1 \
    -l 0.0006 \
    -q 0.0 \
    -u 700 \
    -n 5000 \
    -y 1 \
    -v 250 -s 20000 \
    -h 1 && echo "end $(date)"' &

You can find the 1500 model checkpoint state here:
https://huggingface.co/aidando73/repro-gpt-2-124M/tree/086c8895ae49f2472bcde14c7866e792b0a330f1/8x_A100_40GB/log124M

Commit hash I checked out: 7ecd890

Note that I didn't run python train_gpt2.py beforehand.

Anyone else getting this error?

@aidando73
Copy link
Author

aidando73 commented Nov 21, 2024

Note that I didn't run python train_gpt2.py beforehand.

When I was using traing_gpt2.cu for inference, I ran into the same issue. But if I ran python train_gpt2.py beforehand I no longer ran into the issue.

My hypothesis is that -1149026846 is the end of file token that we're not setting correctly for the case where we don't run python train_gpt2.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant