Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaner dockerfile: Remove already installed deps #1175

Merged
merged 1 commit into from
Mar 8, 2024

Conversation

tf-nv
Copy link
Contributor

@tf-nv tf-nv commented Mar 7, 2024

Cleaning up the Dockerfile after the ngc pytorch switch (#1170):

  • Eliminate already installed apt packages
  • sparse attn requirement lead to a triton downgrade
  • flash attn is already part of the ngc container (in another version that is compatible with TE)

The flash-attn change might be required for TE, see the TE setup script. The following error from the docker log was indicating so, too. I am not 100% sure if the version upper limit is a technological requirement, or just a precaution. In any case this is now resolved, plus there's no need to build flash-attn.

#14 10.98 Installing collected packages: flash-attn
#14 10.98   Attempting uninstall: flash-attn
#14 10.98     Found existing installation: flash-attn 2.4.2
#14 10.98     Uninstalling flash-attn-2.4.2:
#14 11.74       Successfully uninstalled flash-attn-2.4.2
#14 13.81 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
#14 13.81 transformer-engine 1.3.0 5b90b7f requires flash-attn!=2.0.9,!=2.1.0,<=2.4.2,>=2.0.6, but you have flash-attn 2.5.6 which is incompatible.
#14 13.81 Successfully installed flash-attn-2.5.6

Training pythia14M works (with gpt-j-residual=false , see #1174)
Training a 20B model with TP8 works as well

Just a few steps on 8xH100, not to convergence :)

- Eliminate already installed apt packages
- sparse attn requirement lead to a triton downgrade
- flash attn is already part of the ngc container (in another version
  that is compatible with TE)
@tf-nv tf-nv requested a review from Quentin-Anthony as a code owner March 7, 2024 14:48
Copy link
Member

@Quentin-Anthony Quentin-Anthony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

The requirements files are intended for source-builds anyway, so any package overlap between them and the Dockerfile should be handled like this PR (i.e. removing the requirements file from Dockerfile)

@Quentin-Anthony Quentin-Anthony merged commit 8c13642 into EleutherAI:main Mar 8, 2024
2 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants