Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker #2

Open
14 of 18 tasks
hrhee opened this issue Nov 3, 2021 · 5 comments
Open
14 of 18 tasks

docker #2

hrhee opened this issue Nov 3, 2021 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@hrhee
Copy link
Collaborator

hrhee commented Nov 3, 2021

Docker Build File

Plan: Base docker build file on nvidia/cuda image, copy source code of project, define docker volumes, ...

To Do:

  • run cctest: run_training.py
    • install python 3.9 via miniconda
    • poetry install
    • poetry shell see comment
    • edit .env
    • download MNIST data
    • python run_training.py: Running see comment
  • 11.4.3-cudnn8-runtime-ubuntu20.04 image:
    • on openstack
    • on maxwell
  • build image with gitlab-desy CI pipeline: docker pull gitlab.desy.de:5555/franz.rhee/docker-maxwell:latest
  • Dockerfile: poetry install
  • Dockerfile: COPY source code into image

links:

misc:

[1] https://confluence.desy.de/display/MXW/Running a single job with Docker

@Ivo-B Ivo-B added the enhancement New feature or request label Nov 3, 2021
@hrhee hrhee self-assigned this Nov 4, 2021
@hrhee
Copy link
Collaborator Author

hrhee commented Nov 4, 2021

WIP

$ poetry shell
>Spawning shell within /home/pan/miniconda3/envs/py39
>. /home/pan/miniconda3/envs/py39/bin/activate

$ . /home/pan/miniconda3/envs/py39/bin/activate
>bash: /home/pan/miniconda3/envs/py39/bin/activate: No such file or directoryARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested

But:

conda activate py39
conda list

All packages installed by poetry are listed in conda environment py39

@hrhee
Copy link
Collaborator Author

hrhee commented Nov 4, 2021

[2021-11-04 17:02:29,421][tensorflow][INFO] - Assets written to: checkpoints/epoch_005-0.06.tf/assets
Exception ignored in: <function Pool.__del__ at 0x7fe732686700>
Traceback (most recent call last):
  File "/home/pan/miniconda3/envs/py39/lib/python3.9/multiprocessing/pool.py", line 268, in __del__
  self._change_notifier.put(None)
  File "/home/pan/miniconda3/envs/py39/lib/python3.9/multiprocessing/queues.py", line 378, in put
  self._writer.send_bytes(obj)
  File "/home/pan/miniconda3/envs/py39/lib/python3.9/multiprocessing/connection.py", line 205, in send_bytes
  self._send_bytes(m[offset:offset   size])
  File "/home/pan/miniconda3/envs/py39/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
  self._send(header   buf)
  File "/home/pan/miniconda3/envs/py39/lib/python3.9/multiprocessing/connection.py", line 373, in _send
  n = write(self._handle, buf)
  OSError: [Errno 9] Bad file descriptor

@Ivo-B
Copy link
Owner

Ivo-B commented Nov 4, 2021

Ich hatten gehofft, dass der fehler nur unter Windows auftaucht... das ist irgendwie ganz am ende vom hydra

@hrhee
Copy link
Collaborator Author

hrhee commented Jan 13, 2022

Current behaviour on maxwell:

$ dockerrun --gpus all --rm -it gitlab.desy.de:5555/franz.rhee/docker-maxwell:latest bash
docker_pwd$ python run_training.py mode=exp name=exp_test 

produces

[2022-01-13 17:56:21,203][cctest.executor.training][INFO] - Instantiating trainer <cctest.model.base_model_trainer.TrainingModule>
Error executing job with overrides: ['mode=exp', 'name=exp_test']
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/hydra/_internal/instantiate/_instantiate2.py", line 62, in _call_target
  return _target_(*args, **kwargs)
  TypeError: __init__() got an unexpected keyword argument 'mixed_precision'

             During handling of the above exception, another exception occurred:

             Traceback (most recent call last):
  File "/docker_pwd/run_training.py", line 36, in main
  return train(config)
  File "/docker_pwd/cctest/executor/training.py", line 120, in train
  trainer: TrainingModule = hydra.utils.instantiate(
      File "/usr/local/lib/python3.9/dist-packages/hydra/_internal/instantiate/_instantiate2.py", line 180, in instantiate
      return instantiate_node(config, *args, recursive=_recursive_, convert=_convert_)
      File "/usr/local/lib/python3.9/dist-packages/hydra/_internal/instantiate/_instantiate2.py", line 249, in instantiate_node
      return _call_target(_target_, *args, **kwargs)
      File "/usr/local/lib/python3.9/dist-packages/hydra/_internal/instantiate/_instantiate2.py", line 64, in _call_target
      raise type(e)(
        File "/usr/local/lib/python3.9/dist-packages/hydra/_internal/instantiate/_instantiate2.py", line 62, in _call_target
        return _target_(*args, **kwargs)
        TypeError: Error instantiating 'cctest.model.base_model_trainer.TrainingModule' : __init__() got an unexpected keyword argument 'mixed_precision'

        Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

@Ivo-B
Copy link
Owner

Ivo-B commented Jan 14, 2022

5358e37 should fix this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants