Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TransE - CUDA out of memory #1362

Open
3 tasks done
bolak92 opened this issue Jan 21, 2024 · 3 comments
Open
3 tasks done

TransE - CUDA out of memory #1362

bolak92 opened this issue Jan 21, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@bolak92
Copy link

bolak92 commented Jan 21, 2024

Describe the bug

Unlike the other models, when I train TransE model it fails after few epochs (around 19)
with an error torch.cuda.OutOfMemoryError
This was tested on several GPUs and machines but gives the same result.

Training epochs on cuda:0:   6%| | 19/300 [12:49<3:09:42, 40.51s/epoch, loss=1.6
Traceback (most recent call last):
X
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/pipeline/api.py", line 1546, in pipeline
    stopper_instance, configuration, losses, train_seconds = _handle_training(
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/pipeline/api.py", line 1190, in _handle_training
    losses = training_loop_instance.train(
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/training/training_loop.py", line 378, in train
    result = self._train(
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/training/training_loop.py", line 735, in _train
    callback.post_epoch(epoch=epoch, epoch_loss=epoch_loss)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/training/callbacks.py", line 443, in post_epoch
    callback.post_epoch(epoch=epoch, epoch_loss=epoch_loss, **kwargs)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/training/callbacks.py", line 367, in post_epoch
    if self.stopper.should_stop(epoch):
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/stoppers/early_stopping.py", line 230, in should_stop
    metric_results = self.evaluator.evaluate(
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/evaluation/evaluator.py", line 213, in evaluate
    rv = evaluate(
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/evaluation/evaluator.py", line 687, in evaluate
    relation_filter = _evaluate_batch(
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/evaluation/evaluator.py", line 760, in _evaluate_batch
    scores = model.predict(hrt_batch=batch, target=target, slice_size=slice_size, mode=mode)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/models/base.py", line 481, in predict
    return self.predict_h(hrt_batch, **kwargs, heads=ids)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/models/base.py", line 372, in predict_h
    scores = self.score_h_inverse(rt_batch=rt_batch, **kwargs)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/models/base.py", line 528, in score_h_inverse
    return self.score_t(hr_batch=t_r_inv, tails=heads, **kwargs)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/models/nbase.py", line 505, in score_t
    scores=self.interaction.score(h=h, r=r, t=t, slice_size=slice_size, slice_dim=1),
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/nn/modules.py", line 265, in score
    return self(h=h, r=r, t=t)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/nn/modules.py", line 412, in forward
    return self.__class__.func(**self._prepare_for_functional(h=h, r=r, t=t))
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/nn/functional.py", line 754, in transe_interaction
    return negative_norm_of_sum(h, r, -t, p=p, power_norm=power_norm)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/utils.py", line 652, in negative_norm_of_sum
    return negative_norm(tensor_sum(*x), p=p, power_norm=power_norm)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/utils.py", line 626, in tensor_sum
    return sum(_reorder(tensors=tensors))

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.90 GiB. GPU 0 has a total capacty of 23.70 GiB of which 7.60 GiB is free. Including non-PyTorch memory, this process has 16.10 GiB memory in use. Of the allocated memory 964.10 MiB is allocated by PyTorch, and 14.43 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

How to reproduce

result = pipeline(
    training=train,
    testing=test,
    validation=valid,
    model="TransE",
    model_kwargs={"embedding_dim": 300, "scoring_fct_norm": 1},
    epochs=300,
    stopper="early",
    stopper_kwargs={"frequency": 10, "patience": 2},
    result_tracker="wandb",
    result_tracker_kwargs=dict(project="project_name"),
    device="cuda",
)

Environment

Unable to handle parameter in CooccurrenceFilteredModel: base

Key Value
OS posix
Platform Linux
Release 3.10.0-1160.15.2.el7.x86_64
Time Sun Jan 21 22:36:32 2024
Python 3.9.18
PyKEEN 1.10.1
PyKEEN Hash UNHASHED
PyKEEN Branch
PyTorch 2.1.2
CUDA Available? true
CUDA Version 11.8
cuDNN Version 8700

Additional information

No response

Issue Template Checks

  • This is not a feature request (use a different issue template if it is)
  • This is not a question (use the discussions forum instead)
  • I've read the text explaining why including environment information is important and understand if I omit this information that my issue will be dismissed
@bolak92 bolak92 added the bug Something isn't working label Jan 21, 2024
@lukas-schwab
Copy link

I believe this is a bug for other models as well. I'm running a TextRepresentation DistMult interaction model and despite having 80G of VRAM PyKEEN still tries to allocate 14.90G more than I have. Conincidentally that's OOM by exactly the same margin as in your example.

@mberr
Copy link
Member

mberr commented Feb 19, 2024

Hi @bolak92 ,

could you try whether #1261 has solved your issue? It's not yet in a release but you can use it by installing from source

pip install git https://github.com/pykeen/pykeen.git

@ddofer
Copy link

ddofer commented May 8, 2024

I can confirm that I get this same issue on the latest version, when using apple silicon/"mps" device.
i.e consistently crashes on evaluation due to OOM when using "mps" (Macbook Pro M3).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants