TransE - CUDA out of memory #1362

bolak92 · 2024-01-21T21:39:45Z

Describe the bug

Unlike the other models, when I train TransE model it fails after few epochs (around 19)
with an error torch.cuda.OutOfMemoryError
This was tested on several GPUs and machines but gives the same result.

Training epochs on cuda:0:   6%| | 19/300 [12:49<3:09:42, 40.51s/epoch, loss=1.6
Traceback (most recent call last):
X
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/pipeline/api.py", line 1546, in pipeline
    stopper_instance, configuration, losses, train_seconds = _handle_training(
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/pipeline/api.py", line 1190, in _handle_training
    losses = training_loop_instance.train(
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/training/training_loop.py", line 378, in train
    result = self._train(
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/training/training_loop.py", line 735, in _train
    callback.post_epoch(epoch=epoch, epoch_loss=epoch_loss)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/training/callbacks.py", line 443, in post_epoch
    callback.post_epoch(epoch=epoch, epoch_loss=epoch_loss, **kwargs)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/training/callbacks.py", line 367, in post_epoch
    if self.stopper.should_stop(epoch):
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/stoppers/early_stopping.py", line 230, in should_stop
    metric_results = self.evaluator.evaluate(
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/evaluation/evaluator.py", line 213, in evaluate
    rv = evaluate(
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/evaluation/evaluator.py", line 687, in evaluate
    relation_filter = _evaluate_batch(
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/evaluation/evaluator.py", line 760, in _evaluate_batch
    scores = model.predict(hrt_batch=batch, target=target, slice_size=slice_size, mode=mode)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/models/base.py", line 481, in predict
    return self.predict_h(hrt_batch, **kwargs, heads=ids)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/models/base.py", line 372, in predict_h
    scores = self.score_h_inverse(rt_batch=rt_batch, **kwargs)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/models/base.py", line 528, in score_h_inverse
    return self.score_t(hr_batch=t_r_inv, tails=heads, **kwargs)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/models/nbase.py", line 505, in score_t
    scores=self.interaction.score(h=h, r=r, t=t, slice_size=slice_size, slice_dim=1),
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/nn/modules.py", line 265, in score
    return self(h=h, r=r, t=t)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/nn/modules.py", line 412, in forward
    return self.__class__.func(**self._prepare_for_functional(h=h, r=r, t=t))
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/nn/functional.py", line 754, in transe_interaction
    return negative_norm_of_sum(h, r, -t, p=p, power_norm=power_norm)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/utils.py", line 652, in negative_norm_of_sum
    return negative_norm(tensor_sum(*x), p=p, power_norm=power_norm)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/utils.py", line 626, in tensor_sum
    return sum(_reorder(tensors=tensors))

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.90 GiB. GPU 0 has a total capacty of 23.70 GiB of which 7.60 GiB is free. Including non-PyTorch memory, this process has 16.10 GiB memory in use. Of the allocated memory 964.10 MiB is allocated by PyTorch, and 14.43 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

How to reproduce

result = pipeline(
    training=train,
    testing=test,
    validation=valid,
    model="TransE",
    model_kwargs={"embedding_dim": 300, "scoring_fct_norm": 1},
    epochs=300,
    stopper="early",
    stopper_kwargs={"frequency": 10, "patience": 2},
    result_tracker="wandb",
    result_tracker_kwargs=dict(project="project_name"),
    device="cuda",
)

Environment

Unable to handle parameter in CooccurrenceFilteredModel: base

Key	Value
OS	posix
Platform	Linux
Release	3.10.0-1160.15.2.el7.x86_64
Time	Sun Jan 21 22:36:32 2024
Python	3.9.18
PyKEEN	1.10.1
PyKEEN Hash	UNHASHED
PyKEEN Branch
PyTorch	2.1.2
CUDA Available?	true
CUDA Version	11.8
cuDNN Version	8700

Additional information

No response

Issue Template Checks

This is not a feature request (use a different issue template if it is)
This is not a question (use the discussions forum instead)
I've read the text explaining why including environment information is important and understand if I omit this information that my issue will be dismissed

The text was updated successfully, but these errors were encountered:

lukas-schwab · 2024-02-02T09:09:30Z

I believe this is a bug for other models as well. I'm running a TextRepresentation DistMult interaction model and despite having 80G of VRAM PyKEEN still tries to allocate 14.90G more than I have. Conincidentally that's OOM by exactly the same margin as in your example.

mberr · 2024-02-19T20:09:56Z

Hi @bolak92 ,

could you try whether #1261 has solved your issue? It's not yet in a release but you can use it by installing from source

pip install git https://github.com/pykeen/pykeen.git

ddofer · 2024-05-08T09:38:03Z

I can confirm that I get this same issue on the latest version, when using apple silicon/"mps" device.
i.e consistently crashes on evaluation due to OOM when using "mps" (Macbook Pro M3).

bolak92 added the bug Something isn't working label Jan 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TransE - CUDA out of memory #1362

TransE - CUDA out of memory #1362

bolak92 commented Jan 21, 2024 •

edited by mberr

Loading

lukas-schwab commented Feb 2, 2024

mberr commented Feb 19, 2024

ddofer commented May 8, 2024

TransE - CUDA out of memory #1362

TransE - CUDA out of memory #1362

Comments

bolak92 commented Jan 21, 2024 • edited by mberr Loading

Describe the bug

How to reproduce

Environment

Additional information

Issue Template Checks

lukas-schwab commented Feb 2, 2024

mberr commented Feb 19, 2024

ddofer commented May 8, 2024

bolak92 commented Jan 21, 2024 •

edited by mberr

Loading