You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unlike the other models, when I train TransE model it fails after few epochs (around 19)
with an error torch.cuda.OutOfMemoryError
This was tested on several GPUs and machines but gives the same result.
Training epochs on cuda:0: 6%| | 19/300 [12:49<3:09:42, 40.51s/epoch, loss=1.6
Traceback (most recent call last):
X
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/pipeline/api.py", line 1546, in pipeline
stopper_instance, configuration, losses, train_seconds = _handle_training(
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/pipeline/api.py", line 1190, in _handle_training
losses = training_loop_instance.train(
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/training/training_loop.py", line 378, in train
result =self._train(
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/training/training_loop.py", line 735, in _train
callback.post_epoch(epoch=epoch, epoch_loss=epoch_loss)
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/training/callbacks.py", line 443, in post_epoch
callback.post_epoch(epoch=epoch, epoch_loss=epoch_loss, **kwargs)
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/training/callbacks.py", line 367, in post_epochifself.stopper.should_stop(epoch):
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/stoppers/early_stopping.py", line 230, in should_stop
metric_results =self.evaluator.evaluate(
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/evaluation/evaluator.py", line 213, in evaluate
rv = evaluate(
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/evaluation/evaluator.py", line 687, in evaluate
relation_filter = _evaluate_batch(
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/evaluation/evaluator.py", line 760, in _evaluate_batch
scores = model.predict(hrt_batch=batch, target=target, slice_size=slice_size, mode=mode)
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/models/base.py", line 481, in predictreturnself.predict_h(hrt_batch, **kwargs, heads=ids)
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/models/base.py", line 372, in predict_h
scores =self.score_h_inverse(rt_batch=rt_batch, **kwargs)
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/models/base.py", line 528, in score_h_inversereturnself.score_t(hr_batch=t_r_inv, tails=heads, **kwargs)
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/models/nbase.py", line 505, in score_t
scores=self.interaction.score(h=h, r=r, t=t, slice_size=slice_size, slice_dim=1),
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/nn/modules.py", line 265, in scorereturnself(h=h, r=r, t=t)
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_implreturnself._call_impl(*args, **kwargs)
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_implreturn forward_call(*args, **kwargs)
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/nn/modules.py", line 412, in forwardreturnself.__class__.func(**self._prepare_for_functional(h=h, r=r, t=t))
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/nn/functional.py", line 754, in transe_interactionreturn negative_norm_of_sum(h, r, -t, p=p, power_norm=power_norm)
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/utils.py", line 652, in negative_norm_of_sumreturn negative_norm(tensor_sum(*x), p=p, power_norm=power_norm)
File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/utils.py", line 626, in tensor_sumreturnsum(_reorder(tensors=tensors))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.90 GiB. GPU 0 has a total capacty of 23.70 GiB of which 7.60 GiB is free. Including non-PyTorch memory, this process has 16.10 GiB memory in use. Of the allocated memory 964.10 MiB is allocated by PyTorch, and 14.43 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Unable to handle parameter in CooccurrenceFilteredModel: base
Key
Value
OS
posix
Platform
Linux
Release
3.10.0-1160.15.2.el7.x86_64
Time
Sun Jan 21 22:36:32 2024
Python
3.9.18
PyKEEN
1.10.1
PyKEEN Hash
UNHASHED
PyKEEN Branch
PyTorch
2.1.2
CUDA Available?
true
CUDA Version
11.8
cuDNN Version
8700
Additional information
No response
Issue Template Checks
This is not a feature request (use a different issue template if it is)
This is not a question (use the discussions forum instead)
I've read the text explaining why including environment information is important and understand if I omit this information that my issue will be dismissed
The text was updated successfully, but these errors were encountered:
I believe this is a bug for other models as well. I'm running a TextRepresentation DistMult interaction model and despite having 80G of VRAM PyKEEN still tries to allocate 14.90G more than I have. Conincidentally that's OOM by exactly the same margin as in your example.
I can confirm that I get this same issue on the latest version, when using apple silicon/"mps" device.
i.e consistently crashes on evaluation due to OOM when using "mps" (Macbook Pro M3).
Describe the bug
Unlike the other models, when I train TransE model it fails after few epochs (around 19)
with an error
torch.cuda.OutOfMemoryError
This was tested on several GPUs and machines but gives the same result.
How to reproduce
Environment
Unable to handle parameter in CooccurrenceFilteredModel: base
Additional information
No response
Issue Template Checks
The text was updated successfully, but these errors were encountered: