Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pickle is a security issue #52596

Open
KOLANICH opened this issue Feb 22, 2021 · 11 comments
Open

pickle is a security issue #52596

KOLANICH opened this issue Feb 22, 2021 · 11 comments
Assignees
Labels
module: hub module: pickle Problems related to pickling of PyTorch objects module: serialization Issues related to serialization (e.g., via pickle, or otherwise) of PyTorch objects topic: security triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@KOLANICH
Copy link

KOLANICH commented Feb 22, 2021

🚀 Feature

We need to do something with it.

Motivation

Pickle is a security issue that can be used to hide backdoors. Unfortunately lots of projects keep using torch.save and torch.load.

Pitch

  • make pytorch.load use pickle only as a serialization format, use an own virtual machine (https://github.com/CensoredUsername/picklemagic can be helpful) for processing pickle files that will do only allowed operations in pytorch itself in a completely controlled way instead of relying on pickle machinery.
  • replace with ONNX
  • deprecate pytorch.load, pytorch.save
  • remove pytorch.save/make it save into ONNX

Alternatives

  • support pickle via a VM indefinitely.

cc @mruberry @nairbv @NicolasHug @vmoens @jdsgomes @ailzhang

@vadimkantorov
Copy link
Contributor

vadimkantorov commented Feb 22, 2021

Related: #52181 ... One of my suggestions was supporting hdf5 for safe weights storage

@rgommers
Copy link
Collaborator

It already has a clear warning in https://pytorch.org/docs/master/generated/torch.load.html:

image

Deprecating doesn't seem warranted; the Python pickle module itself (https://docs.python.org/3/library/pickle.html) has a similar warning. As does numpy.load (the issue is smaller there, but it exists and will remain with a docs warning).

A good alternative does seem needed indeed. That part of the issue seems duplicate with gh-52181.

@vadimkantorov
Copy link
Contributor

vadimkantorov commented Feb 22, 2021

torch.hub.load doesn't have such a big clear notice may be good it printed a UserWarning that's easier to notice (not everyone would check in docs these "simple" methods). I also didn't see such big red warnings at model zoos like hugging face

a nightmare scenario: some "cool" recent model uploaded to hugging face by some malicious actors, publicized on twitter, and then gets backdoor access at research labs over the world... it must execute code, i propose to have some this_is_dangerous_and_can_in_principle_execute_malicious = True mandatory keyword argument to these code-executing methods, so that the user is actively aware of this

@KOLANICH
Copy link
Author

KOLANICH commented Feb 22, 2021

It already has a clear warning in https://pytorch.org/docs/master/generated/torch.load.html:

It doesn't prevent people from using it for loading models and from storing models into this format.

i propose to have some this_is_dangerous_and_can_in_principle_execute_malicious = True mandatory keyword argument to these code-executing methods, so that the user is actively aware of this

Not enough. Phasing out it is the only feasible way in long term. In short term the solution is to defuse pickle used and make it clear not only to devs of the software but also to users that they should blame the devs still using pickle. Such dangerous temporary solutions when introduced should have a clear deadline. Also IMHO it was a big mistake to introduce pickle into python at all.

@bdhirsh bdhirsh added module: pickle Problems related to pickling of PyTorch objects triage review labels Feb 22, 2021
@zou3519 zou3519 added the module: serialization Issues related to serialization (e.g., via pickle, or otherwise) of PyTorch objects label Mar 1, 2021
@anjali411 anjali411 added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Mar 1, 2021
@zou3519
Copy link
Contributor

zou3519 commented Mar 1, 2021

cc @ailzhang for hub -- what do you think? Folks are concerned about who can post to hub and if downloaded models can contain viruses.

cc @suo for package or deploy: I remember there was some discussion around unifying the serialization behavior of torch.save/load (and using zipfile serialization?). One of the things proposed in this issue is to change the serialization format for PyTorch. We probably don't want to go down the route of yet another serialization format here.

@vadimkantorov
Copy link
Contributor

For zipfile - I hope it doesn't resort to good old pickle torch.load to handle loading of weights from inside the zipfile.

In my opinion, all existing frameworks, including TF / tfjs have screwed up weights serialization, so we already ended up with 10 different formats :( So adding another "good" one should not be a problem, if it is secure, performant and interoperable.

@vadimkantorov
Copy link
Contributor

Also I wonder if huggingface does anything about verifying who submits models there. I think huggingface hub is now getting more popular than the slow-moving torchhub one

@KOLANICH
Copy link
Author

KOLANICH commented Mar 1, 2021

I don't think WHO sends the model is the main issue. If the models themselves can not be Turing-complete and a complex of a model pytorch machinery by itself cannot be Turing complete and if models don't allow priveleged operations within computation graph, such as calling any API which effect is not simple computation, we are likely safe, if operations with numbers are implemented correctly (they can be not: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3177 ).

So at first we should concentrate not on WHO can upload models, but on WHAT models can do. Authenticating models via digital signatures is clearly useful, but its effect will be very minor.

The problem with models is that they are large, binary and hard to audit for humans. It won't help much to know that the model was published by huggingface, if we still don't trust huggingface and have no real reasons to trust them.

Also there are parties other than huggingface. Should we not to use their models because we don't trust them?

So we need not attribution in the first place, but a proof that any model cannot compromise integrity of a system and confidentiality of the information processed in it in some sensible threat model. So we trust not the persons issuing the models, but the beleif that our threat model is sensible.

@malfet
Copy link
Contributor

malfet commented Sep 7, 2022

After looking at https://github.com/python/cpython/blob/0d04b8d9e1953d2311f78b772f21c9e07fbcbb6d/Lib/pickle.py#L1137 implementation (which can be considered a reference implementation of the feature), it looks like above requirements can be achieved by restricting what classes/methods are safe to instantiate via find_class method, that torch.load already overrides in

pytorch/torch/serialization.py

Lines 1086 to 1093 in 1a33e94

def find_class(self, mod_name, name):
if type(name) is str and 'Storage' in name:
try:
return StorageType(name)
except KeyError:
pass
mod_name = load_module_mapping.get(mod_name, mod_name)
return super().find_class(mod_name, name)

pytorchmergebot pushed a commit that referenced this issue Oct 21, 2022
This addresses the security issue in default Python's `unpickler` that allows arbitrary code execution while unpickling.
Restrict classes allowed to be unpicked to in `None`, `int`, `bool`, `str`, `float`, `list`, `tuple`, `dict`/`OrderedDict` as well as `torch.Size`, `torch.nn.Param` as well as  `torch.Tensor` and `torch.Storage` variants.

Defaults `weights_only` is set to `False`,  but allows global override to safe only load via `TORCH_FORCE_WEIGHTS_ONLY_LOAD` environment variable.

To some extent, addresses #52596
Pull Request resolved: #86812
Approved by: https://github.com/ezyang
malfet added a commit that referenced this issue Oct 21, 2022
This addresses the security issue in default Python's `unpickler` that allows arbitrary code execution while unpickling.
Restrict classes allowed to be unpicked to in `None`, `int`, `bool`, `str`, `float`, `list`, `tuple`, `dict`/`OrderedDict` as well as `torch.Size`, `torch.nn.Param` as well as  `torch.Tensor` and `torch.Storage` variants.

Defaults `weights_only` is set to `False`,  but allows global override to safe only load via `TORCH_FORCE_WEIGHTS_ONLY_LOAD` environment variable.

To some extent, addresses #52596
Pull Request resolved: #86812
Approved by: https://github.com/ezyang

(cherry picked from commit 961ebca)
atalman pushed a commit that referenced this issue Oct 21, 2022
* Tweak several test serialization to store models state_dict (#87143)

Namely, change:
- `test_meta_serialization`
- `test_serialization_2gb_file`
- `test_pathlike_serialization`
Pull Request resolved: #87143
Approved by: https://github.com/ezyang

(cherry picked from commit 4a533f1)

* Add `weights_only` option to `torch.load` (#86812)

This addresses the security issue in default Python's `unpickler` that allows arbitrary code execution while unpickling.
Restrict classes allowed to be unpicked to in `None`, `int`, `bool`, `str`, `float`, `list`, `tuple`, `dict`/`OrderedDict` as well as `torch.Size`, `torch.nn.Param` as well as  `torch.Tensor` and `torch.Storage` variants.

Defaults `weights_only` is set to `False`,  but allows global override to safe only load via `TORCH_FORCE_WEIGHTS_ONLY_LOAD` environment variable.

To some extent, addresses #52596
Pull Request resolved: #86812
Approved by: https://github.com/ezyang

(cherry picked from commit 961ebca)
@dhrubo-os
Copy link

dhrubo-os commented Mar 29, 2023

Is there any update on this issue?

Is this true also for torchScript file? If I trace a model and then load the model, will this risk be still there? In my understanding, when we trace the model, only weight can be picked up.

Can anybody please confirm?

@KOLANICH, @malfet @vadimkantorov

Thanks.

@JaimeArboleda
Copy link

Hello! I am concerned about this issue. In our organization, the Security team wants to avoid those type of risks, but on the other hand, we data scientists want to work with torch hub and huggingface hub. I was wondering if there is any way of checking the safety of a torch model in an environment like Google Colab prior to downloading it in our private servers. Do you think this could work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: hub module: pickle Problems related to pickling of PyTorch objects module: serialization Issues related to serialization (e.g., via pickle, or otherwise) of PyTorch objects topic: security triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests