Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorFlow binary crashes on Apple M1 in x86_64 Docker container #52845

Closed
dwyatte opened this issue Oct 28, 2021 · 50 comments
Closed

TensorFlow binary crashes on Apple M1 in x86_64 Docker container #52845

dwyatte opened this issue Oct 28, 2021 · 50 comments
Assignees
Labels
2.6.0 stat:awaiting response Status - Awaiting response from author subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues type:bug Bug type:build/install Build and install issues

Comments

@dwyatte
Copy link
Contributor

dwyatte commented Oct 28, 2021

Please make sure that this is a bug. As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
  • TensorFlow installed from (source or binary): Binary
  • TensorFlow version (use command below): TensorFlow 2.6.0, tf-nightly 2.8.0.dev20211028
  • Python version: 3.6.9, 3.7.x, 3.8.x
  • CUDA/cuDNN version: N/A
  • GPU model and memory: N/A

Describe the current behavior

dwyatte-macbookpro:~ dwyatte$ docker run tensorflow/tensorflow:latest python -c "import tensorflow as tf"    
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
2021-10-28 22:50:41.481158: F tensorflow/core/lib/monitoring/sampler.cc:42] Check failed: bucket_limits_[i] > bucket_limits_[i - 1] (0 vs. 10)
qemu: uncaught target signal 6 (Aborted) - core dumped

Describe the expected behavior
Clean exit

Standalone code to reproduce the issue
Requires an Apple M1 (arm64) host OS:
docker run tensorflow/tensorflow:latest python -c "import tensorflow as tf"

This was previously mentioned in #42387 but unfortunately closed. When importing TensorFlow in an x86_64 docker container on an Apple M1, TensorFlow crashes. As far as I can tell, this should work as I can import and use other Python packages in the same container without problems (including things like numpy).

It's unclear whether this is something that can be avoided at the TensorFlow level or an unavoidable bug in qemu ([1], [2]), but I wanted to reraise the issue.

@dwyatte dwyatte added the type:bug Bug label Oct 28, 2021
@mohantym mohantym added type:build/install Build and install issues subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues 2.6.0 labels Oct 29, 2021
@mohantym
Copy link
Contributor

Hi @dwyatte ! Could you check these threads ? link1,link2

@mohantym mohantym added the stat:awaiting response Status - Awaiting response from author label Oct 29, 2021
@dwyatte
Copy link
Contributor Author

dwyatte commented Oct 29, 2021

Thanks @mohantym

The links just reference the warning above which I believe is innocuous since Docker can emulate the image's platform. TensorFlow doesn't publish official linux/arm64/v8 images (would require an aarch64 TensorFlow build), but I would think that would remove the warning. Note that the problem is specifically with TensorFlow's assumptions about the emulated platform and not the image or other libraries, which run fine when emulating linux/amd64:

dwyatte-macbookpro:~ dwyatte$ docker run tensorflow/tensorflow:latest python -c "import numpy as np; print(np.random.rand(10))"   
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
[0.86125896 0.40657583 0.76832123 0.77205272 0.99326573 0.513298
 0.64218547 0.15977918 0.37553315 0.56692333]

I suspect Check failed: bucket_limits_[i] > bucket_limits_[i - 1] (0 vs. 10) is a sanity check that TensorFlow runs on startup that fails under emulation. IMO this issue is about whether there is anything that can be done on the TensorFlow side to relax or correct this check or whether this is a critical check that is violated e.g., by qemu (https://gitlab.com/qemu-project/qemu/-/issues/601 suggests it could be floating point inaccuracy, although that seems to just be a guess).

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Oct 31, 2021
@mohantym
Copy link
Contributor

mohantym commented Nov 2, 2021

Hi @sanatmpa1! Could you please look at this issue?

@mohantym mohantym assigned sanatmpa1 and unassigned mohantym Nov 2, 2021
@vazkir
Copy link

vazkir commented Nov 2, 2021

I am taking a class where we use tensorflow inside docker containers and everybody with an M1 mac in that class had this exact same issue including me. Unfortunately nobody has found a fix so I am going to subsribe to this issue as well, I hope there exist some kind of workarround/solution!

@alexcombessie
Copy link

Hi,

I have the exact same issue. It is hindering my development process. While my app is deployed on an x86 server, I do need to use my M1 mac with emulation to develop code locally and to push it to production.

All other major data science packages work correctly under x86 rosetta emulation: pandas, scikit-learn, torch, transformers, spacy, xgboost, lightgbm.

I appreciate the great work you are doing with TensorFlow. I would be really grateful if you could take the time to help the data scientists / ML engineers out there who are using ARM-based development laptops.

Thanks a lot,

Alex

PS: I am not interested in forks like tensorflow-macos etc as I need my work to be cross-platform.

@bhack
Copy link
Contributor

bhack commented Nov 7, 2021

apple/tensorflow_macos#164 (comment)

https://github.com/ARM-software/Tool-Solutions/tree/master/docker/tensorflow-aarch64

But as someone still needs to use this in emulation I suppose in that It could be a qemu BUG with DBL_MAX in emulation

@sachinprasadhs sachinprasadhs added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Nov 11, 2021
@vazkir
Copy link

vazkir commented Nov 14, 2021

Did anybody find any way to run tensorflow inside a docker container on any M1, M1 Pro or M1 Max device? Would really love to know any workaround so I can start building containers with tf. Thanks in advance for any tips!

@bhack
Copy link
Contributor

bhack commented Nov 14, 2021

If the point is to have a published X86 wheel without AVX we have already an open ticket, so it is better to add a comment there instead of having a new ticket:

#19584

If instead you want to have AVX TCG support in QEMU e.g. on M1 there is already an open ticket at:
https://gitlab.com/qemu-project/qemu/-/issues/164

@dwyatte
Copy link
Contributor Author

dwyatte commented Nov 15, 2021

So I do think this is due to AVX instructions. If I install an unofficial wheel (e.g., from yaroslavvb/tensorflow-community-wheels#198) and run a variant of the docker run command above, I do not get a crash on import.

dwyatte-macbookpro:~ dwyatte$ docker run -it tensorflow/tensorflow:latest bash -c 'pip uninstall -y tensorflow-cpu && pip install -U https://tf.novaal.de/barcelona/tensorflow-2.6.0-cp38-cp38-linux_x86_64.whl && python -c "import tensorflow as tf; tf.print(\"hello world\")"'
...
2021-11-15 23:44:35.660302: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
hello world

Thanks for the lead @bhack. I agree, some solutions which you mention are:
1.) Publishing non-AVX wheels (or having non-AVX code paths available within a single wheel)
2.) Correctly handling in qemu via emulation/TCG/etc.

@bhack
Copy link
Contributor

bhack commented Nov 16, 2021

For the first point I don't know if anyone at @Intel-tensorflow is interested to publish an SSE4.x only wheel in https://pypi.org/project/intel-tensorflow/

@gabac
Copy link

gabac commented Nov 18, 2021

@dwyatte Thanks a lot for the tip. With an unofficial wheel I was able to get Tensorflow running within Docker on an Apple M1 processor 🚀

@janvdp
Copy link

janvdp commented Nov 19, 2021

@gabac One you built or one that is available online? I'm facing the same issue...

@gabac
Copy link

gabac commented Nov 19, 2021

E.g. if you use pip as a package manager use e.g. pip install -U https://tf.novaal.de/barcelona/tensorflow-2.5.0-cp37-cp37m-linux_x86_64.whl for Python 3.7, Tensorflow 2.5.0

@janvdp
Copy link

janvdp commented Nov 19, 2021

Thanks, that did the trick! Unfortunately, Docker M1 Mac seems to be pretty slow... :( (not talking about training...)

@bhack
Copy link
Contributor

bhack commented Nov 19, 2021

For performance you need to use tensorflow-macos

@josemiguelalves
Copy link

any update?

@harraz
Copy link

harraz commented Oct 7, 2022

I've tried Tensorflow 2.3.1 and I still get F tensorflow/core/lib/monitoring/sampler.cc:42] Check failed: bucket_limits_[i] > bucket_limits_[i - 1] (0 vs. 10) qemu: uncaught target signal 6 (Aborted) - core dumped Any suggestions would be great - thanks.

Any luck with this issue. I get this when i try to import tensorflow in python

@dwyatte
Copy link
Contributor Author

dwyatte commented Oct 8, 2022

While this issue was originally opened around emulating TensorFlow on x86_64 in Docker, it does look like there are now tensorflow aarch64 binaries that can be used in linux/arm64/v8 Docker containers. More info here: https://blog.tensorflow.org/2022/09/announcing-tensorflow-official-build-collaborators.html

Dockerfile

FROM python:3.7-slim

RUN pip install tensorflow==2.10.0 tensorflow-io==0.27.0
CMD python -c "import tensorflow as tf; print(tf.constant(42) / 2   2)"
docker build --platform=linux/arm64/v8 . -t tensorflow
docker run --platform=linux/arm64/v8 tensorflow 
tf.Tensor(23.0, shape=(), dtype=float64)

@sachinprasadhs
Copy link
Contributor

@dwyatte , Thanks for confirming, if your issue is resolved, could you please close this issue.
Also, refer https://www.tensorflow.org/install for latest install instructions. Thanks!

@sachinprasadhs sachinprasadhs added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Nov 30, 2022
@dwyatte
Copy link
Contributor Author

dwyatte commented Dec 4, 2022

@dwyatte , Thanks for confirming, if your issue is resolved, could you please close this issue.

Sure, I think we can close this now. QEMU also appears to have merged AVX instructions, so once that is pulled into Docker, it might also be possible to run via emulation.

https://gitlab.com/qemu-project/qemu/-/issues/164#note_1140802183

@dwyatte dwyatte closed this as completed Dec 4, 2022
@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@fumoboy007
Copy link

fumoboy007 commented Dec 5, 2022

@sachinprasadhs Will Google release prebuilt ARM64 Docker images to Docker Hub? I’m especially interested in an ARM64 tensorflow/serving image.

@sachinprasadhs
Copy link
Contributor

CC:@angerson , @learning-to-play

@learning-to-play
Copy link
Collaborator

Thanks for reaching out! I'm not aware of any plans to release prebuilt ARM64 Docker images.

@fumoboy007
Copy link

@learning-to-play It would be great for the community if we had prebuilt images for all architectures that we support. 🙏

DGaffney added a commit to meedan/alegre that referenced this issue Dec 13, 2022
* Bump ujson from 1.35 to 5.4.0

Bumps [ujson](https://github.com/ultrajson/ultrajson) from 1.35 to 5.4.0.
- [Release notes](https://github.com/ultrajson/ultrajson/releases)
- [Commits](ultrajson/ultrajson@v1.35...5.4.0)

---
updated-dependencies:
- dependency-name: ujson
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Meedan 2116 update image scoring (#244)

* CHECK-2116 update alegre image endpoint to return correctly ordered scoring, add init perl to start.sh file while we're here

* CHECK-2116 update alegre image endpoint to return correctly ordered scoring, add init perl to start.sh file while we're here

* CHECK-2116 fix typo

* CHECK-2116 update test for new scoring setup

* CHECK-2116 update contract test

* Meedan 2120 add limits (#245)

* CHECK-2120 initial push on adding limit to all search responses

* CHECK-2120 fix typo

* CHECK-2120 remove bad id after testing context hashes on dev

* CHECK-2120 update variable name

* CHECK-2120 update test

* CHECK-2120 fix typo

* CHECK-2120 refactor audio similarity to make search function less complex

* CHECK-2120 fix more minor code climate issue

* Change Alegre port to 3100 to avoid conflict on Mac Monterey (#246)

Port 5000, which Alegre currently runs on, is now used by AirPlay on
Macs running Monterey. As a result, there is an error that port is in
use when our application tries to use that port in development.

To fix this, I modified the external port to 3100, which it seems
to have been at some point in the past (reflected by Readme). For
internal consistency, I went ahead and updated the internal port
to 5000, as well, even though it wasn't really necessary.

Fixes CHECK-2147

* Meedan 2178 delete with context (#247)

* CHECK-2120 initial push on adding limit to all search responses

* CHECK-2120 fix typo

* CHECK-2120 remove bad id after testing context hashes on dev

* CHECK-2120 update variable name

* CHECK-2120 update test

* CHECK-2120 fix typo

* CHECK-2120 refactor audio similarity to make search function less complex

* CHECK-2120 fix more minor code climate issue

* CHECK-2178 add deletion conditional on context uniqueness

* CHECK-2178 fix code climate issues

* CHECK-2178 remove context on text until we are able to do something with it in next ticket

* add type checking

* and of course we want is list

* CHECK-2178 add prints to diagnose these last bugs

* CHECK-2178 work on type mismatch now

* CHECK-2178 fix tests with updated input data

* CHECK-2178 fix typo in function params and update tests to reflect added context

* CHECK-2178 add context to test

* CHECK-2178 remove prints

* CHECK-2139 add parameters to establish min cutoff score from ES as we… (#250)

* CHECK-2139 add parameters to establish min cutoff score from ES as well as per-model thresholding

* CHECK-2139 resolve codeclimate suggestion

* Use community version of Tensorflow that works with M1

The TensorFlow binary downloaded from a normal TensorFlow 2.3.1 pip install (from requirements)
was crashing when we used the linux/x86_64 emulated arch with M1 macs (which is needed
because TensorFlow does not yet have an arm-supported version).

To solve this, we are using a community wheel of Tensorflow 2.3.1 compiled as we need it.

More on this here: tensorflow/tensorflow#52845

Paired with Ahmed!

CHECK-2147

* Fixes creating text graphs

When I was trying to generate text clusters locally, it didn’t fail, but no clusters were returned. It worked well for images. Looks like some changes to text similarity were not reflects in the graph writer. Looks like "model" should now be "models" and "text" should be "content". I'm not sure, so I'll ask Devin to review it.

Fixes CHECK-2212.

* CHECK-2179 initial push on using context in text like other media (#249)

* CHECK-2179 initial push on using context in text like other media

* CHECK-2179 alter logic of delete to allow to attempt to delete any not-multi-context doc

* CHECK-2179 re-add missing var

* CHECK-2131 add errbit notification for broken search result (#253)

* CHECK-2131 add errbit notification for broken search result

* CHECK-2131 remove now irrelevant test

* CHECK-2131 old test is changed due to minor change from API - fix maybe?

* CHECK-2131 make test more robust

* CHECK-2131 switch args

* CHECK-2131 More test fixes

* CHECK-2131 this set of tests man!

* CHECK-2131 more fixing on these tests

* CHECK-2387 don't allow nil thresholds (#255)

* CHECK-2387 don't allow nil thresholds

* CHECK-2387 ah the old zero is not game in python

* CHECK-2284 update documentation to more explicitly call out that swagger docs wont work out of box (#257)

* CHECK-2284 update documentation to more explicitly call out that swagger docs wont work out of box

* MEEDAN-2284 fix whitespace

* CHECK-2437 add support for using analyzers by language (#258)

* CHECK-2437 add support for using analyzers by language

* CHECK-2437 remove old dependencies from half-implementation of analyzers

* CHECK-2437 shift es client

* CHECK-2437 add tests for new use case

* CHECK-2437 add fix for tests to actually pass

* Meedan 2437 multiple analyzer indices (#261)

* CHECK-2437 add support for using analyzers by language

* CHECK-2437 remove old dependencies from half-implementation of analyzers

* CHECK-2437 shift es client

* CHECK-2437 add tests for new use case

* CHECK-2437 add fix for tests to actually pass

* CHECK-2437 resolve code review fixes

* Optionally allow language override

* CHECK-2437 add ascii folding and other minor tweaks (#262)

* Change order of analyzer filters

* remove draft lines

* CHECK-1716 Add explicit model returns for all responses, also sneak in some language analyzer changes (#264)

* CHECK-1716 Add explicit model returns for all responses, also sneak in some language analyzer changes

* CHECK-1716 add updates to test fixtures

* CHECK-1716 add more test fixes

* CHECK-2608 version bump cld (#265)

* CHECK-2608 add test function (#266)

* Fixing PostgreSQL Dockerfile

All CI builds were failing with this error:

```
W: The repository 'http://apt.postgresql.org/pub/repos/apt stretch-pgdg Release' does not have a Release file.
E: Failed to fetch http://apt.postgresql.org/pub/repos/apt/dists/stretch-pgdg/11/binary-amd64/Packages  404  Not Found [IP: 147.75.85.69 80]
E: Some index files failed to download. They have been ignored, or old ones used instead.
The command '/bin/sh -c apt-get update && apt-get install -y     gawk     postgresql-plperl-$PG_MAJOR     && localedef -i ru_RU -c -f UTF-8 -A /usr/share/locale/locale.alias ru_RU.UTF-8     && rm -rf /var/lib/apt/lists/*' returned a non-zero code: 100
Service 'postgres' failed to build : Build failed
```

Here's an announcement: https://www.postgresql.org/message-id/[email protected]

Fixed by installing the packages from the archive repository.

* CHECK-2690 remove vectors from responses for alegre text (#268)

* Meedan 2690 remove vectors from response (#269)

* CHECK-2690 remove vectors from responses for alegre text

* CHECK-2690 apply stripper to every case

* CHECK-2690 minor fix

* CHECK-2702 fix thresholding function for audio (#270)

* CHECK-2702 fix thresholding function for audio

* CHECK-2702 fix tests

* invert index

* CHECK-2782 update matching to reject mismatched lengths (#273)

* Bump pyjwt from 1.6.4 to 2.4.0 (#236)

Bumps [pyjwt](https://github.com/jpadilla/pyjwt) from 1.6.4 to 2.4.0.
- [Release notes](https://github.com/jpadilla/pyjwt/releases)
- [Changelog](https://github.com/jpadilla/pyjwt/blob/master/CHANGELOG.rst)
- [Commits](jpadilla/pyjwt@1.6.4...2.4.0)

---
updated-dependencies:
- dependency-name: pyjwt
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333 dependabot[bot]@users.noreply.github.com>

* Bump joblib from 1.0.1 to 1.2.0 (#260)

Bumps [joblib](https://github.com/joblib/joblib) from 1.0.1 to 1.2.0.
- [Release notes](https://github.com/joblib/joblib/releases)
- [Changelog](https://github.com/joblib/joblib/blob/master/CHANGES.rst)
- [Commits](joblib/joblib@1.0.1...1.2.0)

---
updated-dependencies:
- dependency-name: joblib
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333 dependabot[bot]@users.noreply.github.com>

* Bump certifi from 2018.10.15 to 2022.12.7 (#272)

Bumps [certifi](https://github.com/certifi/python-certifi) from 2018.10.15 to 2022.12.7.
- [Release notes](https://github.com/certifi/python-certifi/releases)
- [Commits](certifi/python-certifi@2018.10.15...2022.12.07)

---
updated-dependencies:
- dependency-name: certifi
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333 dependabot[bot]@users.noreply.github.com>

* Bump mako from 1.0.7 to 1.2.2 (#256)

Bumps [mako](https://github.com/sqlalchemy/mako) from 1.0.7 to 1.2.2.
- [Release notes](https://github.com/sqlalchemy/mako/releases)
- [Changelog](https://github.com/sqlalchemy/mako/blob/main/CHANGES)
- [Commits](https://github.com/sqlalchemy/mako/commits)

---
updated-dependencies:
- dependency-name: mako
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333 dependabot[bot]@users.noreply.github.com>

* Bump protobuf from 3.9.2 to 3.18.3 (#259)

Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 3.9.2 to 3.18.3.
- [Release notes](https://github.com/protocolbuffers/protobuf/releases)
- [Changelog](https://github.com/protocolbuffers/protobuf/blob/main/generate_changelog.py)
- [Commits](protocolbuffers/protobuf@v3.9.2...v3.18.3)

---
updated-dependencies:
- dependency-name: protobuf
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333 dependabot[bot]@users.noreply.github.com>

* Update article.py

* Update bulk_similarity_controller.py

* Update bulk_similarity_controller.py

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333 dependabot[bot]@users.noreply.github.com>
Co-authored-by: Devin Gaffney <[email protected]>
Co-authored-by: Christa Hartsock <[email protected]>
Co-authored-by: Christa Hartsock <[email protected]>
Co-authored-by: Caio Almeida <[email protected]>
@hkayann
Copy link

hkayann commented Nov 22, 2024

I can’t believe this issue still exists in 2024. I tried installing TensorFlow with Anaconda, Docker, and specific TensorFlow versions, but none worked. Finally, this is how I solved it:

Step 1: Install Miniforge

Miniforge is optimized for Apple Silicon.

curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh
source ~/miniforge3/bin/activate

Step 2: Create a New Environment and Install TensorFlow

1. Create and activate a new Conda environment:

conda create -n tensorflow_env python=3.10 -y
conda activate tensorflow_env

2. Install TensorFlow:

conda install -c apple tensorflow-deps
python -m pip install tensorflow-macos
python -m pip install tensorflow-metal

Step 3: Test the Installation

Run the following script to verify:

import tensorflow as tf
print("TensorFlow Version:", tf.__version__)
print("Is GPU available:", tf.config.list_physical_devices('GPU'))

This issue is about dockers, please check the title.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.6.0 stat:awaiting response Status - Awaiting response from author subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues type:bug Bug type:build/install Build and install issues
Projects
None yet
Development

No branches or pull requests