Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent test case failed on tensorflow gpu env #20027

Open
shashaka opened this issue Jul 22, 2024 · 6 comments
Open

intermittent test case failed on tensorflow gpu env #20027

shashaka opened this issue Jul 22, 2024 · 6 comments

Comments

@shashaka
Copy link
Contributor

On keras/src/trainers/data_adapters/generator_data_adapter_test.py, I found that there is intermittent test case failed on tensorflow gpu env.
This is related in test_basic_flow method on this test case, so, I made test code for this on my local side.

import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import math

import jax
import numpy as np
import tensorflow as tf
import torch
from absl.testing import parameterized
from jax import numpy as jnp

from keras.src import backend
from keras.src import testing
from keras.src.trainers.data_adapters import generator_data_adapter
def example_generator(x, y, sample_weight=None, batch_size=32):
    def make():
        for i in range(math.ceil(len(x) / batch_size)):
            low = i * batch_size
            high = min(low   batch_size, len(x))
            batch_x = x[low:high]
            batch_y = y[low:high]
            if sample_weight is not None:
                yield batch_x, batch_y, sample_weight[low:high]
            else:
                yield batch_x, batch_y

    return make


class TestCase(testing.TestCase, parameterized.TestCase):

    def test_basic_flow(self, use_sample_weight, generator_type):
        x = np.random.random((34, 4)).astype("float32")
        y = np.array([[i, i] for i in range(34)], dtype="float32")
        sw = np.random.random((34,)).astype("float32")
        if generator_type == "tf":
            x, y, sw = tf.constant(x), tf.constant(y), tf.constant(sw)
        elif generator_type == "jax":
            x, y, sw = jnp.array(x), jnp.array(y), jnp.array(sw)
        elif generator_type == "torch":
            x, y, sw = (
                torch.as_tensor(x),
                torch.as_tensor(y),
                torch.as_tensor(sw),
            )
        if not use_sample_weight:
            sw = None
        make_generator = example_generator(
            x,
            y,
            sample_weight=sw,
            batch_size=16,
        )

        adapter = generator_data_adapter.GeneratorDataAdapter(make_generator())
        if backend.backend() == "numpy":
            it = adapter.get_numpy_iterator()
            expected_class = np.ndarray
        elif backend.backend() == "tensorflow":
            it = adapter.get_tf_dataset()
            expected_class = tf.Tensor
        elif backend.backend() == "jax":
            it = adapter.get_jax_iterator()
            expected_class = (
                jax.Array if generator_type == "jax" else np.ndarray
            )
        elif backend.backend() == "torch":
            it = adapter.get_torch_dataloader()
            expected_class = torch.Tensor

        sample_order = []
        for i, batch in enumerate(it):
            if use_sample_weight:
                self.assertEqual(len(batch), 3)
                bx, by, bsw = batch
            else:
                self.assertEqual(len(batch), 2)
                bx, by = batch
            self.assertIsInstance(bx, expected_class)
            self.assertIsInstance(by, expected_class)
            self.assertEqual(bx.dtype, by.dtype)
            self.assertContainsExactSubsequence(str(bx.dtype), "float32")
            if i < 2:
                self.assertEqual(bx.shape, (16, 4))
                self.assertEqual(by.shape, (16, 2))
            else:
                self.assertEqual(bx.shape, (2, 4))
                self.assertEqual(by.shape, (2, 2))
            if use_sample_weight:
                self.assertIsInstance(bsw, expected_class)
            for i in range(by.shape[0]):
                sample_order.append(by[i, 0])
        self.assertAllClose(sample_order, list(range(34)))

        print(f"*" * 50)


for _ in range(1000):
    TestCase().test_basic_flow(True, 'tf')
    print("All passed!")

And I got an error as below, many of running was succeeded, however, some are failed.

InvalidArgumentError                      Traceback (most recent call last)
Cell In[2], line 85
     81         print(f"*" * 50)
     84 for _ in range(1000):
---> 85     TestCase().test_basic_flow(True, 'tf')
     86     print("All passed!")

Cell In[2], line 78, in TestCase.test_basic_flow(self, use_sample_weight, generator_type)
     76         self.assertIsInstance(bsw, expected_class)
     77     for i in range(by.shape[0]):
---> 78         sample_order.append(by[i, 0])
     79 self.assertAllClose(sample_order, list(range(34)))
     81 print(f"*" * 50)

File ~/miniconda3/envs/keras/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    151 except Exception as e:
    152   filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153   raise e.with_traceback(filtered_tb) from None
    154 finally:
    155   del filtered_tb

File ~/miniconda3/envs/keras/lib/python3.10/site-packages/tensorflow/python/framework/ops.py:5983, in raise_from_not_ok_status(e, name)
   5981 def raise_from_not_ok_status(e, name) -> NoReturn:
   5982   e.message  = (" name: "   str(name if name is not None else ""))
-> 5983   raise core._status_to_exception(e) from None

InvalidArgumentError: {{function_node __wrapped__StridedSlice_device_/job:localhost/replica:0/task:0/device:GPU:0}} Expected begin, end, and strides to be 1D equal size tensors, but got shapes [2], [1], and [1] instead. [Op:StridedSlice] name: strided_slice/

So, is there anyone can confirm whether this is the bug or not??

@sachinprasadhs
Copy link
Collaborator

sachinprasadhs commented Jul 22, 2024

I tried it in colab GPU runtime with the code you have provided, for all the 1000 runs I got "All Passed" message.
Attaching the Gist here for reference

@shashaka
Copy link
Contributor Author

shashaka commented Jul 23, 2024

@sachinprasadhs
When I install the keras from the source as github master branch, I found that this issue was reproduced.
Can you check the below colab notebook?

https://colab.sandbox.google.com/gist/sachinprasadhs/e73e2c7428f44ccc0d2ef486bed047c6/20027.ipynb

@grasskin
Copy link
Member

Hi @shashaka, could we try and get a pared down colab of this issue? Please remove anything non-relevant to Tensorflow and to this reproduction. Please also add keras.config.disable_traceback_filtering() so we can get a full trace error.

@grasskin grasskin removed the keras-team-review-pending Pending review by a Keras team member. label Jul 23, 2024
@ghsanti
Copy link
Contributor

ghsanti commented Jul 24, 2024

Here is a simplified gist (shows the error)

(With disabled traceback filtering.)

Happens both with GPU and CPU. (It does happen only some times !)

PS: This might be obvious but without the test environment there seems to be no error (gist).

@shashaka
Copy link
Contributor Author

I also updated my gist based on @ghsanti 's one.
It seems that this error occurred when slicing the data on data generator.

https://colab.research.google.com/gist/shashaka/71e1e97d1486398c0bcca1fb4fc084d8/20027.ipynb

@grasskin grasskin added the keras-team-review-pending Pending review by a Keras team member. label Jul 25, 2024
@grasskin
Copy link
Member

Thank you @shashaka and @ghsanti, unless this shows up in our own testing environment (internally/github CI) we are unlikely to have the bandwidth to dive deeper into what is happening since this might be environment specific. If you're taking a closer look and find the code pointer responsible we'd be happy to support any PR's. Leaving open for now!

@grasskin grasskin removed the keras-team-review-pending Pending review by a Keras team member. label Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants