intermittent test case failed on tensorflow gpu env #20027

shashaka · 2024-07-22T15:16:32Z

On keras/src/trainers/data_adapters/generator_data_adapter_test.py, I found that there is intermittent test case failed on tensorflow gpu env.
This is related in test_basic_flow method on this test case, so, I made test code for this on my local side.

import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import math

import jax
import numpy as np
import tensorflow as tf
import torch
from absl.testing import parameterized
from jax import numpy as jnp

from keras.src import backend
from keras.src import testing
from keras.src.trainers.data_adapters import generator_data_adapter

def example_generator(x, y, sample_weight=None, batch_size=32):
    def make():
        for i in range(math.ceil(len(x) / batch_size)):
            low = i * batch_size
            high = min(low   batch_size, len(x))
            batch_x = x[low:high]
            batch_y = y[low:high]
            if sample_weight is not None:
                yield batch_x, batch_y, sample_weight[low:high]
            else:
                yield batch_x, batch_y

    return make


class TestCase(testing.TestCase, parameterized.TestCase):

    def test_basic_flow(self, use_sample_weight, generator_type):
        x = np.random.random((34, 4)).astype("float32")
        y = np.array([[i, i] for i in range(34)], dtype="float32")
        sw = np.random.random((34,)).astype("float32")
        if generator_type == "tf":
            x, y, sw = tf.constant(x), tf.constant(y), tf.constant(sw)
        elif generator_type == "jax":
            x, y, sw = jnp.array(x), jnp.array(y), jnp.array(sw)
        elif generator_type == "torch":
            x, y, sw = (
                torch.as_tensor(x),
                torch.as_tensor(y),
                torch.as_tensor(sw),
            )
        if not use_sample_weight:
            sw = None
        make_generator = example_generator(
            x,
            y,
            sample_weight=sw,
            batch_size=16,
        )

        adapter = generator_data_adapter.GeneratorDataAdapter(make_generator())
        if backend.backend() == "numpy":
            it = adapter.get_numpy_iterator()
            expected_class = np.ndarray
        elif backend.backend() == "tensorflow":
            it = adapter.get_tf_dataset()
            expected_class = tf.Tensor
        elif backend.backend() == "jax":
            it = adapter.get_jax_iterator()
            expected_class = (
                jax.Array if generator_type == "jax" else np.ndarray
            )
        elif backend.backend() == "torch":
            it = adapter.get_torch_dataloader()
            expected_class = torch.Tensor

        sample_order = []
        for i, batch in enumerate(it):
            if use_sample_weight:
                self.assertEqual(len(batch), 3)
                bx, by, bsw = batch
            else:
                self.assertEqual(len(batch), 2)
                bx, by = batch
            self.assertIsInstance(bx, expected_class)
            self.assertIsInstance(by, expected_class)
            self.assertEqual(bx.dtype, by.dtype)
            self.assertContainsExactSubsequence(str(bx.dtype), "float32")
            if i < 2:
                self.assertEqual(bx.shape, (16, 4))
                self.assertEqual(by.shape, (16, 2))
            else:
                self.assertEqual(bx.shape, (2, 4))
                self.assertEqual(by.shape, (2, 2))
            if use_sample_weight:
                self.assertIsInstance(bsw, expected_class)
            for i in range(by.shape[0]):
                sample_order.append(by[i, 0])
        self.assertAllClose(sample_order, list(range(34)))

        print(f"*" * 50)


for _ in range(1000):
    TestCase().test_basic_flow(True, 'tf')
    print("All passed!")

And I got an error as below, many of running was succeeded, however, some are failed.

InvalidArgumentError                      Traceback (most recent call last)
Cell In[2], line 85
     81         print(f"*" * 50)
     84 for _ in range(1000):
---> 85     TestCase().test_basic_flow(True, 'tf')
     86     print("All passed!")

Cell In[2], line 78, in TestCase.test_basic_flow(self, use_sample_weight, generator_type)
     76         self.assertIsInstance(bsw, expected_class)
     77     for i in range(by.shape[0]):
---> 78         sample_order.append(by[i, 0])
     79 self.assertAllClose(sample_order, list(range(34)))
     81 print(f"*" * 50)

File ~/miniconda3/envs/keras/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    151 except Exception as e:
    152   filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153   raise e.with_traceback(filtered_tb) from None
    154 finally:
    155   del filtered_tb

File ~/miniconda3/envs/keras/lib/python3.10/site-packages/tensorflow/python/framework/ops.py:5983, in raise_from_not_ok_status(e, name)
   5981 def raise_from_not_ok_status(e, name) -> NoReturn:
   5982   e.message  = (" name: "   str(name if name is not None else ""))
-> 5983   raise core._status_to_exception(e) from None

InvalidArgumentError: {{function_node __wrapped__StridedSlice_device_/job:localhost/replica:0/task:0/device:GPU:0}} Expected begin, end, and strides to be 1D equal size tensors, but got shapes [2], [1], and [1] instead. [Op:StridedSlice] name: strided_slice/

So, is there anyone can confirm whether this is the bug or not??

The text was updated successfully, but these errors were encountered:

sachinprasadhs · 2024-07-22T21:15:42Z

I tried it in colab GPU runtime with the code you have provided, for all the 1000 runs I got "All Passed" message.
Attaching the Gist here for reference

shashaka · 2024-07-23T00:16:28Z

@sachinprasadhs
When I install the keras from the source as github master branch, I found that this issue was reproduced.
Can you check the below colab notebook?

https://colab.sandbox.google.com/gist/sachinprasadhs/e73e2c7428f44ccc0d2ef486bed047c6/20027.ipynb

grasskin · 2024-07-23T16:50:35Z

Hi @shashaka, could we try and get a pared down colab of this issue? Please remove anything non-relevant to Tensorflow and to this reproduction. Please also add keras.config.disable_traceback_filtering() so we can get a full trace error.

ghsanti · 2024-07-24T09:47:43Z

Here is a simplified gist (shows the error)

(With disabled traceback filtering.)

Happens both with GPU and CPU. (It does happen only some times !)

PS: This might be obvious but without the test environment there seems to be no error (gist).

shashaka · 2024-07-24T10:04:31Z

I also updated my gist based on @ghsanti 's one.
It seems that this error occurred when slicing the data on data generator.

https://colab.research.google.com/gist/shashaka/71e1e97d1486398c0bcca1fb4fc084d8/20027.ipynb

grasskin · 2024-07-25T17:49:15Z

Thank you @shashaka and @ghsanti, unless this shows up in our own testing environment (internally/github CI) we are unlikely to have the bandwidth to dive deeper into what is happening since this might be environment specific. If you're taking a closer look and find the code pointer responsible we'd be happy to support any PR's. Leaving open for now!

github-actions bot assigned sachinprasadhs Jul 22, 2024

sachinprasadhs added type:Bug backend:tensorflow labels Jul 22, 2024

sachinprasadhs added the stat:awaiting response from contributor label Jul 22, 2024

google-ml-butler bot removed the stat:awaiting response from contributor label Jul 23, 2024

sachinprasadhs added the keras-team-review-pending Pending review by a Keras team member. label Jul 23, 2024

grasskin removed the keras-team-review-pending Pending review by a Keras team member. label Jul 23, 2024

sachinprasadhs added the stat:awaiting response from contributor label Jul 23, 2024

google-ml-butler bot removed the stat:awaiting response from contributor label Jul 24, 2024

grasskin added the keras-team-review-pending Pending review by a Keras team member. label Jul 25, 2024

grasskin removed the keras-team-review-pending Pending review by a Keras team member. label Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intermittent test case failed on tensorflow gpu env #20027

intermittent test case failed on tensorflow gpu env #20027

shashaka commented Jul 22, 2024

sachinprasadhs commented Jul 22, 2024 •

edited

Loading

shashaka commented Jul 23, 2024 •

edited by sachinprasadhs

Loading

grasskin commented Jul 23, 2024

ghsanti commented Jul 24, 2024 •

edited

Loading

shashaka commented Jul 24, 2024

grasskin commented Jul 25, 2024

intermittent test case failed on tensorflow gpu env #20027

intermittent test case failed on tensorflow gpu env #20027

Comments

shashaka commented Jul 22, 2024

sachinprasadhs commented Jul 22, 2024 • edited Loading

shashaka commented Jul 23, 2024 • edited by sachinprasadhs Loading

grasskin commented Jul 23, 2024

ghsanti commented Jul 24, 2024 • edited Loading

shashaka commented Jul 24, 2024

grasskin commented Jul 25, 2024

sachinprasadhs commented Jul 22, 2024 •

edited

Loading

shashaka commented Jul 23, 2024 •

edited by sachinprasadhs

Loading

ghsanti commented Jul 24, 2024 •

edited

Loading