Skip to content

[bug] keras issue on tpu-vm #19448

Open
Open
@innat

Description

@innat
keras: 3.0.5
tensorflow: 2.15.0

There seems some conflict to use keras 3 in tpu-vm. Kaggle/docker-python#1370 (comment)

import tensorflow as tf
import keras 

tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu="local")
strategy = tf.distribute.TPUStrategy(tpu)

with strategy.scope():
    # Construct and compile an instance of CustomModel
    inputs = keras.Input(shape=(32,))
    outputs = keras.layers.Dense(1)(inputs)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="adam", loss="mse", metrics=["mae"])

# Just use `fit` as usual
x = np.random.random((1000, 32))
y = np.random.random((1000, 1))
model.fit(x, y, epochs=3)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1712289536.759567      13 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
Epoch 1/3
---------------------------------------------------------------------------
NotFoundError                             Traceback (most recent call last)
Cell In[6], line 11
      9 x = np.random.random((1000, 32))
     10 y = np.random.random((1000, 1))
---> 11 model.fit(x, y, epochs=3)

File /usr/local/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py:123, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    120     filtered_tb = _process_traceback_frames(e.__traceback__)
    121     # To get the full stack trace, call:
    122     # `keras.config.disable_traceback_filtering()`
--> 123     raise e.with_traceback(filtered_tb) from None
    124 finally:
    125     del filtered_tb

File /usr/local/lib/python3.10/site-packages/tensorflow/python/eager/execute.py:53, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     51 try:
     52   ctx.ensure_initialized()
---> 53   tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     54                                       inputs, attrs, num_outputs)
     55 except core._NotOkStatusException as e:
     56   if name is not None:

NotFoundError: Graph execution error:

Detected at node TPUReplicate/_compile/_15189418723048853925/_4 defined at (most recent call last):
<stack traces unavailable>
Detected at node TPUReplicate/_compile/_15189418723048853925/_4 defined at (most recent call last):
<stack traces unavailable>
Detected at node TPUReplicate/_compile/_15189418723048853925/_4 defined at (most recent call last):
<stack traces unavailable>
Detected at node TPUReplicate/_compile/_15189418723048853925/_4 defined at (most recent call last):
<stack traces unavailable>
Detected at node TPUReplicate/_compile/_15189418723048853925/_4 defined at (most recent call last):
<stack traces unavailable>
Detected at node TPUReplicate/_compile/_15189418723048853925/_4 defined at (most recent call last):
<stack traces unavailable>
Detected at node TPUReplicate/_compile/_15189418723048853925/_4 defined at (most recent call last):
<stack traces unavailable>
Detected at node TPUReplicate/_compile/_15189418723048853925/_4 defined at (most recent call last):
<stack traces unavailable>
Detected at node TPUReplicate/_compile/_15189418723048853925/_4 defined at (most recent call last):
<stack traces unavailable>
9 root error(s) found.
  (0) NOT_FOUND:  XLA:TPU compile permanent error. Container localhost does not exist. (Could not find resource: localhost/tpu_mesh_common_state)
	 [[{{node TPUReplicate/_compile/_15189418723048853925/_4}}]]
  (1) NOT_FOUND:  XLA:TPU compile permanent error. Container localhost does not exist. (Could not find resource: localhost/tpu_mesh_common_state)
	 [[{{node TPUReplicate/_compile/_15189418723048853925/_4}}]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_316]]
  (2) NOT_FOUND:  XLA:TPU compile permanent error. Container localhost does not exist. (Could not find resource: localhost/tpu_mesh_common_state)
	 [[{{node TPUReplicate/_compile/_15189418723048853925/_4}}]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_316]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_255]]
  (3) NOT_FOUND:  XLA:TPU compile permanent error. Container localhost does not exist. (Could not find resource: localhost/tpu_mesh_common_state)
	 [[{{node TPUReplicate/_compile/_15189418723048853925/_4}}]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_316]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_255]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_271]]
  (4) NOT_FOUND:  XLA:TPU compile permanent error. Container localhost does not exist. (Could not find resource: localhost/tpu_mesh_common_state)
	 [[{{node TPUReplicate/_compile/_15189418723048853925/_4}}]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_316]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_255]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_271]]
	 [[cluster_one_step_on_iterator/control_after/_1/_387]]
  (5) NOT_FOUND:  XLA:TPU compile permanent error. Container localhost does not exist. (Could not find resource: localhost/tpu_mesh_common_state)
	 [[{{node TPUReplicate/_compile/_15189418723048853925/_4}}]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_316]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_255]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_271]]
	 [[cluster_one_step_on_iterator/control_after/_1/_387]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_220]]
  (6) NOT_FOUND:  XLA:TPU compile permanent error. Container localhost does not exist. (Could not find resource: localhost/tpu_mesh_common_state)
	 [[{{node TPUReplicate/_compile/_15189418723048853925/_4}}]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_316]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_255]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_271]]
	 [[cluster_one_step_on_iterator/control_after/_1/_387]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_220]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_284]]
  (7) NOT_FOUND:  XLA:TPU compile permanent error. Container localhost does not exist. (Could not find resource: localhost/tpu_mesh_common_state)
	 [[{{node TPUReplicate/_compile/_15189418723048853925/_4}}]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_316]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_255]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_271]]
	 [[cluster_one_step_on_iterator/control_after/_1/_387]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_220]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_284]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_236]]
  (8) NOT_FOUND:  XLA:TPU compile permanent error. Container localhost does not exist. (Could not find resource: localhost/tpu_mesh_common_state)
	 [[{{node TPUReplicate/_compile/_15189418723048853925/_4}}]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_316]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_255]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_271]]
	 [[cluster_one_step_on_iterator/control_after/_1/_387]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_220]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_284]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_236]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_303]]
0 successful operations.
0 derived errors ignored. [Op:__inference_one_step_on_iterator_2865]

Metadata

Metadata

Labels

To investigateLooks like a bug. It needs someone to investigate.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions