Skip to content

Keras callback creating .profile-empty file blocks loading data #2084

Closed
@wchargin

Description

@wchargin

Repro steps:

  1. Create a virtualenv with tf-nightly-2.0-preview==2.0.0.dev20190402
    and open two terminals in this environment.

  2. In one terminal, run the following simple Python script (but
    continue to the next step while this script is still running):

    from __future__ import absolute_import
    from __future__ import division
    from __future__ import print_function
    
    import tensorflow as tf
    
    
    DATASET = tf.keras.datasets.mnist
    INPUT_SHAPE = (28, 28)
    OUTPUT_CLASSES = 10
    
    
    def model_fn():
      model = tf.keras.models.Sequential([
          tf.keras.layers.Input(INPUT_SHAPE),
          tf.keras.layers.Flatten(),
          tf.keras.layers.Dense(128, activation="relu"),
          tf.keras.layers.BatchNormalization(),
          tf.keras.layers.Dense(256, activation="relu"),
          tf.keras.layers.Dropout(0.2),
          tf.keras.layers.Dense(OUTPUT_CLASSES, activation="softmax"),
      ])
      model.compile(
          loss="sparse_categorical_crossentropy",
          optimizer="adagrad",
          metrics=["accuracy"],
      )
      return model
    
    
    def main():
      model = model_fn()
      ((x_train, y_train), (x_test, y_test)) = DATASET.load_data()
      model.fit(
          x=x_train,
          y=y_train,
          validation_data=(x_test, y_test),
          callbacks=[tf.keras.callbacks.TensorBoard()],
          epochs=5,
      )
    
    
    if __name__ == "__main__":
      main()
  3. Wait for (say) epoch 2/5 to finish training. Then, in the other
    terminal, launch tensorboard --logdir ./logs.

  4. Open TensorBoard and observe that both training and validation runs
    appear with two epochs’ worth of data:

    Screenshot just after launching TensorBoard

  5. As training continues, refresh TensorBoard and/or reload the page.
    Observe that validation data continues to appear, but training data
    has stalled—even after well after the training has completed, the
    plot is incomplete:

    Screenshot of bad state

  6. Kill the TensorBoard process and restart it. Note that the data
    appears as desired:

    Screenshot of good state after TensorBoard relaunch

The same problem occurs in tf-nightly (non-2.0-preview), but
manifests differently: because there is only one run (named .) instead
of separate train/validation, all data stops being displayed after
the epoch in which TensorBoard is opened.

Note as a special case of this that if TensorBoard is running before
training starts, then train data may not appear at all:

Screenshot of validation-only data

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions