Skip to content

Not detect requirements.txt in TensorFlow script mode #911

@xkumiyu

Description

@xkumiyu

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): TensorFlow
  • Framework Version: 1.12.0
  • Python Version: 3.6
  • CPU or GPU: CPU
  • Python SDK Version: 1.32.1
  • Are you using a custom image: No

Describe the problem

In #839, it was mentioned that Docker Image detected requirements.txt and installed python libraries, but in my experiment it was not installed.
I think the Image did not detect requirements.txt, but I confirmed that requirements.txt exists in sourcedir.tar.gz uploaded to S3.

Minimal repro / logs

2019-07-07 06:54:53 Starting - Starting the training job...
2019-07-07 06:54:58 Starting - Launching requested ML instances......
2019-07-07 06:56:05 Starting - Preparing the instances for training...
2019-07-07 06:56:43 Downloading - Downloading input data...
2019-07-07 06:57:31 Training - Training image download completed. Training in progress.
2019-07-07 06:57:31 Uploading - Uploading generated training model
2019-07-07 06:57:31 Failed - Training job failed

2019-07-07 06:57:21,299 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training
2019-07-07 06:57:21,306 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2019-07-07 06:57:21,547 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2019-07-07 06:57:21,561 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2019-07-07 06:57:21,572 sagemaker-containers INFO     Invoking user script

Training Env:

{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_tensorflow_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "model_dir": "s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model"
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/source/sourcedir.tar.gz",
    "module_name": "train",
    "network_interface_name": "eth0",
    "num_cpus": 2,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "train.py"
}

Environment variables:

SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"model_dir":"s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model"}
SM_USER_ENTRY_POINT=train.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["training"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_tensorflow_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=2
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_tensorflow_container.training:main","hosts":["algo-1"],"hyperparameters":{"model_dir":"s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model"},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":2,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
SM_USER_ARGS=["--model_dir","s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HP_MODEL_DIR=s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model
PYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/lib/python36.zip:/usr/lib/python3.6:/usr/lib/python3.6/lib-dynload:/usr/local/lib/python3.6/dist-packages:/usr/lib/python3/dist-packages

Invoking script with the following command:

/usr/bin/python train.py --model_dir s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model


Traceback (most recent call last):
  File "train.py", line 1, in <module>
    import matplotlib.pyplot as plt
ModuleNotFoundError: No module named 'matplotlib'
2019-07-07 06:57:21,597 sagemaker-containers ERROR    ExecuteUserScriptError:
Command "/usr/bin/python train.py --model_dir s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model"
  • Exact command to reproduce:
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(entry_point='train.py',
                       source_dir='src',
                       role=role,
                       train_instance_type='ml.m5.large',
                       train_instance_count=1,
                       framework_version='1.12.0',
                       py_version='py3')
estimator.fit(input_data)
$ tree src
src
|-- train.py
`-- requirements.txt
  • train.py
import matplotlib.pyplot as plt

if __name__ == "__main__":
    pass
  • requirements.txt
-i https://pypi.org/simple
matplotlib==3.1.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions