-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
Description
System Information
- Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): TensorFlow
- Framework Version: 1.12.0
- Python Version: 3.6
- CPU or GPU: CPU
- Python SDK Version: 1.32.1
- Are you using a custom image: No
Describe the problem
In #839, it was mentioned that Docker Image detected requirements.txt and installed python libraries, but in my experiment it was not installed.
I think the Image did not detect requirements.txt, but I confirmed that requirements.txt exists in sourcedir.tar.gz uploaded to S3.
Minimal repro / logs
2019-07-07 06:54:53 Starting - Starting the training job...
2019-07-07 06:54:58 Starting - Launching requested ML instances......
2019-07-07 06:56:05 Starting - Preparing the instances for training...
2019-07-07 06:56:43 Downloading - Downloading input data...
2019-07-07 06:57:31 Training - Training image download completed. Training in progress.
2019-07-07 06:57:31 Uploading - Uploading generated training model
2019-07-07 06:57:31 Failed - Training job failed
2019-07-07 06:57:21,299 sagemaker-containers INFO Imported framework sagemaker_tensorflow_container.training
2019-07-07 06:57:21,306 sagemaker-containers INFO No GPUs detected (normal if no gpus installed)
2019-07-07 06:57:21,547 sagemaker-containers INFO No GPUs detected (normal if no gpus installed)
2019-07-07 06:57:21,561 sagemaker-containers INFO No GPUs detected (normal if no gpus installed)
2019-07-07 06:57:21,572 sagemaker-containers INFO Invoking user script
Training Env:
{
"additional_framework_parameters": {},
"channel_input_dirs": {
"training": "/opt/ml/input/data/training"
},
"current_host": "algo-1",
"framework_module": "sagemaker_tensorflow_container.training:main",
"hosts": [
"algo-1"
],
"hyperparameters": {
"model_dir": "s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model"
},
"input_config_dir": "/opt/ml/input/config",
"input_data_config": {
"training": {
"TrainingInputMode": "File",
"S3DistributionType": "FullyReplicated",
"RecordWrapperType": "None"
}
},
"input_dir": "/opt/ml/input",
"is_master": true,
"job_name": "sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259",
"log_level": 20,
"master_hostname": "algo-1",
"model_dir": "/opt/ml/model",
"module_dir": "s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/source/sourcedir.tar.gz",
"module_name": "train",
"network_interface_name": "eth0",
"num_cpus": 2,
"num_gpus": 0,
"output_data_dir": "/opt/ml/output/data",
"output_dir": "/opt/ml/output",
"output_intermediate_dir": "/opt/ml/output/intermediate",
"resource_config": {
"current_host": "algo-1",
"hosts": [
"algo-1"
],
"network_interface_name": "eth0"
},
"user_entry_point": "train.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"model_dir":"s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model"}
SM_USER_ENTRY_POINT=train.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["training"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_tensorflow_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=2
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_tensorflow_container.training:main","hosts":["algo-1"],"hyperparameters":{"model_dir":"s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model"},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":2,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
SM_USER_ARGS=["--model_dir","s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HP_MODEL_DIR=s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model
PYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/lib/python36.zip:/usr/lib/python3.6:/usr/lib/python3.6/lib-dynload:/usr/local/lib/python3.6/dist-packages:/usr/lib/python3/dist-packages
Invoking script with the following command:
/usr/bin/python train.py --model_dir s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model
Traceback (most recent call last):
File "train.py", line 1, in <module>
import matplotlib.pyplot as plt
ModuleNotFoundError: No module named 'matplotlib'
2019-07-07 06:57:21,597 sagemaker-containers ERROR ExecuteUserScriptError:
Command "/usr/bin/python train.py --model_dir s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model"
- Exact command to reproduce:
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(entry_point='train.py',
source_dir='src',
role=role,
train_instance_type='ml.m5.large',
train_instance_count=1,
framework_version='1.12.0',
py_version='py3')
estimator.fit(input_data)$ tree src
src
|-- train.py
`-- requirements.txt
- train.py
import matplotlib.pyplot as plt
if __name__ == "__main__":
pass- requirements.txt
-i https://pypi.org/simple
matplotlib==3.1.1