-
Notifications
You must be signed in to change notification settings - Fork 78
Add convert to recordio format function #183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
typhoonzero
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need a Dockerfile and k8s yaml file and README.md.
| path = output_path + "/" + name | ||
| mkdir_p(path) | ||
|
|
||
| mod.convert(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we do convert and split at the same time? @Yancey1989
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
convert的时候已经split了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the convert function here:https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/dataset/common.py#L167, it also needs some other required parameters such as reader and num_shards. So maybe the code can not be run correctly, or I missed sth. ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh,sorry,我应该把相关的PR都放上来的。其实,跟这个相关的PR是这个:
PaddlePaddle/Paddle#2608
docker/convert/convert.py
Outdated
| raise | ||
|
|
||
| def convert(output_path, name): | ||
| print "proc " + name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use logging instead of printing, it will log the time and log level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docker/convert/convert.py
Outdated
| print "proc " + name | ||
| mod = __import__("paddle.v2.dataset." + name, fromlist=['']) | ||
|
|
||
| path = output_path + "/" + name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use os.path.join(output_path, name) instead of hard code delimiter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Done.
| path = output_path + "/" + name | ||
| mkdir_p(path) | ||
|
|
||
| mod.convert(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the convert function here:https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/dataset/common.py#L167, it also needs some other required parameters such as reader and num_shards. So maybe the code can not be run correctly, or I missed sth. ?
docker/convert/Dockerfile
Outdated
| @@ -0,0 +1,4 @@ | |||
| FROM paddlepaddle/paddle | |||
| ADD ./convert.py /convert/ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add multiple files using a single line to reduce the image layers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
docker/convert/convert.py
Outdated
| import logging | ||
| import logging.config | ||
|
|
||
| logging.config.fileConfig('logging.conf') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use [dict_config], so we don't need another config file.
| FROM paddlepaddle/paddle | ||
| ADD ./convert.py /convert/ | ||
| ADD ./logging.conf /convert/ | ||
| ADD .cache/paddle/dataset /root/.cache/paddle/dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a default command for this dockerfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没有default command,路径是需要指定的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't get your point. Default command could be CMD ["/program", "argument"], this can be overrided by k8s yaml definations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.Done.
helinwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, could you let me know why users need to submit a k8s job to convert the public datasets?
I thought we will convert them locally (or by a k8s job), upload to the cluster public storage. And users don't need to convert them again?
|
@helinwang 这里的 |
helinwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gongweibao 懂了,谢谢!LGTM.
Fix PaddlePaddle/Paddle#2550