Skip to content

Conversation

@beckett1124
Copy link
Contributor

No description provided.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要用 # 写一个标题,比如 # 常用数据集

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一行内容的“所以”应该是

所以我们会预处理好一些数据,放在Paddle的服务器上

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一段的“所以”对应的原因,应该是

同时为了方便大家用Paddle做实验的时候,可以直接访问这些预处理好的数据,我们提供一套Python库。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有时候预处理会改变数据的格式或者其他重要特性,以至于原始数据和预处理后的数据之间差别很大,访问方式也可能很大,一个boolean flag恐怕不足以覆盖这样的差别。另外,有些数据集不需要预处理。我建议原始数据和预处理后的数据可以用不同的Python package,比如: paddle.data.amazon.product_review. 是我们预处理后的数据,paddle.data.amazon.product_review.raw 是原始数据。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool flage只是区别数据来源,后续的处理在自个的package里面。done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个例子里漏了import语句,所以没有体现出来Tensorflow的data package 的目录组织方式。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

多了几个空行,导致paddle的格式检查不过。
请使用pre-commit脚本进行格式化。

Copy link
Collaborator

@wangkuiyi wangkuiyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我对data packages的设计有一些考虑。写在comments里了。

Data packages应该会和我们的新Paddle API有很强的相关性,建议和 @jacquesqiao @reyoung 一起讨论一下。看看如何让用户用起来最方便。

多谢 @beckett1124

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个路径是什么意思?是不是说把数据下载下来,写到这个目录里?

我在想,是不是可以有更紧密地和Paddle API结合的方式?比如我看到乔龙飞和于洋的新Python API的设计里,好像是有一个函数,貌似叫 feed_data 的 —— 只要把一个data instance 传给这个参数,就可以用来训练模型或者测试模型了。

如果是这样,是不是每个data package可以提供一个fetch函数,获取一个data instance,这样我们的训练程序就可以写成:

for iter in range(1,100):
    trainer.feed(paddle.data.amazon.product_review.training.fetch())

就完了?

这个可以和 @reyoung @jacquesqiao 商量一下。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_unneed 是什么意思?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src_flag 是说用原始数据吗?原始的翻译是 raw

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shirnk_txt 是什么意思?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

切分txt的数据文件,shirnk_txt(‘train’,10)是指将train的文件切分至10%,用于训练

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块跟于洋和乔龙飞商量了,他们来做后续文件的切分。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我原以为我们会有一系列不容的packages:

paddle.data.amazon.product_review.raw  # 原始数据,json格式
paddle.data.amazon.product_review.training # 处理过的,json格式
paddle.data.amazon.product_review.testing  # 处理过的,json格式
paddle.data.switchboard.training.wav     # 处理过的,音频片段
paddle.data.switchboard.training.labels  # 处理过的,每个拼音片段对应的文字label
paddle.data.switchboard.testing.wav
paddle.data.switchboard.testing.labels
paddle.data.youtube.video       # Youtube 公开的一些视频的
paddle.data.youtube.subtitles  # 每个视频对应的字幕

这样设计比较自由,可以覆盖各种格式的训练数据。这也是我之前担心 src_flag 这样的flag可能导致我们的设计不够灵活的原因。

if not os.path.exists(file_path):
temp_file_name,_ = download_with_urlretrieve(source_url)
temp_file_path = os.getcwd()
os.rename(temp_file_name,src_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参数列表有的有空格,有的没空格

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯 这个我修改一下

@@ -0,0 +1,100 @@
#/usr/bin/env python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

似乎要修改下 python/CMakeLists.txt 类似
file(GLOB UTILS_PY_FILES . ./paddle/data/*.py)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个不需要,我问过廖纲,只需要修改setup.py.in 就可以。本地测试了,可以load上

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@jacquesqiao jacquesqiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

几个小问题

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add docstring

添加整体文件的注释。docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

import numpy as np
from six.moves import urllib
import stat

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

添加 __all__

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不需要添加all吧

source_name = "cifar"
file_source = "cifar-10-batches-py"
#Set the download dir for cifar.
data_home = set_data_path(source_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_home是fetch的一个参数。默认是"~/paddle_data_directory"

即:

def fetch(directory=None):
    if directory is None:
         directory = "~/paddle_data_directory"

没必要用set_data_path这么复杂的东西

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好,done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

但是我们要考虑到权限的问题呀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Returns:
path to downloaded file.
"""
num_images_train = 50000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里好多参数没有用?比如num_images_train

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

都去掉吧

#Set the download dir for cifar.
data_home = set_data_path(source_name)
filepath = data_download(data_home,source_url)
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

要么不要注释,要么删掉。

永远不要再版本管理系统里使用注释来去掉代码。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return download_dir


def move_files(source_dire, target_dire):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数没必要有吧。直接调用shutils就好了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好,done

else:
tarfile.open(name=file_path, mode="r:gz").extractall(download_dir)
print("Data has been already downloaded and unpacked!")
return download_dir
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

方便的话返回解压缩的路径吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

g_file.close()
print("Data has been already downloaded and unpacked!")
os.remove(file_path)
return download_dir
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

返回这几个文件的路径吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

data_url = urlparse.urljoin(source_url,file)
file_path = os.path.join(download_dir,file)
untar_path = os.path.join(download_dir,file.replace(".gz",""))
if not os.path.exists(file_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

单纯的gzip文件不需要解压缩。。因为Python可以直接按照读取。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,168 @@
#/usr/bin/env python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

太多复制粘贴的代码了。请清理一下。

现在我们在paddle.data这个package下,可以写一个子module,比如叫download_http等等。或者,这几个文件都放到一个.py里面,叫fetch.py,里面暴露 fetch_mnist之类的函数

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

BASE_URL = 'http://yann.lecun.com/exdb/mnist/%s-ubyte.gz'


class Categories(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazon那个dataset叫做Categories因为就是有很多分类,而mnist这个可能只有一个数据集,没有必要叫做Categories了。

:param directory:
:return:
"""
if directory is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mnist的数据集,一次就要下载那四个文件下来。不存在category的概念



class Categories(object):
M1m = "ml-1m"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

M1m => 'ML1M'

ML 是MovieLens的缩写, 1M是指这个数据大小有1M

__all__ = ['fetch', 'Categories', 'preprocess']


def calculate_md5(fn):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数可以extract到一个公共文件里面。



class Categories(object):
Conll05test = "conll05st-tests"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这如果只有一个语料的话,就没有必要有这个categories对象了。


if directory is None:
directory = os.path.expanduser(
os.path.join('~', 'paddle_data', 'amazon'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地址为什么是amazon?



class Categories(object):
AclImdb = "aclImdb_v1"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一个的就不用Categories了。

@@ -0,0 +1,5 @@
"""
The :mod:`paddle.datasets` module includes utilities to load datasets,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddle.datasets ==> paddle.data ?

label_file = os.path.join(directory, filename + label)
else:
image_file = os.path.join(directory, 't10' + image)
label_file = os.path.join(directory, 't10' + label)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t10 -> t10k

@CLAassistant
Copy link

CLAassistant commented Jul 22, 2019

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 3 committers have signed the CLA.

❌ baidu
❌ beckett1124
❌ reyoung


baidu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this pull request Sep 25, 2019
* add contribute faq

* refine with comments
@paddle-bot-old paddle-bot-old bot closed this Aug 26, 2022
@paddle-bot
Copy link

paddle-bot bot commented Aug 26, 2022

很抱歉,经过我们的反复讨论,你的PR暂未达到合入标准,请阅读飞桨原生算子开发规范,你可以重新提交新的PR,我们先将此PR关闭,感谢你的贡献。
Sorry to inform you that through our discussion, your PR fails to meet the merging standard (Reference: Paddle Custom Operator Design Doc). You can also submit an new one. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants