Skip to content

Added docs for CC100 and SST2 #1604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 12, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 44 additions & 44 deletions docs/source/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,84 +32,74 @@ AG_NEWS

.. autofunction:: AG_NEWS

AmazonReviewFull
~~~~~~~~~~~~~~~~

SogouNews
~~~~~~~~~
.. autofunction:: AmazonReviewFull

.. autofunction:: SogouNews
AmazonReviewPolarity
~~~~~~~~~~~~~~~~~~~~

.. autofunction:: AmazonReviewPolarity

DBpedia
~~~~~~~

.. autofunction:: DBpedia

YelpReviewPolarity
~~~~~~~~~~~~~~~~~~
IMDb
~~~~

.. autofunction:: YelpReviewPolarity
.. autofunction:: IMDB

YelpReviewFull
~~~~~~~~~~~~~~
SogouNews
~~~~~~~~~

.. autofunction:: YelpReviewFull
.. autofunction:: SogouNews

SST2
~~~~

.. autofunction:: SST2

YahooAnswers
~~~~~~~~~~~~

.. autofunction:: YahooAnswers

AmazonReviewPolarity
~~~~~~~~~~~~~~~~~~~~

.. autofunction:: AmazonReviewPolarity

AmazonReviewFull
~~~~~~~~~~~~~~~~

.. autofunction:: AmazonReviewFull

IMDb
~~~~
YelpReviewFull
~~~~~~~~~~~~~~

.. autofunction:: IMDB
.. autofunction:: YelpReviewFull

SST2
~~~~
YelpReviewPolarity
~~~~~~~~~~~~~~~~~~

.. autofunction:: SST2
.. autofunction:: YelpReviewPolarity


Language Modeling
^^^^^^^^^^^^^^^^^

PennTreebank
~~~~~~~~~~~~

.. autofunction:: PennTreebank

WikiText-2
~~~~~~~~~~

.. autofunction:: WikiText2


WikiText103
~~~~~~~~~~~

.. autofunction:: WikiText103


PennTreebank
~~~~~~~~~~~~

.. autofunction:: PennTreebank


Machine Translation
^^^^^^^^^^^^^^^^^^^

Multi30k
~~~~~~~~

.. autofunction:: Multi30k



IWSLT2016
~~~~~~~~~

Expand All @@ -120,20 +110,25 @@ IWSLT2017

.. autofunction:: IWSLT2017

Multi30k
~~~~~~~~

Sequence Tagging
^^^^^^^^^^^^^^^^
.. autofunction:: Multi30k

UDPOS
~~~~~

.. autofunction:: UDPOS
Sequence Tagging
^^^^^^^^^^^^^^^^

CoNLL2000Chunking
~~~~~~~~~~~~~~~~~

.. autofunction:: CoNLL2000Chunking

UDPOS
~~~~~

.. autofunction:: UDPOS


Question Answer
^^^^^^^^^^^^^^^
Expand All @@ -153,6 +148,11 @@ SQuAD 2.0
Unsupervised Learning
^^^^^^^^^^^^^^^^^^^^^

CC100
~~~~~~

.. autofunction:: CC100

EnWik9
~~~~~~

Expand Down
11 changes: 11 additions & 0 deletions torchtext/datasets/cc100.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,17 @@

@_create_dataset_directory(dataset_name=DATASET_NAME)
def CC100(root: str, language_code: str = "en"):
"""CC100 Dataset

For additional details refer to https://data.statmt.org/cc-100/

Args:
root: Directory where the datasets are saved. Default: os.path.expanduser('~/.torchtext/cache')
language_code: the language of the dataset

:returns: DataPipe that yields tuple of language code and text
:rtype: (str, str)
"""
if language_code not in VALID_CODES:
raise ValueError(f"Invalid language code {language_code}")

Expand Down
5 changes: 2 additions & 3 deletions torchtext/datasets/conll2000chunking.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,8 @@ def CoNLL2000Chunking(root: str, split: Union[Tuple[str], str]):
For additional details refer to https://www.clips.uantwerpen.be/conll2000/chunking/

Number of lines per split:
train: 8936

test: 2012
- train: 8936
- test: 2012

Args:
root: Directory where the datasets are saved. Default: os.path.expanduser('~/.torchtext/cache')
Expand Down
5 changes: 5 additions & 0 deletions torchtext/datasets/multi30k.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,11 @@ def Multi30k(

For additional details refer to https://www.statmt.org/wmt16/multimodal-task.html#task1

Number of lines per split:
- train: 29000
- valid: 1014
- test: 1000

Args:
root: Directory where the datasets are saved. Default: os.path.expanduser('~/.torchtext/cache')
split: split or splits to be returned. Can be a string or tuple of strings. Default: ('train', 'valid', 'test')
Expand Down
6 changes: 2 additions & 4 deletions torchtext/datasets/squad1.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,8 @@ def SQuAD1(root: str, split: Union[Tuple[str], str]):
For additional details refer to https://rajpurkar.github.io/SQuAD-explorer/

Number of lines per split:
train: 87599

Dev: 10570

- train: 87599
- dev: 10570

Args:
root: Directory where the datasets are saved. Default: os.path.expanduser('~/.torchtext/cache')
Expand Down
5 changes: 2 additions & 3 deletions torchtext/datasets/squad2.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,8 @@ def SQuAD2(root: str, split: Union[Tuple[str], str]):
For additional details refer to https://rajpurkar.github.io/SQuAD-explorer/

Number of lines per split:
train: 130319

Dev: 11873
- train: 130319
- dev: 11873


Args:
Expand Down
18 changes: 16 additions & 2 deletions torchtext/datasets/sst2.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@

from torchtext._internal.module_utils import is_module_available
from torchtext.data.datasets_utils import (
_add_docstring_header,
_create_dataset_directory,
_wrap_split_argument,
)
Expand Down Expand Up @@ -37,10 +36,25 @@
}


@_add_docstring_header(num_lines=NUM_LINES, num_classes=2)
@_create_dataset_directory(dataset_name=DATASET_NAME)
@_wrap_split_argument(("train", "dev", "test"))
def SST2(root, split):
"""SST2 Dataset

For additional details refer to https://nlp.stanford.edu/sentiment/

Number of lines per split:
- train: 67349
- dev: 872
- test: 1821

Args:
root: Directory where the datasets are saved. Default: os.path.expanduser('~/.torchtext/cache')
split: split or splits to be returned. Can be a string or tuple of strings. Default: (`train`, `dev`, `test`)

:returns: DataPipe that yields tuple of text and/or label (1 to 4). The `test` split only returns text.
:rtype: Union[(int, str), (str,)]
"""
# TODO Remove this after removing conditional dependency
if not is_module_available("torchdata"):
raise ModuleNotFoundError(
Expand Down
8 changes: 3 additions & 5 deletions torchtext/datasets/udpos.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,9 @@ def UDPOS(root: str, split: Union[Tuple[str], str]):
"""UDPOS Dataset

Number of lines per split:
train: 12543

valid: 2002

test: 2077
- train: 12543
- valid: 2002
- test: 2077

Args:
root: Directory where the datasets are saved. Default: os.path.expanduser('~/.torchtext/cache')
Expand Down