dataset¶
Dataset package.
mnist¶
MNIST dataset.
This module will download dataset from http://yann.lecun.com/exdb/mnist/ and parse training set and test set into paddle reader creators.
-
paddle.dataset.mnist.
train
()[source] MNIST training set creator.
It returns a reader creator, each sample in the reader is image pixels in [-1, 1] and label in [0, 9].
- Returns
Training reader creator
- Return type
callable
-
paddle.dataset.mnist.
test
()[source] MNIST test set creator.
It returns a reader creator, each sample in the reader is image pixels in [-1, 1] and label in [0, 9].
- Returns
Test reader creator.
- Return type
callable
-
paddle.dataset.mnist.
convert
(path)[source] Converts dataset to recordio format
cifar¶
CIFAR dataset.
This module will download dataset from https://www.cs.toronto.edu/~kriz/cifar.html and parse train/test set into paddle reader creators.
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
The CIFAR-100 dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class.
-
paddle.dataset.cifar.
train100
()[source] CIFAR-100 training set creator.
It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 99].
- Returns
Training reader creator
- Return type
callable
-
paddle.dataset.cifar.
test100
()[source] CIFAR-100 test set creator.
It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].
- Returns
Test reader creator.
- Return type
callable
-
paddle.dataset.cifar.
train10
(cycle=False)[source] CIFAR-10 training set creator.
It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].
- Parameters
cycle (bool) – whether to cycle through the dataset
- Returns
Training reader creator
- Return type
callable
-
paddle.dataset.cifar.
test10
(cycle=False)[source] CIFAR-10 test set creator.
It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].
- Parameters
cycle (bool) – whether to cycle through the dataset
- Returns
Test reader creator.
- Return type
callable
-
paddle.dataset.cifar.
convert
(path)[source] Converts dataset to recordio format
conll05¶
Conll05 dataset. Paddle semantic role labeling Book and demo use this dataset as an example. Because Conll05 is not free in public, the default downloaded URL is test set of Conll05 (which is public). Users can change URL and MD5 to their Conll dataset. And a pre-trained word vector model based on Wikipedia corpus is used to initialize SRL model.
-
paddle.dataset.conll05.
get_dict
()[source] Get the word, verb and label dictionary of Wikipedia corpus.
-
paddle.dataset.conll05.
get_embedding
()[source] Get the trained word vector based on Wikipedia corpus.
-
paddle.dataset.conll05.
test
()[source] Conll05 test set creator.
Because the training dataset is not free, the test dataset is used for training. It returns a reader creator, each sample in the reader is nine features, including sentence sequence, predicate, predicate context, predicate context flag and tagged sequence.
- Returns
Training reader creator
- Return type
callable
imdb¶
IMDB dataset.
This module downloads IMDB dataset from http://ai.stanford.edu/%7Eamaas/data/sentiment/. This dataset contains a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. Besides, this module also provides API for building dictionary.
-
paddle.dataset.imdb.
build_dict
(pattern, cutoff)[source] Build a word dictionary from the corpus. Keys of the dictionary are words, and values are zero-based IDs of these words.
-
paddle.dataset.imdb.
train
(word_idx)[source] IMDB training set creator.
It returns a reader creator, each sample in the reader is an zero-based ID sequence and label in [0, 1].
- Parameters
word_idx (dict) – word dictionary
- Returns
Training reader creator
- Return type
callable
-
paddle.dataset.imdb.
test
(word_idx)[source] IMDB test set creator.
It returns a reader creator, each sample in the reader is an zero-based ID sequence and label in [0, 1].
- Parameters
word_idx (dict) – word dictionary
- Returns
Test reader creator
- Return type
callable
-
paddle.dataset.imdb.
convert
(path)[source] Converts dataset to recordio format
imikolov¶
imikolov’s simple dataset.
This module will download dataset from http://www.fit.vutbr.cz/~imikolov/rnnlm/ and parse training set and test set into paddle reader creators.
-
paddle.dataset.imikolov.
build_dict
(min_word_freq=50)[source] Build a word dictionary from the corpus, Keys of the dictionary are words, and values are zero-based IDs of these words.
-
paddle.dataset.imikolov.
train
(word_idx, n, data_type=1)[source] imikolov training set creator.
It returns a reader creator, each sample in the reader is a word ID tuple.
- Parameters
word_idx (dict) – word dictionary
n (int) – sliding window size if type is ngram, otherwise max length of sequence
data_type (member variable of DataType (NGRAM or SEQ)) – data type (ngram or sequence)
- Returns
Training reader creator
- Return type
callable
-
paddle.dataset.imikolov.
test
(word_idx, n, data_type=1)[source] imikolov test set creator.
It returns a reader creator, each sample in the reader is a word ID tuple.
- Parameters
word_idx (dict) – word dictionary
n (int) – sliding window size if type is ngram, otherwise max length of sequence
data_type (member variable of DataType (NGRAM or SEQ)) – data type (ngram or sequence)
- Returns
Test reader creator
- Return type
callable
-
paddle.dataset.imikolov.
convert
(path)[source] Converts dataset to recordio format
movielens¶
Movielens 1-M dataset.
Movielens 1-M dataset contains 1 million ratings from 6000 users on 4000 movies, which was collected by GroupLens Research. This module will download Movielens 1-M dataset from http://files.grouplens.org/datasets/movielens/ml-1m.zip and parse training set and test set into paddle reader creators.
-
paddle.dataset.movielens.
get_movie_title_dict
()[source] Get movie title dictionary.
-
paddle.dataset.movielens.
max_movie_id
()[source] Get the maximum value of movie id.
-
paddle.dataset.movielens.
max_user_id
()[source] Get the maximum value of user id.
-
paddle.dataset.movielens.
max_job_id
()[source] Get the maximum value of job id.
-
paddle.dataset.movielens.
movie_categories
()[source] Get movie categoriges dictionary.
-
paddle.dataset.movielens.
user_info
()[source] Get user info dictionary.
-
paddle.dataset.movielens.
movie_info
()[source] Get movie info dictionary.
-
paddle.dataset.movielens.
convert
(path)[source] Converts dataset to recordio format
-
class
paddle.dataset.movielens.
MovieInfo
(index, categories, title)[source] Movie id, title and categories information are stored in MovieInfo.
-
class
paddle.dataset.movielens.
UserInfo
(index, gender, age, job_id)[source] User id, gender, age, and job information are stored in UserInfo.
sentiment¶
The script fetch and preprocess movie_reviews data set that provided by NLTK
TODO(yuyang18): Complete dataset.
-
paddle.dataset.sentiment.
get_word_dict
()[source] Sorted the words by the frequency of words which occur in sample :return:
words_freq_sorted
-
paddle.dataset.sentiment.
train
()[source] Default training set reader creator
-
paddle.dataset.sentiment.
test
()[source] Default test set reader creator
-
paddle.dataset.sentiment.
convert
(path)[source] Converts dataset to recordio format
uci_housing¶
UCI Housing dataset.
This module will download dataset from https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ and parse training set and test set into paddle reader creators.
-
paddle.dataset.uci_housing.
train
()[source] UCI_HOUSING training set creator.
It returns a reader creator, each sample in the reader is features after normalization and price number.
- Returns
Training reader creator
- Return type
callable
-
paddle.dataset.uci_housing.
test
()[source] UCI_HOUSING test set creator.
It returns a reader creator, each sample in the reader is features after normalization and price number.
- Returns
Test reader creator
- Return type
callable
wmt14¶
WMT14 dataset. The original WMT14 dataset is too large and a small set of data for set is provided. This module will download dataset from http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz and parse training set and test set into paddle reader creators.
-
paddle.dataset.wmt14.
train
(dict_size)[source] WMT14 training set creator.
It returns a reader creator, each sample in the reader is source language word ID sequence, target language word ID sequence and next word ID sequence.
- Returns
Training reader creator
- Return type
callable
-
paddle.dataset.wmt14.
test
(dict_size)[source] WMT14 test set creator.
It returns a reader creator, each sample in the reader is source language word ID sequence, target language word ID sequence and next word ID sequence.
- Returns
Test reader creator
- Return type
callable
-
paddle.dataset.wmt14.
convert
(path)[source] Converts dataset to recordio format
wmt16¶
ACL2016 Multimodal Machine Translation. Please see this website for more details: http://www.statmt.org/wmt16/multimodal-task.html#task1
If you use the dataset created for your task, please cite the following paper: Multi30K: Multilingual English-German Image Descriptions.
- @article{elliott-EtAl:2016:VL16,
author = {{Elliott}, D. and {Frank}, S. and {Sima”an}, K. and {Specia}, L.}, title = {Multi30K: Multilingual English-German Image Descriptions}, booktitle = {Proceedings of the 6th Workshop on Vision and Language}, year = {2016}, pages = {70–74}, year = 2016
}
-
paddle.dataset.wmt16.
train
(src_dict_size, trg_dict_size, src_lang='en')[source] WMT16 train set reader.
This function returns the reader for train data. Each sample the reader returns is made up of three fields: the source language word index sequence, target language word index sequence and next word index sequence.
NOTE: The original like for training data is: http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz
paddle.dataset.wmt16 provides a tokenized version of the original dataset by using moses’s tokenization script: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl
- Parameters
src_dict_size (int) – Size of the source language dictionary. Three special tokens will be added into the dictionary: <s> for start mark, <e> for end mark, and <unk> for unknown word.
trg_dict_size (int) – Size of the target language dictionary. Three special tokens will be added into the dictionary: <s> for start mark, <e> for end mark, and <unk> for unknown word.
src_lang (string) – A string indicating which language is the source language. Available options are: “en” for English and “de” for Germany.
- Returns
The train reader.
- Return type
callable
-
paddle.dataset.wmt16.
test
(src_dict_size, trg_dict_size, src_lang='en')[source] WMT16 test set reader.
This function returns the reader for test data. Each sample the reader returns is made up of three fields: the source language word index sequence, target language word index sequence and next word index sequence.
NOTE: The original like for test data is: http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/mmt16_task1_test.tar.gz
paddle.dataset.wmt16 provides a tokenized version of the original dataset by using moses’s tokenization script: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl
- Parameters
src_dict_size (int) – Size of the source language dictionary. Three special tokens will be added into the dictionary: <s> for start mark, <e> for end mark, and <unk> for unknown word.
trg_dict_size (int) – Size of the target language dictionary. Three special tokens will be added into the dictionary: <s> for start mark, <e> for end mark, and <unk> for unknown word.
src_lang (string) – A string indicating which language is the source language. Available options are: “en” for English and “de” for Germany.
- Returns
The test reader.
- Return type
callable
-
paddle.dataset.wmt16.
validation
(src_dict_size, trg_dict_size, src_lang='en')[source] WMT16 validation set reader.
This function returns the reader for validation data. Each sample the reader returns is made up of three fields: the source language word index sequence, target language word index sequence and next word index sequence.
NOTE: The original like for validation data is: http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz
paddle.dataset.wmt16 provides a tokenized version of the original dataset by using moses’s tokenization script: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl
- Parameters
src_dict_size (int) – Size of the source language dictionary. Three special tokens will be added into the dictionary: <s> for start mark, <e> for end mark, and <unk> for unknown word.
trg_dict_size (int) – Size of the target language dictionary. Three special tokens will be added into the dictionary: <s> for start mark, <e> for end mark, and <unk> for unknown word.
src_lang (string) – A string indicating which language is the source language. Available options are: “en” for English and “de” for Germany.
- Returns
The validation reader.
- Return type
callable
-
paddle.dataset.wmt16.
get_dict
(lang, dict_size, reverse=False)[source] return the word dictionary for the specified language.
- Parameters
lang (string) – A string indicating which language is the source language. Available options are: “en” for English and “de” for Germany.
dict_size (int) – Size of the specified language dictionary.
reverse (bool) – If reverse is set to False, the returned python dictionary will use word as key and use index as value. If reverse is set to True, the returned python dictionary will use index as key and word as value.
- Returns
The word dictionary for the specific language.
- Return type
dict
-
paddle.dataset.wmt16.
fetch
()[source] download the entire dataset.
-
paddle.dataset.wmt16.
convert
(path, src_dict_size, trg_dict_size, src_lang)[source] Converts dataset to recordio format.