fluid.dataset¶
DatasetFactory¶
-
class
paddle.fluid.dataset.
DatasetFactory
[source] DatasetFactory is a factory which create dataset by its name, you can create “QueueDataset” or “InMemoryDataset”, or “FileInstantDataset”, the default is “QueueDataset”.
Example
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset")
-
create_dataset
(datafeed_class='QueueDataset') Create “QueueDataset” or “InMemoryDataset”, or “FileInstantDataset”, the default is “QueueDataset”.
- Parameters
datafeed_class (str) – datafeed class name, QueueDataset or InMemoryDataset. Default is QueueDataset.
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset()
-
InMemoryDataset¶
-
class
paddle.fluid.dataset.
InMemoryDataset
[source] InMemoryDataset, it will load data into memory and shuffle data before training. This class should be created by DatasetFactory
Example
dataset = paddle.fluid.DatasetFactory().create_dataset(“InMemoryDataset”)
-
load_into_memory
() Load data into memory
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") filelist = ["a.txt", "b.txt"] dataset.set_filelist(filelist) dataset.load_into_memory()
-
local_shuffle
() Local shuffle
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") filelist = ["a.txt", "b.txt"] dataset.set_filelist(filelist) dataset.load_into_memory() dataset.local_shuffle()
-
global_shuffle
(fleet=None) Global shuffle. Global shuffle can be used only in distributed mode. i.e. multiple processes on single machine or multiple machines training together. If you run in distributed mode, you should pass fleet instead of None.
Examples
import paddle.fluid as fluid from paddle.fluid.incubate.fleet.parameter_server.pslib import fleet dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") filelist = ["a.txt", "b.txt"] dataset.set_filelist(filelist) dataset.load_into_memory() dataset.global_shuffle(fleet)
- Parameters
fleet (Fleet) – fleet singleton. Default None.
-
release_memory
() Release InMemoryDataset memory data, when data will not be used again.
Examples
import paddle.fluid as fluid from paddle.fluid.incubate.fleet.parameter_server.pslib import fleet dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") filelist = ["a.txt", "b.txt"] dataset.set_filelist(filelist) dataset.load_into_memory() dataset.global_shuffle(fleet) exe = fluid.Executor(fluid.CPUPlace()) exe.run(fluid.default_startup_program()) exe.train_from_dataset(fluid.default_main_program(), dataset) dataset.release_memory()
-
get_memory_data_size
(fleet=None) Get memory data size, user can call this function to know the num of ins in all workers after load into memory.
Note
This function may cause bad performance, because it has barrier
- Parameters
fleet (Fleet) – Fleet Object.
- Returns
The size of memory data.
Examples
import paddle.fluid as fluid from paddle.fluid.incubate.fleet.parameter_server.pslib import fleet dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") filelist = ["a.txt", "b.txt"] dataset.set_filelist(filelist) dataset.load_into_memory() print dataset.get_memory_data_size(fleet)
-
get_shuffle_data_size
(fleet=None) Get shuffle data size, user can call this function to know the num of ins in all workers after local/global shuffle.
Note
This function may cause bad performance to local shuffle, because it has barrier. It does not affect global shuffle.
- Parameters
fleet (Fleet) – Fleet Object.
- Returns
The size of shuffle data.
Examples
import paddle.fluid as fluid from paddle.fluid.incubate.fleet.parameter_server.pslib import fleet dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") filelist = ["a.txt", "b.txt"] dataset.set_filelist(filelist) dataset.load_into_memory() dataset.global_shuffle(fleet) print dataset.get_shuffle_data_size(fleet)
-
desc
() Returns a protobuf message for this DataFeedDesc
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() print(dataset.desc())
- Returns
A string message
-
set_batch_size
(batch_size) Set batch size. Will be effective during training
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_batch_size(128)
- Parameters
batch_size (int) – batch size
-
set_filelist
(filelist) Set file list in current worker.
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_filelist(['a.txt', 'b.txt'])
- Parameters
filelist (list) – file list
-
set_hdfs_config
(fs_name, fs_ugi) Set hdfs config: fs name ad ugi
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_hdfs_config("my_fs_name", "my_fs_ugi")
- Parameters
fs_name (str) – fs name
fs_ugi (str) – fs ugi
-
set_pipe_command
(pipe_command) Set pipe command of current dataset A pipe command is a UNIX pipeline command that can be used only
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_pipe_command("python my_script.py")
- Parameters
pipe_command (str) – pipe command
-
set_thread
(thread_num) Set thread num, it is the num of readers.
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_thread(12)
- Parameters
thread_num (int) – thread num
-
set_use_var
(var_list) Set Variables which you will use.
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_use_var([data, label])
- Parameters
var_list (list) – variable list
-
QueueDataset¶
-
class
paddle.fluid.dataset.
QueueDataset
[source] QueueDataset, it will process data streamly.
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("QueueDataset")
-
local_shuffle
() Local shuffle data.
Local shuffle is not supported in QueueDataset NotImplementedError will be raised
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("QueueDataset") dataset.local_shuffle()
-
global_shuffle
(fleet=None) Global shuffle data.
Global shuffle is not supported in QueueDataset NotImplementedError will be raised
Examples
import paddle.fluid as fluid from paddle.fluid.incubate.fleet.parameter_server.pslib import fleet dataset = fluid.DatasetFactory().create_dataset("QueueDataset") dataset.global_shuffle(fleet)
-
desc
() Returns a protobuf message for this DataFeedDesc
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() print(dataset.desc())
- Returns
A string message
-
set_batch_size
(batch_size) Set batch size. Will be effective during training
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_batch_size(128)
- Parameters
batch_size (int) – batch size
-
set_filelist
(filelist) Set file list in current worker.
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_filelist(['a.txt', 'b.txt'])
- Parameters
filelist (list) – file list
-
set_hdfs_config
(fs_name, fs_ugi) Set hdfs config: fs name ad ugi
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_hdfs_config("my_fs_name", "my_fs_ugi")
- Parameters
fs_name (str) – fs name
fs_ugi (str) – fs ugi
-
set_pipe_command
(pipe_command) Set pipe command of current dataset A pipe command is a UNIX pipeline command that can be used only
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_pipe_command("python my_script.py")
- Parameters
pipe_command (str) – pipe command
-
set_thread
(thread_num) Set thread num, it is the num of readers.
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_thread(12)
- Parameters
thread_num (int) – thread num
-
set_use_var
(var_list) Set Variables which you will use.
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_use_var([data, label])
- Parameters
var_list (list) – variable list
-