Take Numpy Array as Training Data¶
PaddlePaddle Fluid supports configuring data layer with fluid.layers.data()
.
Then you can use Numpy Array or directly use Python to create C++
fluid.LoDTensor
, and then feed it to fluid.Executor
or fluid.ParallelExecutor
through Executor.run(feed=...)
.
Configure Data Layer¶
With fluid.layers.data()
, you can configure data layer in neural network. Details are as follows:
import paddle.fluid as fluid
image = fluid.layers.data(name="image", shape=[3, 224, 224])
label = fluid.layers.data(name="label", shape=[1], dtype="int64")
# use image/label as layer input
prediction = fluid.layers.fc(input=image, size=1000, act="softmax")
loss = fluid.layers.cross_entropy(input=prediction, label=label)
...
In the code above, image
and label
are two input data layers created by fluid.layers.data
. image
is float data of shape [3, 224, 224]
; label
is the int data of shape [1]
. Note that:
-1
is represented for the dimension of batch size by default in Fluid. And-1
is added to the first dimension ofshape
by default. Therefore in the code above, it would be alright to transfer numpy array of[32, 3, 224, 224]
toimage
. If you want to customize the position of the batch size dimension, please setfluid.layers.data(append_batch_size=False)
.Please refer to the tutorial in the advanced user guide: Customize the BatchSize dimension .Data type of category labels in Fluid is
int64
and the label starts from 0. About the supported data types,please refer to Data types supported by Fluid .
Transfer Train Data to Executor¶
Both Executor.run
and ParallelExecutor.run
receive a parameter feed
.
The parameter is a dict in Python. Its key is the name of data layer,such as image
in code above. And its value is the corresponding numpy array.
For example:
exe = fluid.Executor(fluid.CPUPlace())
# init Program
exe.run(fluid.default_startup_program())
exe.run(feed={
"image": numpy.random.random(size=(32, 3, 224, 224)).astype('float32'),
"label": numpy.random.random(size=(32, 1)).astype('int64')
})
Advanced Usage¶
How to feed Sequence Data¶
Sequence data is a unique data type supported by PaddlePaddle Fluid. You can take LoDTensor
as input data type.
You need to:
Feed all data to be trained in a mini-batch.
Get the length of each sequence.
You can use fluid.create_lod_tensor
to create LoDTensor
.
To feed sequence information, it is necessary to set the sequence nested depth lod_level
.
For instance, if the training data are sentences consisting of words, lod_level=1
; if train data are paragraphs which consists of sentences that consists of words, lod_level=2
.
For example:
sentence = fluid.layers.data(name="sentence", dtype="int64", shape=[1], lod_level=1)
...
exe.run(feed={
"sentence": create_lod_tensor(
data=numpy.array([1, 3, 4, 5, 3, 6, 8], dtype='int64').reshape(-1, 1),
recursive_seq_lens=[[4, 1, 2]],
place=fluid.CPUPlace()
)
})
Training data sentence
contain three samples, the lengths of which are 4, 1, 2
respectively.
They are data[0:4]
, data[4:5]
and data[5:7]
respectively.
How to prepare training data for every device in ParallelExecutor¶
When you feed data to ParallelExecutor.run(feed=...)
,
you can explicitly assign data for every training device (such as GPU).
You need to feed a list to feed
. Each element of the list is a dict.
The key of the dict is name of data layer and the value of dict is value of data layer.
For example:
parallel_executor = fluid.ParallelExecutor()
parallel_executor.run(
feed=[
{
"image": numpy.random.random(size=(32, 3, 224, 224)).astype('float32'),
"label": numpy.random.random(size=(32, 1)).astype('int64')
},
{
"image": numpy.random.random(size=(16, 3, 224, 224)).astype('float32'),
"label": numpy.random.random(size=(16, 1)).astype('int64')
},
]
)
In the code above, GPU0 will train 32 samples and GPU1 will train 16 samples.
Customize the BatchSize dimension¶
Batch size is the first dimension of data by default in PaddlePaddle Fluid, indicated by -1
.But in advanced usage, batch_size could be fixed or respresented by other dimension or multiple dimensions, which could be implemented by setting fluid.layers.data(append_batch_size=False)
.
fixed BatchSize dimension
image = fluid.layers.data(name="image", shape=[32, 784], append_batch_size=False)Here
image
is always a matrix with size of[32, 784]
.
batch size expressed by other dimension
sentence = fluid.layers.data(name="sentence", shape=[80, -1, 1], append_batch_size=False, dtype="int64")Here the middle dimension of
sentence
is batch size. This type of data layout is applied in fixed-length recurrent neural networks.
Data types supported by Fluid¶
Data types supported by PaddlePaddle Fluid contains:
float16: supported by part of operations
float32: major data type of real number
float64: minor data type of real number, supported by most operations
int32: minor data type of labels
int64: major data type of labels
uint64: minor data type of labels
bool: type of control flow data
int16: minor type of labels
uint8: input data type, used for pixel of picture