Introduction to C++ Inference API¶
To make the deployment of inference model more convenient, a set of high-level APIs are provided in Fluid to hide diverse optimization processes in low level.
Inference library contains:
header file
paddle_inference_api.h
which defines all interfaceslibrary file
libpaddle_fluid.so
orlibpaddle_fluid.a
Details are as follows:
PaddleTensor¶
PaddleTensor defines basic format of input and output data for inference. Common fields are as follows:
name
is used to indicate the name of variable in model correspondent with input data.shape
represents the shape of a Tensor.data
is stored inPaddleBuf
in method of consecutive storage.PaddleBuf
can receieve outer data or independentlymalloc
memory. You can refer to associated definitions in head file.dtype
represents data type of Tensor.
Use Config to create different engines¶
The low level of high-level API contains various optimization methods which are called engines. Switch between different engines is done by transferring different Config.
NativeConfig
native engine, consisting of native forward operators of paddle, can naturally support all models trained by paddle.AnalysisConfig
TensorRT mixed engine. It is used to speed up GPU and supports [TensorRT] with subgraph. Moreover, this engine supports all paddle models and automatically slices part of computing subgraphs to TensorRT to speed up the process (WIP). For specific usage, please refer to here.
Process of Inference Deployment¶
In general, the steps are:
Use appropriate configuration to create
PaddlePredictor
Create
PaddleTensor
for input and transfer it intoPaddlePredictor
PaddleTensor
for fetching output
The complete process of implementing a simple model is shown below with part of details omitted.
#include "paddle_inference_api.h"
// create a config and modify associated options
paddle::NativeConfig config;
config.model_dir = "xxx";
config.use_gpu = false;
// create a native PaddlePredictor
auto predictor =
paddle::CreatePaddlePredictor<paddle::NativeConfig>(config);
// create input tensor
int64_t data[4] = {1, 2, 3, 4};
paddle::PaddleTensor tensor;
tensor.shape = std::vector<int>({4, 1});
tensor.data.Reset(data, sizeof(data));
tensor.dtype = paddle::PaddleDType::INT64;
// create output tensor whose memory is reusable
std::vector<paddle::PaddleTensor> outputs;
// run inference
CHECK(predictor->Run(slots, &outputs));
// fetch outputs ...
At compile time, it is proper to co-build with libpaddle_fluid.a/.so
.
Adavanced Usage¶
memory management of input and output¶
data
field of PaddleTensor
is a PaddleBuf
, used to manage a section of memory for copying data.
There are two modes in term of memory management in PaddleBuf
:
Automatic allocation and manage memory
int some_size = 1024; PaddleTensor tensor; tensor.data.Resize(some_size);
Transfer outer memory
int some_size = 1024; // You can allocate outside memory and keep it available during the usage of PaddleTensor void* memory = new char[some_size]; tensor.data.Reset(memory, some_size); // ... // You need to release memory manually to avoid memory leak delete[] memory;
In the two modes, the first is more convenient while the second strictly controls memory management to facilitate integration with tcmalloc
and other libraries.
Upgrade performance based on contrib::AnalysisConfig¶
AnalyisConfig is at the stage of pre-release and protected by namespace contrib
, which may be adjusted in the future.
Similar to NativeConfig
, AnalysisConfig
can create a inference engine with high performance after a series of optimization, including analysis and optimization of computing graph as well as integration and revise for some important Ops, which largely promotes the peformance of models, such as While, LSTM, GRU .
The usage of AnalysisConfig
is similiar with that of NativeConfig
but the former only supports CPU at present and is supporting GPU more and more.
AnalysisConfig config;
config.SetModel(dirname); // set the directory of the model
config.EnableUseGpu(100, 0 /*gpu id*/); // use GPU,or
config.DisableGpu(); // use CPU
config.SwitchSpecifyInputNames(true); // need to appoint the name of your input
config.SwitchIrOptim(); // turn on the optimization switch,and a sequence of optimizations will be executed in operation
Note that input PaddleTensor needs to be allocated. Previous examples need to be revised as follows:
auto predictor =
paddle::CreatePaddlePredictor<paddle::contrib::AnalysisConfig>(config); // it needs AnalysisConfig here
// create input tensor
int64_t data[4] = {1, 2, 3, 4};
paddle::PaddleTensor tensor;
tensor.shape = std::vector<int>({4, 1});
tensor.data.Reset(data, sizeof(data));
tensor.dtype = paddle::PaddleDType::INT64;
tensor.name = "input0"; // name need to be set here
The subsequent execution process is totally the same with NativeConfig
.
variable-length sequence input¶
When dealing with variable-length sequence input, you need to set LoD for PaddleTensor
.
# Suppose the sequence lengths are [3, 2, 4, 1, 2, 3] in order.
tensor.lod = {{0,
/*0 + 3=*/3,
/*3 + 2=*/5,
/*5 + 4=*/9,
/*9 + 1=*/10,
/*10 + 2=*/12,
/*12 + 3=*/15}};
For more specific examples, please refer toLoD-Tensor Instructions
Suggestion for Performance¶
If the CPU type permits, it’s best to use the versions with support for AVX and MKL.
Reuse input and output
PaddleTensor
to avoid frequent memory allocation resulting in low performanceTry to replace
NativeConfig
withAnalysisConfig
to perform optimization for CPU or GPU inference