Introduction to C++ Inference API¶
To make the deployment of inference model more convenient, a set of high-level APIs are provided in Fluid to hide diverse optimization processes in low level.
Inference library contains:
header file
paddle_inference_api.hwhich defines all interfaceslibrary file
libpaddle_fluid.soorlibpaddle_fluid.a
Details are as follows:
PaddleTensor¶
PaddleTensor defines basic format of input and output data for inference. Common fields are as follows:
nameis used to indicate the name of variable in model correspondent with input data.shaperepresents the shape of a Tensor.datais stored inPaddleBufin method of consecutive storage.PaddleBufcan receieve outer data or independentlymallocmemory. You can refer to associated definitions in head file.dtyperepresents data type of Tensor.
Use Config to create different engines¶
The low level of high-level API contains various optimization methods which are called engines. Switch between different engines is done by transferring different Config.
NativeConfignative engine, consisting of native forward operators of paddle, can naturally support all models trained by paddle.AnalysisConfigTensorRT mixed engine. It is used to speed up GPU and supports [TensorRT] with subgraph. Moreover, this engine supports all paddle models and automatically slices part of computing subgraphs to TensorRT to speed up the process (WIP). For specific usage, please refer to here.
Process of Inference Deployment¶
In general, the steps are:
Use appropriate configuration to create
PaddlePredictorCreate
PaddleTensorfor input and transfer it intoPaddlePredictorPaddleTensorfor fetching output
The complete process of implementing a simple model is shown below with part of details omitted.
#include "paddle_inference_api.h"
// create a config and modify associated options
paddle::NativeConfig config;
config.model_dir = "xxx";
config.use_gpu = false;
// create a native PaddlePredictor
auto predictor =
paddle::CreatePaddlePredictor<paddle::NativeConfig>(config);
// create input tensor
int64_t data[4] = {1, 2, 3, 4};
paddle::PaddleTensor tensor;
tensor.shape = std::vector<int>({4, 1});
tensor.data.Reset(data, sizeof(data));
tensor.dtype = paddle::PaddleDType::INT64;
// create output tensor whose memory is reusable
std::vector<paddle::PaddleTensor> outputs;
// run inference
CHECK(predictor->Run(slots, &outputs));
// fetch outputs ...
At compile time, it is proper to co-build with libpaddle_fluid.a/.so .
Adavanced Usage¶
memory management of input and output¶
data field of PaddleTensor is a PaddleBuf, used to manage a section of memory for copying data.
There are two modes in term of memory management in PaddleBuf :
Automatic allocation and manage memory
int some_size = 1024; PaddleTensor tensor; tensor.data.Resize(some_size);
Transfer outer memory
int some_size = 1024; // You can allocate outside memory and keep it available during the usage of PaddleTensor void* memory = new char[some_size]; tensor.data.Reset(memory, some_size); // ... // You need to release memory manually to avoid memory leak delete[] memory;
In the two modes, the first is more convenient while the second strictly controls memory management to facilitate integration with tcmalloc and other libraries.
Upgrade performance based on contrib::AnalysisConfig¶
AnalyisConfig is at the stage of pre-release and protected by namespace contrib , which may be adjusted in the future.
Similar to NativeConfig , AnalysisConfig can create a inference engine with high performance after a series of optimization, including analysis and optimization of computing graph as well as integration and revise for some important Ops, which largely promotes the peformance of models, such as While, LSTM, GRU .
The usage of AnalysisConfig is similiar with that of NativeConfig but the former only supports CPU at present and is supporting GPU more and more.
AnalysisConfig config;
config.SetModel(dirname); // set the directory of the model
config.EnableUseGpu(100, 0 /*gpu id*/); // use GPU,or
config.DisableGpu(); // use CPU
config.SwitchSpecifyInputNames(true); // need to appoint the name of your input
config.SwitchIrOptim(); // turn on the optimization switch,and a sequence of optimizations will be executed in operation
Note that input PaddleTensor needs to be allocated. Previous examples need to be revised as follows:
auto predictor =
paddle::CreatePaddlePredictor<paddle::contrib::AnalysisConfig>(config); // it needs AnalysisConfig here
// create input tensor
int64_t data[4] = {1, 2, 3, 4};
paddle::PaddleTensor tensor;
tensor.shape = std::vector<int>({4, 1});
tensor.data.Reset(data, sizeof(data));
tensor.dtype = paddle::PaddleDType::INT64;
tensor.name = "input0"; // name need to be set here
The subsequent execution process is totally the same with NativeConfig .
variable-length sequence input¶
When dealing with variable-length sequence input, you need to set LoD for PaddleTensor .
# Suppose the sequence lengths are [3, 2, 4, 1, 2, 3] in order.
tensor.lod = {{0,
/*0 + 3=*/3,
/*3 + 2=*/5,
/*5 + 4=*/9,
/*9 + 1=*/10,
/*10 + 2=*/12,
/*12 + 3=*/15}};
For more specific examples, please refer toLoD-Tensor Instructions
Suggestion for Performance¶
If the CPU type permits, it’s best to use the versions with support for AVX and MKL.
Reuse input and output
PaddleTensorto avoid frequent memory allocation resulting in low performanceTry to replace
NativeConfigwithAnalysisConfigto perform optimization for CPU or GPU inference