Save, Load Models or Variables & Incremental Learning

Model variable classification

In PaddlePaddle Fluid, all model variables are represented by fluid.Variable() as the base class. Under this base class, model variables can be divided into the following categories:

  1. Model parameter

The model parameters are the variables trained and learned in the deep learning model. During the training process, the training framework calculates the current gradient of each model parameter according to the back propagation algorithm, and updates the parameters according to their gradients by the optimizer. The essence of the training process of a model can be seen as the process of continuously iterative updating of model parameters. In PaddlePaddle Fluid, the model parameters are represented by fluid.framework.Parameter , which is a derived class of fluid.Variable() . Besides various properties of fluid.Variable() , fluid.framework.Parameter can also be configured with its own initialization methods, update rate and other properties.

  1. Persistable variable

Persistable variables refer to variables that persist throughout the training process and are not destroyed by the end of an iteration, such as the global learning rate which is dynamically adjusted. In PaddlePaddle Fluid, persistable variables are represented by setting the persistable property of fluid.Variable() to True. All model parameters are persistable variables, but not all persistable variables are model parameters.

  1. Temporary variables

All model variables that do not belong to the above two categories are temporary variables. This type of variable exists only in one training iteration. After each iteration, all temporary variables will be destroyed, and before the next iteration, A new set of temporary variables will be constructed first for this iteration. In general, most of the variables in the model belong to this category, such as the input training data, the output of a normal layer, and so on.

How to save model variables

The model variables we need to save are different depending on the application. For example, if we just want to save the model for future predictions, just saving the model parameters will be enough. But if we need to save a checkpoint for future recovery of current training, then we should save all the persistable variables, and even record the current epoch and step id. It is because even though some model variables are not parameters, they are still essential for model training.

Save the model to make prediction for new samples

If we save the model to make prediction for new samples, just saving the model parameters will be sufficient. We can use the fluid.io.save_params() interface to save model parameters.

For example:

import paddle.fluid as fluid

exe = fluid.Executor(fluid.CPUPlace())
param_path = "./my_paddle_model"
prog = fluid.default_main_program()
fluid.io.save_params(executor=exe, dirname=param_path, main_program=None)

In the example above, by calling the fluid.io.save_params function, PaddlePaddle Fluid scans all model variables in the default fluid.Program , i.e. prog and picks out all model parameters. All these model parameters are saved to the specified param_path .

How to load model variables

Corresponding to saving of model variables, we provide two sets of APIs to load the model parameters and the persistable variables of model.

Load model to make predictions for new samples

For models saved with fluid.io.save_params , you can load them with fluid.io.load_params.

For example:

import paddle.fluid as fluid

exe = fluid.Executor(fluid.CPUPlace())
param_path = "./my_paddle_model"
prog = fluid.default_main_program()
fluid.io.load_params(executor=exe, dirname=param_path,
                     main_program=prog)

In the above example, by calling the fluid.io.load_params function, PaddlePaddle Fluid will scan all the model variables in prog, filter out all the model parameters, and try to load them from param_path .

It is important to note that the prog here must be exactly the same as the forward part of the prog used when calling fluid.io.save_params and cannot contain any operations of parameter updates. If there is an inconsistency between the two, it may cause some variables not to be loaded correctly; if the parameter update operation is incorrectly included, it may cause the parameters to be changed during normal prediction. The relationship between these two fluid.Program is similar to the relationship between training fluid.Program and test fluid.Program, see: Evaluate model while training .

In addition, special care must be taken that fluid.default_startup_program() must be run before calling fluid.io.load_params . If you run it later, it may overwrite the loaded model parameters and cause an error.

Prediction of the used models and parameters saving

The inference engine provides two interfaces : prediction model saving fluid.io.save_inference_model and the prediction model loading fluid.io.load_inference_model.

  • fluid.io.save_inference_model: Please refer to 预测引擎 .

  • fluid.io.load_inference_model: Please refer to 预测引擎 .

Incremental training

Incremental training means that a learning system can continuously learn new knowledge from new samples and preserve most of the knowledge that has been learned before. Therefore, incremental learning involves two points: saving the parameters that need to be persisted at the end of the last training, and loading the last saved persistent parameters at the beginning of the next training. Therefore incremental training involves the following APIs: fluid.io.save_persistables, fluid.io.load_persistables .

Single-node incremental training

The general steps of incremental training on a single unit are as follows:

  1. At the end of the training, call fluid.io.save_persistables to save the persistable parameter to the specified location.

  2. After the training startup_program is executed successfully by the executor Executor, call fluid.io.load_persistables to load the previously saved persistable parameters.

  3. Continue training with the executor Executor or ParallelExecutor.

Example:

import paddle.fluid as fluid

exe = fluid.Executor(fluid.CPUPlace())
path = "./models"
prog = fluid.default_main_program()
fluid.io.save_persistables(exe, path, prog)

In the above example, by calling the fluid.io.save_persistables function, PaddlePaddle Fluid will find all persistable variables from all model variables in the default fluid.Program, e.t. prog , and save them to the specified path directory.

import paddle.fluid as fluid

exe = fluid.Executor(fluid.CPUPlace())
path = "./models"
startup_prog = fluid.default_startup_program()
exe.run(startup_prog)
fluid.io.load_persistables(exe, path, startup_prog)
main_prog = fluid.default_main_program()
exe.run(main_prog)

In the above example, by calling the fluid.io.load_persistables function, PaddlePaddle Fluid will find persistable variables from all model variables in the default fluid.Program , e.t. prog . and load them one by one from the specified path directory to continue training.

The general steps for multi-node incremental training (without distributed large-scale sparse matrices)

There are several differences between multi-node incremental training and single-node incremental training:

  1. At the end of the training, when fluid.io.save_persistables is called to save the persistence parameters, it is not necessary for all trainers to call this method, usually it is called on the 0th trainer.

  2. The parameters of multi-node incremental training are loaded on the PServer side, and the trainer side does not need to load parameters. After the PServers are fully started, the trainer will synchronize the parameters from the PServer.

The general steps for multi-node incremental training (do not enable distributed large-scale sparse matrices) are:

  1. At the end of the training, Trainer 0 will call fluid.io.save_persistables to save the persistable parameters to the specified path.

  2. Share all the parameters saved by trainer 0 to all PServers through HDFS or other methods. (each PServer needs to have complete parameters).

  3. After the training startup_program is successfully executed by the executor ( Executor ), the PServer calls fluid.io.load_persistables to load the persistable parameters saved by the 0th trainer.

  4. The PServer continues to start PServer_program via the executor Executor.

  5. All training node trainers conduct training process normally through the executor Executor or ParallelExecutor .

For trainers whose parameters are to be saved during training, for example:

import paddle.fluid as fluid

exe = fluid.Executor(fluid.CPUPlace())
path = "./models"
trainer_id = 0
if trainer_id == 0:
    prog = fluid.default_main_program()
    fluid.io.save_persistables(exe, path, prog)

In the above example, the 0 trainer calls the fluid.io.save_persistables function. By calling this function, PaddlePaddle Fluid will find all persistable variables in all model variables from default fluid.Program , e.t. prog , and save them to the specified path directory. The stored model is then uploaded to a location accessible for all PServers by invoking a third-party file system (such as HDFS).

For the PServer to be loaded with parameters during training, for example:

import paddle.fluid as fluid

exe = fluid.Executor(fluid.CPUPlace())
path = "./models"
    pserver_endpoints = "127.0.0.1:1001,127.0.0.1:1002"
    trainers = 4
    Training_role == "PSERVER"
    config = fluid.DistributeTranspilerConfig()
    t = fluid.DistributeTranspiler(config=config)
    t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers, sync_mode=True)

    if training_role == "PSERVER":
            current_endpoint = "127.0.0.1:1001"
            pserver_prog = t.get_pserver_program(current_endpoint)
            pserver_startup = t.get_startup_program(current_endpoint, pserver_prog)

            exe.run(pserver_startup)
            fluid.io.load_persistables(exe, path, pserver_startup)
            exe.run(pserver_prog)
    if training_role == "TRAINER":
            main_program = t.get_trainer_program()
                            exe.run(main_program)

In the above example, each PServer obtains the parameters saved by trainer 0 by calling the HDFS command, and obtains the PServer’s fluid.Program by configuration. PaddlePaddle Fluid will find all persistable variables in all model variables from this fluid.Program , e.t. pserver_startup , and load them from the specified path directory.