fluid.optimizer¶
Adadelta¶
-
paddle.fluid.optimizer.
Adadelta
alias of
paddle.fluid.optimizer.AdadeltaOptimizer
Adagrad¶
-
paddle.fluid.optimizer.
Adagrad
alias of
paddle.fluid.optimizer.AdagradOptimizer
AdagradOptimizer¶
-
class
paddle.fluid.optimizer.
AdagradOptimizer
(learning_rate, epsilon=1e-06, regularization=None, name=None, initial_accumulator_value=0.0)[source] Adaptive Gradient Algorithm (Adagrad)
The update is done as follows:
\[ \begin{align}\begin{aligned}moment\_out &= moment + grad * grad\\param\_out &= param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon}\end{aligned}\end{align} \]The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) does not have the epsilon attribute. It is added here in our implementation as also proposed here: http://cs231n.github.io/neural-networks-3/#ada for numerical stability to avoid the division by zero error.
- Parameters
learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
epsilon (float) – a small float value for numerical stability.
regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
name – A optional name prefix.
initial_accumulator_value (float) – Initial value for moment accumulator.
Examples
import paddle.fluid as fluid import numpy as np np_inp = np.array([[1.0, 2.0], [3.0, 4.0]], dtype=np.float32) inp = fluid.layers.data( name="inp", shape=[2, 2], append_batch_size=False) out = fluid.layers.fc(inp, size=3) out = fluid.layers.reduce_sum(out) optimizer = fluid.optimizer.Adagrad(learning_rate=0.2) optimizer.minimize(out) exe = fluid.Executor(fluid.CPUPlace()) exe.run(fluid.default_startup_program()) exe.run( feed={"inp": np_inp}, fetch_list=[out.name])
-
apply_gradients
(params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
Examples
import paddle.fluid as fluid loss = network() optimizer = fluid.optimizer.SGD(learning_rate=0.1) params_grads = optimizer.backward(loss) # you may append operations for params_grads here # ... optimizer.apply_gradients(params_grads)
-
apply_optimize
(loss, startup_program, params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
-
backward
(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None) First part of minimize, do auto-diff to append backward ops for the current program.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
- Returns
list of (param, grad) pair, grad is the output of backward.
- Return type
list
Examples
See examples in apply_gradients.
-
load
(stat_dict) load optimizer with learning rate decay in dygraph mode :return: None
- Parameters
stat_dict – the dict load by load_persistable method
Examples:
from __future__ import print_function import numpy as np import paddle import paddle.fluid as fluid from paddle.fluid.optimizer import SGDOptimizer from paddle.fluid.dygraph.nn import FC from paddle.fluid.dygraph.base import to_variable class MLP(fluid.Layer): def __init__(self, name_scope): super(MLP, self).__init__(name_scope) self._fc1 = FC(self.full_name(), 10) self._fc2 = FC(self.full_name(), 10) def forward(self, inputs): y = self._fc1(inputs) y = self._fc2(y) return y with fluid.dygraph.guard(): mlp = MLP('mlp') optimizer2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) train_reader = paddle.batch( paddle.dataset.mnist.train(), batch_size=128, drop_last=True) for batch_id, data in enumerate(train_reader()): dy_x_data = np.array( [x[0].reshape(1, 28, 28) for x in data]).astype('float32') y_data = np.array([x[1] for x in data]).astype('int64').reshape( 128, 1) img = to_variable(dy_x_data) label = to_variable(y_data) label._stop_gradient = True cost = mlp(img) avg_loss = fluid.layers.reduce_mean(cost) avg_loss.backward() optimizer.minimize(avg_loss) mlp.clear_gradients() fluid.dygraph.save_persistables( mlp.state_dict(), [optimizer, optimizer2], "save_dir_2") if batch_id == 2: break with fluid.dygraph.guard(): mlp_load = MLP('mlp') optimizer_load2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) parameters, optimizers = fluid.dygraph.load_persistables( "save_dir_2") mlp_load.load_dict(parameters) optimizer_load2.load(optimizers) self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
-
minimize
(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None) Add operations to minimize loss by updating parameter_list.
This method combines interface backward() and apply_gradients() into one.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
grad_clip (GradClipBase|None) – Gradient clip strategy
- Returns
(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.
- Return type
tuple
Adam¶
-
paddle.fluid.optimizer.
Adam
alias of
paddle.fluid.optimizer.AdamOptimizer
Adamax¶
-
paddle.fluid.optimizer.
Adamax
alias of
paddle.fluid.optimizer.AdamaxOptimizer
AdamaxOptimizer¶
-
class
paddle.fluid.optimizer.
AdamaxOptimizer
(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, regularization=None, name=None)[source] We implement the Adamax optimizer from Section 7 of the Adam paper: https://arxiv.org/abs/1412.6980. Adamax is a variant of the Adam algorithm based on the infinity norm.
Adamax updates:
\[ \begin{align}\begin{aligned}t & = t + 1\\moment\_out & = {\beta}_1 * moment + (1 - {\beta}_1) * grad\\inf\_norm\_out & = max({\beta}_2 * inf\_norm + \epsilon, |grad|)\\learning\_rate & = \frac{learning\_rate}{1 - {\beta}_1^t}\\param\_out & = param - learning\_rate * \frac{moment\_out}{inf\_norm\_out}\end{aligned}\end{align} \]The original paper does not have an epsilon attribute. However, it is added here for numerical stability to prevent the division by 0 error.
Examples
import paddle.fluid as fluid import numpy # First create the Executor. place = fluid.CPUPlace() # fluid.CUDAPlace(0) exe = fluid.Executor(place) train_program = fluid.Program() startup_program = fluid.Program() with fluid.program_guard(train_program, startup_program): data = fluid.layers.data(name='X', shape=[1], dtype='float32') hidden = fluid.layers.fc(input=data, size=10) loss = fluid.layers.mean(hidden) adam = fluid.optimizer.Adamax(learning_rate=0.2) adam.minimize(loss) # Run the startup program once and only once. exe.run(startup_program) x = numpy.random.random(size=(10, 1)).astype('float32') outs = exe.run(program=train_program, feed={'X': x}, fetch_list=[loss.name])
- Parameters
learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
beta1 (float) – The exponential decay rate for the 1st moment estimates.
beta2 (float) – The exponential decay rate for the 2nd moment estimates.
epsilon (float) – a small float value for numerical stability.
regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
name – A optional name prefix.
Notes
Currently, AdamaxOptimizer doesn’t support sparse parameter optimization.
-
apply_gradients
(params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
Examples
import paddle.fluid as fluid loss = network() optimizer = fluid.optimizer.SGD(learning_rate=0.1) params_grads = optimizer.backward(loss) # you may append operations for params_grads here # ... optimizer.apply_gradients(params_grads)
-
apply_optimize
(loss, startup_program, params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
-
backward
(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None) First part of minimize, do auto-diff to append backward ops for the current program.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
- Returns
list of (param, grad) pair, grad is the output of backward.
- Return type
list
Examples
See examples in apply_gradients.
-
load
(stat_dict) load optimizer with learning rate decay in dygraph mode :return: None
- Parameters
stat_dict – the dict load by load_persistable method
Examples:
from __future__ import print_function import numpy as np import paddle import paddle.fluid as fluid from paddle.fluid.optimizer import SGDOptimizer from paddle.fluid.dygraph.nn import FC from paddle.fluid.dygraph.base import to_variable class MLP(fluid.Layer): def __init__(self, name_scope): super(MLP, self).__init__(name_scope) self._fc1 = FC(self.full_name(), 10) self._fc2 = FC(self.full_name(), 10) def forward(self, inputs): y = self._fc1(inputs) y = self._fc2(y) return y with fluid.dygraph.guard(): mlp = MLP('mlp') optimizer2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) train_reader = paddle.batch( paddle.dataset.mnist.train(), batch_size=128, drop_last=True) for batch_id, data in enumerate(train_reader()): dy_x_data = np.array( [x[0].reshape(1, 28, 28) for x in data]).astype('float32') y_data = np.array([x[1] for x in data]).astype('int64').reshape( 128, 1) img = to_variable(dy_x_data) label = to_variable(y_data) label._stop_gradient = True cost = mlp(img) avg_loss = fluid.layers.reduce_mean(cost) avg_loss.backward() optimizer.minimize(avg_loss) mlp.clear_gradients() fluid.dygraph.save_persistables( mlp.state_dict(), [optimizer, optimizer2], "save_dir_2") if batch_id == 2: break with fluid.dygraph.guard(): mlp_load = MLP('mlp') optimizer_load2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) parameters, optimizers = fluid.dygraph.load_persistables( "save_dir_2") mlp_load.load_dict(parameters) optimizer_load2.load(optimizers) self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
-
minimize
(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None) Add operations to minimize loss by updating parameter_list.
This method combines interface backward() and apply_gradients() into one.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
grad_clip (GradClipBase|None) – Gradient clip strategy
- Returns
(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.
- Return type
tuple
AdamOptimizer¶
-
class
paddle.fluid.optimizer.
AdamOptimizer
(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, regularization=None, name=None, lazy_mode=False)[source] This implements the Adam optimizer from Section 2 of the Adam paper : https://arxiv.org/abs/1412.6980. Adam is a first-order gradient-based optimization method based on adaptive estimates of lower-order moments.
Adam updates:
\[ \begin{align}\begin{aligned}t & = t + 1\\moment\_1\_out & = {\beta}_1 * moment\_1 + (1 - {\beta}_1) * grad\\moment\_2\_out & = {\beta}_2 * moment\_2 + (1 - {\beta}_2) * grad * grad\\learning\_rate & = learning\_rate * \ \frac{\sqrt{1 - {\beta}_2^t}}{1 - {\beta}_1^t}\\param\_out & = param - learning\_rate * \frac{moment\_1}{\sqrt{moment\_2} + \epsilon}\end{aligned}\end{align} \]- Parameters
learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
beta1 (float) – The exponential decay rate for the 1st moment estimates.
beta2 (float) – The exponential decay rate for the 2nd moment estimates.
epsilon (float) – a small float value for numerical stability.
regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
name – A optional name prefix.
lazy_mode(bool – false): The official Adam algorithm has two moving-average accumulators
accumulators are updated at every step. Every element of the two moving-average is updated (the) –
both dense mode and sparse mode. If the size of parameter is very large, then the update (in) –
be very slow. The lazy mode only update the element that has gradient is the current (may) –
so it will be much more faster. But this mode has different semantics with the (mini-batch,) –
Adam algorithm and may lead to different result. (original) –
Examples
import paddle import paddle.fluid as fluid place = fluid.CPUPlace() main = fluid.Program() with fluid.program_guard(main): x = fluid.layers.data(name='x', shape=[13], dtype='float32') y = fluid.layers.data(name='y', shape=[1], dtype='float32') y_predict = fluid.layers.fc(input=x, size=1, act=None) cost = fluid.layers.square_error_cost(input=y_predict, label=y) avg_cost = fluid.layers.mean(cost) adam_optimizer = fluid.optimizer.AdamOptimizer(0.01) adam_optimizer.minimize(avg_cost) fetch_list = [avg_cost] train_reader = paddle.batch( paddle.dataset.uci_housing.train(), batch_size=1) feeder = fluid.DataFeeder(place=place, feed_list=[x, y]) exe = fluid.Executor(place) exe.run(fluid.default_startup_program()) for data in train_reader(): exe.run(main, feed=feeder.feed(data), fetch_list=fetch_list)
-
apply_gradients
(params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
Examples
import paddle.fluid as fluid loss = network() optimizer = fluid.optimizer.SGD(learning_rate=0.1) params_grads = optimizer.backward(loss) # you may append operations for params_grads here # ... optimizer.apply_gradients(params_grads)
-
apply_optimize
(loss, startup_program, params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
-
backward
(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None) First part of minimize, do auto-diff to append backward ops for the current program.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
- Returns
list of (param, grad) pair, grad is the output of backward.
- Return type
list
Examples
See examples in apply_gradients.
-
load
(stat_dict) load optimizer with learning rate decay in dygraph mode :return: None
- Parameters
stat_dict – the dict load by load_persistable method
Examples:
from __future__ import print_function import numpy as np import paddle import paddle.fluid as fluid from paddle.fluid.optimizer import SGDOptimizer from paddle.fluid.dygraph.nn import FC from paddle.fluid.dygraph.base import to_variable class MLP(fluid.Layer): def __init__(self, name_scope): super(MLP, self).__init__(name_scope) self._fc1 = FC(self.full_name(), 10) self._fc2 = FC(self.full_name(), 10) def forward(self, inputs): y = self._fc1(inputs) y = self._fc2(y) return y with fluid.dygraph.guard(): mlp = MLP('mlp') optimizer2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) train_reader = paddle.batch( paddle.dataset.mnist.train(), batch_size=128, drop_last=True) for batch_id, data in enumerate(train_reader()): dy_x_data = np.array( [x[0].reshape(1, 28, 28) for x in data]).astype('float32') y_data = np.array([x[1] for x in data]).astype('int64').reshape( 128, 1) img = to_variable(dy_x_data) label = to_variable(y_data) label._stop_gradient = True cost = mlp(img) avg_loss = fluid.layers.reduce_mean(cost) avg_loss.backward() optimizer.minimize(avg_loss) mlp.clear_gradients() fluid.dygraph.save_persistables( mlp.state_dict(), [optimizer, optimizer2], "save_dir_2") if batch_id == 2: break with fluid.dygraph.guard(): mlp_load = MLP('mlp') optimizer_load2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) parameters, optimizers = fluid.dygraph.load_persistables( "save_dir_2") mlp_load.load_dict(parameters) optimizer_load2.load(optimizers) self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
-
minimize
(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None) Add operations to minimize loss by updating parameter_list.
This method combines interface backward() and apply_gradients() into one.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
grad_clip (GradClipBase|None) – Gradient clip strategy
- Returns
(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.
- Return type
tuple
DecayedAdagrad¶
-
paddle.fluid.optimizer.
DecayedAdagrad
alias of
paddle.fluid.optimizer.DecayedAdagradOptimizer
DecayedAdagradOptimizer¶
-
class
paddle.fluid.optimizer.
DecayedAdagradOptimizer
(learning_rate, decay=0.95, epsilon=1e-06, regularization=None, name=None)[source] Decayed Adagrad Optimizer
The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)
The update is done as follows:
\[ \begin{align}\begin{aligned}moment\_out & = decay * moment + (1 - decay) * grad * grad\\param\_out & = param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon}\end{aligned}\end{align} \]The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) does not have an epsilon attribute. It is added here for numerical stability to avoid the division by zero error.
- Parameters
learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
decay (float) – decay rate.
epsilon (float) – a small float value for numerical stability.
regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
name – A optional name prefix.
Examples
import paddle.fluid as fluid import paddle.fluid.layers as layers from paddle.fluid.optimizer import DecayedAdagrad x = layers.data( name='x', shape=[-1, 10], dtype='float32' ) trans = layers.fc( x, 100 ) cost = layers.reduce_mean( trans ) optimizer = fluid.optimizer.DecayedAdagrad(learning_rate=0.2) optimizer.minimize(cost)
Notes
Currently, DecayedAdagradOptimizer doesn’t support sparse parameter optimization.
-
apply_gradients
(params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
Examples
import paddle.fluid as fluid loss = network() optimizer = fluid.optimizer.SGD(learning_rate=0.1) params_grads = optimizer.backward(loss) # you may append operations for params_grads here # ... optimizer.apply_gradients(params_grads)
-
apply_optimize
(loss, startup_program, params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
-
backward
(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None) First part of minimize, do auto-diff to append backward ops for the current program.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
- Returns
list of (param, grad) pair, grad is the output of backward.
- Return type
list
Examples
See examples in apply_gradients.
-
load
(stat_dict) load optimizer with learning rate decay in dygraph mode :return: None
- Parameters
stat_dict – the dict load by load_persistable method
Examples:
from __future__ import print_function import numpy as np import paddle import paddle.fluid as fluid from paddle.fluid.optimizer import SGDOptimizer from paddle.fluid.dygraph.nn import FC from paddle.fluid.dygraph.base import to_variable class MLP(fluid.Layer): def __init__(self, name_scope): super(MLP, self).__init__(name_scope) self._fc1 = FC(self.full_name(), 10) self._fc2 = FC(self.full_name(), 10) def forward(self, inputs): y = self._fc1(inputs) y = self._fc2(y) return y with fluid.dygraph.guard(): mlp = MLP('mlp') optimizer2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) train_reader = paddle.batch( paddle.dataset.mnist.train(), batch_size=128, drop_last=True) for batch_id, data in enumerate(train_reader()): dy_x_data = np.array( [x[0].reshape(1, 28, 28) for x in data]).astype('float32') y_data = np.array([x[1] for x in data]).astype('int64').reshape( 128, 1) img = to_variable(dy_x_data) label = to_variable(y_data) label._stop_gradient = True cost = mlp(img) avg_loss = fluid.layers.reduce_mean(cost) avg_loss.backward() optimizer.minimize(avg_loss) mlp.clear_gradients() fluid.dygraph.save_persistables( mlp.state_dict(), [optimizer, optimizer2], "save_dir_2") if batch_id == 2: break with fluid.dygraph.guard(): mlp_load = MLP('mlp') optimizer_load2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) parameters, optimizers = fluid.dygraph.load_persistables( "save_dir_2") mlp_load.load_dict(parameters) optimizer_load2.load(optimizers) self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
-
minimize
(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None) Add operations to minimize loss by updating parameter_list.
This method combines interface backward() and apply_gradients() into one.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
grad_clip (GradClipBase|None) – Gradient clip strategy
- Returns
(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.
- Return type
tuple
DGCMomentumOptimizer¶
-
class
paddle.fluid.optimizer.
DGCMomentumOptimizer
(learning_rate, momentum, rampup_begin_step, rampup_step=1, sparsity=[0.999], use_nesterov=False, local_grad_clip_norm=None, num_trainers=None, regularization=None, name=None)[source] Original paper is https://arxiv.org/abs/1712.01887
DGC reduces the communication bandwidth by sending only the important gradients (sparse update): only gradients larger than a threshold are transmitted.
To avoid losing information, DGC accumulates the rest of the gradients locally.
Eventually, these gradients become large enough to be transmitted.
Thus, DGC sends the large gradients immediately but eventually send all of the gradients over time.
To ensure no loss of accuracy, DGC employs momentum correction and local gradient clipping on top of the gradient sparsification to maintain model performance.
DGC also uses momentum factor masking and warmup training to overcome the staleness problem caused by reduced communication.
This optimizer will do two things:
Compress the gradient by get TopK import value from tensor and use it for allreduce to reduce network bandwidth.
Call momentum to optimize on the cost.
- Parameters
learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
momentum (float) – Momentum factor.
rampup_begin_step (int) – The beginning step from which gradient compression is implemented.
rampup_step (int) – How long it use the sparsity periods. Default is 1. for example: If the sparsity is [0.75, 0.9375, 0.984375, 0.996, 0.999], and the rampup_step is 5, it will use 0.75 at 0 step, and 0.9375 at 1 step, and so on. And when reach sparsity array ends, it will use 0.999 then and after.
sparsity (list[float]) – Get top important element from gradient tensor, the ratio is (1 - current sparsity).
use_nesterov (bool) – Enables Nesterov momentum. True means use nesterov.
local_grad_clip_norm (float) – Clip norm value if needed.
num_trainers – The number of training nodes.
regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
name – An optional name prefix.
Examples
import paddle.fluid as fluid optimizer = fluid.optimizer.DGCMomentumOptimizer( learning_rate=0.0001, momentum=0.9, rampup_step=1000, rampup_begin_step=1252, sparsity=[0.999, 0.999])
-
apply_gradients
(params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
Examples
import paddle.fluid as fluid loss = network() optimizer = fluid.optimizer.SGD(learning_rate=0.1) params_grads = optimizer.backward(loss) # you may append operations for params_grads here # ... optimizer.apply_gradients(params_grads)
-
apply_optimize
(loss, startup_program, params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
-
backward
(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None) First part of minimize, do auto-diff to append backward ops for the current program.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
- Returns
list of (param, grad) pair, grad is the output of backward.
- Return type
list
Examples
See examples in apply_gradients.
-
load
(stat_dict) load optimizer with learning rate decay in dygraph mode :return: None
- Parameters
stat_dict – the dict load by load_persistable method
Examples:
from __future__ import print_function import numpy as np import paddle import paddle.fluid as fluid from paddle.fluid.optimizer import SGDOptimizer from paddle.fluid.dygraph.nn import FC from paddle.fluid.dygraph.base import to_variable class MLP(fluid.Layer): def __init__(self, name_scope): super(MLP, self).__init__(name_scope) self._fc1 = FC(self.full_name(), 10) self._fc2 = FC(self.full_name(), 10) def forward(self, inputs): y = self._fc1(inputs) y = self._fc2(y) return y with fluid.dygraph.guard(): mlp = MLP('mlp') optimizer2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) train_reader = paddle.batch( paddle.dataset.mnist.train(), batch_size=128, drop_last=True) for batch_id, data in enumerate(train_reader()): dy_x_data = np.array( [x[0].reshape(1, 28, 28) for x in data]).astype('float32') y_data = np.array([x[1] for x in data]).astype('int64').reshape( 128, 1) img = to_variable(dy_x_data) label = to_variable(y_data) label._stop_gradient = True cost = mlp(img) avg_loss = fluid.layers.reduce_mean(cost) avg_loss.backward() optimizer.minimize(avg_loss) mlp.clear_gradients() fluid.dygraph.save_persistables( mlp.state_dict(), [optimizer, optimizer2], "save_dir_2") if batch_id == 2: break with fluid.dygraph.guard(): mlp_load = MLP('mlp') optimizer_load2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) parameters, optimizers = fluid.dygraph.load_persistables( "save_dir_2") mlp_load.load_dict(parameters) optimizer_load2.load(optimizers) self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
-
minimize
(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None) Add operations to minimize loss by updating parameter_list.
This method combines interface backward() and apply_gradients() into one.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
grad_clip (GradClipBase|None) – Gradient clip strategy
- Returns
(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.
- Return type
tuple
ExponentialMovingAverage¶
-
class
paddle.fluid.optimizer.
ExponentialMovingAverage
(decay=0.999, thres_steps=None, name=None)[source] Compute the moving average of parameters with exponential decay. Given a parameter \(\theta\), its exponential moving average (EMA) will be
\[ \begin{align}\begin{aligned}\text{EMA}_0 & = 0\\\text{EMA}_t & = \text{decay} * \text{EMA}_{t-1} + (1 - \text{decay}) * \theta_t\end{aligned}\end{align} \]The average results calculated by update() method will be saved in temporary variables which are created and maintained by the object, and can be applied to parameters of current model by calling apply() method. And the restore() method is used to restore the parameters.
Bias correction. All EMAs are initialized to \(0\) and hence they will be zero biased, which can be corrected by divided by a factor \((1 - \text{decay}^t)\) , i.e., the actual EMAs applied to parameters when calling apply() method would be
\[\widehat{\text{EMA}}_t = \frac{\text{EMA}_t}{1 - \text{decay}^t}\]Decay rate scheduling. A large decay rate very close to 1 would result in that the averages move very slowly. And a better strategy is to set a relative smaller decay rate in the very beginning. The argument thres_steps allows users to pass a Variable to schedule the decay rate, in this case, the actual decay rate becomes
\[\min(\text{decay}, \frac{1 + \text{thres_steps}}{10 + \text{thres_steps}})\]Usually thres_steps can be the global training steps.
- Parameters
decay (float) – The exponential decay rate, usually close to 1, such as 0.999, 0.9999, … .
thres_steps (Variable|None) – If not None, schedule the decay rate.
name (str|None) – An optional name prefix.
Examples
import numpy import paddle import paddle.fluid as fluid data = fluid.layers.data(name='x', shape=[5], dtype='float32') hidden = fluid.layers.fc(input=data, size=10) cost = fluid.layers.mean(hidden) test_program = fluid.default_main_program().clone(for_test=True) optimizer = fluid.optimizer.Adam(learning_rate=0.001) optimizer.minimize(cost) global_steps = fluid.layers.learning_rate_scheduler._decay_step_counter() ema = fluid.optimizer.ExponentialMovingAverage(0.999, thres_steps=global_steps) ema.update() place = fluid.CPUPlace() exe = fluid.Executor(place) exe.run(fluid.default_startup_program()) for pass_id in range(3): for batch_id in range(6): data = numpy.random.random(size=(10, 5)).astype('float32') exe.run(program=fluid.default_main_program(), feed={'x': data}, fetch_list=[cost.name]) # usage 1 with ema.apply(exe): data = numpy.random.random(size=(10, 5)).astype('float32') exe.run(program=test_program, feed={'x': data}, fetch_list=[hidden.name]) # usage 2 with ema.apply(exe, need_restore=False): data = numpy.random.random(size=(10, 5)).astype('float32') exe.run(program=test_program, feed={'x': data}, fetch_list=[hidden.name]) ema.restore(exe)
-
update
() Update Exponential Moving Average. Should only call this method in train program.
-
apply
(executor, need_restore=True) Apply moving average to parameters for evaluation.
- Parameters
executor (Executor) – The Executor to execute applying.
need_restore (bool) – Whether to restore parameters after applying.
-
restore
(executor) Restore parameters.
- Parameters
executor (Executor) – The Executor to execute restoring.
Ftrl¶
-
paddle.fluid.optimizer.
Ftrl
alias of
paddle.fluid.optimizer.FtrlOptimizer
FtrlOptimizer¶
-
class
paddle.fluid.optimizer.
FtrlOptimizer
(learning_rate, l1=0.0, l2=0.0, lr_power=-0.5, regularization=None, name=None)[source] FTRL (Follow The Regularized Leader) Optimizer.
The paper that proposed Follow The Regularized Leader (FTRL): (https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf)
\[ \begin{align}\begin{aligned}&new\_accum = squared\_accum + grad^2\\&if (lr\_power == -0.5):\\&\quad linear\_accum += grad - \frac{\sqrt{new\_accum} - \sqrt{squared\_accum}}{learning\_rate * param}\\&else:\\&\quad linear\_accum += grad - \frac{new\_accum^{-lr\_power} - accum^{-lr\_power}}{learning\_rate * param}\\ &x = l1 * sign(linear\_accum) - linear\_accum\\&if (lr\_power == -0.5):\\&\quad y = \frac{\sqrt{new\_accum}}{learning\_rate} + (2 * l2)\\&\quad pre\_shrink = \frac{x}{y}\\&\quad param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0)\\&else:\\&\quad y = \frac{new\_accum^{-lr\_power}}{learning\_rate} + (2 * l2)\\&\quad pre\_shrink = \frac{x}{y}\\&\quad param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0)\\&squared\_accum += grad^2\end{aligned}\end{align} \]- Parameters
learning_rate (float|Variable) – global learning rate.
l1 (float) – L1 regularization strength.
l2 (float) – L2 regularization strength.
lr_power (float) – Learning Rate Power.
regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
name – A optional name prefix.
- Raises
ValueError
– If learning_rate, rho, epsilon, momentum are None.
Examples
import paddle import paddle.fluid as fluid import numpy as np place = fluid.CPUPlace() main = fluid.Program() with fluid.program_guard(main): x = fluid.layers.data(name='x', shape=[13], dtype='float32') y = fluid.layers.data(name='y', shape=[1], dtype='float32') y_predict = fluid.layers.fc(input=x, size=1, act=None) cost = fluid.layers.square_error_cost(input=y_predict, label=y) avg_cost = fluid.layers.mean(cost) ftrl_optimizer = fluid.optimizer.Ftrl(learning_rate=0.1) ftrl_optimizer.minimize(avg_cost) fetch_list = [avg_cost] train_reader = paddle.batch( paddle.dataset.uci_housing.train(), batch_size=1) feeder = fluid.DataFeeder(place=place, feed_list=[x, y]) exe = fluid.Executor(place) exe.run(fluid.default_startup_program()) for data in train_reader(): exe.run(main, feed=feeder.feed(data), fetch_list=fetch_list)
Notes
Currently, FtrlOptimizer doesn’t support sparse parameter optimization.
-
apply_gradients
(params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
Examples
import paddle.fluid as fluid loss = network() optimizer = fluid.optimizer.SGD(learning_rate=0.1) params_grads = optimizer.backward(loss) # you may append operations for params_grads here # ... optimizer.apply_gradients(params_grads)
-
apply_optimize
(loss, startup_program, params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
-
backward
(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None) First part of minimize, do auto-diff to append backward ops for the current program.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
- Returns
list of (param, grad) pair, grad is the output of backward.
- Return type
list
Examples
See examples in apply_gradients.
-
load
(stat_dict) load optimizer with learning rate decay in dygraph mode :return: None
- Parameters
stat_dict – the dict load by load_persistable method
Examples:
from __future__ import print_function import numpy as np import paddle import paddle.fluid as fluid from paddle.fluid.optimizer import SGDOptimizer from paddle.fluid.dygraph.nn import FC from paddle.fluid.dygraph.base import to_variable class MLP(fluid.Layer): def __init__(self, name_scope): super(MLP, self).__init__(name_scope) self._fc1 = FC(self.full_name(), 10) self._fc2 = FC(self.full_name(), 10) def forward(self, inputs): y = self._fc1(inputs) y = self._fc2(y) return y with fluid.dygraph.guard(): mlp = MLP('mlp') optimizer2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) train_reader = paddle.batch( paddle.dataset.mnist.train(), batch_size=128, drop_last=True) for batch_id, data in enumerate(train_reader()): dy_x_data = np.array( [x[0].reshape(1, 28, 28) for x in data]).astype('float32') y_data = np.array([x[1] for x in data]).astype('int64').reshape( 128, 1) img = to_variable(dy_x_data) label = to_variable(y_data) label._stop_gradient = True cost = mlp(img) avg_loss = fluid.layers.reduce_mean(cost) avg_loss.backward() optimizer.minimize(avg_loss) mlp.clear_gradients() fluid.dygraph.save_persistables( mlp.state_dict(), [optimizer, optimizer2], "save_dir_2") if batch_id == 2: break with fluid.dygraph.guard(): mlp_load = MLP('mlp') optimizer_load2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) parameters, optimizers = fluid.dygraph.load_persistables( "save_dir_2") mlp_load.load_dict(parameters) optimizer_load2.load(optimizers) self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
-
minimize
(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None) Add operations to minimize loss by updating parameter_list.
This method combines interface backward() and apply_gradients() into one.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
grad_clip (GradClipBase|None) – Gradient clip strategy
- Returns
(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.
- Return type
tuple
LambOptimizer¶
-
class
paddle.fluid.optimizer.
LambOptimizer
(learning_rate=0.001, lamb_weight_decay=0.01, beta1=0.9, beta2=0.999, epsilon=1e-06, regularization=None, exclude_from_weight_decay_fn=None, name=None)[source] LAMB (Layer-wise Adaptive Moments optimizer for Batching training) Optimizer.
LAMB Optimizer is designed to scale up the batch size of training without losing accuracy, which supports adaptive element-wise updating and accurate layer-wise correction. For more information, please refer to Large Batch Optimization for Deep Learning: Training BERT in 76 minutes .
The updating of parameters follows:
\[ \begin{align}\begin{aligned}m_t &= \beta_1 m_{t - 1}+ (1 - \beta_1)g_t \\\v_t &= \beta_2 v_{t - 1} + (1 - \beta_2)g_t^2 \\\r_t &= \frac{m_t}{\sqrt{v_t}+\epsilon} \\\w_t &= w_{t-1} -\eta_t \frac{\left \| w_{t-1}\right \|}{\left \| r_t + \lambda w_{t-1}\right \|} (r_t + \lambda w_{t-1})\end{aligned}\end{align} \]where \(m\) is the 1st moment, and \(v\) the 2nd moment, \(\eta\) the learning rate, \(\lambda\) the LAMB weight decay rate.
- Parameters
learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
lamb_weight_decay (float) – The LAMB weight decay rate.
beta1 (float) – The exponential decay rate for the 1st moment estimates.
beta2 (float) – The exponential decay rate for the 2nd moment estimates.
epsilon (float) – A small float value for numerical stability.
regularization (Regularizer) – A Regularizer, such as fluid.regularizer.L1DecayRegularizer.
exclude_from_weight_decay_fn (function) – Exclude a parameter from weight decay when exclude_from_weight_decay_fn(parameter) returns true.
name (str|None) – An optional name prefix.
Examples
import paddle.fluid as fluid data = fluid.layers.data(name='x', shape=[5], dtype='float32') hidden = fluid.layers.fc(input=data, size=10) cost = fluid.layers.mean(hidden) def exclude_fn(param): return param.name.endswith('.b_0') optimizer = fluid.optimizer.Lamb(learning_rate=0.002, exclude_from_weight_decay_fn=exclude_fn) optimizer.minimize(cost)
-
apply_gradients
(params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
Examples
import paddle.fluid as fluid loss = network() optimizer = fluid.optimizer.SGD(learning_rate=0.1) params_grads = optimizer.backward(loss) # you may append operations for params_grads here # ... optimizer.apply_gradients(params_grads)
-
apply_optimize
(loss, startup_program, params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
-
backward
(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None) First part of minimize, do auto-diff to append backward ops for the current program.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
- Returns
list of (param, grad) pair, grad is the output of backward.
- Return type
list
Examples
See examples in apply_gradients.
-
load
(stat_dict) load optimizer with learning rate decay in dygraph mode :return: None
- Parameters
stat_dict – the dict load by load_persistable method
Examples:
from __future__ import print_function import numpy as np import paddle import paddle.fluid as fluid from paddle.fluid.optimizer import SGDOptimizer from paddle.fluid.dygraph.nn import FC from paddle.fluid.dygraph.base import to_variable class MLP(fluid.Layer): def __init__(self, name_scope): super(MLP, self).__init__(name_scope) self._fc1 = FC(self.full_name(), 10) self._fc2 = FC(self.full_name(), 10) def forward(self, inputs): y = self._fc1(inputs) y = self._fc2(y) return y with fluid.dygraph.guard(): mlp = MLP('mlp') optimizer2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) train_reader = paddle.batch( paddle.dataset.mnist.train(), batch_size=128, drop_last=True) for batch_id, data in enumerate(train_reader()): dy_x_data = np.array( [x[0].reshape(1, 28, 28) for x in data]).astype('float32') y_data = np.array([x[1] for x in data]).astype('int64').reshape( 128, 1) img = to_variable(dy_x_data) label = to_variable(y_data) label._stop_gradient = True cost = mlp(img) avg_loss = fluid.layers.reduce_mean(cost) avg_loss.backward() optimizer.minimize(avg_loss) mlp.clear_gradients() fluid.dygraph.save_persistables( mlp.state_dict(), [optimizer, optimizer2], "save_dir_2") if batch_id == 2: break with fluid.dygraph.guard(): mlp_load = MLP('mlp') optimizer_load2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) parameters, optimizers = fluid.dygraph.load_persistables( "save_dir_2") mlp_load.load_dict(parameters) optimizer_load2.load(optimizers) self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
-
minimize
(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None) Add operations to minimize loss by updating parameter_list.
This method combines interface backward() and apply_gradients() into one.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
grad_clip (GradClipBase|None) – Gradient clip strategy
- Returns
(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.
- Return type
tuple
LarsMomentum¶
-
paddle.fluid.optimizer.
LarsMomentum
alias of
paddle.fluid.optimizer.LarsMomentumOptimizer
LarsMomentumOptimizer¶
-
class
paddle.fluid.optimizer.
LarsMomentumOptimizer
(learning_rate, momentum, lars_coeff=0.001, lars_weight_decay=0.0005, regularization=None, name=None)[source] Momentum optimizer with LARS support
The update equations are as follows:
\[ \begin{align}\begin{aligned}& local\_learning\_rate = learning\_rate * lars\_coeff * \ \frac{||param||}{||gradient|| + lars\_weight\_decay * ||param||}\\& velocity = mu * velocity + local\_learning\_rate * (gradient + lars\_weight\_decay * param)\\& param = param - velocity\end{aligned}\end{align} \]- Parameters
learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
momentum (float) – momentum factor
lars_coeff (float) – defines how much we trust the layer to change its weights.
lars_weight_decay (float) – weight decay coefficient for decaying using LARS.
regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
name – A optional name prefix.
Examples
optimizer = fluid.optimizer.LarsMomentum(learning_rate=0.2, momentum=0.1, lars_weight_decay=0.001) optimizer.minimize(cost)
-
apply_gradients
(params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
Examples
import paddle.fluid as fluid loss = network() optimizer = fluid.optimizer.SGD(learning_rate=0.1) params_grads = optimizer.backward(loss) # you may append operations for params_grads here # ... optimizer.apply_gradients(params_grads)
-
apply_optimize
(loss, startup_program, params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
-
backward
(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None) First part of minimize, do auto-diff to append backward ops for the current program.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
- Returns
list of (param, grad) pair, grad is the output of backward.
- Return type
list
Examples
See examples in apply_gradients.
-
load
(stat_dict) load optimizer with learning rate decay in dygraph mode :return: None
- Parameters
stat_dict – the dict load by load_persistable method
Examples:
from __future__ import print_function import numpy as np import paddle import paddle.fluid as fluid from paddle.fluid.optimizer import SGDOptimizer from paddle.fluid.dygraph.nn import FC from paddle.fluid.dygraph.base import to_variable class MLP(fluid.Layer): def __init__(self, name_scope): super(MLP, self).__init__(name_scope) self._fc1 = FC(self.full_name(), 10) self._fc2 = FC(self.full_name(), 10) def forward(self, inputs): y = self._fc1(inputs) y = self._fc2(y) return y with fluid.dygraph.guard(): mlp = MLP('mlp') optimizer2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) train_reader = paddle.batch( paddle.dataset.mnist.train(), batch_size=128, drop_last=True) for batch_id, data in enumerate(train_reader()): dy_x_data = np.array( [x[0].reshape(1, 28, 28) for x in data]).astype('float32') y_data = np.array([x[1] for x in data]).astype('int64').reshape( 128, 1) img = to_variable(dy_x_data) label = to_variable(y_data) label._stop_gradient = True cost = mlp(img) avg_loss = fluid.layers.reduce_mean(cost) avg_loss.backward() optimizer.minimize(avg_loss) mlp.clear_gradients() fluid.dygraph.save_persistables( mlp.state_dict(), [optimizer, optimizer2], "save_dir_2") if batch_id == 2: break with fluid.dygraph.guard(): mlp_load = MLP('mlp') optimizer_load2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) parameters, optimizers = fluid.dygraph.load_persistables( "save_dir_2") mlp_load.load_dict(parameters) optimizer_load2.load(optimizers) self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
-
minimize
(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None) Add operations to minimize loss by updating parameter_list.
This method combines interface backward() and apply_gradients() into one.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
grad_clip (GradClipBase|None) – Gradient clip strategy
- Returns
(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.
- Return type
tuple
ModelAverage¶
-
class
paddle.fluid.optimizer.
ModelAverage
(average_window_rate, min_average_window=10000, max_average_window=10000, regularization=None, name=None)[source] Accumulate the average of parameters within sliding window. The average result will be saved in temporary variables which can be applied to parameter variables of current model by calling ‘apply()’ method. And the ‘restore()’ method is used to restore the parameter values of current model.
The size of average window is determined by average_window_rate, min_average_window, max_average_window and current update times.
- Parameters
average_window_rate – The rate of average window.
min_average_window – The minimum size of average window.
max_average_window – The maximum size of average window.
regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
name – A optional name prefix.
Examples
import paddle.fluid as fluid import numpy # First create the Executor. place = fluid.CPUPlace() # fluid.CUDAPlace(0) exe = fluid.Executor(place) train_program = fluid.Program() startup_program = fluid.Program() with fluid.program_guard(train_program, startup_program): # build net data = fluid.layers.data(name='X', shape=[1], dtype='float32') hidden = fluid.layers.fc(input=data, size=10) loss = fluid.layers.mean(hidden) optimizer = fluid.optimizer.Momentum(learning_rate=0.2, momentum=0.1) optimizer.minimize(loss) # build ModelAverage optimizer model_average = fluid.optimizer.ModelAverage(0.15, min_average_window=10000, max_average_window=20000) exe.run(startup_program) x = numpy.random.random(size=(10, 1)).astype('float32') outs = exe.run(program=train_program, feed={'X': x}, fetch_list=[loss.name]) # apply ModelAverage with model_average.apply(exe): x = numpy.random.random(size=(10, 1)).astype('float32') exe.run(program=train_program, feed={'X': x}, fetch_list=[loss.name])
-
apply
(executor, need_restore=True) Apply average values to parameters of current model.
- Parameters
executor (fluid.Executor) – current executor.
need_restore (bool) – If you finally need to do restore, set it to True. Default is True.
-
restore
(executor) Restore parameter values of current model.
- Parameters
executor (fluid.Executor) – current executor.
-
apply_gradients
(params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
Examples
import paddle.fluid as fluid loss = network() optimizer = fluid.optimizer.SGD(learning_rate=0.1) params_grads = optimizer.backward(loss) # you may append operations for params_grads here # ... optimizer.apply_gradients(params_grads)
-
apply_optimize
(loss, startup_program, params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
-
backward
(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None) First part of minimize, do auto-diff to append backward ops for the current program.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
- Returns
list of (param, grad) pair, grad is the output of backward.
- Return type
list
Examples
See examples in apply_gradients.
-
load
(stat_dict) load optimizer with learning rate decay in dygraph mode :return: None
- Parameters
stat_dict – the dict load by load_persistable method
Examples:
from __future__ import print_function import numpy as np import paddle import paddle.fluid as fluid from paddle.fluid.optimizer import SGDOptimizer from paddle.fluid.dygraph.nn import FC from paddle.fluid.dygraph.base import to_variable class MLP(fluid.Layer): def __init__(self, name_scope): super(MLP, self).__init__(name_scope) self._fc1 = FC(self.full_name(), 10) self._fc2 = FC(self.full_name(), 10) def forward(self, inputs): y = self._fc1(inputs) y = self._fc2(y) return y with fluid.dygraph.guard(): mlp = MLP('mlp') optimizer2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) train_reader = paddle.batch( paddle.dataset.mnist.train(), batch_size=128, drop_last=True) for batch_id, data in enumerate(train_reader()): dy_x_data = np.array( [x[0].reshape(1, 28, 28) for x in data]).astype('float32') y_data = np.array([x[1] for x in data]).astype('int64').reshape( 128, 1) img = to_variable(dy_x_data) label = to_variable(y_data) label._stop_gradient = True cost = mlp(img) avg_loss = fluid.layers.reduce_mean(cost) avg_loss.backward() optimizer.minimize(avg_loss) mlp.clear_gradients() fluid.dygraph.save_persistables( mlp.state_dict(), [optimizer, optimizer2], "save_dir_2") if batch_id == 2: break with fluid.dygraph.guard(): mlp_load = MLP('mlp') optimizer_load2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) parameters, optimizers = fluid.dygraph.load_persistables( "save_dir_2") mlp_load.load_dict(parameters) optimizer_load2.load(optimizers) self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
-
minimize
(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None) Add operations to minimize loss by updating parameter_list.
This method combines interface backward() and apply_gradients() into one.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
grad_clip (GradClipBase|None) – Gradient clip strategy
- Returns
(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.
- Return type
tuple
Momentum¶
-
paddle.fluid.optimizer.
Momentum
alias of
paddle.fluid.optimizer.MomentumOptimizer
MomentumOptimizer¶
-
class
paddle.fluid.optimizer.
MomentumOptimizer
(learning_rate, momentum, use_nesterov=False, regularization=None, name=None)[source] Simple Momentum optimizer with velocity state
This optimizer has a flag for Nestrov Momentum.
The update equations are as follows:
\[ \begin{align}\begin{aligned}& velocity = mu * velocity + gradient\\& if (use\_nesterov):\\&\quad param = param - (gradient + mu * velocity) * learning\_rate\\& else:\\&\quad param = param - learning\_rate * velocity\end{aligned}\end{align} \]- Parameters
learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
momentum (float) – momentum factor
use_nesterov (bool) – enables Nesterov momentum
regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
name – A optional name prefix.
Examples
import paddle import paddle.fluid as fluid import numpy as np place = fluid.CPUPlace() main = fluid.Program() with fluid.program_guard(main): x = fluid.layers.data(name='x', shape=[13], dtype='float32') y = fluid.layers.data(name='y', shape=[1], dtype='float32') y_predict = fluid.layers.fc(input=x, size=1, act=None) cost = fluid.layers.square_error_cost(input=y_predict, label=y) avg_cost = fluid.layers.mean(cost) moment_optimizer = fluid.optimizer.MomentumOptimizer(learning_rate=0.001, momentum=0.9) moment_optimizer.minimize(avg_cost) fetch_list = [avg_cost] train_reader = paddle.batch( paddle.dataset.uci_housing.train(), batch_size=1) feeder = fluid.DataFeeder(place=place, feed_list=[x, y]) exe = fluid.Executor(place) exe.run(fluid.default_startup_program()) for data in train_reader(): exe.run(main, feed=feeder.feed(data), fetch_list=fetch_list)
-
apply_gradients
(params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
Examples
import paddle.fluid as fluid loss = network() optimizer = fluid.optimizer.SGD(learning_rate=0.1) params_grads = optimizer.backward(loss) # you may append operations for params_grads here # ... optimizer.apply_gradients(params_grads)
-
apply_optimize
(loss, startup_program, params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
-
backward
(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None) First part of minimize, do auto-diff to append backward ops for the current program.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
- Returns
list of (param, grad) pair, grad is the output of backward.
- Return type
list
Examples
See examples in apply_gradients.
-
load
(stat_dict) load optimizer with learning rate decay in dygraph mode :return: None
- Parameters
stat_dict – the dict load by load_persistable method
Examples:
from __future__ import print_function import numpy as np import paddle import paddle.fluid as fluid from paddle.fluid.optimizer import SGDOptimizer from paddle.fluid.dygraph.nn import FC from paddle.fluid.dygraph.base import to_variable class MLP(fluid.Layer): def __init__(self, name_scope): super(MLP, self).__init__(name_scope) self._fc1 = FC(self.full_name(), 10) self._fc2 = FC(self.full_name(), 10) def forward(self, inputs): y = self._fc1(inputs) y = self._fc2(y) return y with fluid.dygraph.guard(): mlp = MLP('mlp') optimizer2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) train_reader = paddle.batch( paddle.dataset.mnist.train(), batch_size=128, drop_last=True) for batch_id, data in enumerate(train_reader()): dy_x_data = np.array( [x[0].reshape(1, 28, 28) for x in data]).astype('float32') y_data = np.array([x[1] for x in data]).astype('int64').reshape( 128, 1) img = to_variable(dy_x_data) label = to_variable(y_data) label._stop_gradient = True cost = mlp(img) avg_loss = fluid.layers.reduce_mean(cost) avg_loss.backward() optimizer.minimize(avg_loss) mlp.clear_gradients() fluid.dygraph.save_persistables( mlp.state_dict(), [optimizer, optimizer2], "save_dir_2") if batch_id == 2: break with fluid.dygraph.guard(): mlp_load = MLP('mlp') optimizer_load2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) parameters, optimizers = fluid.dygraph.load_persistables( "save_dir_2") mlp_load.load_dict(parameters) optimizer_load2.load(optimizers) self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
-
minimize
(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None) Add operations to minimize loss by updating parameter_list.
This method combines interface backward() and apply_gradients() into one.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
grad_clip (GradClipBase|None) – Gradient clip strategy
- Returns
(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.
- Return type
tuple
PipelineOptimizer¶
-
class
paddle.fluid.optimizer.
PipelineOptimizer
(optimizer, cut_list=None, place_list=None, concurrency_list=None, queue_size=30, sync_steps=1, start_cpu_core_id=0)[source] Pipeline Optimizer
Train with pipeline mode. The program will be splited by cut_list.
If the len of cut_list is k, then the whole program (including backward part) will be splited to 2*k-1 sections.
So the length of place_list and concurrency_list must be also 2*k-1.
Note: Though the asynchronous mode is applied in pipeline training to speed up, the final performance depends on the training progress of each pipeline heavily.
And we will try the synchronous mode in the future.
- Parameters
optimizer (Optimizer) – The based optimizer, such as SGD.
cut_list (list of Variable list) – The cut variable of the main_program.
place_list (list of Place) – The place where the section will run on.
concurrency_list (list of int) – The concurrency degree.
queue_size (int) – Each section will consume scopes from its in-scope queue and produce scopes to out-scope queue. And this parameter specify the scope queue size. [Optional. Default: 30].
sync_steps (int) – The synchronization steps between different cards. [Optional. Default: 1].
start_cpu_core_id (int) – specify the first cpu core id. [Optional. Default:0].
Examples
import paddle.fluid as fluid import paddle.fluid.layers as layers x = fluid.layers.data(name='x', shape=[1], dtype='int64', lod_level=0) y = fluid.layers.data(name='y', shape=[1], dtype='int64', lod_level=0) emb_x = layers.embedding(input=x, param_attr=fluid.ParamAttr(name="embx"), size=[10,2], is_sparse=False) emb_y = layers.embedding(input=y, param_attr=fluid.ParamAttr(name="emby",learning_rate=0.9), size=[10,2], is_sparse=False) concat = layers.concat([emb_x, emb_y], axis=1) fc = layers.fc(input=concat, name="fc", size=1, num_flatten_dims=1, bias_attr=False) loss = layers.reduce_mean(fc) optimizer = fluid.optimizer.SGD(learning_rate=0.5) optimizer = fluid.optimizer.PipelineOptimizer(optimizer, cut_list=[[emb_x, emb_y], [loss]], place_list=[fluid.CPUPlace(), fluid.CUDAPlace(0), fluid.CPUPlace()], concurrency_list=[1, 1, 4], queue_size=2, sync_steps=1, ) optimizer.minimize(loss) place = fluid.CPUPlace() exe = fluid.Executor(place) exe.run(fluid.default_startup_program()) filelist = [] # you should set your own filelist, e.g. filelist = ["dataA.txt"] dataset = fluid.DatasetFactory().create_dataset("FileInstantDataset") dataset.set_use_var([x,y]) dataset.set_batch_size(batch_size) dataset.set_filelist(filelist) exe.train_from_dataset( fluid.default_main_program(), dataset, thread=2, debug=False, fetch_list=[], fetch_info=[], print_period=1)
RMSPropOptimizer¶
-
class
paddle.fluid.optimizer.
RMSPropOptimizer
(learning_rate, rho=0.95, epsilon=1e-06, momentum=0.0, centered=False, regularization=None, name=None)[source] Root Mean Squared Propagation (RMSProp) is an unpublished, adaptive learning rate method. The original slides proposed RMSProp: Slide 29 of http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf .
The original equation is as follows:
\[ \begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\w & = w - \frac{\eta} {\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\end{aligned}\end{align} \]The first equation calculates moving average of the squared gradient for each weight. Then dividing the gradient by \(sqrt{v(w,t)}\).
In some cases, adding a momentum term :math: beta is beneficial. In our implementation, Nesterov momentum is used:
\[ \begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\v(w, t) & = \beta v(w, t-1) + \frac{\eta} {\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\\w & = w - v(w, t)\end{aligned}\end{align} \]if centered is True:
\[ \begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\g(w, t) & = \rho g(w, t-1) + (1 - \rho)\nabla Q_{i}(w)\\v(w, t) & = \beta v(w, t-1) + \frac{\eta} {\sqrt{r(w,t) - (g(w, t))^2 + \epsilon}} \nabla Q_{i}(w)\\w & = w - v(w, t)\end{aligned}\end{align} \]where, \(\rho\) is a hyperparameter and typical values are 0.9, 0.95 and so on. :math: beta is the momentum term. :math: epsilon is a smoothing term to avoid division by zero, usually set somewhere in range from 1e-4 to 1e-8.
- Parameters
learning_rate (float) – global learning rate.
rho (float) – rho is :math: rho in equation, set 0.95 by default.
epsilon (float) –
- math
epsilon in equation is smoothing term to
avoid division by zero, set 1e-6 by default.
momentum (float) – \(\beta\) in equation is the momentum term, set 0.0 by default.
centered (bool) – If True, gradients are normalized by the estimated variance of the gradient; if False, by the uncentered second moment. Setting this to True may help with training, but is slightly more expensive in terms of computation and memory. Defaults to False.
regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
name – A optional name prefix.
- Raises
ValueError
– If learning_rate, rho, epsilon, momentum are None.
Examples
import paddle import paddle.fluid as fluid import numpy as np place = fluid.CPUPlace() main = fluid.Program() with fluid.program_guard(main): x = fluid.layers.data(name='x', shape=[13], dtype='float32') y = fluid.layers.data(name='y', shape=[1], dtype='float32') y_predict = fluid.layers.fc(input=x, size=1, act=None) cost = fluid.layers.square_error_cost(input=y_predict, label=y) avg_cost = fluid.layers.mean(cost) rms_optimizer = fluid.optimizer.RMSProp(learning_rate=0.1) rms_optimizer.minimize(avg_cost) fetch_list = [avg_cost] train_reader = paddle.batch( paddle.dataset.uci_housing.train(), batch_size=1) feeder = fluid.DataFeeder(place=place, feed_list=[x, y]) exe = fluid.Executor(place) exe.run(fluid.default_startup_program()) for data in train_reader(): exe.run(main, feed=feeder.feed(data), fetch_list=fetch_list)
-
apply_gradients
(params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
Examples
import paddle.fluid as fluid loss = network() optimizer = fluid.optimizer.SGD(learning_rate=0.1) params_grads = optimizer.backward(loss) # you may append operations for params_grads here # ... optimizer.apply_gradients(params_grads)
-
apply_optimize
(loss, startup_program, params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
-
backward
(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None) First part of minimize, do auto-diff to append backward ops for the current program.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
- Returns
list of (param, grad) pair, grad is the output of backward.
- Return type
list
Examples
See examples in apply_gradients.
-
load
(stat_dict) load optimizer with learning rate decay in dygraph mode :return: None
- Parameters
stat_dict – the dict load by load_persistable method
Examples:
from __future__ import print_function import numpy as np import paddle import paddle.fluid as fluid from paddle.fluid.optimizer import SGDOptimizer from paddle.fluid.dygraph.nn import FC from paddle.fluid.dygraph.base import to_variable class MLP(fluid.Layer): def __init__(self, name_scope): super(MLP, self).__init__(name_scope) self._fc1 = FC(self.full_name(), 10) self._fc2 = FC(self.full_name(), 10) def forward(self, inputs): y = self._fc1(inputs) y = self._fc2(y) return y with fluid.dygraph.guard(): mlp = MLP('mlp') optimizer2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) train_reader = paddle.batch( paddle.dataset.mnist.train(), batch_size=128, drop_last=True) for batch_id, data in enumerate(train_reader()): dy_x_data = np.array( [x[0].reshape(1, 28, 28) for x in data]).astype('float32') y_data = np.array([x[1] for x in data]).astype('int64').reshape( 128, 1) img = to_variable(dy_x_data) label = to_variable(y_data) label._stop_gradient = True cost = mlp(img) avg_loss = fluid.layers.reduce_mean(cost) avg_loss.backward() optimizer.minimize(avg_loss) mlp.clear_gradients() fluid.dygraph.save_persistables( mlp.state_dict(), [optimizer, optimizer2], "save_dir_2") if batch_id == 2: break with fluid.dygraph.guard(): mlp_load = MLP('mlp') optimizer_load2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) parameters, optimizers = fluid.dygraph.load_persistables( "save_dir_2") mlp_load.load_dict(parameters) optimizer_load2.load(optimizers) self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
-
minimize
(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None) Add operations to minimize loss by updating parameter_list.
This method combines interface backward() and apply_gradients() into one.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
grad_clip (GradClipBase|None) – Gradient clip strategy
- Returns
(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.
- Return type
tuple
SGD¶
-
paddle.fluid.optimizer.
SGD
alias of
paddle.fluid.optimizer.SGDOptimizer
SGDOptimizer¶
-
class
paddle.fluid.optimizer.
SGDOptimizer
(learning_rate, regularization=None, name=None)[source] Optimizer of the stochastic gradient descent algorithm.
\[param\_out = param - learning\_rate * grad\]- Parameters
learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
name – A optional name prefix.
Examples
import paddle import paddle.fluid as fluid import numpy as np place = fluid.CPUPlace() main = fluid.Program() with fluid.program_guard(main): x = fluid.layers.data(name='x', shape=[13], dtype='float32') y = fluid.layers.data(name='y', shape=[1], dtype='float32') y_predict = fluid.layers.fc(input=x, size=1, act=None) cost = fluid.layers.square_error_cost(input=y_predict, label=y) avg_cost = fluid.layers.mean(cost) sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001) sgd_optimizer.minimize(avg_cost) fetch_list = [avg_cost] train_reader = paddle.batch( paddle.dataset.uci_housing.train(), batch_size=1) feeder = fluid.DataFeeder(place=place, feed_list=[x, y]) exe = fluid.Executor(place) exe.run(fluid.default_startup_program()) for data in train_reader(): exe.run(main, feed=feeder.feed(data), fetch_list=fetch_list)
-
apply_gradients
(params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
Examples
import paddle.fluid as fluid loss = network() optimizer = fluid.optimizer.SGD(learning_rate=0.1) params_grads = optimizer.backward(loss) # you may append operations for params_grads here # ... optimizer.apply_gradients(params_grads)
-
apply_optimize
(loss, startup_program, params_grads) Second part of minimize, appending optimization operators for given params_grads pairs.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
params_grads (list) – list of (param, grad) pair to do optimization.
- Returns
A list of operators appended to the current program.
- Return type
list
-
backward
(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None) First part of minimize, do auto-diff to append backward ops for the current program.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
- Returns
list of (param, grad) pair, grad is the output of backward.
- Return type
list
Examples
See examples in apply_gradients.
-
load
(stat_dict) load optimizer with learning rate decay in dygraph mode :return: None
- Parameters
stat_dict – the dict load by load_persistable method
Examples:
from __future__ import print_function import numpy as np import paddle import paddle.fluid as fluid from paddle.fluid.optimizer import SGDOptimizer from paddle.fluid.dygraph.nn import FC from paddle.fluid.dygraph.base import to_variable class MLP(fluid.Layer): def __init__(self, name_scope): super(MLP, self).__init__(name_scope) self._fc1 = FC(self.full_name(), 10) self._fc2 = FC(self.full_name(), 10) def forward(self, inputs): y = self._fc1(inputs) y = self._fc2(y) return y with fluid.dygraph.guard(): mlp = MLP('mlp') optimizer2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) train_reader = paddle.batch( paddle.dataset.mnist.train(), batch_size=128, drop_last=True) for batch_id, data in enumerate(train_reader()): dy_x_data = np.array( [x[0].reshape(1, 28, 28) for x in data]).astype('float32') y_data = np.array([x[1] for x in data]).astype('int64').reshape( 128, 1) img = to_variable(dy_x_data) label = to_variable(y_data) label._stop_gradient = True cost = mlp(img) avg_loss = fluid.layers.reduce_mean(cost) avg_loss.backward() optimizer.minimize(avg_loss) mlp.clear_gradients() fluid.dygraph.save_persistables( mlp.state_dict(), [optimizer, optimizer2], "save_dir_2") if batch_id == 2: break with fluid.dygraph.guard(): mlp_load = MLP('mlp') optimizer_load2 = SGDOptimizer( learning_rate=fluid.layers.natural_exp_decay( learning_rate=0.1, decay_steps=10000, decay_rate=0.5, staircase=True)) parameters, optimizers = fluid.dygraph.load_persistables( "save_dir_2") mlp_load.load_dict(parameters) optimizer_load2.load(optimizers) self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
-
minimize
(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None) Add operations to minimize loss by updating parameter_list.
This method combines interface backward() and apply_gradients() into one.
- Parameters
loss (Variable) – loss variable to run optimizations.
startup_program (Program) – startup_program for initializing parameters in parameter_list.
parameter_list (list) – list of Variables to update.
no_grad_set (set|None) – set of Variables should be ignored.
grad_clip (GradClipBase|None) – Gradient clip strategy
- Returns
(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.
- Return type
tuple