AdamW SGD optimization
adamW(
stepsize = 0.05,
beta1 = 0.9,
beta2 = 0.999,
lambda = 0.01,
epsilon = 1e-08
)
stepsize for SGD
beta1 for AdamW
beta2 for AdamW
lambda (weight decay) for AdamW
epsilon for numerical stability
a list of control variables for optimization
(used in control_opt
function)
The update rule for AdamW is: $$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$ $$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$ $$\hat{m_t} = m_t / (1 - \beta_1^t)$$ $$\hat{v_t} = v_t / (1 - \beta_2^t)$$ $$x_{t+1} = x_t - \text{stepsize} * \left( \lambda x_t + \frac{\hat{m_t}}{\sqrt{\hat{v_t}} + \epsilon} \right)$$