AdamW SGD optimization

adamW(
  stepsize = 0.05,
  beta1 = 0.9,
  beta2 = 0.999,
  lambda = 0.01,
  epsilon = 1e-08
)

Arguments

stepsize

stepsize for SGD

beta1

beta1 for AdamW

beta2

beta2 for AdamW

lambda

lambda (weight decay) for AdamW

epsilon

epsilon for numerical stability

Value

a list of control variables for optimization (used in control_opt function)

Details

The update rule for AdamW is: $$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$ $$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$ $$\hat{m_t} = m_t / (1 - \beta_1^t)$$ $$\hat{v_t} = v_t / (1 - \beta_2^t)$$ $$x_{t+1} = x_t - \text{stepsize} * \left( \lambda x_t + \frac{\hat{m_t}}{\sqrt{\hat{v_t}} + \epsilon} \right)$$