Decay parameter of Adam optimizer in Keras

Question

I think that Adam optimizer is designed such that it automtically adjusts the learning rate. But there is an option to explicitly mention the decay in the Adam parameter options in Keras. I want to clarify the effect of decay on Adam optimizer in Keras. If we compile the model using decay say 0.01 on lr = 0.001, and then fit the model running for 50 epochs, then does the learning rate get reduced by a factor of 0.01 after each epoch?

Is there any way where we can specify that the learning rate should decay only after running for certain number of epochs?

In pytorch there is a different implementation called AdamW, which is not present in the standard keras library. Is this the same as varying the decay after every epoch as mentioned above?

Thanks in advance for the reply.

OverLordGoldDragon · Accepted Answer · 2020-02-02 19:20:01Z

9

From source code, decay adjusts lr per iterations according to

lr = lr * (1. / (1. + decay * iterations))  # simplified

see image below. This is epoch-independent. iterations is incremented by 1 on each batch fit (e.g. each time train_on_batch is called, or how many ever batches are in x for model.fit(x) - usually len(x) // batch_size batches).

To implement what you've described, you can use a callback as below:

from keras.callbacks import LearningRateScheduler
def decay_schedule(epoch, lr):
    # decay by 0.1 every 5 epochs; use `% 1` to decay after each epoch
    if (epoch % 5 == 0) and (epoch != 0):
        lr = lr * 0.1
    return lr

lr_scheduler = LearningRateScheduler(decay_schedule)
model.fit(x, y, epochs=50, callbacks=[lr_scheduler])

The LearningRateScheduler takes a function as an argument, and the function is fed the epoch index and lr at the beginning of each epoch by .fit. It then updates lr according to that function - so on next epoch, the function is fed the updated lr.

Also, there is a Keras implementation of AdamW, NadamW, and SGDW, by me - Keras AdamW.

Clarification: the very first call to .fit() invokes on_epoch_begin with epoch = 0 - if we don't wish lr to be decayed immediately, we should add a epoch != 0 check in decay_schedule. Then, epoch denotes how many epochs have already passed - so when epoch = 5, the decay is applied.

edited Feb 2, 2020 at 19:20

answered Feb 2, 2020 at 18:28

OverLordGoldDragon

19.4k10 gold badges56 silver badges108 bronze badges

Just want to clarify again, so if I use the standard Adam optimizer in keras Adam(lr=xx,decay=yy) does the lr now reduce after each batch size and each epoch?
– Arjun
Feb 2, 2020 at 23:04
Also what is the difference between this method and AdamW?
– Arjun
Feb 2, 2020 at 23:04
@Arjun AdamW only concerns itself with weight decays - whereas AdamWR uses cyclic learning rates; see my repo's README for a concise overview of both. You may also find this thread useful. As for decay, in general, I advise against it, as most training is simply spent with a very small fraction of the original lr, eventually decaying entirely to zero.
– OverLordGoldDragon
Feb 2, 2020 at 23:49
@Arjun Since decay is independent of epoch - yes, it'll apply both on epoch end and on batch fit end, since "epoch end" happens at a "batch end". (But no, it doesn't "stack", i.e. happen twice on epoch end)
– OverLordGoldDragon
Feb 2, 2020 at 23:51
To me, this answer like similar others has a major disadvantage. Where and how we should specify the optimizer inside the .compile() method of the model. In your example above you specify LearningRateScheduler which is fine and the model.fit(). But where is the model.compile() statement with the initialization of the Adam optimizer. Explicitely using the code above won't start the training process
– NikSp
Dec 29, 2020 at 11:17

| Show 1 more comment

Timbus Calin · Accepted Answer · 2020-02-02 19:13:08Z

2

Internally, there is a learning rate decay at each after each batch-size, yet not after each epoch as it is commonly believed.

You can read more about it here: https://www.pyimagesearch.com/2019/07/22/keras-learning-rate-schedules-and-decay/

However, you can also implement your own learning_rate scheduler, via a custom callback function:

    def learning_rate_scheduler(epoch, lr): 
        #Say you want to decay linearly by 5 after every 10 epochs the lr
        #(epoch + 1) since it starts from epoch 0
        if (epoch + 1) % 10 == 0:
           lr = lr / 5

    callbacks = [
       tensorflow.keras.callbacks.LearningRateScheduler(learning_rate_scheduler, verbose=1)
    ]

    model.fit(...,callbacks=callbacks,...)

The above method works for all types of optimizers, not only Adam.

edited Feb 2, 2020 at 19:13

answered Feb 2, 2020 at 18:14

Timbus Calin

14.4k6 gold badges44 silver badges65 bronze badges

Actually, nevermind the latter half of my comment - it only applied to the old Keras API; from the source code, the callback does indeed apply recursively, so your original, except the conditional check, was fine - also updated my answer.
– OverLordGoldDragon
Feb 2, 2020 at 19:00
Pardon the mishaps - epoch + 1 doesn't quite work either; to avoid an overly complicated expression, I just coded the condition explicitly.
– OverLordGoldDragon
Feb 2, 2020 at 19:16
Yes, I thought that maybe I don't remember well and reupdated it; thanks for repointing it out.
– Timbus Calin
Feb 2, 2020 at 19:16

Add a comment |

Collectives™ on Stack Overflow

Decay parameter of Adam optimizer in Keras

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged
python
keras
deep-learning
tf.keras
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pythonkerasdeep-learningtf.keras or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
keras
deep-learning
tf.keras
or ask your own question.