Optimization as a Model for Few-Shot LearningPosted on 19/04/2019, in Paper.
- Overview: This paper trained a LSTM learner network to replace the optimization algorithm in deep learning and applied this to few-shot meta-learning.
- Meta-learning task: Instead of having one single split of train, valid and test, this paper is dealing with meta-set that consists of multiple regular dataset splits,
D_trainis as small as 5 labeled samples.
- LSTM: By comparing with the SGD gradients updates, with the LSTM input/forget gate, the author proposed the idea that LSTM module can be viewed as a parametric optimization strategy, i.e. meta-learner. The LSTM parameters are shared across all channels.
- Meta-learning Training: There are two portions of the parameters: parameters for the classier, and parameters for the meta-learner. The former is updated during training (with
Tepochs and then the later is updated with the loss from that particular
D_test_i. Notice that when Another
D_test_iwere draw from the meta-set, the parameters of the classier will be initialized. Notice batch-normalization needs to be set up carefully to avoid information leakage.
- Result: The model achieves SOTA on mini-ImageNet. Visualization of gate values indicates the model has different learning strategy for different dataset and adopt different meta-learning policy for different tasks (1 shot v.s. 5 shot).
- Literature review: This paper provides several good pointers to early day meta-learning literatures (before 1995) in section 4.1
Some meta learning works this paper cites for a couple of times:
- Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. CoRR, abs/1606.04474, 2016. URL http://arxiv.org/abs/1606.04474.