Optimization as a Model for Few-Shot Learning

Posted on 19/04/2019, in Paper.
  • Overview: This paper trained a LSTM learner network to replace the optimization algorithm in deep learning and applied this to few-shot meta-learning.
  • Meta-learning task: Instead of having one single split of train, valid and test, this paper is dealing with meta-set that consists of multiple regular dataset splits, D_train and D_test. Usually D_train is as small as 5 labeled samples.
  • LSTM: By comparing with the SGD gradients updates, with the LSTM input/forget gate, the author proposed the idea that LSTM module can be viewed as a parametric optimization strategy, i.e. meta-learner. The LSTM parameters are shared across all channels.
  • Meta-learning Training: There are two portions of the parameters: parameters for the classier, and parameters for the meta-learner. The former is updated during training (with D_train_i) for T epochs and then the later is updated with the loss from that particular D_test_i. Notice that when Another D_train_i and D_test_i were draw from the meta-set, the parameters of the classier will be initialized. Notice batch-normalization needs to be set up carefully to avoid information leakage.
  • Result: The model achieves SOTA on mini-ImageNet. Visualization of gate values indicates the model has different learning strategy for different dataset and adopt different meta-learning policy for different tasks (1 shot v.s. 5 shot).
  • Literature review: This paper provides several good pointers to early day meta-learning literatures (before 1995) in section 4.1

Some meta learning works this paper cites for a couple of times:

  • Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. CoRR, abs/1606.04474, 2016. URL