Overview: This paper provides a model-free meta-learning module on top of gradient based ML models (supervised/RL, etc.).

Few-shot learning : The goal of few shots learning problem is phased as: A model that can quickly adapt to a new task using only a few data points and training iteration.

MAML algorithm: Instead of looking for architecture or policy that are more transferable than others, (as did in Optimization as a Model for Few-Shot Learning ), the algorithm takes an explicit approach of the problem: Our gradient descent optimizer will go to the direction of $\theta$ where the expectation of loss decrease, i.e. applying one gradient descent step on a sampled task, is maximized. The induced objective functions can be written as:
\begin{equation}
\min_{\theta} \sum_{T_i \sim p(T)} L_{T_i}(f_{\theta’}) = \min_{\theta} \sum_{T_i \sim p(T)} L_{T_i}\left(f_{\theta-\alpha \nabla_\theta L_{T_i}(f_{\theta})}\right)
\end{equation}

Hessian matrix: Because the gradients of SGD loss needs to be calculated, there are Hessian matrix calculation that cis computationally expensive. The author claims by linear approximating the Hessian, the performance will not be hurt but with a 33% speed-up.

RL: This algorithm is also applicable to any RL models, with policy gradient. I am not that into this part.

Result: For few-shot image classification, it achieves SOTA.

Not quite sure I understand this For simplicity of notation, we will consider one gradient update for the rest of this section, but using multiple gradient updates is a straightforward extension in section 2.2.

Updated 0918: I understand the above point now: Computational it will be much more expensive though. I also need to catch up in RL to understand section 5.3