Large-Scale Long-Tailed Recognition in an Open World

Posted on 21/05/2019, in Paper.
  • Overview: This paper formulated the task of Open Long-Tailed Recognition (OLTR) that combined many-shot, medium-shot, few-shot and open-set classification into one. This would ask for recognition robustness and open-set sensitivity at the same time which are to some extent competing. This paper proposed a unified solution which leverages the self-attention,memorized meta feature, cosine classifier and large margin loss.

  • OLTR: There has been several image classification tasks spanning over the spectrum: Imbalanced classification, few-shot learning, and open-set recognition – either of them focusing on solving one aspect of classification problem, i.e. one particular range of frequency of train/test data samples (per class). The authors propose the open Long-Tailed Recognition (OLTR) task where one integrated algorithm is asked to solve them all at once, dealing with classes that have no training data (zero-shot/open-set), 20ish (few-shot), all the way to more than hundreds (many shots).
  • Dynamic Meta-Embedding: The first important module propose is Dynamic meta-embedding. When an income image are from a new category, or category with less samples, the naive-trained encoder usually does not extract good enough features ($v^{direct}$). One way to fix this is to augmented the feature vector with the encoding of other concepts. Here centroids of classes ($M = {c_i}^K$) is used as the memory embedding. Concept Selector and Hallucinator will re-weight the memory embedding across classes as well as dimensions of $v^{direct}$: \begin{equation} v^{reweighted} = v^{direct} + tanh(T_{sel}(v^{direct})) \otimes T_{hal}(v^{direct})^T M \end{equation} We can further leverage the memory we have by penalize the embedding with the distance of nearest centroid: \begin{equation} v^{meta} = \frac{1}{\gamma} v^{reweighted}, \, \gamma := \min_i ||v^{direct} - c_i||_2 \end{equation} Therefore if the direct embedding is far away from the any classes embedding, $v^{meta}$ will be very small and in the extreme case nothing will be passed to the classifier – this gives the downstream module a chance to switch between few-shot and zero-shot.
  • Modulated Attention: To improve the $v^{meta}$, we can plug in self attention on top of the features generated: \begin{equation} f^{att} = f + MA(f) \otimes SA(f) \end{equation} , where $MA(\cdot)$ and $SA(\cdot)$ are not explained in details but inherited from the Attention Is All You Need paper.
  • Classification: This paper choose to use cosine similarity as the classier to mitigate 1 the huge difference of sample numbers per classes. It will normalize the $v^{meta}$ to slightly shorter than 1 an the weigh ts to be 1 before calculating the dot product. In most of the experiments, if the softmax score is less than 0.1 the image is flagged as novel.
  • Result: The author constructed three long tail dataset by sampling according to a Pareto distribution upon existing dataset. The model achieve consistent better performance for all region of OLTR. In the ablation study, the dynamic meta-embedding seems to be biggest driver.

1: See Dynamic Few-Shot Visual Learning without Forgetting for reference.

There is a lot of engineering in the paper but the ambition of unifying classification tasks is inspiring. It leverages a lot of recent techniques that worth reading:

  • Attention Is All You Need
  • Large-Margin Softmax Loss for Convolutional Neural Networks
  • Non-local Neural Networks
  • Dynamic Few-Shot Visual Learning without Forgetting