### Large-Scale Long-Tailed Recognition in an Open World

Posted on 21/05/2019, in Paper.**Overview**: This paper formulated the task of Open Long-Tailed Recognition (OLTR) that combined many-shot, medium-shot, few-shot and open-set classification into one. This would ask for recognition robustness and open-set sensitivity at the same time which are to some extent competing. This paper proposed a unified solution which leverages the self-attention,memorized meta feature, cosine classifier and large margin loss.

**OLTR**: There has been several image classification tasks spanning over the spectrum: Imbalanced classification, few-shot learning, and open-set recognition – either of them focusing on solving one aspect of classification problem, i.e. one particular range of frequency of train/test data samples (per class). The authors propose the open Long-Tailed Recognition (OLTR) task where one integrated algorithm is asked to solve them all at once, dealing with classes that have no training data (zero-shot/open-set), 20ish (few-shot), all the way to more than hundreds (many shots).**Dynamic Meta-Embedding**: The first important module propose is Dynamic meta-embedding. When an income image are from a new category, or category with less samples, the naive-trained encoder usually does not extract good enough features ($v^{direct}$). One way to fix this is to augmented the feature vector with the encoding of other concepts. Here centroids of classes ($M = {c_i}^K$) is used as the memory embedding.`Concept Selector`

and`Hallucinator`

will re-weight the memory embedding across classes as well as dimensions of $v^{direct}$: \begin{equation} v^{reweighted} = v^{direct} + tanh(T_{sel}(v^{direct})) \otimes T_{hal}(v^{direct})^T M \end{equation} We can further leverage the memory we have by penalize the embedding with the distance of nearest centroid: \begin{equation} v^{meta} = \frac{1}{\gamma} v^{reweighted}, \, \gamma := \min_i ||v^{direct} - c_i||_2 \end{equation} Therefore if the direct embedding is far away from the any classes embedding, $v^{meta}$ will be very small and in the extreme case nothing will be passed to the classifier – this gives the downstream module a chance to switch between few-shot and zero-shot.**Modulated Attention**: To improve the $v^{meta}$, we can plug in self attention on top of the features generated: \begin{equation} f^{att} = f + MA(f) \otimes SA(f) \end{equation} , where $MA(\cdot)$ and $SA(\cdot)$ are not explained in details but inherited from the`Attention Is All You Need`

paper.**Classification**: This paper choose to use cosine similarity as the classier to mitigate^{ 1 }the huge difference of sample numbers per classes. It will normalize the $v^{meta}$ to slightly shorter than 1 an the weigh ts to be 1 before calculating the dot product. In most of the experiments, if the softmax score is less than 0.1 the image is flagged as novel.**Result**: The author constructed three long tail dataset by sampling according to a Pareto distribution upon existing dataset. The model achieve consistent better performance for all region of OLTR. In the ablation study, the dynamic meta-embedding seems to be biggest driver.

1: See `Dynamic Few-Shot Visual Learning without Forgetting`

for reference. ↩

There is a lot of engineering in the paper but the ambition of unifying classification tasks is inspiring. It leverages a lot of recent techniques that worth reading:

`Attention Is All You Need`

`Large-Margin Softmax Loss for Convolutional Neural Networks`

`Non-local Neural Networks`

`Dynamic Few-Shot Visual Learning without Forgetting`