Meta-Learning for Low-Resource Neural Machine Translation

Posted on 19/09/2019, in Paper.
  • Overview: This paper frame low-resource NMT as a meta learning problem showed the proposed approach outperforms the multilingual, transfer learning based ones.
  • Low-resource translation: There are two ways to handle the problem insufficient train data in NMT: a) use the unlabeled data; b) share the knowledge between low/high-resource languages. Previous multilingual NMT takes the approach to jointly train multilingual system; this might not results in a universal space.
  • Meta-leanring for NMT : MAML provides a good way to learn good parameter initialization. In this paper, for each meta-learning episode, we first sample the task $T^k$, then sample two subsets of training samples independently from the chosen task: One of them is used to simulate the learning process and the other is to evaluate the output. In the inner learning process we also penalize the update (i.e. $||\theta - \theta^0||$) lengths and in the outside loop use the following to approximate the meta gradient: \begin{equation} \nabla_\theta L^{D’}(\theta’) \approx \nabla_{\theta’} L^{D’}(\theta’) \end{equation}
  • Unified Lexical Representation (ULR): One challenge in multilingual NLP tasks is the mismatch of the vocabulary space $V_k$. ULR starts with embedding trained on each language $k$, $\epsilon_{\text{query}}^k$. The universal embedding system is represented as a key value pairs: $(\epsilon_u, \epsilon_{\text{key}})$. We can choose $\epsilon_u$ to be inherited from one main language, e.g. English. Then, we can generate the embedding for language $k$ and token $x$ now as: \begin{equation} e[x] = \sum_{I=1}^M \alpha_i\epsilon_{u,i}, \, \text{where} \, \alpha_i \propto exp \left(-\frac{1}{\tau} \epsilon_{\text{key}, i}^T A \epsilon_{\text{query}, i}^k \right) \end{equation} The shared parameters $\epsilon_u, \epsilon_{\text{key}})$ and $A$ are updated in the outside loop.
    • During the fine-tuning stage, only an incremental portion of the embedding is updated.
  • Result: The experiments mainly used Europarl5, WMT’16&17 and Korean Parallel Dataset and will use Romanian (Ro) , Latvian (Lv), Finnish (Fi), Turkish (Tr) and Korean (Ko) as the low-resource languages. The proposed method beat Multilingual Transfer Learning especially for zero-shot and or the model is training set is small. Adding more source tasked always benefits but there are detailed meta structures between languages. Meta-learning methods are also hard to overfit the data.


  • Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks