Comparable corpora as cure-all remedy: possibilities for mapping languages without parallel resources

Serge Sharoff

Parallel corpora became a staple of modern Machine Translation. However, the typical sources of truly parallel data are institutional repositories, like the UN or the European Parliament, which are limited in terms of the language pairs and domains covered. In this talk, I'll present my experience of working with less parallel resources, which can provide sufficient data for building translation models. In particular I will discuss the possibilities for constructing cross-lingual word embedding spaces between related languages.

