Bilingual distributed word representations from document-aligned comparable data
From MaRDI portal
Publication:2800962
DOI10.1613/JAIR.4986zbMATH Open1352.68264arXiv1509.07308OpenAlexW2229725139WikidataQ129489800 ScholiaQ129489800MaRDI QIDQ2800962FDOQ2800962
Authors: Ivan Vulić, Marie-Francine Moens
Publication date: 19 April 2016
Published in: The Journal of Artificial Intelligence Research (JAIR) (Search for Journal in Brave)
Abstract: We propose a new model for learning bilingual word representations from non-parallel document-aligned data. Following the recent advances in word representation learning, our model learns dense real-valued word vectors, that is, bilingual word embeddings (BWEs). Unlike prior work on inducing BWEs which heavily relied on parallel sentence-aligned corpora and/or readily available translation resources such as dictionaries, the article reveals that BWEs may be learned solely on the basis of document-aligned comparable data without any additional lexical resources nor syntactic information. We present a comparison of our approach with previous state-of-the-art models for learning bilingual word representations from comparable data that rely on the framework of multilingual probabilistic topic modeling (MuPTM), as well as with distributional local context-counting models. We demonstrate the utility of the induced BWEs in two semantic tasks: (1) bilingual lexicon extraction, (2) suggesting word translations in context for polysemous words. Our simple yet effective BWE-based models significantly outperform the MuPTM-based and context-counting representation models from comparable data as well as prior BWE-based models, and acquire the best reported results on both tasks for all three tested language pairs.
Full work available at URL: https://arxiv.org/abs/1509.07308
Recommendations
- A survey of cross-lingual word embedding models
- A statistical view on bilingual lexicon extraction. From parallel corpora to nonparallel corpora
- scientific article; zbMATH DE number 1949705
- scientific article; zbMATH DE number 1949716
- Topically-informed bilingually-constrained recursive autoencoders for statistical machine translation
Learning and adaptive systems in artificial intelligence (68T05) Natural language processing (68T50)
Cited In (5)
- A statistical view on bilingual lexicon extraction. From parallel corpora to nonparallel corpora
- Topically-informed bilingually-constrained recursive autoencoders for statistical machine translation
- \textsc{Nasari}: integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities
- Unsupervised translation disambiguation based on web indirect association of a bilingual word
- A survey of cross-lingual word embedding models
This page was built for publication: Bilingual distributed word representations from document-aligned comparable data
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q2800962)