Multilingual distributed word representation using deeplearning

Sohsah, Gihad

View/Open

File_10128882 (2.280Mb)

Date

2016

Author

Sohsah, Gihad

Metadata

Show full item record

Abstract

Bu çalışmada manalı çok dilli temsillerin çıkarılması problemi morfolojik olarak zengin dillere odakla incelenmiştir. Çok dilliliği sağlamak için cümle olarak eşleştirilmiş metinleri kullanan veri tabanlı bir metot kullanılmıştır. Bu metot sadece yalın kelimeleri değil, hiyerarşik olarak kelime parçaları, ekleri ve diğer morphemeleri dikkate alacak şekilde hiyerarşik olarak geliştirilmiştir. Ayrıca, parçalarından kelime ve cümle temsilleri oluşturmak için farklı mimari ve birleştirme fonksiyonları incelenmiş ve karşılaştırılmıştır. Bir dilin birleştirme fonksiyonu o dilin özelliklerine göre seçilmiştir.Farklı metotları karşılaştırmak ve temsillerin amaca uygunluğunu belirlemek için, biri hızlı kontrol amaçlı olmak üzere üç test kullanılmıştır. Hızlı kontrol için oluşturulan paraphrase testi hem metinlerden öğrenme olup olmadığını kontrol etmek, hem de parametreleri belirlemek için kullanılmıştır.İkinci test, t-SNE, modelin birbirine eşit kelime ve cümleleri bir araya getirmesinin görsel bir testdir.En anlamlı sonucu veren üçüncü test bir diller arası döküman sınıflandırma (CLDC) testidir. Bu test çok kaynak bulunan bir dilde eğitilen gözetimli öğrenme sınıflandırıcının az kaynak bulunan bir dilde performans kaybı olmadan test edilmesiyle ilgilidir. Modellerin performansı recall ve precision'ı göz önünde bulunduran bir ölçü olan F1-skoru ile açıklanmaktadır. Bu çalışmada kelime temsilleri kullanılarak İngilizce gibi çok kaynaklı bir dilden, Türkçe gibi az kaynaklı bir dile bilgi aktarılması ve İngilizce üzerinde eğitilen modellerin Türkçe dökümanların sınıflandırılmasında yeterli performans sağlamasına odaklanılmıştır.Bu test de, averaged-perceptron sınıflandırıcısı TED-Corpus kullanılarak eğitilmektedir. Bu sınıflandırıcı eğitildikten sonra bir dökümanı tesmiline göre 14 sınıftan birine atamalıdır. Bu döküman temsili bütün kelimelerin temsillerini ekleyerek oluşturulmaktadır. Averaged perceptron iki sınıflı bir sınıflandırıcı olduğundan one-vs-all tekniği kullanılarak bu sınıflandırıcı 14 sınıflı bir sınıflandırıcıya dönüştürülmüştür.Deneylere göre, kelime temsillerini Türkçe ve İngilizce için ortak çıkaran ve toplama kompozisyon fonksiyonlarını kullanan çok dilli yöntem, hem kelime hem cümle seviyesinde CLDC F1 skoru olarak en iyi sonuçları vermiştir. Farklı veri işleme yöntemleri incelenmiştir. Cümleler bir tokenizer kullanarak boşluklar ile kelimelere yada zengin morfolojisi olan diller için bir morfolojik inceleyici kullanılarak morphemelerine ayrılabilir. Deney sonuçlarına göre İngilizce'de kelimeleri, Türkçe'de morphemeleri kullanmak en iyi sonuçları vermiştir.

In this work, the problem of extracting meaningful multilingual word embeddings is studied with special focus on morphologically-rich languages. In order to achieve multilingualism, a data-driven method that makes use of a sentence-aligned parallel corpus is used. This method is expanded hierarchically to take account for the words parts, tokens or morphemes, rather than just considering the raw words as the basic language units. Also various architectures and aggregation functions for constructing word and sentences embeddings given their parts are studied and compared. The aggregation function for a specific function is chosen according to the nature of the particular language. To evaluate the different methods, one sanity check test is used and two more tests are used to evaluate the quality of the resulting representations. The sanity check which is mainly used to make sure that the models are learning anything from the corpus and is also used for the parameter tuning, is the paraphrase test.The second test, t-SNE, is a visual test that just gives insights about the model's ability to bring semantically equivalent words and sentences close to each other in the space.The third test, that gives the most meaningful measure, is the cross-lingual document classification task. The (CLDC) task is concerned with training a supervised classifier using documents from one language, the rich-resources language, and testing it using another language, the low-resources one, while maintaining a satisfying performance. The performances of the models are described in terms of the F1-score.As the experiments have shown, a multilingual framework for extracting word-embeddings jointly for both English and Turkish that uses additive functions at both sentences and words level results the best result in terms of the F1-score achieved in the (CLDC) task. Various data preprocessing methods are also studied, the words can be either extracted by simply dividing the sentences using the space as a delimiter, using a tokenizer, or using a morphological analyzer in the case of the morphologically-rich languages. The experiments showed that using the raw data format for English and the morphemes as the basic language unit for Turkish yields the best results.

URI

https://acikbilim.yok.gov.tr/handle/20.500.12812/631585

Collections

TEZLER

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess