Learning word-vector quantization: A study in morphological disambiguation of turkish

Arslan, Enis

dc.contributor.advisor	Orhan, Umut
dc.contributor.author	Arslan, Enis
dc.date.accessioned	2020-12-29T09:03:56Z
dc.date.available	2020-12-29T09:03:56Z
dc.date.submitted	2020
dc.date.issued	2020-02-13
dc.identifier.uri	https://acikbilim.yok.gov.tr/handle/20.500.12812/376405
dc.description.abstract	NLP uygulamalarının başarısı, dillerin temel birimi olan kelimelerin doğru biçimbirimsel analizine bağlıdır. Kökler, kelime türü etiketleri ve biçimbirimsel özellikler, bir kelimenin temel birimleridir. Türkçe gibi biçimbirimsel olarak karmaşık olan diller zengin özelliklere sahiptir. Türkçe'nin türetimsel olarak üretken yapısı gözönüne alındığında, bir kök kelimeden binlerce kelime üretilebilmekte ve bu durum seyrekleşmeye yol açmaktadır. Biçimbirimsel analizörler, bir kök kelimenin biçimbirim analizini yapan araçlardır. Biçimbirimsel analizörler, tek bir kelime için birden fazla ayrıştırma üretebilir ve bu durum ise belirsizliği göstermektedir. Belirsizlik giderme işlemi, Türkçe gibi morfolojik olarak karmaşık diller için oldukça zor bir işlemdir. Bu problemin giderilmesi için sunulan çalışmalarda yüksek doğruluk değerleri elde edilmiş olmasına rağmen, daha gidilecek yol vardır. Seyreklik ve yüksek miktarda denetimli verinin bulunmuyor olması, daha uzun çalışma sürelerine ve daha düşük doğruluk değerlerine sebep olabilmektedir. Son zamanlarda biçimbirimsel belirsizliklerin giderilmesi çalışmaları genellikle sinir öğrenme modelleri ile yapılmaktadır. Bildiğimiz kadarıyla, Türkçe için, kelimelerin vektör uzayında eğitilerek konumlandırılmasıyla biçimbirimsel belirsizliği gideren bir yöntem henüz önerilmemiştir. Bu eksiklikten hareketle, bu tezde, belirsiz kelimenin doğru adaylarını belirsiz olmayan komşuların yanına yerleştirerek biçimbirimsel belirsizliği çözen bir vektör uzay modeli geliştirilmiş ve uygulanmıştır. Sözcük vektörü nicelleştirme öğrenmesi (LWQ) adlı model, iyi bilinen bir öğrenme algoritması olan vektörel nicelleştirme öğrenmesi (LVQ)'nin bir türevidir. LWQ, literatürde sunulan diğer algoritmalara göre daha iyi başarı oranları elde etmektedir.
dc.description.abstract	Nowadays, most of the NLP applications are dependent on the accurate morphological analysis of the basic language units: words. Root words, part-of-speech (POS) tags and morphological features are the basic units of a word. Morphologically complex languages like Turkish have rich feature sets. When combined with productive inflectional and derivational morphology, thousands of words can be produced from a root word and this leads to sparsity. Morphological analyzers are the tools that perform the morphological analysis of a word. They can produce multiple parses for a single word where this indicates ambiguity. Disambiguation is the removal process of ambiguity where it is a much complicated task for morphologically complex languages like Turkish. Although high accuracy values are obtained for the studies performed on this task, there is still a challenge. Sparsity and insufficiency of high volume supervised data is the cause of longer running times and accuracy loss. Recent studies for morphological disambiguation are generally presented on neural learning models. To our best knowledge, a disambiguation method which takes the advantage of training of words in a vector-space has not been proposed. Motivated by this shortcoming, in this thesis, we have developed and implemented a vector-space model that solves morphological ambiguity by locating the correct candidates of ambiguous words near to the unambiguous neighbors. The model, named learning word-vector quantization (LWQ), is an adaptation of a well-known learning algorithm, learning vector quantization (LVQ). LWQ outperforms the algorithms presented in the literature for the morphological disambiguation of Turkish.	en_US
dc.language	English
dc.language.iso	en
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	Attribution 4.0 United States	tr_TR
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol	tr_TR
dc.subject	Computer Engineering and Computer Science and Control	en_US
dc.title	Learning word-vector quantization: A study in morphological disambiguation of turkish
dc.title.alternative	Sözcük vektörü nicelleştirme öğrenmesi: TÜrkçe için biçimbirimsel belirsizlik giderme çalışması
dc.type	doctoralThesis
dc.date.updated	2020-02-13
dc.contributor.department	Bilgisayar Mühendisliği Anabilim Dalı
dc.subject.ytm	Natural language processing
dc.identifier.yokid	10317550
dc.publisher.institute	Fen Bilimleri Enstitüsü
dc.publisher.university	ÜSKÜDAR ÜNİVERSİTESİ
dc.identifier.thesisid	609281
dc.description.pages	128
dc.publisher.discipline	Diğer

Files in this item

Name:: yokAcikBilim_10317550.pdf
Size:: 3.451Mb
Format:: PDF
Description:: File_10317550

View/Open

This item appears in the following Collection(s)

TEZLER

Show simple item record

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess