Unsupervised morphological analysis using tries

Ak, Koray

View/Open

File_403560 (354.4Kb)

Date

2011

Author

Ak, Koray

Metadata

Show full item record

Abstract

Biçimbirim analizi ya da ayrıştırması, kelimelerin yapısını, dizilimini ve fonksiyonlarını inceler, kelimeler içindeki en küçük anlam taşıyan morfemleri belirler ve dilin modelini çıkarmaya çalışır. Konuşma işleme, bilgisayarlı çeviri, bilgi bulgetir, metin anlama ve istatiksel dil modelleme gibi alanlarda kullanılır. Biçimbirim analizi, metiniçinde bir çok sözcük formu olduğundan çoğu dil için hem zor hem de gereklidir. Çekimli dillerde aynı köke ait binlerce değişik sözcük formu olabilir, bu da çekimlenmiş sözcük dizilerini oluşturmayı zor kılar. Doğal dil işleme uygulamalarının büyük verilerle çalıştığı düşünülürse bu işin dilbilimciler tarafından el ile yapılması karmaşıklık ve gerçek zamanlı işleme açşından mümkün değildir. Bu nedenle bu işlemin otomatikleşmiş biçimbirim algoritmaları tarafından yapılması gerekmektedir. Bu bağlamda öğreticisiz biçimbirim çözümleyicilerin kullanıldığı sistemlerle işlenmemiş metin bütünceleri işlenebilir.Bu çalışmada metin bütünceleri ve dilin modeli hakkında bilgi çıkarımı yapacakbir gözetimsiz öğrenme algoritması önerilmiştir. Tasarlanan algoritma, metin bütüncesindegeçen kelimelerden oluşturduğu ağaçlar ile verilen kelimelerin kök ve eklerini kelimeleringeçme sıklığına göre bulmaya çalışmaktadır. Kelimelerin kökleri çıkarıldıktan sonra algoritma geri kalan sözcük kşımları ile ek ağaçları oluşturup özyineli bir şekilde tümekleri bulur. Algoritma Fince, Ingilizce ve Türkçe dillerinde denenip önceki çalışmalarınçoğundan iyi sonuçlar vermiştir.

Morphological analysis or decomposition studies the structure, formation, functionof words, identifies the morphemes (smallest meaning-bearing elements) of thelanguage and attempts to formulate rules that model the language. It is widely used indifferent areas such as speech recognition, machine translation, information retrieval,text understanding, and statistical language modeling. Considering that the naturallanguage processing applications are dealing with large amounts of data, it is not feasibleto use linguists to analyze text corpus by hand, the complexity and real timeprocessing requirements leads to automated morphological analysis. As an alternativeto the hand-made systems, there exist algorithms that work unsupervised manner andautonomously do morphological analysis for the words in an unannotated text corpus.In this thesis, an unsupervised learning algorithm is proposed to extract informationabout the text corpus and the model of the language. The proposed algorithmconstructs a trie that consists of characters and the occurrences of the words as nodes.The algorithm then detects roots of the given words by examining the occurrences inthe path of the word. When the root is revealed, the algorithm creates a new trie fromthe affix parts, left after the root for each word. The algorithm continues recursivelyuntil there is no affix left to process. Experimental results on three languages (Finnish,English and Turkish) show that our novel algorithm performs better than most of theprevious algorithms in the field and gives promising results.

URI

https://acikbilim.yok.gov.tr/handle/20.500.12812/93871

Collections

TEZLER

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess