Power of frequencies: N-grams and semi-supervised morphological segmentation in Turkish

Kiliç, Özkan

dc.contributor.advisor	Bozşahin, Hüseyin Cem
dc.contributor.author	Kiliç, Özkan
dc.date.accessioned	2020-12-10T09:13:59Z
dc.date.available	2020-12-10T09:13:59Z
dc.date.submitted	2013
dc.date.issued	2018-08-06
dc.identifier.uri	https://acikbilim.yok.gov.tr/handle/20.500.12812/225455
dc.description.abstract	Türkçe serbest sözcük dizimine sahip bitişimli bir dildir. İletişim sırasında, Türkçedekikelimelerin yapısal bölümlerine ayrılması gereklidir; çünkü Türkçenin biçimbilimselsözdizimi karışıktır ve bu durum anlamsal çözümlemede merkezi bir rol oynar. Sözcük-altıparçacıkların ayrıştırılması aslında çocuklar tarafından şaşırtıcı bir başarıyla gerçekleştirilenbir biçimbirim bölme işlemidir. Bu çalışmada, Türkçe kelimelerin biçimbirim ayrıştırılmasıbir yarı-denetimli Gizli Markov Modeli ile gösterilmiştir. Model, tekrarların ve dizilimleringücünü dil ediniminde doğrudan (veya dolaylı olumsuz) kanıt olarak vurgulamaktadır.Yöntem, ODTÜ Türkçe Derlemi ve ODTÜ-Sabancı Türkçe Ağaç Yapılı Derlemi tarafındaneğitildikten sonra .88, .92 ve .90 (duyarlık, doğruluk, f-değeri) ölçümlerine ulaşmıştır.Ayrıca, bileşik sözcük tanımlama ve bölme için istatistiksel yaklaşımlar önerilmiştir. Bilişselbilimlerde sıklıkların kullanımını desteklemek amacıyla, Türkçe sıfat pekiştirme ve sahtekelimelerin kabul edilebilirliği ile ilgili deneysel çalışmalar ve ilgili istatistiksel modeller buçalışmada önerilmiştir. Bu çalışma şunu göstermektedir; çocukları yönlendirenkonuşmalarda olası kelime formları ve muhtemel olmayan biçimbirim sıralarına yönelikçarpık bir olasılık yığını olduğu için, bu yığın çeşitli istatistiksel modeller tarafından insandüzeyinde dilbilimsel yetenekleri taklit etmede kullanılabilir. Ayrıca, insanlar istatistiksel biröğrenme yeteneğine sahiptir ve bu yetenek doğalcıların iddia ettiği gibi dil yetisine hasdeğildir fakat genel bilişsel yeteneklere dahildir. Bu durum dili analiz edecek hesaplamalı veistatistiksel modellerin anlamlı ve geçerli kullanımlarına olanak sağlamaktadır. Böyletahminsel modeller dilin derinlemesine anlaşılmasına izin vermektedir.Anahtar Kelimeler: Biçimbirim Bölme; Dolaylı Olumsuz Delil; Yarı-denetimli Öğrenme
dc.description.abstract	Turkish is an agglutinating language with a non-rigid word order. When communicating, theword internal structure in Turkish is required to be segmented because Turkishmorphosyntax is tortuous and it plays a central role in semantic analysis. Distinguishing asub-word unit actually means performing a morph segmentation task, which is accomplishedby children at an astonishing success rate. In this study, morph segmentation of Turkishwords was demonstrated with a semi-supervised Hidden Markov Model, which emphasizedthe power of frequencies and sequences as direct (or indirect negative) evidence for languageacquisition. The method achieved .88, .92 and .90 (precision, recall and f-score) measuresafter being trained by the METU Corpus and the METU-Sabancı Turkish Treebank.Additionally, statistical approaches were offered for compound word recognition andsegmentation. In order to corroborate the use of frequencies in the cognitive studies, theexperimental studies and the corresponding statistical models in Turkish emphaticreduplication and the acceptability of nonce words were also proposed in this study. Thisstudy shows that since the probability mass in child-directed speech is skewed towardpossible word forms and unlikely morph sequences, this mass can be used by various modelsto mimic human-level linguistic capabilities. Furthermore, human beings have a statisticallearning ability and it is not specific to the faculty of language as claimed by nativists but togeneral cognition. This allows the plausible and valid use of computational and statisticalmodels to analyze language. Such predictive models can allow a deeper understanding oflanguage.Keywords: Indirect Negative Evidence; Morph Segmentation; Semi-supervised Learning	en_US
dc.language	English
dc.language.iso	en
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	Attribution 4.0 United States	tr_TR
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol	tr_TR
dc.subject	Computer Engineering and Computer Science and Control	en_US
dc.subject	Dilbilim	tr_TR
dc.subject	Linguistics	en_US
dc.title	Power of frequencies: N-grams and semi-supervised morphological segmentation in Turkish
dc.title.alternative	Tekrarların gücü: Türkçe'de N-gramlar ve yarı-denetimli biçimbilimsel bölme
dc.type	doctoralThesis
dc.date.updated	2018-08-06
dc.contributor.department	Bilişsel Bilim Anabilim Dalı
dc.identifier.yokid	10005431
dc.publisher.institute	Enformatik Enstitüsü
dc.publisher.university	ORTA DOĞU TEKNİK ÜNİVERSİTESİ
dc.identifier.thesisid	343082
dc.description.pages	159
dc.publisher.discipline	Diğer

Files in this item

Name:: yokAcikBilim_10005431.pdf
Size:: 1.563Mb
Format:: PDF
Description:: File_10005431

View/Open

This item appears in the following Collection(s)

TEZLER

Show simple item record

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess