Word context and token representations from paradigmatic relations and their application to part-of-speech induction

Sert, Enis Rifat

View/Open

File_10018048 (758.3Kb)

Date

2013

Author

Sert, Enis Rifat

Metadata

Show full item record

Abstract

Kelimelerin Öklit uzayında gerçek yoğun vektörler tarafından temsili kelimeler arasındaki ilgililiğin uzaklık ve açı cinsinden tanımlanmasına olanak sağlamaktadır. Kelime temsilleri tarafından işgal edilen bölgeler kelimelerin sözdizimsel ve anlamsal özelliklerini yansıtmaktadırlar. Bunlara ek olarak, kelime temsilleri doğal dil işleme algoritmalarına öznitelik olarak eklenebilmektedirler.Bu tez içinde, kelime temsillerini denetimsiz olarak, örneksel ilişkilerini yani kelimelerin değiştirilebilirliğini kullanarak üretiyoruz. S-CODE isimli Öklitsel gömme algorıtmasını çalıştırarak kelime türü temsillerine ek olarak, kelime bağlamı ve kelime andacı temsilleri elde ediyoruz. Kelime bağlamı ve kelime andacı temsilleri her kelime turu için sadece bir temsille kısıtlanmadıkları için çok sözdizimsel kategorili kelimelerle başa çıkma yeteneğine sahiptirler.Kelime türü, kelime bağlamı ve kelime andacı temsillerini k-means algorıtmasını kullanarak kümeleyip sözcük türü tümevarımı (part-of-speech induction) problemine uyguluyoruz. Penn Treebank bütüncesinin 45 sözcük türü etiketli Wall Street Journal kısımı için tür ve andaç temelli sözcük türü tümevarımları elde ediyoruz. Sözcük türü tümevarımlarımız ile tür temelliler için 0.8025 ve andaç temelliler için 0.8039 Çoktan-Bire eşleme kesinlikleri elde ediyoruz. Bildiğimiz kadarıyla tekniklerimiz bu sonuçlarla alandaki en gelişmiş teknikler olmuşlardır. Bununla beraber, çok anlamlılığı ölçmek için 'Altın Standart Etiket Treddütü' ölçüsünü takdim ederek andaç temelli sözcük turu tümevarımlarımızın çok sözdizimsel kategorili kelimelerde başarılı olduğunu gösteriyoruz.

Representation of words as dense real vectors in the Euclidean space provides an intuitive definition of relatedness in terms of the distance or the angle between one another. Regions occupied by these word representations reveal syntactic and semantic traits of the words. On top of that, word representations can be incorporated in other natural language processing algorithms as features.In this thesis, we generate word representations in an unsupervised manner by utilizing paradigmatic relations which are concerned with substitutability of words. We employ an Euclidean embedding algorithm (S-CODE) to generate word context and word token representations from the substitute word distributions, in addition to word type representations. Word context and word token representations are capable of handling syntactic category ambiguities of word types because they are not restricted to a single representation for each word type.We apply the word type, word context and word token representations to the part-of-speech induction problem by clustering the representations with k-means algorithm and obtain type and token based part-of-speech induction for Wall Street Journal section of Penn Treebank with 45 gold-standard tags. To the best of our knowledge, these part-of-speech induction results are the state-of-the-art for both type based and token based part-of-speech induction with Many-To-One mapping accuracies of 0.8025 and 0.8039, respectively. We also introduce a measure of ambiguity, Gold-standard-tag Perplexity, which we use to show that our token based part-of-speech induction is indeed successful at inducing part-of-speech categories of ambiguous word types.

URI

https://acikbilim.yok.gov.tr/handle/20.500.12812/168962

Collections

TEZLER

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess