Gizli dirichlet ayrımı ve Word2vec yöntemlerinin birleşimi ile özgün bir metin temsil modeli geliştirilmesi

Çelenli, Halil İbrahim

View/Open

File_10329980 (1.158Mb)

Date

2020

Author

Çelenli, Halil İbrahim

Metadata

Show full item record

Abstract

Son zamanlarda veri miktarındaki artış ile derin öğrenme, makine öğrenmesinin en popüler alanı olmaya başlamıştır. Bu artış ile Doğal Dil İşleme alanında da yeni yöntemlerin geliştirilmesini sağlamıştır. Metinsel verilerin temsil edilmesi, geleneksel yöntemler üzerinde Kelime Çantası Modeli gibi kelime temsil yöntemleri kullanılarak temsil edilir. Fakat yeni yöntemler üzerinde hızlı ve verimli olabilmesi için kelime kalıplama yöntemleri kullanılmaya başlanmıştır. Kelime kalıplama yöntemlerinin en popüler olanı Word2vec yöntemidir. Word2vec yöntemi kelimelerin bağlamlarındaki istatistiklere bakarak, yapay sinir ağlarını kullanarak her kelime için bir vektör gösterimini öğrenmektedir. Dokümanların temsil edilmesi için ise Doc2vec olarak bilinen kelime kalıplama yöntemi temelli yöntem kullanılmaktadır. Konu modelleme teknikleri ise kelimelerin konu olasılık dağılımları üzerinde rastgele bir araya gelerek dokümanları oluşturmaktadır. En sık kullanılan modeli Gizli Dirichlet Ayırımı (LDA) modelidir. LDA modeli konuların dokümanlar üzerindeki dağılımı ile kelimelerin konular üzerindeki dağılımı olmak üzere 2 farklı dağılım üretmektedir.Tez çalışması içerisinde Word2vec yöntemi, LDA model dağılımları ile birleştirip yeni bir kelime kalıplama vektörü geliştirilmiştir. Bu sayede dokümanlar daha iyi temsil edilmiştir. Geliştirilen yöntem ile doküman temsilinde kullanılan Doc2vec yöntemleri sınıflandırma algoritmaları kullanılarak karşılaştırılmıştır. Sınıflandırma sonucunda geliştirilen yöntemin sonuçları iyileştirdiği ve model karmaşıklığını azalttığı gösterilmiştir.

Recently, with the increase in the amount of data, deep learning has become the most popular field of machine learning. With this increase, new methods have been developed in the field of Natural Language Processing. Representation of textual data is represented on traditional methods using word representation methods such as the Bag of Words model. However, word embeddings methods are use in order to be fast and efficient on new methods. The most popular method of word embeddings is Word2vec. The Word2vec method learns to view a vector for each word using artificial neural networks, looking at the statistics in the context of the words. For the representation of the documents, the word embedding method known as Doc2vec is use.Topic modeling techniques are randomly generated on the topic probability distributions of the words and establish the documents. The most commonly used model is the Latent Dirichlet Allocation (LDA). The LDA model produces 2 different distributions, the distribution of topics on documents and the distribution of words on topics.This thesis, a new word embedding vector was developed by combining the Word2vec method with the LDA model distributions. In this way, the documents are better represented. The developed method and Doc2vec methods document representation were compared using classification algorithms. It has been shown that the method developed as a result of classification improves results and reduces model complexity.

URI

https://acikbilim.yok.gov.tr/handle/20.500.12812/410902

Collections

TEZLER

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess