Dokümanların anlamsal benzerliklerine dayalı özgün bir konu modelleme yöntemi

Ekinci, Ekin

dc.contributor.advisor	İlhan Omurca, Sevinç
dc.contributor.author	Ekinci, Ekin
dc.date.accessioned	2020-12-29T12:52:49Z
dc.date.available	2020-12-29T12:52:49Z
dc.date.submitted	2019
dc.date.issued	2019-11-07
dc.identifier.uri	https://acikbilim.yok.gov.tr/handle/20.500.12812/414743
dc.description.abstract	Yapısal ve yapısal olmayan milyarlarca içeriği biz kullanıcılarına sunan Web, günümüzün önemli veri kaynaklarından birisi haline gelmiştir. Sunulan içerik her geçen gün büyümekte, bu içerikten istenilen bilginin otomatik bir şekilde çıkartılması ve çıkartılan bilginin organize edilme, analiz edilme ve anlaşılması adımında ise daha yeni ve daha etkili yöntemlerin geliştirilmesi gerekmektedir. Konu modelleri ise bahsedilen bu görevleri gerçekleştirme aşamasında güçlü ve başarılı bir yöntem olarak karşımıza çıkmaktadır. İlk olarak 1990 yılında ortaya çıkan konu modelleri içerisinde ise en yeni ve başarılı olanı Gizli Dirichlet Ayırımıdır (LDA).Doküman gibi ayrık verileri modellemek ve dokümanı meydana getiren konuları ortaya çıkarmak için kullanılan üretici grafiksel bir yöntem olan LDA, sadece kelimelerin doküman koleksiyonunda birlikte geçme durumlarını dikkate almaktadır. Buna karşın içerdikleri anlamsal bilgiyi ise dikkate almamaktadır. Bu durum önemli bir dezavantaj oluşturmaktadır.Bu tez çalışmasında kavram ve adlandırılmış varlıklar şeklindeki anlamsal bilgiyi LDA'ya dahil ederek anlamsal olarak ilişkili, uyumlu, detayları yakalayabilen ve daha anlamlı konuları elde etmek amacıyla iki konu modeli önerilmiştir. Concept-LDA olarak adlandırılan birinci yöntemde, LDA'nın temel varsayımı olan kelime torbası yaklaşımı, {kelime+kavram+adlandırılmış varlık} torbası olacak şekilde genişletilerek anlamsal bir zenginleştirme yöntemi hedeflenmiştir. Geliştirilen Concept-LDA alandan bağımsız bir yöntemdir. NET-LDA olarak adlandırılan ikinci yöntemde ise, anlamsal olarak benzer dokümanlar birleştirilmiş ve birleştirme adımında elde edilen anlamsal benzerlik bilgisi yeni bir adaptif parametre olarak modele dahil edilmiştir. NET-LDA hem alandan hem de dilden bağımsız olup her iki yöntem ile başarılı konuların çıkartılması sağlanmıştır. Anlamsal bilginin elde edilmesi adımında ise graf tabanlı bir yaklaşım olan Babelfy kullanılmıştır.Geliştirilen yöntemlerin performansları hem niceliksel hem de niteliksel olarak değerlendirilmiştir. Concept-LDA'nın değerlendirilmesi adımında on iki farklı ürüne ait İngilizce kullanıcı yorumları kullanılmıştır; NET-LDA'nın değerlendirilmesinde ise biri Türkçe diğer on iki tanesi İngilizce olmak üzere on üç farklı ürüne ait kullanıcı yorumları kullanılmıştır. Ayrıca, geliştirilen yöntemler hem niceliksel hem de niteliksel olarak üç temel yöntemden elde edilen sonuçlar ile karşılaştırılmıştır. Yapılan deneyler sonucunda anlamsal bilginin modele dahil edilmesi ile anlamsal olarak ilişkili, uyumlu, detayları yakalayabilen ve daha anlamlı konuların elde edildiği görülmüştür. Geliştirilen yöntemlerin temel yöntemlere kıyasla da oldukça başarılı oldukları yapılan deneylerde ispatlanmıştır.
dc.description.abstract	The Web, which provides billions of structural and non-structural content to its users, has become one of today's important data sources. The content provided is growing day by day, newer and more effective methods need to be developed in the process of automatically extracting desired information from this content and organizing, analyzing and understanding this extracted information. Topic models come across as a powerful and successful method for performing these tasks. Among the topic models themselves, which first appeared in 1990, Latent Dirichlet Allocation (LDA) is the most recent and successful topic model.LDA, which is a generative graphical method used to model discrete data such as documents and reveal the topics that compose the documents, considers only word co-occurrence distribution in the document. On the other hand, LDA does not considers the semantic information documents contain. This poses a significant drawback.In this thesis, two topic models have been devised by incorporating semantic knowledge in the form of concepts and named entities into the LDA in order to obtain semantically related, coherent, detailed and more meaningful topics. In the first method called Concept-LDA, bag-of-words which is the basic assumption of LDA is expanded to be a bag of {words+concepts+named entities} as a semantic enrichment method is aimed. The proposed Concept-LDA is independent of domain. In the second method called NET-LDA, semantically similar documents are merged and semantic similarity obtaining in the merging step is injected into the model as a new adaptive parameter. NET-LDA is independent both of domain and language. In the step of obtaining semantic knowledge a graph based approach Babelfy is used.The performances of the proposed methods are evaluated both quantitatively and qualitatively. In the evaluation of Concept-LDA, user reviews of twelve different domains are used; in the evaluation of NET-LDA, user reviews of thirteen different domains one in Turkish and the other twelve in English are used. Besides, the proposed methods are compared both quantitatively and qualitatively with the results obtained from three baselines. As a result of the experiments conducted, it is seen that the incorporating semantic knowledge into the model semantically related, coherent, detailed and more meaningful topics are obtained. It has been proved with the experiments that the proposed methods are also fairly successful compared to the baselines.	en_US
dc.language	Turkish
dc.language.iso	tr
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	Attribution 4.0 United States	tr_TR
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol	tr_TR
dc.subject	Computer Engineering and Computer Science and Control	en_US
dc.title	Dokümanların anlamsal benzerliklerine dayalı özgün bir konu modelleme yöntemi
dc.title.alternative	An original topic model method based on semantic similarity of documents
dc.type	doctoralThesis
dc.date.updated	2019-11-07
dc.contributor.department	Bilgisayar Mühendisliği Anabilim Dalı
dc.identifier.yokid	10241909
dc.publisher.institute	Fen Bilimleri Enstitüsü
dc.publisher.university	KOCAELİ ÜNİVERSİTESİ
dc.identifier.thesisid	575122
dc.description.pages	109
dc.publisher.discipline	Diğer

Files in this item

Name:: yokAcikBilim_10241909.pdf
Size:: 4.414Mb
Format:: PDF
Description:: File_10241909

View/Open

This item appears in the following Collection(s)

TEZLER

Show simple item record

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess