Metin madenciliği kullanarak ingilizce doküman sınıflama

Özdoğan, Ahmet Görkem

dc.contributor.advisor	Turan, Metin
dc.contributor.author	Özdoğan, Ahmet Görkem
dc.date.accessioned	2020-12-04T18:10:10Z
dc.date.available	2020-12-04T18:10:10Z
dc.date.submitted	2019
dc.date.issued	2020-01-09
dc.identifier.uri	https://acikbilim.yok.gov.tr/handle/20.500.12812/95170
dc.description.abstract	Günümüzde metin tabanlı dokümanların sınıflandırılması özellikle kurumsal yazışmaların ve dijital dokümantasyonun çok yapıldığı durumlarda ciddi öneme sahiptir. Bu çalışmada bilinirliği yüksek olan kosinüs benzerliği ve Jaccard benzerliği ile Noktasal karşılıklı Bilgi (PMI) birliktelik ölçütü karşılaştırılarak sonuçlar gözlemlenmiştir. Özellik seçimi için, Helmholtz prensibi ile Gestalt teorisi kullanılmıştır. Bu yöntem metin madenciliğinde, özellik çıkarımı, özetleme gibi alanlarda kullanılmıştır. Çalışma için kullanılan doküman veri seti spor ve eğitim temalarında olup, toplam 14 alt kavram önceden belirlenmiştir. Önceden belirlenmiş kavramlara sahip dokümanlar için Kosinüs, Jaccard ve PMI benzerlik ölçütleri karşılaştırılmıştır. Her bir dokümanın benzerlik katsayılarının ortalamaları baz alınarak yapılan sınıflama ise anlamlı kelimelerin yüzdelik değerlerine göre farklı başarımlar elde edilmiştir. Bu bakımdan PMI benzerlik ölçütü anlamlı kelime dağılımlarına adaptif bir yaklaşım sergiler iken Kosinüs benzerlik ölçütünde ve Jaccard benzerliğinde herhangi bir iyileşme gözlemlenmemiştir. Çalışmanın sonraki kısmında, PMI benzerlik ölçütünü K-Means modeli üzerinde uygulayarak öbekleme sonuçları gözlemlenmiştir. Sonuçları iyileştirmek üzere benzerlik gösteren kelimelerin sonuçlara daha belirgin etki yapması amacıyla öbeklenen doküman vektörlerin temsilinde yüzdesel eşikler uygulanmıştır. Bu aralıkta yapılan öbekleme çalışmasında yaklaşık %70'lere varan başarı sağlanmıştır.
dc.description.abstract	Nowadays, the classification of text-based documents is of very import, especially when lots of corporate correspondence and digital documentation are done. Classification of similar texts from piles is a factor increases productivity. In text mining, various approaches to such problems are sought. In this study, we have compared the Cosine similarity and Jaccard similarity with PMI (Pointwise Mutual Information) criterion and the results are observed. The Gestalt theory with the Helmholtz principle was used to identify meaningful words. This method has been used in text mining in areas such as feature extraction, text summarization. The document data set used for the study was in the sports and educational themes and a total of 14 sub-concepts were pre-determined. Cosine Jaccard and PMI similarity criteria were compared for documents with predetermined concepts. On the basis of all of the documents with a similarity rate on average, the likeness of Cosine similarity was 75%, Jaccard similarity was 40% and PMI similarity was 55%. On the other hand, based on the accuracy values, the cosine similarity criterion was 80%, Jaccard similarity was 65%, and PMI similarity was 65%. According to the averages of the similarity coefficients of each document, different performances were obtained according to the percentage of meaningful words. In the point of view, while the PMI similarity criterion exhibits an adaptive approach to meaningful word distributions, no improvement was observed in the cosine similarity criterion and in the Jaccard Similarity. In the next part of the study, clustering results were observed by applying the PMI similarity criterion on K-Means model. In the clustering study for randomly selected classes, it was observed that the 20 randomly selected documents were assigned to different classes in the calculations, considering that the first random classes were assigned different topics. Percentage thresholds were applied to the document vectors of the clustered document vectors in order to have a more obvious effect on words with common similarities in order to improve the results. In the calculations of these threshold values between 25% and 75%, the most successful interval was 60-65%. In this range, the success of the clustering was achieved up to 70%.	en_US
dc.language	Turkish
dc.language.iso	tr
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	Attribution 4.0 United States	tr_TR
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol	tr_TR
dc.subject	Computer Engineering and Computer Science and Control	en_US
dc.title	Metin madenciliği kullanarak ingilizce doküman sınıflama
dc.title.alternative	English document classification using text mining
dc.type	masterThesis
dc.date.updated	2020-01-09
dc.contributor.department	Bilgisayar Mühendisliği Anabilim Dalı
dc.subject.ytm	Natural language processing
dc.subject.ytm	Text mining
dc.identifier.yokid	10306118
dc.publisher.institute	Fen Bilimleri Enstitüsü
dc.publisher.university	İSTANBUL TİCARET ÜNİVERSİTESİ
dc.identifier.thesisid	600821
dc.description.pages	98
dc.publisher.discipline	Diğer

Files in this item

Name:: yokAcikBilim_10306118.pdf
Size:: 2.004Mb
Format:: PDF
Description:: File_10306118

View/Open

This item appears in the following Collection(s)

TEZLER

Show simple item record

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess