Metin madenciliği ile doküman demetleme

M.Taha, Syolai

dc.contributor.advisor	Özdemir, Suat
dc.contributor.author	M.Taha, Syolai
dc.date.accessioned	2020-12-29T08:26:01Z
dc.date.available	2020-12-29T08:26:01Z
dc.date.submitted	2011
dc.date.issued	2018-08-06
dc.identifier.uri	https://acikbilim.yok.gov.tr/handle/20.500.12812/362989
dc.description.abstract	Günümüzde, büyük miktardaki veri Internet ortamında yer alan dokümanlar şeklinde saklanmaktadır. Buradaki esas problem bu verilerden önemli bilgileri çıkarmak ve keşfedilmemiş örüntüleri bulmaktır. Bu problemin çözümü için kullanılabilecek yöntemlerden birisi de kümeleme teknikleri ile dokümanlar arasındaki ilişkileri gruplayarak, farklı gruplar arasındaki ilişkileri ve örüntüleri bulmaktır. Kümeleme analizi, nesnelerin sınıflandırılmasını detaylı bir şekilde açıklamak hedefiyle geliştirilmiştir. Bu hedefe yönelik olarak, elamanlar içlerindeki benzerliklere göre gruplara ayrılır. Diğer bir hedef ise, benzer elemanların gruplanmasıyla veri setini küçültmektir. Bu çalışmanın amacı bölünmeli kümeleme teknikleri kullanarak İngilizce ve Türkçe metinlerde bulunan verileri belirli başlıklar altında kümeleyerek gerekli bilgiyi elde etmektir . Çalışmada metinlerin tümü Terim Frekansı ? Ters Doküman Frekansı (TF-IDF) vektörleri ile ifade edilmiştir. Daha sonra metin madenciliği konusunda, geleneksel bilgiye erişim çalışmalarının eksiklerini gideren Latin Semantic Index (LSI) yöntemi kullanılmıştır. LSI yöntemi K-Means ve K-Median algoritmalarını kullanarak gerek metinlerden gerekse bu metinlerde geçen terimlerden temel kavram vektörleri oluşturup her bir metnin ve terimin bu vektörler üzerindeki iz düşümünü hesaplar. Çalışmada TF, TF-IDF ve LSI kullanıldığında K-Means ve K-Median algoritmalarının başarıları karşılaştırılmıştır. K-Means algoritmasının kümeleme başarısı, K-Median algoritmasından daha iyi çıkmıştır. Veri seti olarak bu çalışmada oluşturulan Milliyet gazetesi veri seti ve literaturde sıklıkla kullanılan R8 ve WebKB-4 veri setleri kullanılmıştır. Milliyet gazetesi veri setinde sağlık, siyaset ve futbol adlı üç alt başlık bulunmaktadır. R8 veri seti Reuters-21578 içinde bulunmakta ve sekiz sınıf içermektedir. WebKB-4 veri seti farklı üniversitelerin bilgisayar bilimleri bölümlerinden toplanan web sayfaları kullanılarak oluşturulmuş ve dört sınıf içermektedir. Çalışma Microsoft. Net ortamında C# dili kullanılarak gerçekleştirilmiştir..
dc.description.abstract	Today, the data in much quantity is kept in type of documents that take place at the internet media. The main problem at here is, to reject the important data from these data and to find out the not discovered patterns. One of the methods that can be used for solving this problem is to find out the relations and patterns between the different groups by grouping of the relations between the documents by using the aggregation techniques. The aggregation analysis has been developed in target of explaining the classification of the objects in details. Related to this target, the elements are separated according to the comparisons inside them. The other target is to make the data set smaller by grouping the alike elements. The target of this study is to prove the necessary data by aggregating the data inside the Turkish and English texts in titles by using the division aggregation techniques. At the study, all texts have been expressed Term Frequency ? Inverse Document Frequency (TF ? IDF) vectors. Later, at the text mining subject, Latin Semantic Index (LSI) method that supplies the deficiency of reaching to the traditional data studies has been used. The LSI method makes up basic concept vectors both from the texts and the terms that are told at these texts by using the K ? Means and K ? Median Algorithms and calculates the projections of each term and text on these vectors. At the study the successes of K ? Means and K ? Median algorithms when TF, TF ? IDF and LSI has been used, has been compared. The aggregating success of K ? Means algorithm has been found better than K ? Median algorithm. At this study, as data set, Milliyet newspaper data set and R8 and WebKB ? 4 data sets that that are frequently used at the literature are used. At Milliyet newspaper data set, there are three subtitles named health, politics and football. R8 data set is found inside Reuters ? 21578 and contents eight classes. WebKB ? 4 data set has been made up by using the web pages that are collected from the computer sciences departments of different universities and contents four classes. The study has been realized by using C# language at Microsoft. Net media.	en_US
dc.language	Turkish
dc.language.iso	tr
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	Attribution 4.0 United States	tr_TR
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol	tr_TR
dc.subject	Computer Engineering and Computer Science and Control	en_US
dc.title	Metin madenciliği ile doküman demetleme
dc.title.alternative	Dokument clustering using text mining
dc.type	masterThesis
dc.date.updated	2018-08-06
dc.contributor.department	Bilgisayar Bilimleri Anabilim Dalı
dc.identifier.yokid	418098
dc.publisher.institute	Bilişim Enstitüsü
dc.publisher.university	GAZİ ÜNİVERSİTESİ
dc.identifier.thesisid	316607
dc.description.pages	100
dc.publisher.discipline	Diğer

Files in this item

Name:: yokAcikBilim_418098.pdf
Size:: 1.919Mb
Format:: PDF
Description:: File_418098

View/Open

This item appears in the following Collection(s)

TEZLER

Show simple item record

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess