Metin madenciliği ile doküman demetleme

M.Taha, Syolai

View/Open

File_418098 (1.919Mb)

Date

2011

Author

M.Taha, Syolai

Metadata

Show full item record

Abstract

Günümüzde, büyük miktardaki veri Internet ortamında yer alan dokümanlar şeklinde saklanmaktadır. Buradaki esas problem bu verilerden önemli bilgileri çıkarmak ve keşfedilmemiş örüntüleri bulmaktır. Bu problemin çözümü için kullanılabilecek yöntemlerden birisi de kümeleme teknikleri ile dokümanlar arasındaki ilişkileri gruplayarak, farklı gruplar arasındaki ilişkileri ve örüntüleri bulmaktır. Kümeleme analizi, nesnelerin sınıflandırılmasını detaylı bir şekilde açıklamak hedefiyle geliştirilmiştir. Bu hedefe yönelik olarak, elamanlar içlerindeki benzerliklere göre gruplara ayrılır. Diğer bir hedef ise, benzer elemanların gruplanmasıyla veri setini küçültmektir. Bu çalışmanın amacı bölünmeli kümeleme teknikleri kullanarak İngilizce ve Türkçe metinlerde bulunan verileri belirli başlıklar altında kümeleyerek gerekli bilgiyi elde etmektir . Çalışmada metinlerin tümü Terim Frekansı ? Ters Doküman Frekansı (TF-IDF) vektörleri ile ifade edilmiştir. Daha sonra metin madenciliği konusunda, geleneksel bilgiye erişim çalışmalarının eksiklerini gideren Latin Semantic Index (LSI) yöntemi kullanılmıştır. LSI yöntemi K-Means ve K-Median algoritmalarını kullanarak gerek metinlerden gerekse bu metinlerde geçen terimlerden temel kavram vektörleri oluşturup her bir metnin ve terimin bu vektörler üzerindeki iz düşümünü hesaplar. Çalışmada TF, TF-IDF ve LSI kullanıldığında K-Means ve K-Median algoritmalarının başarıları karşılaştırılmıştır. K-Means algoritmasının kümeleme başarısı, K-Median algoritmasından daha iyi çıkmıştır. Veri seti olarak bu çalışmada oluşturulan Milliyet gazetesi veri seti ve literaturde sıklıkla kullanılan R8 ve WebKB-4 veri setleri kullanılmıştır. Milliyet gazetesi veri setinde sağlık, siyaset ve futbol adlı üç alt başlık bulunmaktadır. R8 veri seti Reuters-21578 içinde bulunmakta ve sekiz sınıf içermektedir. WebKB-4 veri seti farklı üniversitelerin bilgisayar bilimleri bölümlerinden toplanan web sayfaları kullanılarak oluşturulmuş ve dört sınıf içermektedir. Çalışma Microsoft. Net ortamında C# dili kullanılarak gerçekleştirilmiştir..

Today, the data in much quantity is kept in type of documents that take place at the internet media. The main problem at here is, to reject the important data from these data and to find out the not discovered patterns. One of the methods that can be used for solving this problem is to find out the relations and patterns between the different groups by grouping of the relations between the documents by using the aggregation techniques. The aggregation analysis has been developed in target of explaining the classification of the objects in details. Related to this target, the elements are separated according to the comparisons inside them. The other target is to make the data set smaller by grouping the alike elements. The target of this study is to prove the necessary data by aggregating the data inside the Turkish and English texts in titles by using the division aggregation techniques. At the study, all texts have been expressed Term Frequency ? Inverse Document Frequency (TF ? IDF) vectors. Later, at the text mining subject, Latin Semantic Index (LSI) method that supplies the deficiency of reaching to the traditional data studies has been used. The LSI method makes up basic concept vectors both from the texts and the terms that are told at these texts by using the K ? Means and K ? Median Algorithms and calculates the projections of each term and text on these vectors. At the study the successes of K ? Means and K ? Median algorithms when TF, TF ? IDF and LSI has been used, has been compared. The aggregating success of K ? Means algorithm has been found better than K ? Median algorithm. At this study, as data set, Milliyet newspaper data set and R8 and WebKB ? 4 data sets that that are frequently used at the literature are used. At Milliyet newspaper data set, there are three subtitles named health, politics and football. R8 data set is found inside Reuters ? 21578 and contents eight classes. WebKB ? 4 data set has been made up by using the web pages that are collected from the computer sciences departments of different universities and contents four classes. The study has been realized by using C# language at Microsoft. Net media.

URI

https://acikbilim.yok.gov.tr/handle/20.500.12812/362989

Collections

TEZLER

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess