Eksik değerleri en olası değer ile doldurmanın sınıflandırma algoritmaları üzerinden karşılaştırılması

Keklik, Çağdaş

dc.contributor.advisor	Örencik, Cengiz
dc.contributor.author	Keklik, Çağdaş
dc.date.accessioned	2021-05-09T09:42:08Z
dc.date.available	2021-05-09T09:42:08Z
dc.date.submitted	2017
dc.date.issued	2018-08-06
dc.identifier.uri	https://acikbilim.yok.gov.tr/handle/20.500.12812/709470
dc.description.abstract	Günümüz bilgi çağında gözümüze çarpan veri madenciliği en temel makine öğrenmesi yöntemlerinden biri olarak dikkat çekmektedir. Gün geçtikçe bilgisayarların devamlı ucuzlama durumu ve güç performansının dur durak bilmeden artışı, bilgisayarlarda çok fazla miktarlarda verinin saklanabilmesine olanak vermektedir. Veri madenciliği, bu büyük hacim ve çeşitlilikteki veriden anlamlı bilgi edinebilmenin hemen hemen tek çözüm yolu şeklinde bakılmaktadır. Bu sebepten ötürü çok miktarda verileri işleyebilen metotları kullanabilmek, hayati olabilecek bir öneme sahiptir. Veri madenciliğinin asıl amacı birçok veri içerisinde saklı durumda mevcut olan örüntü ve eğilimleri bulup çıkartma işlemidir.Çok büyük veri ambarlarının içinde tutulan veriler tek olarak kullanıldıklarında değersiz olarak görülebilseler de, bu veriler toparlanıp bir hedefe odaklı olarak kullanıldığı zaman anlamlı hale dönüşmektedirler. Asıl amaç veriyi uygun bilgiye çevirme işidir ve bu veri madenciliği ile gerçekleştirilmektedir. Veri madenciliğinde esas olan şey kısaca verilerin işlenmesi metodudur. Dünya üzerinde durmaksızın artış gösteren ve inanılmaz boyutlara ulaşan veriyi en yüksek performansı sağlayacak şekilde kullanmanın yolu veri madenciliğinden geçmektedir. Bu olay diğer alanlarda görüldüğü gibi tıp alanında da çok büyük ilgi odağı haline gelmiştir. Veri madenciliği yaparken karşılaşan en temel problemlerden biri üzerinde çalışılan verinin düzenlenmesidir. Verinin bazı satırları eksik değerler içerebilir. Bu değerlerin eksik olması o verinin işleme sokulmasını ve diğer değerler ile karşılaştırılmasını imkansız kılar. Bu tezde bu eksik değerlerin olası en uygun değerler ile doldurularak işleme sokulmasının sonuca etkileri analiz edilmiştir. Eksik değer içeren satırları toptan yok saymak, belli bir sınır değerden çok eksik veri içeren satırları yok sayıp kalan değerleri olası tahmini değerler ile doldurmak ve her türlü eksik veriyi olası en uygun değer ile doldurarak analize dahil etmek senaryoları ayrı ayrı test edilerek başarımları test edilmiş ve birbirlerine olan üstünlükleri değerlendirilmiştir.Bu analizlerimizde kanser verisi örnek test kümesi olarak seçilmiştir. Veri madenciliğinin tanımı ile başlayarak sonrasında veri madenciliği tekniklerinin ve algoritmalarının kullanılıp kanser hastalığının bu kapsamda irdelenmesi ve erken teşhisin çıkarılabilmesi ve ayrıca bu algoritmaların performanslarının weka adlı program kullanılarak elde edilen çıktılar doğrultusunda karşılaştırılması hedeflenmiş.Üzerinde çalışılacak olan Wisconsin veri setinde kanser verileri irdelenecektir. Karar ağacı algoritmalarından olan J48, Bayes ile sınıflandırma yapılan algoritmalarından biri olan Naive – Bayes, regresyon esasında olan algoritmalardan biri olan lojistik şekilde olan regresyon ve örnek tabanlı şekilde sınıflandırma algoritmalarından biri olan KStar biçiminde olan algoritmaları dikkate alınarak oluşan modeller ortaya getirilmiş ayrıca oluşturulan modellerin başarım dereceleri birbirleri arasında karşılaştırılmıştır.
dc.description.abstract	In the era of information age, data mining is notable as one of the most fundamental machine learning methods. The continuous increase in the computation power and storage capacities of computers leads an increased development in data analytics and data mining resulting several research and methods on the field. The main aim of data mining is to extract valuable knowledge from large amounts of diverse data that can be used in decision making. Data mining can be used in different areas such as predicting future events, describing interesting patterns or clustering similar data elements which gives knowledge that can be used in the decision making process.While individual data elements have little or no value, when large amounts of data collected together it becomes quite valuable. Valuable information and goal-oriented knowledge can be extracted from this large data through data mining methods. The continuous rise of data production in the world requires efficient data mining tools to control on the huge amounts of data. Therefore, data mining has become one of the most essential parts in medical researches as also occurred in several other fields.One of the fundamental problems in data mining is to prepare and preprocess the data for the mining operation. In this concept, missing values is an important issue. The collected data may contain some missing fields. As the data contains null values, it is impossible to make any comparison with those values. A possible solution is to fill those missing values with the best fitting value. In this theses, we compare three scenarios where, in the first one we omit all the lines that contains any missing value, in the second one we omit the lines that have missing values larger than a threshold and fill the rest with best fitting values, and in the third case we fill all the missing values with the best fit. We then compare the success rates of those scenarios using different algorithms and different success metrics. During those analyses we use a cancer database as test set. Starting from the definition of data mining, we explain some well-known data mining algorithms. Next, we apply those techniques on a publicly available health record data to predict cancer related diseases and provide analysis and comparison of the performances of different methods utilizing a software program named Weka.In this thesis, the breast cancer related data of the Wisconsin data set is used as the publicly free health record data. For the algorithms, we select J48 algorithm as a decision tree based approach, the Naive - Bayesian method as a Bayesian classification approach, logistic regression method and the K-star algorithm as a sample based classification method. The performance of each test scenario is compared according to accuracy and efficiency metrics.	en_US
dc.language	Turkish
dc.language.iso	tr
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	Attribution 4.0 United States	tr_TR
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol	tr_TR
dc.subject	Computer Engineering and Computer Science and Control	en_US
dc.title	Eksik değerleri en olası değer ile doldurmanın sınıflandırma algoritmaları üzerinden karşılaştırılması
dc.title.alternative	Comparison of filling missing values with the best fit over classification algorithms
dc.type	masterThesis
dc.date.updated	2018-08-06
dc.contributor.department	Bilgisayar Mühendisliği Ana Bilim Dalı
dc.identifier.yokid	10054553
dc.publisher.institute	Fen Bilimleri Enstitüsü
dc.publisher.university	BEYKENT ÜNİVERSİTESİ
dc.identifier.thesisid	459256
dc.description.pages	102
dc.publisher.discipline	Bilgisayar Mühendisliği Bilim Dalı

Files in this item

Name:: yokAcikBilim_10054553.pdf
Size:: 2.747Mb
Format:: PDF
Description:: File_10054553

View/Open

This item appears in the following Collection(s)

TEZLER

Show simple item record

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess