Makine öğrenmesi sınıflandırma yöntemlerinde performans metrikleri ile test tekniklerinin farklı veri setleri üzerinde değerlendirilmesi

Alan, Abdullah

View/Open

File_10338705 (2.429Mb)

Date

2020

Author

Alan, Abdullah

Metadata

Show full item record

Abstract

Makine öğrenmesi sınıflandırma yöntemlerinde bir model oluşturulurken en önemli sorunlardan birisi en iyi sınıflandırıcının seçimi sürecidir. Doğru sınıflandırıcının seçiminde, veri setinin, modeli oluşturmak için ayrılan eğitim bölümü ile modelin test edilmesi aşamasında kullanılan test bölümünün doğru seçilmesi büyük önem arz etmektedir. Çalışmada bölme işlemi için hold-out ve 10 katlı çapraz doğrulama yöntemleri kullanılmıştır. Model oluşturulduktan sonra ise sınıflandırıcının performansını değerlendirmek için bazı metrikler kullanılmaktadır. Bu tez çalışması kapsamında veri dağılımı ve karar sınıfı dağılımı birbirinden farklı olan 32 veri setine dokuz sınıflandırıcı uygulanmıştır. Model oluşturmak için açık kaynak kodlu bir dil olan Python programlama dili ve Sklearn, Pandas, Numpy, Seaborn ve Matplotlib kütüphanelerinden faydalanılmıştır. Bu sınıflandırıcılar ile hold-out ve çapraz doğrulama yöntemleri kullanılarak modeller oluşturulmuştur. Sınıflandırıcıların performanslarını değerlendirmek için karmaşıklık matrisinden faydalanılmış ve karmaşıklık matrisi yardımı ile her bir model için doğruluk, kesinlik, anma, F1, MCC ve AUC değerleri hesaplanmıştır. Dengesiz veri setlerinde MCC ve AUC değerlerinin sınıflandırıcı seçiminde daha doğru sonuçlar verdiği gözlemlenmiştir. Elde edilen sonuçlar incelendiğinde hold-out yöntemi yirmi veri setinde çapraz doğrulama yönteminden daha iyi sonuç vermiş olsa da model seçimi yapılırken çapraz doğrulama yöntemi seçilmiştir. Bunun nedeni hold-out yönteminde model eğitilirken ya da test edilirken hiç görmediği bir veri dağılımı olabilmektedir. Seyahat sigortası ve ph tanıma veri setleri ile oluşturulan modellere bakıldığında bazı durumlarda yüksek başarım değerinin düşük başarım değerinden daha kötü olduğu görülmektedir.

One of the most important problems when creating a model in machine learning classification methods is the selection process of the best classifier. In the selection of the correct classifier, it is very important to select the data set, the training section reserved for creating the model and the test section used in the testing phase of the model. In the study, hold-out and 10-fold cross-validation methods were used for division. After the model is created, some metrics are used to evaluate the performance of the classifier. Within the scope of this thesis, nine classifiers have been applied to 32 data sets whose data distribution and decision class distribution are different from each other. The Python programming language, which is an open source language, and the Sklearn, Pandas, Numpy, Seaborn and Matplotlib libraries were used to create models. With these classifiers, models were created using hold-out and cross-validation methods. The complexity matrix was used to evaluate the performance of the classifiers and the accuracy, precision, recall, F1, MCC and AUC values were calculated for each model with the help of the confusion matrix. It was observed that MCC and AUC values gave more accurate results in the classifier selection in unbalanced data sets. When the obtained results are analyzed, although the hold-out method has yielded better results than the cross verification method in twenty datasets, the cross verification method was chosen while selecting the model. The reason for this may be a distribution of data that he never saw during the model training or testing in the hold-out method. When the models created with travel insurance and ph recognition data sets are analyzed, it is seen that in some cases, the high performance value is worse than the low performance value.

URI

https://acikbilim.yok.gov.tr/handle/20.500.12812/400966

Collections

TEZLER

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess