Row and column selection algorithm for SVR model estimation on large scale business problems

Yaman, Kübra

dc.contributor.advisor	Ali, Fatma Özden
dc.contributor.author	Yaman, Kübra
dc.date.accessioned	2020-12-08T08:02:18Z
dc.date.available	2020-12-08T08:02:18Z
dc.date.submitted	2010
dc.date.issued	2018-08-06
dc.identifier.uri	https://acikbilim.yok.gov.tr/handle/20.500.12812/170066
dc.description.abstract	Bu çalışmada çok büyük veri setleri için Destek Vektör Regresyon (DVR) modellinin kurulabilmesini mümkün kılmak için önemli nokta ve değişkenlerini seçen bir algoritma geliştirilmiştir. İki aşamalı bu yöntemde, yani satır ve sütun seçme algoritmasında, hem satır hem de sütun seçiminde L1-norm düzenlemeli ?-DVR modelleri kurulmuştur. İlk aşama, eğitim veri setinin destek vektörlerinin ağırlıklarını cezalandırarak veri setinin önemli noktalarından en az sayıda destek vektörlerini seçer ve bu seçilen noktaları yeni eğitim veri setine dahil eder. Seçilen bu destek vektörlerinden oluşturulan yeni eğitim veri seti daha sonra ikinci aşamada değişken ağırlıklarını cezalandırarak eğitim veri setinde tutulacak olan değişken alt küme seçiminde kullanılır. Seçilen satır ve tüm değişkenleri içeren eğitim veri seti ile çalıştırılıp kurulan Radyal Tabanli İşlev (RTI) çekirdekli DVR modellerinin test veri seti üzerindeki doğuluğu, karşılaştırma yapılan yani seçilen satır sayısı kadar satırla tüm değişkenleri içeren rassal örneklem veri setinden ve SVMTorch algoritması ile oluşturulan modellerden önemli ölçüde daha iyi olduğu gözlenmiştir.Bu tezin katkısı oldukça büyük veri setlerini kullanarak doğru ve düşük karmaşıklık içeren DVR modellerinin kurulmasını kolaylaştıran bir algoritma geliştirmesidir. Bu çalışmada önerilen algoritma veri setlerinin önemli gözlem ve değişkenlerini seçip ve onları tahmin modelinde kullanmayı mümkün kılmıştır. Deneysel sonuçlar satır ve sütun seçme algoritmasının etkili bir şekilde çalıştığını ve gereksiz değişkenlerin varlığında değişken sayısını önemli ölçüde azaltırken RTİ-DVR modellerinin genelleme hatasını iyileştirdiğini kanıtlamıştır. Bu çalışmada ayrıca seçilen noktaların diğerler noktalardan nasıl farklı olduğunu anlayabilmek için seçilen noktaların tahmin çizgisine olan uzaklıklarına, hedef değere ve veri kümesinin değişkenlerine göre nasıl dağıldıkları analiz edilmiştir. Yapılan analizler sonucunda, L1-normlu ı-DVR standart ?-DVR'a göre çok daha seyrek bir çözüm sunduğunu gözlenmiştir. Ayrıca L1-normlu ?-DVR'de uç noktalardaki hedef değerlere sahip olan gözlemlerin seçilmesi ortalama hedef değerlere sahip olan gözlemlerden daha olasıdır. Standart ?-DVR'nin aksine, L1-normlu ?-DVR algoritmasının destek vektörleri ı tüpünün içinde ve dışında olabilir. Bunlara ilaveten, seçilen sütunlar arasındaki düşük çoklu doğrusal bağıntı algoritmamızın ikinci kısmını oluşturan değişken seçimi prosedürünün doğru bir şekilde çalıştığını desteklemektedir. Son olarak, seçilen noktalarla değişken değerleri arasındaki ilişki incenlenmiş ve bu analizin sonucunda satır ve sütun seçme algoritmasının noktaları seçimini literatürdeki bazı ön bilgilere dayalı yaptığı gözlenmiştir.
dc.description.abstract	This study introduces an algorithm, which selects important observations and variables to estimate SVR models for very large data sets. In this two-stage methodology, namely the Row and Column Selection Algorithm, ?-SVR models with L1-norm regularization are used both for selecting rows and columns. The first stage penalizes support vector weights to identify few support vectors as important points to include in the training data set. These support vectors are then used in the second stage to select the variable subset to be kept in the training data by penalizing the variable weights. The accuracy of holdout test set of the RBF-SVR models trained on this set including selected rows with all variables is significantly better than the accuracy of the same model trained on the benchmark which is the randomly sampled data set of the same size with all variables and SVMTorch.The contribution of this thesis is the development of an algorithm which facilitates estimating SVR models with very large data sets which are accurate and low complexity. By using the proposed algorithm, it is possible to select the important observations and variables and use them for estimation. The experimental results validate that the resulting training data set works effectively and reduces the number of variables dramatically while improving the generalization error of the RBF-SVR models in the presence of redundant variables. Furthermore, we investigate how the selected points differ from others by analyzing their distribution with respect to their distance from the prediction line, target values and the input variables of data set. This analysis demonstrates that L1-norm ?-SVR provides much more sparse solution than standard ?-SVR. Further the observations with extreme target values are more likely to be selected than average observations. Interestingly, in contrast to standard?-SVR, the L1-norm ?-SVR support vectors can be located both inside and outside the ?-tube. Moreover, low multi-collinearity between selected columns gives face validity variable selection procedure of our algorithm, namely second part of the proposed algorithm. Lastly, we identify which points are selected with respect to variables' values. The result of this analysis indicates that the row and column selection algorithm select observations based on background knowledge.	en_US
dc.language	English
dc.language.iso	en
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	Attribution 4.0 United States	tr_TR
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Endüstri ve Endüstri Mühendisliği	tr_TR
dc.subject	Industrial and Industrial Engineering	en_US
dc.title	Row and column selection algorithm for SVR model estimation on large scale business problems
dc.title.alternative	Büyük veri setlerinde destek vektör regresyonu için sütun ve satır seçme yöntemi
dc.type	masterThesis
dc.date.updated	2018-08-06
dc.contributor.department	Endüstri Mühendisliği Anabilim Dalı
dc.subject.ytm	Variable selection
dc.subject.ytm	Sampling
dc.subject.ytm	Sales forecast
dc.subject.ytm	Data mining
dc.identifier.yokid	384336
dc.publisher.institute	Fen Bilimleri Enstitüsü
dc.publisher.university	KOÇ ÜNİVERSİTESİ
dc.identifier.thesisid	276945
dc.description.pages	73
dc.publisher.discipline	Diğer

Files in this item

Name:: yokAcikBilim_384336.pdf
Size:: 413.2Kb
Format:: PDF
Description:: File_384336

View/Open

This item appears in the following Collection(s)

TEZLER

Show simple item record

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess