Machine learning approach for external fraud detection

Mubalaike, Aji

dc.contributor.advisor	Karaçuha, Ertuğrul
dc.contributor.advisor	Adalı, Eşref
dc.contributor.author	Mubalaike, Aji
dc.date.accessioned	2020-12-07T10:00:26Z
dc.date.available	2020-12-07T10:00:26Z
dc.date.submitted	2018
dc.date.issued	2019-05-07
dc.identifier.uri	https://acikbilim.yok.gov.tr/handle/20.500.12812/127920
dc.description.abstract	İlk bilgisayarlar kendi başlarına çalışan bilgisayarlardı ve dış dünyaya bağlantıları yoktu. Bu nedenle güvenlik ile ilgili bir sorunları yoktu. Zaman içinde bilgisayarlar ağları, İnternet ve telsiz ağlar bilgi ve bilgisayar güvenliği için tehlikeli tehdit olmaya başladılar. Özellikle İnternet bilgi ve bilgisayar güvenliğinde önemli açıkların doğmasına neden olmuştur. Bunun başlıca nedeni, İnternet'in kapalı bir ağ yapısının (ARPANET) herkese açık biçimde uygulanmaya konmasıdır.Bilgisayar sistemlerine yapılan saldırılar genel olarak iki sınıfa ayrılmaktadır: İç saldırılar ve dış saldırılar. Bu tez kapsamında dış saldırıların belirlenmesi üzerinde çalışılmıştır.Bilgisayarlara ilk saldırı 1971'de Creeper virüsü ile yapılmıştır. 1985'li yıllarda bireysel bilgisayarların yaygınlaşmaya başlamasıyla, bu bilgisayarlara yönelik saldırılar görülmeye başlamıştır. 1991 yılında geliştirilen Michelangelo virüsünün amacı DOS işletim sistemini bozmaktı. Daha sonra üretilen Melissa (1999) I Love You (2000) çok yayılmış ve etkili olmuş saldırı örnekleridir.Saldırılar yalnızca sunucu ve bireysel bilgisayarlara yönelik olmamakta, sistemlere karşı da yapılmaktadır. Örneğin İran'ın nükleer çalışmalarını engellemek amacıyla üretilen Stuxnet SCADA sistemine ciddi hasarlar vermiştir (2010). USB bağlantısı üzerinden SCADA sistemine bulaştırılan virüs sistemin bütün kaynaklarını ele geçirmiş ve alarm sistemlerini devre dışı bıraktırmıştır. İşletmenler durumdan haberdar olmadıkları için ne olup bittiğini anlayamamış sonuç olarak tesiste ciddi zararlar oluşmuştur.2013 yılında ortaya çıkan Cryptolocker virüsü sunucularda tutulan dosyaları şifreleyerek kullanımını engellemiştir. Engelin kalkması gereken anahtar daha sonra para karşılığı veriliyordu.Rus savaş uçağının düşürülmesine misilleme olarak Türk kurum ve kuruluşlarını hizmet veremez duruma sokmak için yapılan saldırılar yakın zamanda görülmüştür.Bilgi sistemlerine zarar vermeye yönelik olan bu programlardan korunmak için harcanan paranın 2014 verilerine göre 25 Milyar ABD Doları ve bu yazılımların verdiği zararın 491 Milyar ABD Doları olduğu göz önüne alındığında bilgi sistemlerine yapılan dış saldırıların ne denli önemli bir konu olduğu açıktır.Dış saldırılar amaçları açısından sınıflandırıldığında; 1)Sisteme zarar vermek, 2)Sistemin çalışmasını engellemek ve 3)Menfaat sağlamak olarak sınıflandırılabilirler.İran SCADA sistemine yapılan saldırı birinci sınıfa girmektedir. Türk kurum ve kuruluşlarına karşı yapılan DDos saldırıları ikinci sınıfa girmektedir. Üçüncü sınıfa giren saldırılar için çok sayıda örnek verilebilir. Bunların içinde bankalara yönelik saldırılar, müşteri hesaplarından para çalmak en yaygın görülenlerdir.Bu tez çalışmasının kapsamı dışarıdan gelebilecek tehdit ve olası saldırıların incelenmesi ve ortaya çıkarılmasıdır. Bu hedefe ulaşmak üzere öncelikle dış saldırılar incelenmiş ve bunların verebileceği zararlar nitelik ve nicelik açısından değerlendirilmiştir. İkinci aşamada, dış saldırıların nasıl belirlenebileceği üzerinde durulmuştur. Tehdit ve saldırıların belirlenmesi amacıyla geliştirilmiş yöntem ve algoritmalar incelenmiştir. Üçüncü aşamada, dış saldırılara ilişkin veri kümesi oluşturulmaya çalışılmıştır. Dış saldırılara ilişkin olarak önce PaySim mobile Money simulator veri kümesi üzerinde çalışılmıştır. Ardından NSL-KDD veri kümesi üzerinde çalışılmıştır. Ancak bu iki veri kümesi yeterli görülmemiştir ve Canadian Institute for Sybersecurity'nin hazırladığı veri kümesine geçilmiştir. Her üç veri kümesi üzerinde altı algoritma denenmiştir. Bu algoritmalar, K-Nearest Neighbor (KNN), Random Forest (RF), Adaboost, Logistic Regression (LR), Multinominial Naive Bayes (MNB), Stochastic Gradient Discent (SGD). Denemelerimizin sonucunda RF algoritmasının en başarılı sonucu verdiği görülmüştür. Karşılaştırmalar doğruluk ve F1 ölçüsü hesaplanarak yapılmıştır. Aynı veri kümesi için doğruluk değerleri söz konusu yöntemler için 100 üzerinden RF: 100, AdaBoost: 99,99, KNN: 99,97, LR: 98,1, MNB: 96,79 ve SGD: 96,87 olarak bulunmuştur. F1 ölçüsüne değerleri ise 100 üzerinden şöyle bulunmuştur: RF: 99,97, AdaBoost: 99,85, KNN: 99,64, LR: 62,35, MNB: 49,03 ve SGD: 32,80. En iyi algoritmanın belirlenmesinin ardından, algoritmanın daha hızlı çalışmasını sağlamak amacıyla özellik seçimine geçilmiş. Veri kümesinde 79 olan özellikler 14'e indirilmiştir. Seçilen özellikler şunlardır: •Hedef adresi - Destination Port, •İlk_Pencere_byte_ileri_- Init_Win_bytes_forward, •İlk_Pencere_byte_geri_- Init_Win_bytes_ backward•Akış IAT Enk - Flow IAT Min, •İleri IAT Enk - Fwd IAT Min, •Geri IAT Enk - Bwd IAT Min, •Ortalama Paket Boyu - Average Packet Size, •Geri Paket Uzunluğu Std - Bwd Packet Length Std, •İleri Paket Uzunluğu Std - Fwd Packet Length Std, •Paket Uzunluğu Std - Packet Length Std, •Toplam Geri Paketler - Total Backward Packets, •Toplam Geri Paketlerin Uzunluğu - Total Length of Bwd Packets, •İleri_Enk Seg_Boyu - Min_seg_size_forward, •Etiket - LabelBu işlemlerin sonunda seçilmiş özelliklere kullanılarak, değişik algoritmaların başarımları bulunmuş ve sonuçlar diğer araştırmacıların bulduğu değerler ile karşılaştırılmıştır. Karşılaştırmalar sırasında her yöntem için Bulma, Tutturma ve F1 ölçüsü hesaplanmıştır. Değerler yüzde cinsinden verilmiştir: Diğer Araştırmacıların Sonuçları DAS ve Tez Çalışmasının Sonuçları TÇS olarak kısaltılmıştır:Tutturma - DAS: KNN: 96, RF: 98, AdaBoost: 77, NB: 88, MLP: 77 ve ID3: 98 TÇS : KNN: 99,9, RF: 99, AdaBoost: 99,9, LR: 95,8, MNB: 66,9 ve SGD: 96,8Bulma - DAS: KNN: 96, RF: 97, AdaBoost: 84, NB: 04, MLP: 83 ve ID3: 98 TÇS : KNN: 99, RF: 99,4, AdaBoost: 97, LR: 97, MNB: 66,8 ve SGD: 89F1 ölçüsü- DAS: KNN: 96, RF: 97, AdaBoost: 77, NB: 04, MLP: 76 ve ID3: 98 TÇS : KNN: 99,6, RF: 99,9, AdaBoost: 99,8, LR: 62,3, MNB: 49 ve SGD: 32,8Sonuç olarak, bu tez çalışmasında elde edilen sonuçların, diğer araştırmacılar tarafından yapılan çalışmalara oranla daha başarılı olduğu gösterilmiştir.
dc.description.abstract	If we take a very brief timeline of noteworthy fraudulent incidents, there are hundreds of incidents over the last few decades. But one thing we may not be aware of is the first computer virus in 1971, after the invention of the first electronic general-purpose giant computer ENIAC in 1945, the very first computer virus known as the Creeper virus came to exist. From there, 1991, Michelangelo virus, which was designed to infect DOS systems, was perceived as digital apocalypse at that times. Fortunately, it did not have as much of an impact as people were kind of screaming about, but it was really one of the first widespread viruses that everyone started to kind of learn about and be a little bit fearful of. Then forge ahead a few years, we had the Melissa worm in 1999. That was an email-based worm targeting Microsoft Outlook spreading at an excessive speed. Let us fast forward one more year to the `ILoveYou` worm which was very similar in nature and scope and also one of the most damaging, quickly replicating and spreading fraudulent incident of all time. Again back in 2000, within the first few hours of that kind of worms release it infected and spread to millions of computers around the world. Forward another decade, the malicious computer worm uncovered in 2010, Stuxnet, which was generated to attack SCADA systems causing substantial damage to Isan's nuclear program. It was initially intruded into the network via an infected USB drive, and from there it quickly mapped the internal network mapping out internal resources and so forth. But the operators never knew they were spinning out of control because the alarms were disabled. In these way, it destroyed a large piece of the infrastructure around that nuclear facility setting their program back many years. Fast forward a few more years to 2013, we had the advent of the Cryptolocker virus causing millions of dollars in loss to various companies. When Cryptolocker virus is activated, using cryptography name RSA public-key, malware encrypts definite categories of document files deposited on drives of a local network, with the private key deposited only on the control servers belonging to malware. Intruders can only decrypt the data if the payment they required is made by the expressed deadline, threatening victims to delete the private key if the deadline passes. And then lastly, fast forward to more or less present time, 2016 we had the Locky ransomware, which is very similar to Cryptolocker, and that has over 60 different derivatives of that specific piece of malware, exposing financial havoc on any number of systems, companies large and small, law enforcement agencies, and so forth. So fraudulent incidents as malware, whether it would be viruses, worms, and so forth, have been around for over four decades. So it is extremely necessary to detect or prevent these kinds of large-scale breaches efficiently.Generally speaking, it is also necessary to consider the incredible cost of malware infections. In 2014, a few years ago, $491 billion were spent on the recovery of malware infections and $25 billion spent by the consumers as a result of security threats. That is an incredible number to try to wrap our arms around, but it has a huge impact on the economy at large on a global scale. And something that is even perhaps a little more interesting is the fact the specialists spent 1.2 billion hours dealing with the after-effects of malware and malware infections. That is a lot of time obviously. So fraud is not just an annoyance. It is big business, and it costs a lot of money to companies and to consumers to combat malware infections. So it is not just hackers, and it is not just script codes trying to inconvenience people. It is actually criminal organizations; it is a very organized and intentional process. Whether it is Cryptolocker and ransomware, whether it is stealing information, proprietary secrets and competitive advantages from inside of companies, and so forth, it is a really big business and it costs a lot of money of companies and consumers to combat malware infections. So when it comes to positioning our organization or IT infrastructure and so forth, to be in the best position to ward off malware infections and to perhaps prevent from occurring in the first place, it is very important to understand how malware can affect the related PC or the security systems like IDS, how it can get into our network, how it can affect our organizations, and then the things we need to do. It is vital that everyone understands the nature true of this threat and take ideal measures to mitigate or minimize the risks.In addition to these staggering cost of fraudulent incidents, detection and prevention of all kind of malware should be taken into consideration as fast as possible in an efficient way. As we all know, as far as anti-virus and anti-malware software is considered fairly effective. But when we install A/V software, it should be kept updated and also needs us to take the precautions. It couldn't prevent the security system from getting hacked or intruded like a firewall. Other cost-effective countermeasures designed to detect, prevent or block fraudulent malicious activities all over the network could be intrusion detection and prevention systems. After identifying abnormal traffics, IDS or IPS would write to log files when suspicious activity is detected, then would send event notifications taking preventative measures. However, some kind of destructive drawbacks, like misclassification of genuine traffics as anomalies, and incompetence to configure unknown attacks, make intrusion detection and prevention systems run inefficiently.All mentioned striking evidence lends support to the view that we determine to use the combination of two different type of IDSs, identifies as network-based IDS and anomaly-based IDS for the methodology of this research. Network-based IDSs are positioned within the network to mainly detect abnormal malicious traffics by examining passing network transactions. Anomaly-based IDSs is also responsible for the unknown attack traffics, it could detect unknown external frauds, developing non-signature-based IDSs. A number of factors of this combined IDSs could contribute to the success in detecting external unknown frauds that have not been identified previously and minimizing false positive rate. It is indisputable that, machine learning technique which is the subset of artificial intelligence have gained significant awareness in the past few decades. With the contributions of machine learning techniques, we can analyze a tremendous amount of network traffic data with high performance in a short time, and generate reliable external fraud detection and classification model. Taking into account all these factors, we safely plan to present a comprehensive review of external fraudulent attacks and corresponding detection systems and also demonstrate a set of experimental works analyzing the execution of supervised machine learning techniques.	en_US
dc.language	English
dc.language.iso	en
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	Attribution 4.0 United States	tr_TR
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol	tr_TR
dc.subject	Computer Engineering and Computer Science and Control	en_US
dc.title	Machine learning approach for external fraud detection
dc.title.alternative	Dış saldırıların belirlenmesi için makine öğrenimi yaklaşımı
dc.type	masterThesis
dc.date.updated	2019-05-07
dc.contributor.department	Bilişim Uygulamaları Anabilim Dalı
dc.identifier.yokid	10228449
dc.publisher.institute	Bilişim Enstitüsü
dc.publisher.university	İSTANBUL TEKNİK ÜNİVERSİTESİ
dc.identifier.thesisid	529094
dc.description.pages	85
dc.publisher.discipline	Bilgi Güvenliği Mühendisliği ve Kriptografi Bilim Dalı

Files in this item

Name:: yokAcikBilim_10228449.pdf
Size:: 2.183Mb
Format:: PDF
Description:: File_10228449

View/Open

This item appears in the following Collection(s)

TEZLER

Show simple item record

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess