An open-source, machine learning based intrusion detection system

Arslan Tüver, Zemre

dc.contributor.advisor	Özdemir, Enver
dc.contributor.author	Arslan Tüver, Zemre
dc.date.accessioned	2020-12-07T09:55:42Z
dc.date.available	2020-12-07T09:55:42Z
dc.date.submitted	2019
dc.date.issued	2019-12-13
dc.identifier.uri	https://acikbilim.yok.gov.tr/handle/20.500.12812/127247
dc.description.abstract	Güncel global ağ talep tahmin grafiklerini incelediğimizde, üstel bir artış trendi olduğu gerçeği yadsınamaz. Bu yüksek hacimdeki veri talepleri, ağ operatörleri için kendi sistemlerini değişken taleplere cevap verecek şekilde tasarlama zorunluluğunu ortaya çıkarmaktadır. Bunun bir sonucu olarak da, söz konusu karmaşık ağ sistemlerinin güvenliğini sağlamak ya da bir atak sırasında hızla karar vererek davranmak daha zor bir iş haline gelmektedir.Günümüzde her ne kadar ağ operatörleri davetsiz misafirlere karşı daha hazırlıklı olsalar da, saldırganlar da boş durmamakta ve hedef sistemlerin açıklarını bulmak için yeni yöntemler geliştirmekte ve yeni yollar keşfetmektedirler. En çok bilinen atak türlerinden bir tanesi de Servis-Dışı-Bırakma (Denial of Service - DoS) ataklarıdır. DoS atakları, hedef sistemdeki açıklara köle haline getirilmiş, internet bağlantısı olan cihazlar aracılığıyla saldırılarak yapılır.Servis-dışı-bırakma atakları, hedef sistem kaynaklarını, meşru kullanıcılara ulaşılamaz hale getirilmeyi amaçlar. Bir servis-dışı-bırakma atağı, Komut ve Kontrol sunucusu tarafından köle yapılmış cihazlar tarafından yapılır. Dağıtık Servis Dışı Bırakma Atakları da bir çeşit Servis Dışı Bırakma atağı olup çeşitli lokasyonlardan çoklu köle cihazlarla yüklü miktarda trafik yaratarak hedef sistemin kaynaklarını tüketme yoluyla yapılır. Bu atakların verdiği zararlar, özellikle servis sürekliliği sağlaması gereken firmalar için geri dönülemez olabilir. Eğer firmalar, DoS ataklarına karşı korunma stratejisi belirlememişlerse, zararın ekonomik boyutu gerçekten büyük olabilir. Zira, DoS atakları, veri hırsızlığına, dolayısıyla da kullanıcı güveninin sarsılmasına sebebiyet verebilir.Olası bir atağın ōnceden belirlenebilmesi ya da tahmin edilebilmesinin yararları tartışılmaz. Ancak, ne yazık ki, özellikle de atak sırasında gerçek atak kaynağını tespit etmek neredeyse imkansızdır. Bunun sebebi, atak yapan bilgisayarların Internet Protokol (IP) adreslerinin gizlenmesidir.Açıkça görülebilir ki, ateş duvarı (firewall) cihazlarının kurulması ve ağ kapasitesinin arttırılması gibi çözümler ataklardan koruyucu aksiyonlar olarak önerilmektedir. Ancak gelen trafik trendlerinin iyi anlaşılması, olası ataklardan korunmak için iyi bir yöntem olabilir. Bu, gelen ve giden trafiğin sürekli izlenmesi ve hangi trafik tiplerinin normal ya da anormal olarak sınıflandırılabileceğinin belirlenmesiyle sağlanabilir. Bunun yanı sıra, eğer genel trafik davranışları için bir baz çıkarılabilirse, istenmeyen trafiklerin en başında engellenebileceğini iddia etmektedir. Her ne kadar bu bahsedilen yöntemler, güçlü koruyucular olarak değerlendirilebilir olsa da, sistemler daima DDoS atak kurbanları olabilirler. Bu durum ise, akıllıca kurgulanmış, koruyucu ve hedef sisteme özel erken saptama sistemleri sayesinde aşılabilir.Bu çalışmada, biz akıllı Makina Öğrenmesi tabanlı bir potansiyel atak saptama sistemi öneriyoruz. Bu sistem sayesinde, ağ operatörlerinin atak kaynağının lokasyonu hakkında bazı ipuçlarına önceden varabileceklerine inanıyoruz. Sistemimizi canlı bir trafik monitörleme sistemi olarak tasarlamaktayız, öyle ki bu sistem, gelen ve giden trafiği dinleyerek monitörleyecek ve bu veriden öğrenerek problemli istemcileri anormal olarak işaretleyecek. Bu şekilde, ağ operatörleri, mevcut ağ sisteminde olan biten hakkında geniş bir bilgiye sahip olabilirler.Daha önceden bahsedildiği üzere, dağıtık bir atak sırasında, ağır yük altında olan bir ağda gerçek atak kaynağını bulmak, kaynakların kendisini gerçek kullanıcılar olarak gizleyebildiği gerekçesiyle neredeyse imkansızdır. Bu sebeple, atağın geldiği lokasyonla bağlantının kesilmesi kolay değildir. Ancak, eğer atağın geldiği lokasyon bilinebilir ya da tahmin edilebilirse, operatörler bağlantının kesilmesini sağlayabilir ve yükü azaltabilirler. Bu süre zarfında da atağın sistemlerine geri dönülemez zararlar vermemesi için çözümler arayabilirler.Saldırı tespit sistemimiz, büyük miktarda veriyi işleyebilmek adına açık kaynaklı, ölçeklenebilir ve dağıtık olarak tasarlanmıştır. Sistemin özelleştirilebilir ve ölçeklendirilebilir tasarlanmasındaki amaç, önerilen sistemin yoğun veri trafiği olan ya da seyrek trafik görülen her türlü ağ sisteminin önüne kolayca kurulabilir olmasını istememizdir. Sistem, belirli zaman aralıklarında, Apache Kafka konularına işlenmemiş ağ verilerini toplayacak, Kafka tüketicileri ise bu işlenmemiş verileri işleyecek ve işlenen verileri Makine Öğrenmesi Motoruna gönderecektir. Makine Öğrenmesi Motoru, gelen verilere denetimsiz öğrenme algoritmaları uygulayacak ve sonuçları tekrar Kafka'daki ilgili konulara yazacaktır. Ayrıca bir Kafka tüketicisi olarak tanıttığımız Saldırı Sınıflandırıcısı, şüpheli verilerin gerçekten kötü olup olmadığına karar verecek. Saldırı Sınıflandırıcısı'nın kararından sonra, sonuçlar sistem yöneticilerinin kullanımı için Elastic Arama'ya yazılacaktır. Sonuçları sunmak ve görselleştirmek için Kibana arayüzleri kullanılacaktır.Bizim senaryomuzda, N tane sistemimizle konuşmak için izinli bölge olduğunu varsayıyoruz. Her bir bölgenin kendine ait Digital Abone Hattı Erişim Çoklayıcı (DSLAM - Digital Subscriber Line Access Multiplexer) cihazı vardır. Önerdiğimiz model, gelen trafiği bölgesi ile birlikte günlüğe kaydedecektir. Sistemimiz, günlüğe kaydedilen gelen trafik verisinden, makina öğrenmesi algoritmaları sayesinde modeli öğrenecek ve potansiyel köle cihaz barındıran bölgeleri işaretleyecektir. Makina öğrenmesi teknikleri sayesinde, şüpheli istemciler konusunda geniş bir kavrayış sağlayabileceğimize inanmaktayız.Örnek bir konsept kanıtı için gerçek köle cihaz davranışları içeren bir veri seti ile çalışmak daha gerçekçi sonuçların alınmasını sağlayacaktır. Köle bilgisayar saptama için kullanılan data setleri genelde sentetik olarak bilgisayar ortamında yaratılmaktadır. Bilgisayar ortamında yaratılan veri setini kullanmayı tercih etmedik, zira bu şekilde sentetik data ile yapılan çalışmaların realist sonuçlar vereceğine inanmamaktayız. Çalışmamızda Çek Teknik Üniversitesi laboratuvarlarında hazırlanmış, çoklu kötücül yazılım ve atak çeşidi içeren CTU-13 veri setini kullandık. Bu veri setini kendi ihtiyaçlarımıza göre özelleştirerek üzerinde makina öğrenmesi algoritmaları çalıştırdık.Sistemin beyni olarak konumlandırılacak olan Makina Öğrenmesi Motorunu test etmek için, makina öğrenmesi algoritmalarının uygulanmasından önce hesaplanacak bazı ek özellikleri sunduk; ortalama varış arası zaman, ortalama paket uzunluğu, ortalama veri hızı, aynı uzunluktaki paket sayısı oranı, farklı protokol tipleri sayısı. Bu özellikler, Makina Öğrenmesi Motorunun daha iyi doğruluk elde etmesine yardımcı oldu.
dc.description.abstract	The exponential growth trend [3] in global data demand, leaded the network operators and system administrators to set up more complex infrastructures to be able to satisfy ever-changing data requirements. As a consequence, securing a complex network system or to be able to act promptly during an attack become a very troublesome business.While network administrators are becoming more provident against intruders, intruders are also changing their methodology, and they manage to find new ways to attack to the target systems. One of the most well-known attack types is Denial-of-Service (DoS) attacks, which is done by exploiting the vulnerabilities of the target system by compromising some slave computers or botnets.Denial-of-Service (DoS) attacks, aim to make the target system resources unavailable to its legitimate users. The DoS attack is generated by the source which is also known as Command and Control (C&C) server and C&C servers controls and uses the slave/zombie computers to consume the resources of the target system. A type of DoS attack is Distributed Denial of Service Attacks (DDoS), which the attackers use multiple botnets from various locations to exhaust the system with loaded traffic. The damage might be irrevocable for most of the companies, which do not have any DoS attack protection or strategy. The financial consequences of this kind of an attack might be large scale, since it might lead to data theft and eventually the loss of user's trust.Thinking about the unexpected costs and consequences of an attack, it would be truly helpful, if a possible future attack is detected/predicted in an earlier phase. However, it is not easy to detect the true source during an attack, even nearly impossible, since the IP (Internet Protocol) addresses of the attacker's devices are perfectly disguised as legitimate sources, or their IP's might be spoofed.Obviously, there are some preventive actions that are suggested for malicious attacks, such as deploying firewalls or increasing the bandwidth capacity that the system can handle. In addition to these precautions, as stated in [4], having a good understanding of the trend of the usual traffic might be more helpful to avoid possible attacks. This could be done by continuously monitoring the incoming and outgoing traffic of the system and knowing which traffic can be classified as normal and which of them can be classified as abnormal. It is also proposed in [4] that, if it is possible to deduct a base trend for the usual traffic behavior in a system, then it would be possible to reject illegitimate traffic beforehand. Although, it is a robust way to avoid unwanted traffic in a network at an earlier stage, it should not be forgotten that a system can always be a victim of DDoS attack, so we can always provide more intelligent preventive solutions with custom intrusion detection systems.xxiAs mentioned before, it is nearly impossible to find the true attack sources under the heavy load during a distributed attack since the information of the slave computers are spoofed. Therefore, it is not easy to specify the sources and cease the network between the attacker and the victim system. However, if the locations of the attacker can be guessed, then the administrators can cease the connection in between, and thus decrease the load and look for alternative solutions not to exhaust their system in the meanwhile.In this work, we propose an intelligent, Machine Learning (ML) based intrusion detection system, which we believe will give the network administrators particular clues about where the botnets might actually be located during an attack. We will model our system as a live monitoring system which collects the traffic data and learns from this collected traffic data. It will mark and score some hosts from incoming traffic as abnormal and thus, it will give the administrators a general insight of what is going on in the present situation and what they might be expecting for the future.Our intrusion detection system is a novel, open-source, scalable and distributed in order to handle huge volumes of data. We designed the system as customizable and scalable, so that it can be easily set up in front of any network system to cope with intrusion attacks. The system will collect raw network data in Apache Kafka topics in specific time-windows, consumers will process this raw data in chunks and send the processed data to Machine Learning Engine. Machine Learning Engine will apply unsupervised learning algorithms on the incoming data and write the results into related topics in Kafka again. The Intrusion Classifier, which we also introduce as a Kafka consumer, will decide if the suspicious data is actually malicious or not. And after the decision of IC, the results will be written in Elastic Search for the use of system administrators. Kibana interfaces will be used to serve and visualize the results.In our scenario, we have N regions which are allowed to send traffic to our system. Each region has a DSLAM (Digital Subscriber Line Access Multiplexer) device, and our proposed model is able to monitor and log the incoming traffic with the information of the region it is coming from. To satisfy this, our proposed system will be positioned in front of the internal network devices to catch the incoming requests from DSLAM devices. Our system will collect this region-based information and learn from it with unsupervised learning algorithms and mark potential infected regions in real-time. With the help of machine learning algorithms which are defined in Scikit-learn open-source library, we believe we can provide broad insights about the intrusion suspects.To generate a proof of concept of the actual brain of our design, which is the Machine Learning Engine, we wanted to do some experiments on a real botnet data and we wanted to see if we can detect real botnets with the help of open source sci-kit learn library. To be able to detect intrusions, it is better to work on a dataset which contains real botnet behaviours instead of script-generated synthetic data. Therefore, in our study, we decided to work with a well-known dataset which is generated in Czech Technical University laboratories, CTU-13. The dataset provides multiple different scenarios, each simulating various malware infections. We had taken this dataset as a baseline, and we have cleaned and customized the data according to our scenario, and have applied ML algorithms on it to see how well we can detect the intruders.To test the ML Engine which will be located as the brain of the system, we introduced some additional features, which will be calculated before the application of ML algorithms such as; average inter-arrival time, average packet length, average data rate, same-length number of packets ratio, number of different protocol types. These features helped the ML Engine to obtain better accuracy.	en_US
dc.language	English
dc.language.iso	en
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	Attribution 4.0 United States	tr_TR
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol	tr_TR
dc.subject	Computer Engineering and Computer Science and Control	en_US
dc.title	An open-source, machine learning based intrusion detection system
dc.title.alternative	Makı̇na öğrenmesı̇ tabanlı açık kaynak kodlu saldırı tespı̇t sı̇stemı̇
dc.type	masterThesis
dc.date.updated	2019-12-13
dc.contributor.department	Hesaplamalı Bilimler ve Mühendislik Anabilim Dalı
dc.subject.ytm	Cyber attack
dc.subject.ytm	Machine learning
dc.identifier.yokid	10300867
dc.publisher.institute	Bilişim Enstitüsü
dc.publisher.university	İSTANBUL TEKNİK ÜNİVERSİTESİ
dc.identifier.thesisid	594071
dc.description.pages	77
dc.publisher.discipline	Hesaplamalı Bilim ve Mühendislik Bilim Dalı

Files in this item

Name:: yokAcikBilim_10300867.pdf
Size:: 2.465Mb
Format:: PDF
Description:: File_10300867

View/Open

This item appears in the following Collection(s)

TEZLER

Show simple item record

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess