Veri madenciliği yöntemleri kullanarak hava kirliliği tahmini
- Global styles
- Apa
- Bibtex
- Chicago Fullnote
- Help
Abstract
Hava kirliliği büyük şehirlerdeki çevre koşulları üzerinde önemli derecede etkilidir. Kirlilik risklerine karşı korunmak için geliştirilen hava kirliliği kontrol teknolojilerinin varlığı hava kirliliğinin doğru tahminine bağlıdır. Hava kirliliğini belirlemek için gösterge kirleticilerden faydalanılır. Partikül boyutu 10 μm altındaki maddeler (PM10), partikül boyutu 2,5 μm altındaki maddeler (PM2,5), azotoksitler (NO, NO2, NO2), ozon (O3), kükürtoksitler (SO2) ve karbonoksitler (CO) başlıca gösterge kirleticilerdir.Hava kirliliğinin şehir sakinleri üzerinde, özellikle de çocuklar ve kalp/solunum yetmezliği olan insanlar gibi hassas grupların üyeleri üzerinde önemli olumsuz etkileri bulunmaktadır. Yüksek konsantrasyonlarda PM10'a uzun süre maruz kalınması erken ölümlere, bozulmuş kardiyovasküler sisteme ve solunum yolu enfeksiyonlarına neden olabilmektedir. Bu zamana kadar İstanbul için yapılan hava kirliliği tahminleme çalışmalarınn hiçbiri dengesiz sınıf dağılımına sahip veri kümeleri ile değildir. Literatürdeki bu eksikliği gidermek için dengesiz veri dağılımı problemi ile başa çıkabilen ve PM10 kirleticisinin yoğunluğu aracılığıyla İstanbul için hava kirliliği tahmininde bulunan İki Katmanlı Hava Kirliliği Tahminleme Modeli'ni öneriyoruz.Önerdiğimiz, hava kirliliği tahminleme modeli iki katmandan oluşmaktadır. İlk katmanında PM10 sınıflandırma problemi, zararsız sınıf (1) ve tehlikeli sınıf (0) olarak kodlanan dengesiz dağılan veriden ikili sınıflandırma problemi olarak değerlendirilmektedir. Dengesiz veri problemine çözüm olarak örnekleme yaklaşımları ve algoritmaların parametrelerinin dengesiz veriye göre ayarlanması üzerinde durulmuştur. Çözümün örnekleme bölümünde, verilerin Aşağı Örnekleme yöntemlerinden Rastgele Örnekleme ve Near-Miss örneklemesinin versiyonları ile oluşturulan veri dağılımı dengelenmiş versiyonları tatmin edici sonuçlar üretmemiştir. Algoritmik kısımda, dengesiz öğrenme problemleri üzerindeki olumlu etkileriyle öne çıkan topluluk modelleri Rastgele Orman Sınıflandırıcısı (ROS), Ekstra Ağaç Sınıflandırıcısı (EAS), Gradyan Arttırma Sınıflandırıcısı (GAS) ve çekirdek tabanlı algoritmalar çok terimli Destek Vektör Makinesi (poli-DVM), rbf Destek Vektör Makinesi (rbf-DVM) modellerinin performansları AUROC açısından karşılaştırılmıştır. İkili sınıflandırma için önerilen model, tüm eğitim kümesi örneklerini kullanır ve ROS ile tahmin yapar.İkinci katman için başlangıçtaki ikili etiketlenmiş PM10 veri kümesi; gerçek etiketlerine göre tehlikeli sınıf örnekleri (0 etiketliler) bir veri grubu, zararsız sınıf örnekleri (1 etiketliler) bir veri grubu oluşturacak şekilde ikiye ayrılır. Bu veri gruplarında PM10 yoğunluğunu tahminleyen bağımsız regresyon modelleri eğitilir. İki Katmanlı Hava Kirliliği Tahminleme Modeli ile tahminleme aşamasında, ilk aşamada öne çıkan ikili sınıflandırıcı ile öncelikle örneğin hangi sınıfa ait olduğuna karar verilir. Sonrasında ise ait olduğu sınıfın regresyon modeli kullanılarak PM10 kirleticisinin yoğunluğu bulunur. İki Katmanlı Hava Kirliliği Tahminleme Modeli'nin performansı, İstanbul'daki dokuz ölçüm noktasına ait veriler kullanılarak Ortalama Mutlak Hata (OMH) ve Ortalama Kare Hata (OKH) hata metriklerine göre saf regresyon modelleri ile kıyaslanmıştır. Aksaray, Alibeyköy, Beşiktaş, Esenler, Kartal, Sarıyer, Silivri, Üsküdar ve Yenibosna ölçüm noktalarındaki, Ağustos 2011 - Şubat 2018 aralığını kapsayan saatlik meteorolojik ve kirlilik verisi üzerinde, önerilen modelin dengesiz veri problemi ile daha iyi başaa çıkabildiği görülmüştür. Air pollution has a significant effect on environmental conditions in many large cities. Presence of air pollution control technologies developed to protect against the risks of pollution depends on the accurate estimation of air pollution. Accurate air pollution prediction is particularly helpful in ensuring economic and social development in developing countries. Indicator pollutants are used for risk assessment and epidemiological analysis for air pollution studies. Particulate matter under 10 μm (PM10), particulate matter under 2,5 μm (PM2,5), sulphur oxides (SO2), nitrogen oxides (NO, NO2, NOX), carbon oxides (CO) and ozone (O3) are the indicator pollutants frequently seen in this area.Air pollution has a significant impact on inhabitants of the cities, particularly members of vulnerable groups such as children and people with heart failure and respiratory failure. Prolonged exposure to high concentrations of PM10 may cause premature deaths, impaired cardiovascular system, and respiratory tract infections. Considering the threats posed to human health by particulate matter, we focus on PM10 density estimation in this study.In order to guarantee the quality of life in urban and metropolitan centers, it is necessary to estimate the change of air pollution concentrations. In line with this need, to estimate the time at which the air quality is low and the pollution rate will be high at the regional and local scales before pollution occurs; Air quality estimation models have been developed by taking into account the characteristics of atmospheric pollution and the negative effects of air pollution on the standard of living.When the data sets of the current studies in this area are examined, it is seen that meteorological data is used predominantly. Meteorological conditions are critical in determining the concentrations of pollutants in the air. Lower than normal ambient temperature and incoming solar radiation slow down photo-chemical reactions and cause secondary air pollutants, such as O3, to be found in smaller amounts of air. Increased wind speed can increase or decrease air pollutant concentrations. Strong wind speeds can create dust storms by removing particles from the ground. High humidity often affects pollutants (PM, CO and SO2) in the air with high concentrations, but may also result in low concentrations of some contaminants (such as NO2 and O3). One reason for this is that high humidity is an indicator of rainfall events. In addition to using only meteorological data while performing air quality estimations, there are studies that use pollutant data or involve both meteorological and pollution data. The reason why the data set selection is limited in terms of pollution data; the installation and operation of pollutant measuring stations is more difficult and expensive than meteorological stations, the pollutant measuring stations are located in a small number of areas and are difficult to obtain data from pollutant measuring stations.The aim of this thesis is to estimate the intensity of air pollution for İstanbul through PM10 indicator pollutant. For the study which includes meteorological and pollution data covering the period between August 2011 and February 2018, the pollutant measurement stations in İstanbul were examined and Aksaray, Alibeyköy, Beşiktaş, Esenler, Kartal, Sarıyer, Silivri, Üsküdar and Yenibosna stations which has the most data in the past were selected for use in the study. The meteorological station data, which is the closest to the pollution measurement stations, is taken from the TurkishState Meteorological Service.The fact that the data to be used in the estimation of air pollution has some special characteristics may cause difficulties for urban air quality estimation. First, building a station in the city and operating the station requires a high cost, so there is a limited number of measuring stations. Accordingly, it is also difficult to obtain labeled data in this field. Secondly, data loss may occur in the event of a technical failure in stations. Generally, there is only one measuring device at each station, so the data in that time range is lost when the device is calibrated or maintained, not only when there is a problem with the device. Another problem is that urban air pollution data vary depending on the technology used in measuring stations. For example, the number of stations that can measure PM2,5 in İstanbul is considerably less than the number ofstations that can measure CO.An accurate regression can replace air quality monitoring stations with a limited number and distribution in a city. A sensitive classification can provide valuable information to protect people from damage due to air pollution. Generally, both classification and regression provide solutions to support air pollution control, and in this way can have both social and scientific effects.In the air pollution estimation problem for İstanbul, when the density information of the target pollutant PM10 is classified by using the EPA limit values, six classes are obtained. The fact that some classes have only a few examples after this transformation and that these few samples are not sufficient for the prediction model to learn the relevant class have shown that the problem cannot be treated as a sixth classification problem. Alternatively, when the problem is transformed into a binary classification problem and the distributions of the classes in the data set are examined, it is seen that the negative cases in the data (samples of minority class) are quite low compared to positive cases (samples of dominant class). In the classification problem, imbalanced distribution of the samples belonging to the classes is an important problem which makes the learning of the prediction models difficult.In this thesis, we propose a Two Layer Air Pollution Estimation Model which predicts PM10 density by using machine learning algorithms and can cope with imbalanced distribution of data. In the first layer, the PM10 classification problem is considered to be the problem of binary classification on imbalanced data set coded as harmless class (1) and dangerous class (0). Sampling approaches and algorithmic approaches are addressed as a solution to imbalanced binary classification problem. In sampling part, the balanced versions of the data generated by Random Sampling and Near-Miss (three different versions) sampling from the Down Sampling approaches did not yield satisfactory results. In algorithmic part of the solution, ensemble models that stand out with their positive effects on imbalanced learning problems are Random Forest Classifier (RFC), Extra Tree Classifier (ETC), Gradient Boosting Classifier (GBC) and kernel based algorithms polynomial Support Vector Machine (poly-SVM), rbf-SVM performances compared through Area Under ROC Curve (AUROC). The proposed model for binary classification uses all instances of the training set and predicts via RFC.The initial binary-labeled PM10 data set are divided into two groups for the second layer: dangerous class samples (0 labeled) and harmless class instances (1 labeled). In these data groups, independent regression models that predict PM10 density are trained. In the first layer, the leading binary classifier determines which class the test sample belongs to. Then using the regression model of the class to which it belongs, density of the PM10 pollutant is found. The performance of the Two Layer Air Pollution Estimation Model was compared with pure regression models according to the Mean Absolute Error (MAE) and Mean Square Error (MSE) error metrics using data from nine measurement points in ˙Istanbul. It has been seen that the proposed model can better handle the imbalanced data problem on hourly meteorological and pollution data at the measurement points of Aksaray, Alibeyköy, Beşiktaş, Esenler, Kartal, Sarıyer, Silivri, Üsküdar and Yenibosna, covering the period of August 2011 - February 2018.
Collections