Novel merging based height-balanced histogram computation for big data

Büyüktanir, Tolga

View/Open

File_10137411 (1.848Mb)

Date

2017

Author

Büyüktanir, Tolga

Metadata

Show full item record

Abstract

Üretilen ve bulut sistemlerde kaydedilen data miktarı her geçen gün katlanarak artmaktadır. Buna örnek olarak, kullanıcı tarafından üretilen veriler, makinetarafından üretilen veriler ve İnternet'ten crawl edilen veriler gösterilebilir. Petabyte boyutunda dataları depolamak ve işlemek için; Apache Hadoop ekosistemaraçları ve bazı NoSQL frameworkleri gibi verimliliği kanıtlanmış frameworklervardır. Bu araçlar endüstride geniş çaplı kullanılmaktadır ve bu sebepten çeşitliaraştırmalara konu olmaktadır. Önerilen veri işleme teknikleri yukarıda saydığımız frameworklere pratik olması için uyumlu olmalıdır.Önermli veri operasyonlarından bir tanesi de, equi-depth(eş-derinlikli) histogramoluşturmaktır. Çünkü equi-depth histogramlar, sorgu optimizasyonu da gerektiren birçok uygulamada, datanın istatistiksel özelliğini anlamak için hayati önemesahiptir. Bu tezde, büyük veriler için approximate equi-depth histogramının oluşturulması üzerine çalışılmıştır ve verilen zaman aralığının equi-depth histogramınıoluşturan histogram birleştirme tabanlı yeni bir metod ve bu metodu kullanan birframework geliştirilmiştir. Bu framework, parçalar halinde bulunan tam olarak hesaplanmış equi-depth histogramları birleştirmek kaydıyla yaklaşık bir equi-depthhistogram oluşturmaktadır. Oluşturulan bu histogramın bir bucketında bulunanöğe sayısınında oluşabilecek maksimum hata sınırı garanti edilmektedir. Histogra-mın herhangi bir aralığında da maksimum hata sınırı garanti edilmektedir. Biz butezde önerdiğimiz metodun Apache Pig ve web uygulamalarını da sunmaktayız.

The amount of data generated and stored in cloud systems has been increasingexponentially. The examples of data include user generated data, machine generated data as well as data crawled from the Internet. There have been severalframeworks with proven efficiency to store and process the petabyte scale datasuch as Apache Hadoop ecosystem tools, and several NoSQL frameworks. Thesesystems have been widely used in industry and thus are subject to several research. The proposed data processing techniques should be compatible with theabove frameworks in order to be practical.One of the key data operations is deriving equi-depth histograms as they arecrucial in understanding the statistical properties of the underlying data withmany applications including query optimization.In this thesis, we focus on approximate equi-depth histogram construction for big data and propose a novelmerge based histogram construction method with a histogram processing framework which constructs an equi-depth histogram for a given time interval. Theproposed method constructs approximate equi-depth histograms by merging exact equi-depth histograms of partitioned data by guaranteeing a maximum errorbound on the number of items in a bucket (bucket size) as well as any range onthe histogram. We also test Apache Pig User Define Functions of this proposedmethod in this thesis.

URI

https://acikbilim.yok.gov.tr/handle/20.500.12812/702275

Collections

TEZLER

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess