Estimating the selectivity of Sql Like queries

Aytimur, Mehmet

dc.contributor.advisor	Çakmak, Ali
dc.contributor.author	Aytimur, Mehmet
dc.date.accessioned	2021-05-08T07:33:20Z
dc.date.available	2021-05-08T07:33:20Z
dc.date.submitted	2017
dc.date.issued	2020-01-08
dc.identifier.uri	https://acikbilim.yok.gov.tr/handle/20.500.12812/631475
dc.description.abstract	Metin tabanlı verilerin miktarında internet kaynaklarının da etkisiyle gözlemlenen dramatik artış nedeniyle, bu tür verileri esnek arama kalıpları ile sorgulama giderek yaygınlaşmaktadır. İlişkisel veritabanları, SQL LIKE operatörünü kullanarak esnek ﬁltrelerle metinsel verileri sorgulamaya izin vermektedir (örn., LIKE `Sub%`, `Sub` ile başlayan tüm satırlar ile eşleşir). Metinsel verilerin büyük boyutlu olmasından dolayı, bu tür sorguları en verimli şekilde yürütmek, veritabanı performansı için oldukça kritik önem taşır. Verilen bir LIKE sorgusu için en verimli yürütme planını oluştururken, sorgu eniyileyici, sorguda yer alan esnek sorgu yüklemleri için seçicilik tahminine ihtiyaç duyar. Bu tez çalışmasında, bir metin verisi sütundaki veri dağılımını özetlemek üzere yeni bir desen tabanlı histogram yapısı kullanan, SPH ve P-SPH isimleri altında iki yeni algoritma geliştirilmiştir. Daha spesiﬁk olarak, önce bir metin veritabanındaki dizi motiﬂeri (SPH) veya konumsal dizi motiﬂeri (P-SPH) hesaplanır. Sonra bu motiﬂerden özel bir histogram oluşturulur. Sorgu eniyileme sürecinde, bir LIKE sorgu yükleminin seçiciliğini tahmin etmek için bu histogramlar kullanılır. DBLP bibliyografya verileri üzerinde yapılan deney sonuçları, önerdiğimiz tekniklerin, genel LIKE ifadeleri için literatürdeki son teknikten daha az tahmin hatasına sahip olduğunu göstermektedir. Dahası , önerdiğimiz histogram bazlı teknikler geçmişte önerilen en gelişmiş teknikten iki kat daha az hafıza ve bir kat daha az zaman gerektirir.
dc.description.abstract	With the dramatic increase in the amount of the text-based data which commonly contains misspellings and other typos, querying such data with ﬂexible search patterns becomes more and more commonplace. Relational databases support SQL LIKE operator to allow searching with a particular wild-card predicate (e.g., LIKE `Sub%`, which matches all strings starting with `Sub`). Due to large size of text data, executing such queries in the most optimal way is quite critical for database performance. While building the most eﬃcient execution plan for a LIKE query, the query optimizer requires the selectivity estimate of the corresponding ﬂexible wild-card query predicate(s). To this end, we propose novel algorithms to estimate the selectivity of LIKE query predicates. We ﬁrst introduce SPH (Sequential Pattern-based Histogram) algorithm that is based on a new type of pattern-based histogram structure to summarize the data distribution in a particular text column. More speciﬁcally, we ﬁrst mine sequential patterns in a given string database, and then construct a special histogram out of the mined patterns oﬄine, i.e., before query processing starts. Then, during query optimization time, pattern-based histograms are exploited to estimate the selectivity of a LIKE predicate. The experimental results on a real dataset from DBLP show that the proposed techniques outperform the state of the art for generic LIKE queries like %s1%s2%...%sn% where si represents one or more characters. What is more, the proposed histogram structure requires more than two orders of magnitude smaller memory space, and the estimation time is almost an order of magnitude less in comparison to the state of the art. Next, we introduce another LIKE query selectivity estimation algorithm, called P-SPH. P-SPH extends SPH in three distinct ways: (i) it extends regular sequence patterns into more speciﬁc positional sequence patterns, and use them in its histograms, (ii) it introduces a slider-based more ﬂexible matching scheme, and (iii) it employs an information-theoretic redundant pattern elimination mechanism. We experimentally demonstrate that P-SPH further improves the accuracy of SPH at the expense of using more memory.	en_US
dc.language	English
dc.language.iso	en
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	Attribution 4.0 United States	tr_TR
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol	tr_TR
dc.subject	Computer Engineering and Computer Science and Control	en_US
dc.title	Estimating the selectivity of Sql Like queries
dc.title.alternative	Sql Like sorgularının seçiciliğini tahmin etme
dc.type	masterThesis
dc.date.updated	2020-01-08
dc.contributor.department	Bilgisayar Ana Bilim Dalı
dc.identifier.yokid	10165277
dc.publisher.institute	Fen Bilimleri Enstitüsü
dc.publisher.university	İSTANBUL ŞEHİR ÜNİVERSİTESİ
dc.identifier.thesisid	498110
dc.description.pages	68
dc.publisher.discipline	Diğer

Files in this item

Name:: yokAcikBilim_10165277.pdf
Size:: 108.7Mb
Format:: PDF
Description:: File_10165277

View/Open

This item appears in the following Collection(s)

TEZLER

Show simple item record

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess