A template-independent content extraction approach for news web pages

Yeniçağ, Ahmet

dc.contributor.advisor	Can, Fazlı
dc.contributor.author	Yeniçağ, Ahmet
dc.date.accessioned	2020-12-29T08:01:37Z
dc.date.available	2020-12-29T08:01:37Z
dc.date.submitted	2012
dc.date.issued	2018-08-06
dc.identifier.uri	https://acikbilim.yok.gov.tr/handle/20.500.12812/353228
dc.description.abstract	İnternet haber sayfaları, reklamlar, bağlantılar, ve kullanıcı yorumları gibi fazladan elemanlar içermektedirler. Bu elemanlar, haber içeriklerinin çıkartılmasını zorlu kılmaktadırlar.Günümüzdeki haber içeriği çıkartma (HİÇ) yöntemleri genellikle şablon bağımlı olarak çalışmaktadırlar. Haber sağlayıcılar, internet sayfası şablonlarını sıklıkla değiştirdikleri için,bu yöntemler düzenli bakım gerektirmektedirler. Bu nedenle, haber içeriklerini internet sayfası şablonlarına bağımlı olmaksızın doğru bir şekilde çıkartabilecek HİÇ yöntemlerine gereksinim duyulmaktadır. Bu tez çalışmasında, bir şablon bağımsız haber içeriği çıkartma yöntemi (N-EXT) önerilmiştir. N-EXT ilk olarak, bir haber sayfasını HTML etiketlerine göre bloklara ayrıştırır. Daha sonra haber içeriğinin çoğunluğunu ya da tamamını içeren bloğu tespit etmek için ayrıştırdığı tüm blokları inceler. Bu amaçla, bloklara metinsel boyutlarını ve haber başlığına olan benzerliklerini göz önünde tutarak birer ağırlık tahsis eder. Bu iki ağırlık bileşenlerinin önemini belirlemek için k-kat çapraz doğrulama yaklaşımı ve olası farklı benzerlik ölçülerinin etkilerini değerlendirmek için de tek yönlü varyans analizi (ANOVA) ve Scheffe çoklu karşılaştırma testi birlikte kullanılmıştır. En yüksek ağırlığa sahip blok, haber bloğu olarak düşünülür. Haber bloğu içerisinde yer alan fakat haber içeriğiyle ilgisi olmayan cümleler, önerilen yöntem tarafından haber bloğuna olan benzerlikleri değerlendirilerek haber bloğundan elenir. Son olarak, önerilen yöntem olası haber içeriği kalıntılarını tespit etmek için, haber bloğu dışındaki blokları da inceler. Farklı haber sitelerinin internet sayfalarını içeren iki farklı deney koleksiyonu üzerinde yapılan deneylerce, önerilen yöntemin doğruluğu ve dayanıklılığı gösterilmiştir.
dc.description.abstract	News web pages contain additional elements such as advertisements, hyperlinks, and reader comments. These elements make the extraction of news contents a challenging task. Current news content extraction (NCE) methods are usually template-dependent. They require regular maintenance, since news providers frequently change their web page templates. Therefore, there is a need for NCE methods that extract news contents accurately without depending on web page templates. In this thesis, a template-independent News content EXTraction approach, called N-EXT, is introduced. It first parses a web page into its blocks according to the HTML tags. Then, it examines all blocks to detect the one that contains the major part of the news content. For this purpose, it assigns weights to the blocks by considering both their textual sizes and similarities to the news title. For quantifying the importance of these two weight components, we use the k-fold cross validation approach; and for assessing the impact of different possible similarity measures, we use a one-way Analysis of Variance (ANOVA) with a Scheff/'{e} comparison. The block with the highest weight is considered as the news block. Our approach eliminates the sentences in the news block that are not related to the news content by considering similarities of sentences to the news block. Finally, it also examines other blocks to detect the rest of the news content. The experimental results show the accuracy and robustness of our method by using two test collections whose web pages are obtained from several different news websites.	en_US
dc.language	English
dc.language.iso	en
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	Attribution 4.0 United States	tr_TR
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Bilgi ve Belge Yönetimi	tr_TR
dc.subject	Information and Records Management	en_US
dc.subject	Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol	tr_TR
dc.subject	Computer Engineering and Computer Science and Control	en_US
dc.title	A template-independent content extraction approach for news web pages
dc.title.alternative	Haber internet sayfaları için şablon-bağımsız içerik çıkartma yöntemi
dc.type	masterThesis
dc.date.updated	2018-08-06
dc.contributor.department	Bilgisayar Mühendisliği Anabilim Dalı
dc.subject.ytm	News
dc.subject.ytm	Information extraction
dc.subject.ytm	Text detection
dc.identifier.yokid	442005
dc.publisher.institute	Mühendislik ve Fen Bilimleri Enstitüsü
dc.publisher.university	İHSAN DOĞRAMACI BİLKENT ÜNİVERSİTESİ
dc.identifier.thesisid	315163
dc.description.pages	95
dc.publisher.discipline	Diğer

Files in this item

Name:: yokAcikBilim_442005.pdf
Size:: 1.883Mb
Format:: PDF
Description:: File_442005

View/Open

This item appears in the following Collection(s)

TEZLER

Show simple item record

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess