Sınırlı alanlarda konu tespit ve takibi için genişletilmiş bir mimari yapı önerisi

Köse, Güven

View/Open

File_10041824 (3.595Mb)

Date

2014

Author

Köse, Güven

Metadata

Show full item record

Abstract

Internet üzerindeki bilginin devasa boyutlara ulaşması ile birlikte bu mecra bilgi arayan kullanıcıların birinci tercihi haline gelmiştir. Kullanıcıların Internet üzerindeki bilgiye karşı olan bu yoğun ilgisi hem arama motorlarının hem de bilgi erişim sistemlerinin önemini bir kat daha artırmıştır. Internet üzerinde sınırlı sayıda kelime ile bilgi arayan kullancılar, arama motorlarını yoğun olarak kullanırken, daha özel ve derinlemesine bilgi ihtiyacı olan kullanıcılar, özelleşmiş bilgi erişim sistemlerini kullanmaktadırlar. Bu kapsamda özelleşmiş bilgi erişim sistemleri ile ilgili çalışmalar son yıllarda yoğun olarak haber algılama ve izleme sistemleri olarak da tanımlanabilecek `Konu Algılama ve İzleme` programı üzerinde yoğunlaşmıştır. Bu programdaki çalışmaları geleneksel bilgi erişim sistemlerinden ayıran en önemli unsur, bilgi erişim sistemlerinde kullanılan sorgu-belge eşleşmelerinin yerini belge-belge eşleşmelerinin almış olmasıdır. Buna ek olarak, sisteme ulaşan bağımsız iki haberin aynı konuda olup olmadığını anlamaya çalışan `hikâye bağlantı algılama` ve önceden belirlenmiş bir konuda yeni çıkan haberleri yakalamayı hedefleyen `konu izleme` görevleri bu programın en önemli parçaları olarak tanımlanmıştır. Bu çalışma kapsamında, hikâye bağlantı algılama ve konu izleme görevlerinin gerçekleştirilmesinde farklı erişim fonksiyonu ve belge gösterim tekniklerinin başarım üzerindeki etkileri araştırılmıştır. Bu bağlamda, hikâye bağlantı algılama görevinin başarımını test etmek için vektör uzayı modeli ve ilgi modeli erişim fonksiyonu olarak kullanılmıştır. Buna ek olarak, belge gösterim tekniği olan tf.idf değerlerinden en yüksek olan terimler seçilerek bu terimlerle başarım testleri tekrarlanmış ve her bir yöntem için en uygun terim sayıları belirlenmiştir. Ayrıca, konu izleme görevi ile ilgili olarak uygun eşik değerinin seçilmesinin ve erişim fonksiyonu olarak vektör uzayı, ilgi modeli ve k-ortalamalar yöntemlerinin kullanılmasının başarım üzerindeki etkileri araştırılmıştır.Gerek hikâye bağlantı algılama gerekse konu izleme ile ilgili başarım testleri daha önce benzer akademik çalışmalarda kullanılmış olan BilCol-2005 Türkçe haber derlemi kullanılarak gerçekleştirilmiştir. Bu derlem üzerinde gerçekleştirilen başarım testlerinin f-ölçü sonuçlarına göre, hikâye bağlantı algılama görevinde vektör uzayı modelinin ilgi modeline göre çok daha yüksek bir başarıma sahip olduğu belirlenmiştir. Ayrıca, belge gösteriminde vektör uzayı modelinde 30 terim, ilgi modelinde ise 4 terim için en yüksek f-ölçü değerlerine ulaşılmıştır. Konu izleme görevinde, anma ve duyarlığın en yüksek olduğu noktadaki değerin eşik değeri olarak seçilmesinin en başarılı yöntem olduğu belirlenmiştir. Bunun yanında k-ortalamalar yönteminin konu izleme görevinde en başarılı yöntem olduğu tespit edilmiştir. Ayrıca bu çalışma kapsamında, hikâye bağlantı algılama ve konu izleme görevleri için gerçekleştirilen başarım testlerinden elde edilen sonuçlar ışığında, elimizde eğitim belgelerinin bulunmadığı durumlar için Türkçe bir konu izleme sistemi önerilmiştir. Bu sistemde konu modellerini oluşturmak ve zenginleştirmek için vektör uzayı ve ilgi modellerinin AND birleşimlerinin kullanılması önerilmektedir. Ayrıca sisteme yeni ulaşan haberlerin konu modeli ile ilgili olup olmadığının tespit edilebilmesi için k-ortalamalar yöntemi kullanılmalıdır. Önerilen bu mimari yapı ile Türkçe için etkin bir izleme sistemi oluşturulabileceği düşünülmektedir. Anahtar SözcüklerKonu algılama ve izleme, hikâye bağlantı algılama, konu takibi, bilgi erişim sistemleri, Türkçe konu takip sistemi.

As the rate of growth of information on the Internet is enormous, the need for retrieving the right information has become one of the most important things for the users. Users that need specific and deep information aim to use advanced information retrieval technologies, while other users use the search engines with restricted keywords. In this context, `Topic Detection and Tracking` program, which can be defined as news detection and tracking systems, has become one of the most important attraction centers of research. The most important factor of this system that differs from other traditional information retrieval systems is that this system uses document-document matching instead of query-document matching. In addition to this, The `Story Link Detection` detects two similar stories within the system whether they have the same subject or not while the `Topic Tracking` has the target of catching the news updates for a predefined subject. These two properties are considered as the two most important parts of the system. This study investigates the effects of different retrieval functions and document representation techniques on performance in carrying out the tasks of story link detection and topic tracking. In this context, vector space and relevance models were used as retrieval functions. In addition, terms that scored the highest tf.idf values have been selected for document representation, performance tests have been repeated with these terms, and the most appropriate terms for each method have been identified. Moreover, the effects of choosing the appropriate threshold values for topic tracking on performance along with vector space, relevance model and k-means methods as retrieval functions have been examined. Both story link detection and topic tracking performance tests have been fulfilled by the use of BilCol-2005 Turkish news corpus used in similar studies. Vector space model scored higher f-measure values on this corpus than that of relevance model in performance tests for story link detection tasks. The highest f-measure values for document representation were obtained for 30 and 4 terms in vector space and relevance models, respectively. Choosing the threshold value where precision and recall values were the highest turned out to be the most successful method for topic tracking along with k-means method. In the light of the findings obtained from performance tests carried out for story link detection and topic tracking tasks, a topic tracking system for Turkish corpora where no training documets exist has been proposed. The AND combination of the vector space and the relevance models should be used in order to create and enrich topical models. Also, k-means method should be used to determine if incoming news items are related with the topical model. We think the proposed architecture can help to build an effective topic tracking system for Turkish. KeywordsTopic detection and tracking, story link detection, topic tracking, information retrieval systems, topic tracking systems in Turkish.

URI

https://acikbilim.yok.gov.tr/handle/20.500.12812/464564

Collections

TEZLER

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess