Doğal dil işleme teknikleriyle yazar-kitap tanıma

Kaya, Samet

View/Open

File_10198055 (4.391Mb)

Date

2018

Author

Kaya, Samet

Metadata

Show full item record

Abstract

İnsanlık yazının bulunmasından bu yana farklı yollarla birçok yazılı doküman üretmiştir. Yazılmış olan her yazı onu üreten yazarının izlerini taşımaktadır. Yazarın kelime hazinesi, düşünüş biçimi, mantık çıkarımları hatalı ya da eksil bilgileri, yazım alışkanlıkları metne yansımaktadır. Bu bakış açısıyla, yazılan her dokümanın yazarın metinsel parmak izi olduğunu söyleyebiliriz. Ancak gerçek parmak izinde olduğu gibi izde bulunan yazara ait olan özellikleri çıkarmak insan yeteneğini aşmaktadır. Metin üzerimdeki kişisel karakteristiği çıkarmak bilgisayar devriminden önce oldukça zor bir görevdi bunun yanında bilgisayarlar bu işlemi yapabilmektedir. Yazar tanıma işlemi için, çeşitli yazar özellikleri yazara ait eğitim metinlerinden tespit edilmekte ve daha sonra sisteme sokulan başka bir metinin öndeki eğitimden çıkarılmış karakteristik vektörüyle ile benzerliği hesaplanmaktadır. Metin üzerindeki yazar özelliklerinden bazıları: kelime hazinesi, yazım hataları, karakter ve kelime n-gram izleri vs. Bilgisayarlar sayesinde bu tip özellikleri metinin içerisinden çıkarabiliyor ve bir dokümanın yazara aitliğini tespit edebiliyoruz.Bu tezde, yazar tanıma işlemi yapılmıştır. 20 Türk yazarın farklı dağılımlarda yazmış olduğu 120 farklı Türkçe kitap üzerinde çalışılmıştır. Karakter n-gram yazarın stilometri özelliği olarak kullanışmış ve Naive Bayes sınıflayıcı metodu ile de sınıflama işlemi yapılmıştır. Tez kapsamında ilk önce, 120 Türkçe kitap bulunmuş ve txt formatına dönüştürülmüştür. Ardından, tüm kitaplar bir ön işleme sokularak boşluklar, karakter hataları, sayısal ve alfabetik olmayan ifadeler, noktalamalar, Türkçe olmayan karakterler yazıdan çıkarılmıştır. Ön işlemeden sonra, 120 kitap rasgele 20 yazar için 20 eğitim kitabı ve 100 test kitabı olarak iki farklı gruba bölünmüştür. Eğitim kitaplarında yazar etiketi bulunmaktadır. Yazar özelliği olarak bi-gram, tri-gram, quadri-gram özellikleri eğitim kitaplarından frekansı hesaplanarak çıkarılmış ve en sık 200 tanesi yazarın stilometrik vektör uzayı oluşturulmuştur. Bu noktada sistemimiz yazar tanıma işlemi için hazır durumdadır. Sistemimizi test etmek için, her bir test kitabını yazar etiketsiz olarak tek tek sisteme soktuk. Her bir test kitabı da tıpkı eğitim kitabı gibi bi-gram, tri-gram, quadri-gram özellikleri çıkarılarak en sık 200 tanesi yazar özelliği olarak aldık. Sonunda sistemde bulunan yazar özellikleriyle her hangi bir test kitabından çıkardığımız vektörü naive bayes sınıflandırıcı ile sınıflandırma sonuçlarını aldık. Test kitabının gerçekte olan yazarı ile sistemin tahmin ettiği yazar ismini karşılaştırarak sistemimizin başarısını ölçtük ve kaydettik. Tez çalışmasında farklı n-gram performansları Naive Bayes sınıflayıcı üzerinde performansları karşılaştırılmıştır. N-gram vektör uzaylarının yazar tanıma başarımları ölçülmüştür. Gözlemlerin sonucu olarak bi-gram vektör uzayı başarısız olmuştur. Bunun yanında tri-gram ve quadri-gram iyi sonuçlar vermiştir. En iyi performansı %82 başarım ile quadri-gram vermiştir. Tez sonunda tüm sonuçlar, karmaşıklık matrisi verilmiştir. İnternet çağıyla birlikte explonansiyel artmış olan elektronik dokümanların plagarizim, adli araştırma gibi yönlerden incelenebilmesi için tez konusu önemlidir. Alanda birçok İngilizce çalışma bulunmasına rağmen Türkçe çalışma oldukça azdır. Bilgisayar çağında, bilgisayarların insan dilini anlaması ve üretmesi üzerine çalışmalar yürütülmektedir. Türkçe'nin de diğer dillerin gerisinde kalmaması için bu tip çalışma önem arz etmektedir. Bu bakımdan tez Türkçe doğal dil işlemeye katkıda bulunmuştur.Anahtar Kelimeler : Metin sınıflama, Yazar tanıma, Naive bayes sınıflama, N-gram

Since the discovery of the manuscript, humanity has produced many written document in various ways. Every text carries traces of its author. There are author's thesaurus, thinking and logic execution, wrong or incomplete informations, spelling habits on the text.From this point of view, we can say that a written document is the textual fingerprints of the people who write it. However, just like fingerprints, these features are difficult to detect from the text with human abilities. It is difficult to determine the personal characteristics of texts before the computer revolution, but these processes can be achived with computers today. For the author recognition process, various author features are determined by training text and then compared with the features of other texts to look for similarities. Some of the authors features on the text: thesaurus, typographical errors, character n-gram traces, word n-gram traces. Thanks to the computers, these types of properties are extracted from the text and analysis of the status of the writer's ownership of a document. In this thesis, author detection process was studied. 120 different Turkish books written by 20 Turkish authors in different distributions were studied. Character n-gram for stylometry and Naïve Bayes classifier is used to recognize the author of the text. First of all, we gather 120 Turkish novels and convert them to txt format. Then, all book were pre-processed and gaps deleted, erroneous characters corrected, alphanumeric characters, punctuation, and non-Turkish characters removed. After preliminary operations, the books were divided into two group as 20 training and 100 test book. The training books have the author name label and author stylometric features extracted from them. These stylometric features are separately bi-gram, tri-gram, quadri-gram. We calculate the frequency of these features and get 200 for author's stylometry vector space. After all, our system is ready for the author recognition process. For the text, We take the test books one by one and extract bi-gram, tri-gram, quadri-gram for authorship properties as we do for educational books. Following, Extracted most frequent 200 n-gram pass to naive bayes classifier. Naïve Bayes classifier decides who write the book among the authors previously introduced to the system. The system's estimation label and the real author of the book is compared and the results are noted at the end of work. The comparison of the n-grams performances attained by Naïve Bayesian method is examined through this thesis. The achievements of the n-gram vector spaces about author detection were observed in this study. As a result of our observations, bigram vector space is failed. Besides, trigram and quadri-gram gave good accuracy result. It should be noted the best performance with 82% accuracy belongs to the quadri-gram. The other results and confusion matrix is located in the thesis.With the recent developments of computer architectures, the amount of proceedings and articles about the understanding of human languages by the computers has enormously increased. Unfortunately, a large amount of these papers are related with English language. From this perspective, the thesis contributes the Natural Language Processing in Turkish language and provides motivation for the further studies on the field.Keywords : Text classification, Author detection, Naïve Bayesian approach, N-gram.

URI

https://acikbilim.yok.gov.tr/handle/20.500.12812/623741

Collections

TEZLER

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess