Doğal dil işleme teknikleriyle yazar-kitap tanıma

Kaya, Samet

dc.contributor.advisor	Güneş, Ali
dc.contributor.author	Kaya, Samet
dc.date.accessioned	2021-05-08T06:40:51Z
dc.date.available	2021-05-08T06:40:51Z
dc.date.submitted	2018
dc.date.issued	2018-10-10
dc.identifier.uri	https://acikbilim.yok.gov.tr/handle/20.500.12812/623741
dc.description.abstract	İnsanlık yazının bulunmasından bu yana farklı yollarla birçok yazılı doküman üretmiştir. Yazılmış olan her yazı onu üreten yazarının izlerini taşımaktadır. Yazarın kelime hazinesi, düşünüş biçimi, mantık çıkarımları hatalı ya da eksil bilgileri, yazım alışkanlıkları metne yansımaktadır. Bu bakış açısıyla, yazılan her dokümanın yazarın metinsel parmak izi olduğunu söyleyebiliriz. Ancak gerçek parmak izinde olduğu gibi izde bulunan yazara ait olan özellikleri çıkarmak insan yeteneğini aşmaktadır. Metin üzerimdeki kişisel karakteristiği çıkarmak bilgisayar devriminden önce oldukça zor bir görevdi bunun yanında bilgisayarlar bu işlemi yapabilmektedir. Yazar tanıma işlemi için, çeşitli yazar özellikleri yazara ait eğitim metinlerinden tespit edilmekte ve daha sonra sisteme sokulan başka bir metinin öndeki eğitimden çıkarılmış karakteristik vektörüyle ile benzerliği hesaplanmaktadır. Metin üzerindeki yazar özelliklerinden bazıları: kelime hazinesi, yazım hataları, karakter ve kelime n-gram izleri vs. Bilgisayarlar sayesinde bu tip özellikleri metinin içerisinden çıkarabiliyor ve bir dokümanın yazara aitliğini tespit edebiliyoruz.Bu tezde, yazar tanıma işlemi yapılmıştır. 20 Türk yazarın farklı dağılımlarda yazmış olduğu 120 farklı Türkçe kitap üzerinde çalışılmıştır. Karakter n-gram yazarın stilometri özelliği olarak kullanışmış ve Naive Bayes sınıflayıcı metodu ile de sınıflama işlemi yapılmıştır. Tez kapsamında ilk önce, 120 Türkçe kitap bulunmuş ve txt formatına dönüştürülmüştür. Ardından, tüm kitaplar bir ön işleme sokularak boşluklar, karakter hataları, sayısal ve alfabetik olmayan ifadeler, noktalamalar, Türkçe olmayan karakterler yazıdan çıkarılmıştır. Ön işlemeden sonra, 120 kitap rasgele 20 yazar için 20 eğitim kitabı ve 100 test kitabı olarak iki farklı gruba bölünmüştür. Eğitim kitaplarında yazar etiketi bulunmaktadır. Yazar özelliği olarak bi-gram, tri-gram, quadri-gram özellikleri eğitim kitaplarından frekansı hesaplanarak çıkarılmış ve en sık 200 tanesi yazarın stilometrik vektör uzayı oluşturulmuştur. Bu noktada sistemimiz yazar tanıma işlemi için hazır durumdadır. Sistemimizi test etmek için, her bir test kitabını yazar etiketsiz olarak tek tek sisteme soktuk. Her bir test kitabı da tıpkı eğitim kitabı gibi bi-gram, tri-gram, quadri-gram özellikleri çıkarılarak en sık 200 tanesi yazar özelliği olarak aldık. Sonunda sistemde bulunan yazar özellikleriyle her hangi bir test kitabından çıkardığımız vektörü naive bayes sınıflandırıcı ile sınıflandırma sonuçlarını aldık. Test kitabının gerçekte olan yazarı ile sistemin tahmin ettiği yazar ismini karşılaştırarak sistemimizin başarısını ölçtük ve kaydettik. Tez çalışmasında farklı n-gram performansları Naive Bayes sınıflayıcı üzerinde performansları karşılaştırılmıştır. N-gram vektör uzaylarının yazar tanıma başarımları ölçülmüştür. Gözlemlerin sonucu olarak bi-gram vektör uzayı başarısız olmuştur. Bunun yanında tri-gram ve quadri-gram iyi sonuçlar vermiştir. En iyi performansı %82 başarım ile quadri-gram vermiştir. Tez sonunda tüm sonuçlar, karmaşıklık matrisi verilmiştir. İnternet çağıyla birlikte explonansiyel artmış olan elektronik dokümanların plagarizim, adli araştırma gibi yönlerden incelenebilmesi için tez konusu önemlidir. Alanda birçok İngilizce çalışma bulunmasına rağmen Türkçe çalışma oldukça azdır. Bilgisayar çağında, bilgisayarların insan dilini anlaması ve üretmesi üzerine çalışmalar yürütülmektedir. Türkçe'nin de diğer dillerin gerisinde kalmaması için bu tip çalışma önem arz etmektedir. Bu bakımdan tez Türkçe doğal dil işlemeye katkıda bulunmuştur.Anahtar Kelimeler : Metin sınıflama, Yazar tanıma, Naive bayes sınıflama, N-gram
dc.description.abstract	Since the discovery of the manuscript, humanity has produced many written document in various ways. Every text carries traces of its author. There are author's thesaurus, thinking and logic execution, wrong or incomplete informations, spelling habits on the text.From this point of view, we can say that a written document is the textual fingerprints of the people who write it. However, just like fingerprints, these features are difficult to detect from the text with human abilities. It is difficult to determine the personal characteristics of texts before the computer revolution, but these processes can be achived with computers today. For the author recognition process, various author features are determined by training text and then compared with the features of other texts to look for similarities. Some of the authors features on the text: thesaurus, typographical errors, character n-gram traces, word n-gram traces. Thanks to the computers, these types of properties are extracted from the text and analysis of the status of the writer's ownership of a document. In this thesis, author detection process was studied. 120 different Turkish books written by 20 Turkish authors in different distributions were studied. Character n-gram for stylometry and Naïve Bayes classifier is used to recognize the author of the text. First of all, we gather 120 Turkish novels and convert them to txt format. Then, all book were pre-processed and gaps deleted, erroneous characters corrected, alphanumeric characters, punctuation, and non-Turkish characters removed. After preliminary operations, the books were divided into two group as 20 training and 100 test book. The training books have the author name label and author stylometric features extracted from them. These stylometric features are separately bi-gram, tri-gram, quadri-gram. We calculate the frequency of these features and get 200 for author's stylometry vector space. After all, our system is ready for the author recognition process. For the text, We take the test books one by one and extract bi-gram, tri-gram, quadri-gram for authorship properties as we do for educational books. Following, Extracted most frequent 200 n-gram pass to naive bayes classifier. Naïve Bayes classifier decides who write the book among the authors previously introduced to the system. The system's estimation label and the real author of the book is compared and the results are noted at the end of work. The comparison of the n-grams performances attained by Naïve Bayesian method is examined through this thesis. The achievements of the n-gram vector spaces about author detection were observed in this study. As a result of our observations, bigram vector space is failed. Besides, trigram and quadri-gram gave good accuracy result. It should be noted the best performance with 82% accuracy belongs to the quadri-gram. The other results and confusion matrix is located in the thesis.With the recent developments of computer architectures, the amount of proceedings and articles about the understanding of human languages by the computers has enormously increased. Unfortunately, a large amount of these papers are related with English language. From this perspective, the thesis contributes the Natural Language Processing in Turkish language and provides motivation for the further studies on the field.Keywords : Text classification, Author detection, Naïve Bayesian approach, N-gram.	en_US
dc.language	Turkish
dc.language.iso	tr
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	Attribution 4.0 United States	tr_TR
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol	tr_TR
dc.subject	Computer Engineering and Computer Science and Control	en_US
dc.title	Doğal dil işleme teknikleriyle yazar-kitap tanıma
dc.title.alternative	Author-book recognition with natural language processing techniques
dc.type	masterThesis
dc.date.updated	2018-10-10
dc.contributor.department	Bilgisayar Mühendisliği Ana Bilim Dalı
dc.identifier.yokid	10198055
dc.publisher.institute	Fen Bilimleri Enstitüsü
dc.publisher.university	İSTANBUL AYDIN ÜNİVERSİTESİ
dc.identifier.thesisid	511589
dc.description.pages	89
dc.publisher.discipline	Bilgisayar Mühendisliği Bilim Dalı

Files in this item

Name:: yokAcikBilim_10198055.pdf
Size:: 4.391Mb
Format:: PDF
Description:: File_10198055

View/Open

This item appears in the following Collection(s)

TEZLER

Show simple item record

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess