Detecting hate speech in Turkish texts

Hüsünbeyi, Zehra Melce

View/Open

File_10321012 (6.363Mb)

Date

2020

Author

Hüsünbeyi, Zehra Melce

Metadata

Show full item record

Abstract

Yazılı basın ve sosyal medya gibi birçok farklı mecrada önyargılı ve ayrımcı bir dilin kullanıldığı ve yaygınlaştığı görülmektedir. Demokrasi ve insan hakları değerlerine karşı tehdit oluşturan ayrımcı dil ve onun daha saldırgan ve aşağılayıcı, açıkça hedef gösterici şekliyle nefret söylemi acilen çözülmesi gereken küresel bir sorun teşkil etmektedir. Biz de nefret söylemiyle mücadelede önemli olan nefret söylemi tespiti için bir model geliştirdik. Bu amaçla, Hrant Dink Vakfı'nın sistematik bir şekilde nefret söylemi bağlamında annotate ettiği yazılı basın haberlerini PRNet medya takip şirketi web sitesinden çekerek bir dataset oluşturduk. Bildiğimiz kadarıyla, bu çalışmayla, etiketlenmiş bir dataset üzerinde çalışan Türkçe için geliştirilmiş ilk model üretilir. Özellikle yazılı basın haberlerindeki nefret söyleminin büyük kısmının bağlam ve imala-ra dayanması değişen söylemsel ipuçlarını tespit edebilen ve bu söylemlerin etrafında oluşan bağlamı anlayabilen bir sistem gerektirir. Biz de metnin hiyerarşik yapısını kullanarak ifadelerin değişen anlamlarını yakalamayı hedefleyen Hiyerarşik İlgi Ağları (HİA) modelini farklı kelime temsilleriyle inceledik. Modelimizi metin işlemede önemli sonuçlar veren Konvolüsyonel Sinir Ağları ve makine öğrenmesi modelleriyle kıyaslaya-rak probleme uygunluğunu tespit ettik. Çalışmamızı geliştirmek için eleştirel söylem analizi tekniklerini temel alarak probleme yönelik dilbilimsel özellikler geliştirdik. HİA modelini bu özelliklerle birlikte zenginleştirdik. Sonuçlarımız 'diğerleri dili' kullanımına işaret eden özellik kümesiyle performansın geliştiğini gösterir. Türkçe dili için oluşturu-lan bu özellik kümelerinin nefret söyleminin nicel analizinde yeni çalışmaları teşvik edeceğine inanıyoruz.

It is well known that prejudiced and discriminatory language is being widely used and spread through several channels such as printed or social media. The discriminatory language, in particular hate speech as its more aggressive, degrading and openly targeting form, which poses a threat to the values of democracy and human rights is a global problem that needs an immediate solution. Since we find the detection of hate speech important in the fight against hate speech, we have developed a model to detect it. For this purpose, we created a dataset by retrieving printed media news that the Hrant Dink Foundation systematically annotated in the context of hate speech from the website of the PRNet media monitoring company. To the best of our knowledge, with this study, the first model developed for Turkish language that runs on a labeled dataset is produced. In particular, the fact that most of the hate speech in printed media is based on context and implications requires a system that can detect changing discursive cues and understand the context around these discourses. With different word representations, we have examined the Hierarchical Attention Network (HAN) model, which aims to capture the changing meanings of expressions by using the hierarchical structure of the text. We studied the compatibility of our model with the problem by comparing it with Convolution Neural Network (CNN), which provided important results in text processing, and with machine learning models. In order to improve our study, we developed linguistic features for the problem based on critical discourse analysis techniques. We enhanced the HAN model using these features. Our results show that performance increases with a set of features that point out the use of 'othering language'. We believe that these feature sets created for the Turkish language will encourage new studies in the quantitative analysis of hate speech.

URI

https://acikbilim.yok.gov.tr/handle/20.500.12812/72101

Collections

TEZLER

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess