A Test Dataset of Offensive Malay Language by a Cyberbullying Detection Model on Instagram Using Support Vector Machine

Social media services have become a prevalent communication tool due to their capability to instantly share information with a large number of people for free. However, social media also facilitate cyberbullying, and studies have shown that cyberbullying on social media has a severe impact compared to other platforms. In some cases, cyberbullying provokes tragic problems, such as suicide. The information shared on social media services provides a massive amount of textual data, which can be used to explore patterns of human behaviors including cyberbullying. This paper aims to build a dataset of offensive language for research on cyberbullying in the Malay language through a series of baseline experiments by implementing SVM classifiers. These preliminary experiments helped to understand the performance of automatic tools that mine for abusive language within a corpus of Malay texts. To achieve the objectives, social media extraction methods and new crawling technologies oriented have been developed to monitor the Instagram accounts of popular Malaysian celebrities. The resulting collection contains 165,239 real-world comments associated with 27 Instagram public accounts. A sample of this corpus was manually labelled in terms of cyberbullying categories. After the dataset was cleaned, normalized, and vectorized, this led to a collection of 527 comments. Following a standard training (70%) and test (30%) split, the SVM classifier was developed and evaluated. These initial experiments produced a model accuracy of 75% and f1-scores of around 75%.