Comparing Ranking-based and Naive Bayes Approaches to Language Detection on Tweets

This article describes two systems participating to the TweetLID-2014 competition focused on language detection in tweets. The systems are based on two different strategies: ranked dictionaries and Naive Bayes classifiers. The results show that ranking dictionaries performs better with small training corpora whose language distribution is similar to that of the test dataset, while a Naive Bayes algorithm improves the scores with large models even if the data are unbalanced with regard to the test dataset. The experiments also showed that the models based on word unigrams outperform the use of n-grams of characters. In the final evaluation the Naive Bayes classifier got the first position among the unconstrained systems (trained with external sources) participating at the competition.

keywords: Language Identification, Short Text, Naive Bayes Classifier, Dictionary-Based Models