Project description

This Google Summer of Code 2018 project was devoted to sentiment analysis annotation (with an emphasis on figurative language) and the interpretation of political discourse in anglophone countries.  

From a linguistic perspective, figurative language is hugely prevalent in tweets: it is concise, catchy, and gets the point across. From a computational perspective: idioms and metaphors and notoriously difficult to classify due to their heterogeneous nature. However, careful annotation and methods like the MWE tokenizer (NLTK) can successfully tackle this challenge.

The corpus is currently comprised of over 7000 politically-related tweets and contains several categories (for instance, metadata, polarity, mood, speech act, etc.). 

The report includes a detailed description of the methodology devised in the process of manual annotation. Three categories of results are presented, namely, the comprehensive statistical summary of the corpus, quantitative results (yielded by the automatic classifiers trained on the dataset), and qualitative results. The ensuing discussion is largely concerned with providing possible reasons for the lack of a large number of figurative expressions in the corpus. Finally, the conclusions include a brief explanation of the limitations of the study and include several ideas for future research involving the corpus.


A screenshot of the MAGA corpus

