The mantra by Grimmer and Stewart to “validate, validate, validate” is well known in the Social Sciences but how can a researcher strike an ideal balance between rigor and efficiency?
At the CLARIN Annual Conference 2019 in Leipzig, SSHOC partners organised a masterclass for political and social scientists with an interest in using large text collections in their research. This event contributed to two major SSHOC objectives: developing relevant and applicable tools for specific user communities and empowering those communities to actively use such tools. The masterclass addressed the challenges that political and social scientists encounter when confronted with the need to validate their findings obtained with quantitative analysis of text corpora.
The masterclass was offered by Prof. Dr. Andreas Blätte, head of the PolMine project and developer of the polmineR R-package, and Christoph Leonhardt (both University of Duisburg-Essen). They presented common research strategies, talked about why implementing validation remains a technological frontier, mapped out various validation requirements and offered suggestions on how to satisfy the need for validation.
Andreas Blätte elaborated on the required integration of quantitative and qualitative approaches to corpus analysis, and suggested that the combination of the two approaches be described by a new term: quanlification. Although validation by quanlification is needed to achieve valid and sound research results, Blätte noted that such validation is inhibited by technical restrictions. Thus, a set of scenarios and workflows implemented using the polmineR R-package developed by Andreas Blätte were presented as a potential way forward. Topics covered by these workflows were counting, co-occurrence analysis, sentiment analysis, text classification, and Latent Dirichlet Allocation (LDA) Topic Modelling.
Given the various disciplinary backgrounds of the attendees – ranging from computer science to humanities – these workflows were introduced with a focus on the validation of its output rather than on the production of code. However, participants were given ample opportunity to experiment with the polmineR R-package, in order to develop experience with the implementation of validation strategies.
In the course of the day, the participants intensively discussed the possibilities and limits of validation. A shared understanding emerged that the need to integrate quantitative and qualitative approaches to corpus analysis is central to these endeavours. Validating algorithmically derived findings of quantitative approaches based on the initial text is necessary for a more complete insight in both the data and what a method actually measures, ensuring intersubjective and valid research.
So, when counting words, contexts have to be taken into account. When calculating co-occurrences, the output should be filtered by their actual semantic meaning. Sentiment analyses should take into account both the complex nature and ambiguity of human speech and hence be evaluated carefully. And machine learning approaches need to be checked by looking back at the initial data.
The polmineR R-package provides a tool which has the philosophy of quanlification at its core. It offers both qualitative and quantitative approaches to corpus analysis, always allowing to reconstruct the full text. The discussion at the end of the session offered a great opportunity to elaborate on the package’s design by presenting workflows which live up to these standards.
In an upcoming webinar planned in spring 2020, Andreas Blätte will present the potential of polmineR for quanlification. If you are struggling with validation implementation for your results from large text collection or simply want to try out a new tool, sign up for the SSHOC newsletter and be the first to know about the webinar and other SSHOC activities.
Photo: Andreas Blätte talking about text analysis