- Monday | 02/09/2019 to Wednesday | 04/09/2019
- All Day
European Symposium on Societal Challenges in Computational Social Science (Euro CSS) in Zurich
Challenges and Opportunities in Automated Coding of Contentious Political Events
Erdem Yörük (Sociology, Koç University and University of Oxford)
Ali Hürriyetoğlu (Computational Linguistics, Koç University)
Çağrı Yoltar (Anthropology, Koç University)
Fırat Durusan (Political Science, Ankara University and Koç University)
Osman Mutlu (Computer Science, Koç University)
Aline Villavicencio (Computer Science, University of Essex)
Planned duration of the workshop: One day
Collecting protest and conflict event information from news sources enables historical and
comparative studies of social movements in social and political sciences. As the collection of event
data covers more countries, longer time periods, and more details and granularity, which are more
abound in local sources in comparison to international resources, their utility in social science
applications multiplies. Given the excessive time and human effort costs manual data collection would
incur, there is an increasing tendency to rely on machine learning and natural language processing
(NLP) methods to develop automated classification and extraction tools that would possibly deal
better with the enormity of the amount and variety of data to be collected.
As an interdisciplinary team of researchers (composed of computer scientists, computational linguists
and social scientists), we have been working on automated protest information collection for over two
years. This is a sub-task of our ERC-funded research (https://emw.ku.edu.tr) which seeks to explain
welfare system changes in emerging market economies in relation to the shifting trends of local protest
movements. In an effort to capture these events, we are building protest databases for six case
countries (Brazil, China, India, Mexico, South Africa and Turkey) by extracting protest information
from local and international news sources. In this endeavor, we rely on automated methods, and use
machine learning and generalizable NLP. While building our protest information collection models
and designing our methodology, we have encountered many of the well-known challenges of
automated event extraction, ranging from the problem of source selection to the concerns about
completeness and validity of the data to the issues of generalizability (Wang et.al. 2016). This
workshop will address these issues and potential methods to tackle them with new methodologies and
The need for collecting protest or conflict data has been satisfied by manual, semi-automatic and
automatic approaches. However, the results that have been yielded by these approaches to date are
either not at a sufficient quality or require tremendous effort to replicate on new data. Recent reviews
point at major causes for concern in existing protest databases such as insufficient validity and
reliability, inconsistencies within and between corpora, and lack of generalizability in terms of
methodologies and results. On the one hand, manual or semi-automatic methods require high quality
human effort and on the other hand, text classification and information extraction systems tend not to
perform similarly well on corpora from a setting that is different from the one used for training.
Aforementioned shortcomings stem mainly from the lack of regard given to the variable nature of
contentious politics, which takes slightly different forms in different countries and time periods in line
with spatial and temporal variation of sociopolitical phenomena. Those who attempt to tackle this
problem usually resort to not fully automated methods, such as using key term-based filtering of
sources that attempt to make variability more manageable but sacrifice recall performance, resulting in
missing undetermined amount of information from the outset. Also, training models based on a single
case or filtered data would yield static tools that are less capable of performing with comparable recall
and precision when applied to contexts different from those that are trained on. This is also a
significant factor in the validity, reliability and consistency problems facing existing protest databases.
Protest event ontologies and automated tools should be developed in a way that can handle the dynamism
of the context and source variability.
This workshop will work to develop solutions for these methodological issues in a collective manner.
In general, there is lack of scientific collaboration among academic groups working on event-coding
programs (Wang et al 2016 and Lorenzini et al 2016) and the most important objective of this
workshop is to fill this gap and connect the researchers. With this workshop,
we hope to contribute in the formation of a possible collaborative environment for automated event coding,
which has increasingly become target of ever growing
scientific interest, budget and efforts.
A description of the proposed event format and a detailed list of proposed activities:
We propose to organize four different activities in the workshop.
Individual participants present their automated event coding projects. We will
ask them to describe
a. The method of the project
b. News sources
c. Theoretical perspectives that define their event definition
We will ask each participant project to address each of the following major
concerns in automated event coding, mostly organized around the “completeness” and
“validity” of the data. They will discuss how they tackle with these issues in 15 minutes and in
two PowerPoint slides.
Source selection problem: A major question is whether international or local news
sources should be used in data collection and which specific sources. This problem,
which concerns us to an extent, stems from the well-known biases of reporting (or
representation in general): some sources cover protest news more broadly than others
and also different sources report same events differently. This is a serious issue
especially for manual or semi-automated coding because they have to limit their
corpora with only a single or at most a couple of sources. Given these limitations,
selecting the source(s) that would provide the most extensive and reliable coverage
becomes quite a challenge.
Inconsistent corpora over time: The number and variety of sources used may change
over time, leading to inconsistency in the amount of processed data across different
time periods as well as unreliable protest-event trends.
Report selection problem: The problem of choosing articles that are relevant for the
coding of political protest indicated the well-known limitations of the keyword search
method for filtering, and promote the ML/NLP classifiers.
Information Extraction problem: In addition to classification (or report selection),
protest database building requires an effort to identify key event characteristics. The
workshop will discuss strategies to resolve these issues such as how to identify event
characteristics (place, participant, organizer, etc.) without relying on pre-prepared lists
of keywords about actors, places, organizations etc.
Data loss: Since occurrence of events do not perfectly partition into sentences, the
focus on sentences disable us to control for the context.
Duplication Problem: Available protest databases (manual or automated) try to de-
duplicate the events in an effort to provide an accurate number of “real-world” events
in designated contexts. Is this necessary and how should it be conducted?
(Training) Data collection/annotation processes: These processes directly affect
data quality. As EMW, we are working very meticulously when it comes to
annotation/adjudication processes and the structure we have developed may constitute
an example for other teams.
Scientific Tools: Is there a reliability problem in the literature? Do projects using
similar coding rules produce similar results? Is there a problem of outdated
technology? The text-processing systems used in event coding are still similar to ones
developed more than 20 years ago while a range of tools in the field of text processing
has been developed. Is this a big challenge?
Copyright issues: Obtaining the permissions from news sources to share annotated
materials in shared tasks or in event databases is a common challenge that most
researchers face in the field.
3. Annotation manual discussions:
There will be a discussion on the main characteristics of
annotation manuals with the purpose of reaching at a common denominator that span all
annotation manuals and therefore of collaborating in annotation efforts.
4. Special issue:
We will propose to edit a special issue out of the workshop to be published in a
top computational linguistics or political science journal. In this section, participant will
discuss the specific topic of this special issue and possible journals for consideration.
Lorenzini, J., Makarov, P., Kriesi, H., & Wueest, B. (2016). Towards a Dataset of Automatically
Coded Protest Events from English-language Newswire Documents. In Paper presented at the
Amsterdam Text Analysis Conference (http://bruno-
Wang, W., Kennedy, R., Lazer, D., & Ramakrishnan, N. (2016). Growing pains for global
monitoring of societal events. Science, 353(6307), 1502-1503.
Weidmann, Nils B. and Espen Geelmuyden Rød. The Internet and Political Protest in Autocracies.
Chapter 4. Oxford University Press, forthcoming.