WP2 will create the first comparable protest event database on emerging markets, using local news sources and computer-based methods.
There will be three subtasks under WP2: WP2.1: Acquisition of newspaper data: The project will use two newspapers from each country. WP2.2: Coding of the training data: The machine-learning algorithm will (i) classify protest related news and (ii) extract components of protest information (participants, place, ethnicity etc.) into a database. The algorithm will function after it is trained by real data coded by real human-beings; the algorithm will learn and imitate the manual process of coding. Therefore, for each country-case, 1000 thousand protest related news will be classified by the project team and different components of protest data be extracted manually. WP2.3: Programming of the natural language processing and machine-learning algorithm and processing the data: Information extraction area of computational science has been dealing with the problem of extracting information automatically from unstructured text data. There are many different communities of researchers bringing in techniques from machine learning, databases, information retrieval and computational linguistics for various aspects of the information extraction problem over the last two decades (Sunita, 2008). Computing based techniques clearly reduce the workload involved in the selection and coding of articles and remove inter-coder and intra-coder reliability problems (Krippendorff 2004, Wüest et al. 2013). Considering the amount of data to analyse (an average of 3 protest events per day for 365 days during the last 30 years–in the Turkish case, yielded a 30 thousand-entry database), using computational methods seems inevitable. Still, automated protest content analysis techniques are certainly underutilized in the scholarly community (Hutter 2014) and this project will be the first study to create comparable protest dataset for emerging market countries.
The collected data will be analysed both as a time-series indicator and an independent variable in a pooled cross-sectional time-series multivariate regression analysis to establish causal relations between protest waves and welfare policy changes as part of WP3.
The protest database will count the number of events such as strikes, rallies, boycotts, protests, riots, and demonstrations, i.e. the “repertoire of contention” (Tarrow 1994, Tilly 1984).
WP2 does not intend to produce an exhaustive count for all, or even for most incidences of political events, since newspapers report on only a fraction of the events (Davenport 2009, Earl et al. 2004; Ortiz et al. 2005, Silver 2003). It intends to create a measure of the changing levels of grassroots politics events over time and space.
Protest event analysis is a frequently used approach for social movement scholars, as it is an unobtrusive and context-sensitive technique that can convert unstructured matter into large volumes of data in its cross-national, cross-time and cross-issue comparative character (Hutter 2014, Krippendorff 2004).
Newspaper archives are the most reliable source to create protest event databases, as they provide access, selectivity, reliability and continuity over time and coding makes coding easier (Franzosi 2004). Previously, a limited number of similar studies used newspapers that report on global events such as The New York Times or The Times (Silver 2006), but these studies were interested in making global arguments.
This study will directly use indicators from local news resources in order to increase explanatory power at the national level.