MY459 - Quantitative Text Analysis

Lent Term 2021


Course Information

No lectures or classes will take place during School Reading Week 6.

Week Topic Instructor
1 Overview and Fundamentals BM
2 Descriptive Statistical Methods for Text Analysis BM
3 Automated Dictionary Methods BM
4 Machine Learning for Texts BM
5 Supervised Scaling Models for Texts BM
6 Reading Week -
7 Unsupervised Models for Scaling Texts BM
8 Similarity and Clustering Methods BM
9 Topic models FG
10 Word embeddings FG
11 Current topics FG

Course Description

The course surveys methods for systematically extracting quantitative information from political text for social scientific purposes, starting with classical content analysis and dictionary-based methods, to classification methods, and state-of-the-art scaling methods and topic models for estimating quantities from text using statistical techniques. The course lays a theoretical foundation for text analysis but mainly takes a very practical and applied approach, so that students learn how to apply these methods in actual research. The common focus across all methods is that they can be reduced to a three-step process: first, identifying texts and units of texts for analysis; second, extracting from the texts quantitatively measured features—such as coded content categories, word counts, word types, dictionary counts, or parts of speech—and converting these into a quantitative matrix; and third, using quantitative or statistical methods to analyse this matrix in order to generate inferences about the texts or their authors. The course systematically covers these methods in a logical progression, with a practical, hands-on approach where each technique will be applied using appropriate software to real texts.


The course is also designed to cover many fundamental issues in quantitative text analysis such as inter-coder agreement, reliability, validation, accuracy, and precision. It focuses on methods of converting texts into quantitative matrixes of features, and then analysing those features using statistical methods. The course briefly covers the qualitative technique of human coding and annotation but only for the purposes of creating a validation set for automated approaches. These automated approaches include dictionary construction and application, classification and machine learning, scaling models, and topic models. For each topic, we will systematically cover published applications and examples of these methods, from a variety of disciplinary and applied fields but focusing on political science. Lessons will consist of a mixture of theoretical grounding in content analysis approaches and techniques, with hands on analysis of real texts using content analytic and statistical software.


Students must have completed Applied Regression Analysis (MY452) or equivalent.

Students in this course will strongly benefit from prior experience with the R statistical package. All methods will be implemented in R, using primarily the R package quanteda, available from CRAN.


Summative Assignments

Problem sets will be assigned at the beginning of each lab session. These will involve computer exercises applied to texts supplied by the instructor. These will be submitted via GitHub Classroom by their due date, and will be marked to provide 60% of the course grade.

Summative Project

A final project of 3,000 words (5,000 words for MY559 students) will be due at the beginning of ST (on May 4th at 5pm), and form 40% of the course grade. This will be an original analysis of texts using some of the methods covered in class, and may focus on replicating or extending a published work. Additional guidelines are available here.

Assessment criteria

Assignments will be marked using the following criteria:

Some of the assignemnts will involve shorter questions, to which the answers can be relatively unambiguously coded as (fully or partially) correct or incorrect. In the marking, these questions may be further broken down into smaller steps and marked step by step. The final mark is then a function of the proportion of parts of the questions which have been answered correctly. In such marking, the principle of partial credit is observed as far as feasible. This means that an answer to a part of a question will be treated as correct when it is correct conditional on answers to other parts of the question, even if those other parts have been answered incorrectly.

There is no really good single textbook that exists to cover computerized or quantitative text analysis. While not ideally fitting our core purpose, Krippendorf’s classic Content Analysis — just updated — is a good primer for manual methods of content analysis and coverage of some of the same fundamentals faced in quantitative text analysis.

Other readings will consist of articles and book excerpts, as listed below, which will either be made available via Moodle or through the links below.

Cheat Sheets

Cheat sheets contain useful code examples to get you started. Please refer to these materials before you book office hours!

Regular Expressions


A large proportion of the materials were adapted from content developed by Prof. Kenneth Benoit and Dr. Pablo Barbará for previous versions of this course. Some of the assignments were developed by Christian Mueller and Akitaka Matsuo.


Week 1. Overview and fundamentals

This session will cover fundamentals, including the continuum from traditional (non-computer assisted) content analysis to fully automated quantitative text analysis. We will cover the conceptual foundations of content analysis and quantitative content analysis, discuss the objectives, the approach to knowledge, and the particular view of texts when performing quantitative analysis.


Further Reading:

Week 2: Descriptive statistical methods for text analysis

Here we focus on quantitative methods for describing texts, focusing on summary measures that highlight particular characteristics of documents and allowing these to be compared. We will also discuss issues including where to obtain textual data; formatting and working with text files; indexing and meta-data; units of analysis; and definitions of features and measures commonly extracted from texts, including stemming, and stop-words.


Further Reading:

Seminar Materials: Click here to access seminar materials when instructed.

Week 3: Automated dictionary methods

Automatic dictionary-based methods involve association of pre-defined word lists with particular quantitative values assigned by the researcher for some characteristic of interest. This topic covers the design model behind dictionary construction, including guidelines for testing and refining dictionaries. Hand-on work will cover commonly used dictionaries such as LIWC, RID, and the Harvard IV-4, with applications. We will also review a variety of text pre-processing issues and textual data concepts such as word types, tokens, and equivalencies, including word stemming and trimming of words based on term and/or document frequency.


Further Reading:

Week 4: Machine Learning for Texts

Classification methods permit the automatic classification of texts in a test set following machine learning from a training set. We will introduce machine learning methods for classifying documents, including one of the most popular classifiers, the Naive Bayes model. The topic also introduces validation and reporting methods for classifiers and discusses where these methods are applicable.


Further Reading:

Seminar Materials: Click here to access seminar materials when instructed.

Week 5: Supervised Scaling Models for Texts

Building on the Naive Bayes classifier, we introduce the “Wordscores” method of Laver, Benoit and Garry (2003) for scaling latent traits, and show the link between classification and scaling.


Further Reading:

Week 7: Unsupervised Models for Scaling Texts

This session continues text scaling using unsupervised scaling methods, based on parametric approaches modelling features as Poisson distributed (Wordfish and Wordshoal) or non-parametric approaches such as correspodence analysis.


Further Reading:

Seminar Materials: Click here to access seminar materials when instructed.

Week 8: Similarity and clustering methods

Vector representations of documents, measuring distance and similarity, hierarchical and k-means clustering. This topic also revisits feature selection and weighting methods, especially tf-idf.


Further Reading:

Week 9: Topic models

This session will teach how to automatically classify documents into unknown categories using topic models. We will learn how to run the parametric Latent Dirichlet Allocation (LDA) model and the Structural Topic Model (STM), which allows researchers to use covariates to learn about the prevalence and content of topics.


Further Reading:

Seminar Materials: Click here to access seminar materials when instructed.

Week 10: Word embeddings

This week will cover vector representation of words as an alternative way to construct document-feature matrices, with particular attention to word embeddings as a popular type of vector space representation.


Further Reading:

Week 11: Current topics

This week will give an high level outlook on some neural network based models that go beyond the bag of words assumption. In addition, we will introduce the Twitter API. In coding examples we will look at a topic that combines both areas: Trying to predict whether a sentence tends to approve or disapprove. We obtain training data through the Twitter API, process the data, and then train a range of machine learning classifiers.


Further Reading:

Seminar Materials: Click here to access seminar materials when instructed. Download datasets here.


Barberá, Pablo. 2015. “Birds of the Same Feather Tweet Together: Bayesian Ideal Point Estimation Using Twitter Data.” Political Analysis 23(1):76–91. doi: 10.1093/pan/mpu011.

Beauchamp, N. 2017. “Predicting and Interpolating State‐Level Polls Using Twitter Textual Data.” American Journal of Political Science, 61(2), 490-503.

Beil, F, M Ester and X Xu. 2002. Frequent term-based text clustering. In Eighth ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 436–442.

Benoit, K. and M. Laver. 2008. “Compared to What? A Comment on ‘A Robust Transformation Procedure for Interpreting Political Text’ by Martin and Vanberg.” Political Analysis 16(1):101–111. doi: 10.1093/pan/mpm020.

Benoit, Kenneth and Paul Nulty. 2013. “Classification Methods for Scaling Latent Political Traits.” Presented at the Annual Meeting of the Midwest Political Science Association, April 11–14, Chicago.

Blei, David M. 2012. “Probabilistic topic models.” Communications of the ACM 55(4):77. doi: 10.1145/2133806.2133826.

Blei, D.M., A.Y. Ng and M.I. Jordan. 2003. “Latent dirichlet allocation.” The Journal of Machine Learning Research 3:993–1022.

Caliskan, A., Bryson, J.J., and Narayanan, A. 2017. “Semantics derived automatically from language corpora contain human-like biases”, Science.

Chang, J., J. Boyd-Graber, S. Gerrish, C. Wang and D. Blei. 2009. Reading tea leaves: How humans interpret topic models. In Neural Information Processing Systems.

Choi, Seung-Seok, Sung-Hyuk Cha and Charles C. Tappert. 2010. “A Survey of Binary Similarity and Distance Measures.” Journal of Systemics, Cybernetics and Informatics 8(1):43–48.

Clinton, J., S. Jackman and D. Rivers. 2004. “The statistical analysis of roll call voting: A unified approach.” American Journal of Political Science 98(2):355–370. doi: 10.1017/s0003055404001194.

Corley, Courtney and Rada Mihalcea. 2005. Measuring the semantic similarity of texts. In Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment - EMSEE ’05.

Däubler, Thomas, Kenneth Benoit, Slava Mikhaylov and Michael Laver. 2012. “Natural Sentences as Valid Units for Coded Political Texts.” British Journal of Political Science 42(4):937–951. doi: 10.1017/S0007123412000105.

DuBay, William. 2004. The Principles of Readability. Costa Mesa, California.

Dunning, Ted. 1993. “Accurate methods for the statistics of surprise and coincidence.” Computational Linguistics 19:61–74.

Evans, Michael, Wayne McIntosh, Jimmy Lin and Cynthia Cates. 2007. “Recounting the Courts? Applying Automated Content Analysis to Enhance Empirical Legal Research.” Journal of Empirical Legal Studies 4(4):1007–1039.

Gilardi, F., Shipan, C. R., & Wueest, B. 2017. “Policy Diffusion: The Issue-Definition Stage.” Working paper, University of Zurich.

Ginsberg, Jeremy, Matthew H Mohebbi, Rajan S Patel, Lynnette Brammer, Mark S Smolinski and Larry Brilliant. 2008. “Detecting influenza epidemics using search engine query data.” Nature 457(7232):1012–1014.

Grimmer, Justin and Brandon M. Stewart. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21(3):267–297. doi: 10.1093/pan/mps028.

Gurciullo, S. and Mikhaylov, S. 2017. “Detecting policy preferences and dynamics in the UN general debate with neural word embeddings”, 2017 International Conference on the Frontiers and Advances in Data Science.

James, Gareth, Daniela Witten, Trevor Hastie and Robert Tibshirani. 2013. An Introduction to Statistical Learning with Applications in R. Springer Science & Business Media.

Jürgens, Pascal and Andreas Jungherr. 2016. “A Tutorial for Using Twitter Data in the Social Sciences: Data Collection, Preparation, and Analysis.”

Klašnja, M., Barberá, P., Beauchamp, N., Nagler, J., & Tucker, J. 2016. “Measuring public opinion with social media data.” In The Oxford Handbook of Polling and Survey Methods.

Krippendorff, Klaus. 2013. Content Analysis: An Introduction to Its Methodology. 3rd ed. Thousand Oaks, CA: Sage.

Lampos, Vasileios, Daniel Preotiuc-Pietro and Trevor Cohn. 2013. A user-centric model of voting intention from Social Media. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL).

Lantz, Brett. 2013. Machine Learning with R. Packt Publishing Ltd.

Laver, M. and J. Garry. 2000. “Estimating policy positions from political texts.” American Journal of Political Science 44(3):619–634. doi: 10.2307/2669268.

Laver, Michael, Kenneth Benoit and John Garry. 2003. “Estimating the policy positions of political actors using words as data.” American Political Science Review 97(2):311–331. doi: 10.1017/S0003055403000698.

Loughran, Tim and Bill McDonald. 2011. “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” The Journal of Finance 66(1):35–65.

Lowe, W. 2008. “Understanding Wordscores.” Political Analysis 16(4):356–371. doi: 10.1093/pan/mpn004.

Lowe, William and Kenneth Benoit. 2013. “Validating Estimates of Latent Traits From Textual Data Using Human Judgment as a Benchmark.” Political Analysis 21(3):298–313. doi: 10.1093/pan/mpt002.

Lowe, William, Kenneth Benoit, Slava Mikhaylov and Michael Laver. 2011. “Scaling Policy Preferences From Coded Political Texts.” Legislative Studies Quarterly 26(1):123–155. doi: 10.1111/j.1939-9162.2010.00006.x.

Lucas, C., Nielsen, R. A., Roberts, M. E., Stewart, B. M., Storer, A., & Tingley, D. 2015. “Computer-assisted text analysis for comparative politics.” Political Analysis, 23(2), 254-277.

Manning, C. D., P. Raghavan and H. Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press.

Martin, L. W. and G. Vanberg. 2007. “A robust transformation procedure for interpreting political text.” Political Analysis 16(1):93–100. doi: 10.1093/pan/mpm010.

Metaxas, Panagiotis T., Eni Mustafaraj and Daniel Gayo-Avello. 2011. How (not) to predict elections. In Privacy, security, risk and trust (PASSAT), 2011 IEEE third international conference on social computing (SocialCom).

Mikolov, T., Chen, K., Corrado, G., & Dean, J. 2013. “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781.

Neuendorf, K. A. 2002. The Content Analysis Guidebook. Thousand Oaks CA: Sage.

Pennebaker, J. W. and C. K. Chung. 2008. Computerized text analysis of al-Qaeda transcripts. In The Content Analysis Reader, ed. K. Krippendorf and M. A. Bock. Thousand Oaks, CA: Sage.

Pomeroy, C, Dasandi, N. and S. Mikhaylov. 2018. “Disunited Nations? A Multiplex Network Approach to Detecting Preference Affinity Blocs using Texts and Votes”

Roberts, Margaret E, Brandon M Stewart, Dustin Tingley, Christopher Lucas, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson and David G Rand. 2014. “Structural Topic Models for Open-Ended Survey Responses.” American Journal of Political Science 58(4):1064–1082. doi: 10.1080/01621459.2016.1141684.

Rooduijn, Matthijs and Teun Pauwels. 2011. “Measuring Populism: Comparing Two Methods of Content Analysis.” West European Politics 34(6):1272–1283.

Ruths, D., & Pfeffer, J. 2014. “Social media for large studies of behavior.” Science, 346(6213), 1063-1064.

Schonhardt-Bailey, C. (2008). The congressional debate on partial-birth abortion: Constitutional gravitas and moral passion. British journal of political science, 38(3), 383-410.

Seale, Clive, Sue Ziebland and Jonathan Charteris-Black. 2006. “Gender, cancer experience and internet use: A comparative keyword analysis of interviews and online cancer support groups.” Social Science & Medicine 62(10):2577–2590.

Slapin, Jonathan B. and Sven-Oliver Proksch. 2008. “A Scaling Model for Estimating Time-Series Party Positions from Texts.” American Journal of Political Science 52(3):705–722. doi: 10.1111/j.1540-5907.2008.00338.x.

Spirling, A. and Rodriguez, P.L. 2019. “Word Embeddings. What works, what doesn’t, and how to tell the difference for applied research.”.

Steinert-Threlkeld, Z. 2018. “Twitter as Data.” Cambridge University Press.

Tausczik, Y R and James W Pennebaker. 2010. “The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods.” Journal of Language and Social Psychology 29(1):24–54.

Young, L., and Soroka, S. 2012. “Affective news: The automated coding of sentiment in political texts.” Political Communication, 29(2), 205-231.

Yu, B., S. Kaufmann and D. Diermeier. 2008. “Classifying Party Affiliation from Political Speech.” Journal of Information Technology and Politics 5(1):33–48.

Zumel, Nina and John Mount. 2014. Practical Data Science with R. Manning Publications.