UNIVERSITY PARK, Pa. — Social scientists rely on data to study social problems. However, data from traditional surveys can be difficult and time consuming to collect, as well as inaccurate, since not all factors can be measured well. A National Science Foundation-funded Penn State project will evaluate the accuracy of using Twitter data to represent populations across different demographic groups.
According to principal investigator Guangqing Chi, associate professor of rural sociology and demography and public health sciences in the Department of Agricultural Economics, Sociology, and Education, and a Social Science Research Institute co-funded faculty member, Twitter data is generated by a large number of people in real time, is rapidly growing and easily accessible, and is drawing interest from many research disciplines.
“Twitter data has great potential for understanding population dynamics, however, the use of the data has been resisted by social scientists, largely because we know little about the users’ demographic characteristics,” said Chi.
Chi, who also serves as director of the Computational and Spatial Analysis Core of the Social Science Research Institute and Population Research Institute, and his team aim to make Twitter data useful for social science research by evaluating how Twitter users represent — or misrepresent — the population and will develop and test data weights that, when applied to Twitter data, will make the results more representative of the population as a whole.
The three-year, $500,000 project will compile geotagged tweets from 2014 to 2017 and compare the data to county census data in the U.S. The team will refine existing methods to determine demographics such as age, sex and race/ethnicity, and use these values to predict county-wide characteristics. The team will also determine if Twitter data can be used to estimate migration at the county level by comparing it to the Internal Revenue Service migration data, as well as estimates of Puerto Rico migrants to the continent after Hurricane Maria.
“If Twitter data can achieve high levels of validity, it will be a breakthrough for using Twitter data for population research and will significantly advance population science,” according to Jennifer Van Hook, Roy C. Buck Professor of Sociology and Demography. The work will be documented so that the researchers' methods can be applied to other forms of social media.
The research will also enable demographers and sociologists to strengthen research in many other social science disciplines that use demographic data. It could have a direct and significant impact on small-area population estimation and forecasting by providing real-time estimates of population demographics for small-scale geographies, which could have many applications, such as enhancing emergency management and disaster response.
Seed funding for the project was provided by the Social Science Research Institute, Population Research Institute, and Institute for CyberScience. Other researchers on the project include Eric Plutzer, professor of political science; Jennifer Van Hook, Roy C. Buck Professor of Sociology and Demography; Heng Xu, associate professor of information science and technology; Junjun Yin, research associate; and Don Miller, research analyst/programmer, all at Penn State.