Twitter Sentiment Geographical Index (TSGI) dataset: A global high-frequency dataset for monitoring Subjective Well-Being

Introduction
Promoting well-being is one of the key targets of the Sustainable Development Goals at the United Nations. Many national and city governments worldwide are incorporating subjective well-being (SWB) indicators into their agenda to complement traditional objective development and economic metrics. In this study, we develop the Twitter sentiment geographical index (TSGI), a proxy for SWB by applying natural language processing techniques on a comprehensive archive of 7.4 billion geotagged tweets. In contrast to the previous works focusing on SWB, TSGI is not limited to a specific topic, period, or location. Using this data, we construct a high-frequency multi-year database that has global coverage, which enables the evaluation of SWB in 163 countries and regions for one decade. It offers great opportunities to investigate rich topics related to SWB. It mainly provides a detailed sentiment index spanning time and geography. To the best of our knowledge, it is the first SWB dataset at this scale and granularity. The TSGI is a collaborative project between MIT Sustainable Urbanization Lab and Harvard's Center for Geographic Analysis to study the effect of climate change on human well-being using social-media data. More information can be found on the TSGI website and TSGI presentation here.

TSGI_1

Data Availability
This dataset is open to the public and can be accessed on TSGI’s dataverse repository here. Researchers can access the national indices, updated monthly with new data on this link. 

Data Source
The raw tweet data we used to produce the global sentiment and geography index dataset (GSGD) is from CGA Geotweet Archive v2.0, a global collection of geotagged tweets spanning time, geography, and language maintained by the Harvard Center for Geographic Analysis. The Archive extends from 2010 to the present and is updated daily. The number of tweets in the collection is approximately 10 billion. More information on this dataset here.

Methodology
The sentiment index for global geotagged tweets is made in the following steps:

  • First, we vectorize the text into a 768 dimensions vector.
  • Then, we feed the vector into a trained neural classifier to get the single sentiment score.
  • Finally, we aggregate the scores in different administrative areas to represent the local subjective well-being.

TSGI_02

Applications/Publications
The dataset is being used for several use cases including but not limited to Global Sentiment and Climate Change and Global Sentiment during COVID-19. Following is a list of publications which are currently under review:

  • Chai, Y., Kakkar, D., Palacios, J., & Zheng, S. (2022). Twitter Sentiment Geographical Index: A global high-frequency dataset for monitoring Subjective Well-Being. Nature Scientific Data (Under Review).
  • Wang, J., Guetta-Jeanrenaud, N., Palacios, J., Fan, Y., Kakkar, D., Obradovich, N., & Zheng, S. (2022). A global nonlinear effect of temperature on human sentiment. Nature Human Behavior (Under Review).

Questions/Comments:
Any questions or comments on this dataset can be sent to Devika Kakkar.