Twitter Sentiment Geographic Index (TSGI) | Center for Geographic Analysis

Introduction
Promoting well-being is one of the key targets of the Sustainable Development Goals at the United Nations. Many national and city governments worldwide are incorporating subjective well-being (SWB) indicators into their agenda to complement traditional objective development and economic metrics. In this study, we develop the Twitter sentiment geographical index (TSGI), a proxy for SWB by applying natural language processing techniques on a comprehensive archive of 7.4 billion geotagged tweets. In contrast to the previous works focusing on SWB, TSGI is not limited to a specific topic, period, or location. Using this data, we construct a high-frequency multi-year database that has global coverage, which enables the evaluation of SWB in 163 countries and regions for one decade. It offers great opportunities to investigate rich topics related to SWB. It mainly provides a detailed sentiment index spanning time and geography. To the best of our knowledge, it is the first SWB dataset at this scale and granularity. The TSGI is a collaborative project between MIT Sustainable Urbanization Lab and Harvard's Center for Geographic Analysis to study the effect of climate change on human well-being using social-media data. More information can be found on the TSGI website.

Data Availability
This dataset is open to the public and can be accessed on TSGI’s dataverse repository here. Researchers can access the national indices, updated monthly with new data on this link. It has been selected as part of the United Nations' Sustainable Development Goals (SDGs) Today dataset. It could be explored using SDGs Today dashboard here.

Data Source
The raw tweet data we used to produce the global sentiment and geography index dataset (GSGD) is from CGA Geotweet Archive v2.0, a global collection of geotagged tweets spanning time, geography, and language maintained by the Harvard Center for Geographic Analysis. The Archive extends from 2010 to the present and is updated daily. The number of tweets in the collection is approximately 10 billion. More information on this dataset here.

Methodology
The sentiment index for global geotagged tweets is made in the following steps:

First, we vectorize the text into a 768 dimensions vector.
Then, we feed the vector into a trained neural classifier to get the single sentiment score.
Finally, we aggregate the scores in different administrative areas to represent the local subjective well-being.

Use-cases/Publications
The dataset is being used for several use cases including but not limited to Global Sentiment and Climate Change and Global Sentiment during COVID-19. Following is a list of publications which are currently under review:

Chai, Y., Kakkar, D., Palacios, J. et al. Twitter Sentiment Geographical Index Dataset. Nature Sci Data 10, 684 (2023).
Wang, J., Guetta-Jeanrenaud, N., Palacios, J., Fan, Y., Kakkar, D., Obradovich, N., & Zheng, S. (2022). A global nonlinear effect of temperature on human sentiment. Nature Human Behavior (Under Review).
Zou, Lei, Wanhe Li, Mingzheng Yang, Binbin Lin, and Joynal Abedin. 2023. “The Impact of Social Isolation on Sleep Disturbances – Evidence from Geospatial Big Data during COVID-19.” Abstracts of the ICA 6: 1–2.
Wenting Jiang, Mengxi Zhang, Connor Y.H. Wu *, Weichuan Dong (Accepted). "Rural-Urban Differences in the Determinants of Subjective Well-being among X/Twitter Users in the United States". Population, Space and Place.