Harvard CGA and MIT SUL joined forces in 2021 to conduct research on the use of social media data to study the effects of climate change on people’s well being. To achieve this objective, every tweet in Harvard CGA’s Geotweet archive was enriched with two important variables of Sentiment and Geography (Admin 2 boundaries) using advanced GIS Data Science (GPU database OmniSci) and Machine Learning techniques (BERT algorithm) on Harvard’s FAS Research Computing Cluster. This is the first time enrichment has been achieved on this large scale involving processing of about 10 billion tweets. We believe that this enrichment will be useful for a wide range of research applications involving the use of sentiment and geography on Twitter data.
For the next phase of the collaboration, the two teams are planning to develop a High-Performance System for Collection and Processing of Data from Sina-Weibo, the biggest social media platform in China. As a groundwork for this work, MIT SUL has constructed an unique dataset which contains geotagged posts from Sina-Weibo. This panel data contains 9.95 million posts generated by a cohort of 447 thousand active users between January 1st 2018 and June 30 2021. More information on this dataset can be found here.
The scripts can be found on our Github here.
Questions/comments on this project can be send to Devika Kakkar.