Harvard CGA Geotweet Archive v2.0

Publication information:

2026. “Harvard CGA Geotweet Archive V2.0”. doi:10.7910/DVN/3NCMB6

Abstract

The Harvard Center for Geographic Analysis (CGA) maintains the Geotweet Archive, a global record of tweets spanning time, geography, and language. The primary purpose of the Archive is to make a comprehensive collection of geo-located tweets available to the academic research community. The Archive extends from 2010 to July 12, 2023 when Twitter stopped allowing free access to its API, transitioning API access to a paid model. The number of tweets in the collection totals approximately 10 billion, and it is stored on Harvard University’s High Performance Computing (HPC) cluster. 

The Geotweet Archive consists of tweets which carry two types of geospatial signature: 1) GPS-based longitude/latitude generated by the originating device 2) Place-name-centroid-based longitude/latitude from the bounding box provided by Twitter, based on the user-define place designation (typically a town name). Any tweet which carries one or both of these signatures is included in the Archive. Approximately 1-2% of all tweets contain such geographic coordinates, (this percentage needs verification and may vary over time). 

The current version of the Archive is Version 2.0. The original Version 1.0 archive began in 2012 as part of a project started by Ben Lewis of CGA and then Harvard graduate student Todd Mostak, to develop a GPU-powered spatial database. The database needed an interesting, large, spatio-temporal dataset to show off its capabilities. So Todd and Ben built a harvester with the goal of harvesting all tweets containing GPS coordinates coming out of the Twitter firehose. 

The first version of the GPU database was called GEOPS, and it powered TweetMap which ran within WorldMap and represented the first vector-based geospatial big data query and display platform. Eventually when GEOPS was not made open source, Ben Lewis, Merce Crosas, and Apache developer David Smiley developed an open source version of GEOPS built on 2D faceting within Solr which was called "The BOP" (https://gis.harvard.edu/billion-object-platform-bop). 

Later GEOPS formed the basis for technology startup MapD Technologies, which then became OmniSci, and then Heavy.ai. Heavy.ai software now runs on Harvard’s High Performance Computing (HPC) environment to support interactive exploration and analytics with the Geotweet Archive and any other large datasets. 

Version 2.0 of the geotweet archive represents the results of a merge between the CGA archive, and an archive developed by the Department of Geoinformatics at the University of Salzburg in Austria, as well as other archives including one created by Ryan Qi Wang.  Clemens Havas and Bernd Resch at University of Salzburg, worked with Devika Jain and Ben Lewis of Harvard CGA, to deploy Version 2.0.