Network Analysis on Geospatial Big Data in Brazil

Network Analysis is a commonly encountered problem in GIS. Researchers are increasingly working with big geospatial datasets that contain millions of records. At this scale, traditional GIS methods of network analysis fall short and new approaches are needed to analyze the data. In this blog, we describe the procedure we used for calculation of shortest drive distances between 3.5 Million patients and their nearest Hospital in Brazil. There are several tools for calculating the shortest distance calculator; most common among them are PostGIS routing and OSMnx. OSMnx is a python package for street networks, which retrieve, model, analyze, and visualize street networks and other spatial data from OpenStreetMap. In addition to the function of calculating the shortest path, the OSMnx provides the function to download the road network with one line of code, which will simplify our work. After evaluation of both these tools, we decided to use OSMnx as it is simpler to use and less complex to script than a database.

There were multiple challenges we encountered while working on this big data project. Firstly, it was impossible to download the entire street networks of Brazil using OSMnx, because of the large amount of RAM required. Therefore, we divided the datasets in smaller subsets which can be computed effectively. To optimize our calculations, we divided the data into states (27 states) and calculated the distances between origins and destinations in each state. The data could be divided by cities but that would mean 5,570 subsets which would be difficult to manage. We further divide data of every state into different data sets according to provinces, so that calculations can be made in different provinces. The segmentation of the data is performed through the geopandas package Geopandas Clip, using Brazil's administrative boundaries as the mask.

The processing is run on Harvard’s High Performance Computing Cluster (FASRC) using a 2 CPU, 64GB RAM on Jupyter Notebook server. OSMnx offers multi-processing which allows us to use multiple-cores on the FASRC server, which helps to accelerate our calculations. The calculation of each state ranges from 20 mins to 7 hours. The most time-consuming process is the process of downloading the road network (ranges from 20 mins to 4 hours). so we cache it to save time between various runs. The entire process of calculation of 3.5 million distances in Brazil took us about 4 days using this approach. Our solution enabled calculation of the shortest drive distance on this big dataset in a cost and time efficient manner. This will enabled researchers to perform key analysis on various aspects of public health in Brazil.

The scripts for this analysis are in the process of being published to Github.

Questions/comments on the project can be sent to Devika KakkarXiaokang Fu and Jeff Blossom

Brazil_network