Center for Geographic Analysis Harvard University
Additional navigation

You are here

Big Data Geocoding and Overlay Analysis: A combined approach using ArcGIS and PostgreSQL

By Devika Kakkar, Jeff Blossom, and Giovanni Zambotti

In a recent project we were tasked to geocode approximately 53 million U.S. addresses (in CSV format), and determine the 2000 and 2010 U.S. Census block group FIPS codes for each address.  This blog will discuss the workflow used to accomplish this big data processing task, detailing each step taken to optimize the process, including scripts written.  The final process ended up being a combined approach using ArcGIS, PostgreSQL, and files formatted in CSV.  The full job ended up taking 96 hours to geocode in ArcGIS, and 14 hours for block group determination in PostgreSQL.  The Dell PC used for all processing is a 64-bit Windows 7 Professional SP1 operating system machine with i7-6700 CPU @ 3.40GHz × 2 and 64 GB of RAM.

First, the CSV file containing the 53 million addresses was geocoded using ArcGIS for Desktop 10.3 and Esri Business Analyst 2015. The processing took about 96 hours to run. Results were saved into a file geodatabase that ended up being 16.86 GB in size. The geocoding rate started out at 200,000 per hour, but then steadily increased to as much as 900,000 per hour, averaging about 550,000 per hour for the duration. 99% of the addresses were matched. To improve batch geocoding performance, please refer to the link below:

https://blogs.esri.com/esri/arcgis/2011/02/09/tuning-a-locator-for-improved-performance/

Next, determining the census block group 2000 and 2010 FIPS codes for each address was necessary. ArcGIS has several ways to accomplish this, including the tools spatial join, identity, and intersect. The 2000 and 2010 block group polygon boundaries were copied from the ESRI Data and Maps datasets into the file geodatabase holding the geocoded addresses. Each block group file has just over 200,000 polygons. The spatial join, identity, and intersect tools were run on the addresses and 200,000 block group polygons, and each time ArcGIS threw an "Out of Memory" error. Having run into this error before (ArcGIS running out of memory when attempting analyses with datasets numbering in the millions), a python multithreading script (included in "Python Multithreading Script" section below) running simultaneous processes was used to process a subset (~11.7 million addresses) of the full dataset against the block groups. The multiprocessing script uses the python PostGIS library psycopg2, referenced at the link below:

http://pythonhosted.org/psycopg2/

Python Multithreading Script:

import multiprocessing

import time

import psycopg2

from datetime import datetime

startTime = datetime.now()

print startTime

census_block_list = range(1,11722278)

conn = psycopg2.connect("dbname=postgres user=postgres password=postgres")

def update_points_in_census_block(census_id):

       cur = conn.cursor()

sql = "update ba2000 as points set fipsout = polygons.id from bg2000 as polygons where polygons.ogc_fid = " + str(census_id) + " and st_intersects(polygons.wkb_geometry, points.wkb_geometry);"

       cur.execute(sql)

       conn.commit()

if __name__ == '__main__':

        pool = multiprocessing.Pool(processes=3)

        results = pool.map(update_points_in_census_block, census_block_list)

         pool.close()

         pool.join()

         print datetime.now() - startTime

Though this approach was extremely easy to implement, it was not so efficient due to the long processing time of roughly 2.5 hours per block group for the subset used. Therefore, we turned to a solution using the PostgreSQL. PostgreSQL version 9.4 can be downloaded and installed using the link below:

http://www.enterprisedb.com/products-services-training/pgdownload

The PostGIS extension was enabled and connected to the database using QGIS. Importing the addresses and census block group polygons into PostgreSQL took 30 hours. Therefore, an alternative method using a CSV file instead of geodatabase was used for import. First, the geocoded geodatabase was exported to a CSV file.  Then the CSV file was copied to a PostgreSQL table. This took only 6 minutes to complete which was much less compared to the time of importing a geodatabase. After the import was complete, an “Intersect” command was run to find the block group for each address. Processing the block group 2000 and 2010 determination for all 53 million address locations took about 14 hours.  Once complete, the results were exported back into CSV. See the "PostgreSQL workflow" section below, where each command is listed.

In conclusion, by experimenting with various processing techniques available in ArcGIS and PostgreSQL, an efficient process was developed to handle this big data geocoding and spatial overlay analysis task.  

PostgreSQL workflow:

# Connect to the database, QGIS connection is used

# Create an empty table with the same fields as the csv file

CREATE TABLE public.geocoding_result

(      "OBJECTID" oid,
loc_name character varying,
status character varying,
match_addr character varying,
x double precision,
y double precision,
altpatid double precision,
address character varying,
city character varying,
state character varying,
zip_code character varying   )

WITH (OIDS=FALSE);
ALTER TABLE public.geocoding_result
OWNER TO postgres;

# Import addresses, copy csv file to the table in PostgreSQL

COPY geocoding_result from 'C:\Temp \geocoded_final.csv' DELIMITERS ',' CSV;

# Add Geometry field to the table

ALTER TABLE public.geocoding_result ADD COLUMN geom geometry(Point,4326);

# Update geometry with x,y values

UPDATE public.geocoding_result SET geom = ST_SetSRID(ST_MakePoint(x, y),4326)

#import census block groups from file geodatabase into PostGre

ogr2ogr -overwrite -f "PostgreSQL" PG:"host=localhost user=postgres dbname=postgres password=postgres" "~/Census.gdb" "blkgrp2010"
ogr2ogr -overwrite -f "PostgreSQL" PG:"host=localhost user=postgres dbname=postgres password=postgres" "~/Census.gdb" "blkgrp2000"

# Create (spatial) index (optional as importing will automatically build index) to increase performance

CREATE INDEX geocoding_result_gist
ON geocoding_result
USING GIST (wkb_geometry);

# Intersection for 2010 boundary(Make sure that addresses and block groups are in same coordinate system)

CREATE TABLE geofips AS
SELECT geocoding_result.*,blkgrp2010.fips
FROM geocoding_result,blkgrp2010
WHERE ST_Intersects(geocoding_result.wkb_geometry,blkgrp2010.wkb_geometry);

# Rename new field "fips" to fips 2010

ALTER TABLE geofips RENAME COLUMN fips TO fips2010;

# Intersection for 2000 boundary

CREATE TABLE geofipsall AS
SELECT geofips.*,blkgrp2000.fips
FROM geofips,blkgrp2000
WHERE ST_Intersects(geofips.wkb_geometry,blkgrp2000.wkb_geometry);

# Rename new field "fips" to fips 2000

ALTER TABLE geofipsall RENAME COLUMN fips TO fips2000;

# Add Column fips2010, fips2000 to geocoding_result

ALTER TABLE geocoding_result ADD COLUMN fips2010 varchar(12);
ALTER TABLE geocoding_result ADD COLUMN fips2000 varchar(12);

# Select non-geocoded results

CREATE TABLE Ustatus AS SELECT * FROM geocoding_result WHERE status='U';

# Union two tables together

CREATE TABLE geocoding_final_result AS (SELECT * FROM geofipsall UNION SELECT * FROM ustatus);

# Export result to csv

COPY geocoding_final_result(loc_name,status,x,y,match_addr,altpatid,address,
city,state,zip_code,fips2010,fips2000) TO 'C:\Temp\geocoding_final_result_new.csv' DELIMITER ',' CSV HEADER;

# Export a sample result for reviewing

COPY (SELECT loc_name,status,x,y,match_addr,altpatid,address,city,
state,zip_code,fips2010,fips2000 FROM geocoding_final_result LIMIT 1000) TO 'C:\Temp\geocoding_final_result_1000.csv' DELIMITER ',' CSV HEADER;

# End