Infogroup US Historical Business Dataset Analysis

This project involved creating geospatial measures for ~2,000 public firms from the Infogroup US Historical Business Dataset. One of the tasks involved calculating the following variables at the census block group level from the dataset for 23 years of data (1997 – 2019).

1. Businesses per office size type (Office_Size_Code)
2. Businesses per sales volume (Location_Sales_Volume_Code)
3. Businesses per employee size(Location_Employee_Size_Code)
4. Businesses per Business_Status_Code
5. Number of establishments. Will be calculated from Year_Established field.
a. 1 year old
b. 2 years old
c. 3 years old
d. 4 years old
e. 5 years old
f. 6-10 years old
g. 11 and older
6. Total, average, median, top 90, lower 10 percentiles, min, max Employee_Size_Location
7. Total, average, median, top 90, lower 10 percentiles, min, max Sales_Volume_Location
8. Businesses per IDCode
9. Businesses per NAICS
10. Businesses per SIC
11. Total, average, median, top 90, lower 10 percentiles, min , max SALESVOL
12. Businesses per HDBRCH
13. Businesses per EMPNUM
14. Businesses per SQFTCODE
15. Census block group

The project was implemented using Pandas in Python on Harvard's High Performace Computing Cluster. A customised Jupyter notebook was designed for processing each year of dataset and can be easily used for other projects with this dataset.

The scripts are can be found on our Github here.

Questions/comments on the project can be send to Devika Kakkar and Jeff Blossom.