How to Use Census Bureau Data
There is an enormous amount of information about Census Bureau data. There is so much that it is hard to sort through it all. Furthermore, a lot of the information assumes that you already have some understanding of how to work with the data, and/or that you are using a commercial software package (e.g. ArcInfo).
If you want to massage the data yourself -- without going through a commercial tool -- you will need to learn more than you ever wanted to know about the data. This document is what I've taught myself. I'm not certain that I've got it all completely correct, but it's the best I could do.
Terminology
Census 2000 Geographic Terms and Concepts
is a good start at explaining the census data terminology. There are a few things, however, that this document does not make clear:
- What is the relationship between geographical entities? Are blocks subsets of tracts, for example?
Note that tracts are numbered uniquely
within counties. Tract 2031 in Santa Clara County is different from tract 2031 in Calaveras County.
Data files
The census bureau data is split into many files. If you just care about tabulating data, you will only need a data file; if you want to draw maps, you will also need a shapefile set.
Shapefile sets
Shapefiles hold information about regions on maps, e.g. the outlines of states, counties, census tracts, etc. Shapefiles define the outlines in terms of points, and the points are given in latitude/longitude pairs. Shapefiles are actually three different files: a .dbf file, a .shp file, and a .shx file. The .dbf file holds information about the other files; I don't know the difference between a .shp file and a .shx file.
One place to get shapefiles is from ESRI (the makers of ArcInfo). Download from
http://arcdata.esri.com/data/tiger2000/tiger_download.cfm
. They have documented it
here
.
Note that shapefiles are so big and unweildy that the ESRI shapefiles are split into multiple pieces. Generally, a file has a specific region (county or state) and a specific category of shape. There are lots of different types of shapes -- census tracts, cities, voting districts, etc.
Example: If I request California, then "Census Tracts 2000" from the
ESRI download page
, it will ask me which counties I want. If I say all counties, then I'll receive a file, which when I unzip it, will have a bunch of zip files, one for each California county. Unzipping those files will give me one .dbf file, one .shp file, and one .shx file.
On the other hand, you can get
boundary files
directly from the Census Bureau. These have an entire state's worth of data in them, though perhaps aren't as well-documented.
Shapefile utilities
Shapelib
is a wonderful thing, and it has a
python binding
.
Its
Shape API
is very nice and will let you pull out individual fields. In order to figure out what the different fields are, you need to query the associated .dbf file. Use Shapelib's
DBF API
for that.
The Census bureau docs are really lame at telling you whether a field is char or int or double. Never fear -- use the Shapelib distro's
dbfdump
with the
-h
flag, and it will tell you what you need to know.
Data files
One of the juiciest data files that you can get is the SF1 data file. (No, I don't know what SF1 stands for.) As the
SF1 documentation
shows, it has ALL KINDS of yummy population information broken down sixteen ways from Sunday.
The SF1 file is also a .dbf file, so you can use the
DBF API
to extract that information, as above.