Tags: view all tags

How to Use Census Bureau Data

There is an enormous amount of information about Census Bureau data. There is so much that it is hard to sort through it all. Furthermore, a lot of the information assumes that you already have some understanding of how to work with the data, and/or that you are using a commercial software package (e.g. ArcInfo).

If you want to massage the data yourself -- without going through a commercial tool -- you will need to learn more than you ever wanted to know about the data. This document is what I've taught myself. I'm not certain that I've got it all completely correct, but it's the best I could do.

Terminology

Census 2000 Geographic Terms and Concepts

is a good start at explaining the census data terminology. There are a few things, however, that this document does not make clear:

What is the relationship between geographical entities? Are blocks subsets of tracts, for example?

Note that tracts are numbered uniquely within counties. Tract 2031 in Santa Clara County is different from tract 2031 in Calaveras County.

Data files

The census bureau data is split into many files. If you just care about tabulating data, you will only need a data file; if you want to draw maps, you will also need a shapefile set.

Shapefile sets

Shapefiles hold information about regions on maps, e.g. the outlines of states, counties, census tracts, etc. Shapefiles define the outlines in terms of points, and the points are given in latitude/longitude pairs. Shapefiles are actually three different files: a .dbf file, a .shp file, and a .shx file. The .dbf file holds information about the other files; I don't know the difference between a .shp file and a .shx file.

One place to get shapefiles is from ESRI (the makers of ArcInfo). Download from http://arcdata.esri.com/data/tiger2000/tiger_download.cfm . They have documented it here.

Note that shapefiles are so big and unweildy that the ESRI shapefiles are split into multiple pieces. Generally, a file has a specific region (county or state) and a specific category of shape. There are lots of different types of shapes -- census tracts, cities, voting districts, etc.

Example: If I request California, then "Census Tracts 2000" from the ESRI download page, it will ask me which counties I want. If I say all counties, then I'll receive a file, which when I unzip it, will have a bunch of zip files, one for each California county. Unzipping those files will give me one .dbf file, one .shp file, and one .shx file.

On the other hand, you can get boundary files directly from the Census Bureau. These have an entire state's worth of data in them, though perhaps aren't as well-documented.

Shapefile utilities

Shapelib

is a wonderful thing, and it has a python binding

Its Shape API is very nice and will let you pull out individual fields. In order to figure out what the different fields are, you need to query the associated .dbf file. Use Shapelib's DBF API for that.

The Census bureau docs are really lame at telling you whether a field is char or int or double. Never fear -- use the Shapelib distro's dbfdump with the -h flag, and it will tell you what you need to know.

Data files

One of the juiciest data files that you can get is the SF1 data file. (No, I don't know what SF1 stands for.) As the SF1 documentation

shows, it has ALL KINDS of yummy population information broken down sixteen ways from Sunday.

The SF1 file is also a .dbf file, so you can use the DBF API to extract that information, as above.

Set ALLOWTOPICCHANGE = DuckySherwood

Raw edit | More topic actions

Topic revision: r1 - 2005-11-18 - DuckySherwood