Locate Your Users, Part 1: Intro and Data

I want to know where visitors to my web site and users of my web apps come from. I want to do this in order to profile my site audience and provide real-time app localization for users based on where they are located in the world. Rather than intrude on the user experience and ask where they are from I thought I should be able to get that information based on their IP address. A few weeks ago I was playing with the MaxMind GeoLite City database and found that it would be a good starting point for the geographic information I want to obtain per IPv4 address.

GeoCity Lite is available as a high-performance binary file that is lightning fast at performing lookups or as two CSV files. For this exercise I will be using the CSV files in order to load the data into a MySQL database. The GeoLiteCity-Location.csv contains the location information for a block of IPv4 addresses. This information has a column labeled loc_id which we will use to tie IP blocks to it. For each location there is an associated city, region, country, latitude, longitude, postal code, area code and DMA code.

The other half of the CSV package is GeoLiteCity-Blocks.csv. This file contains a comprehensive list of IPv4 address blocks and the location id that the block maps to in the Locations.csv file. The format of this file has a block start IP number , block end IP number and location id. Notice that I did not say the file contains IP addresses, but rather IP numbers. This presented a minor problem for me to overcome, namely converting a string representation of an IP address in the form of “192.168.0.1″ to it’s long representation 3232235521, but that’s a story for the Wednesday’s entry. Tonight we’re just concerned with loading the data.

If you’d like to follow along first download the MaxMind GeoLite City database and unzip it where you see fit. Edit each file from the zip and remove the first few lines containing the copyright and column names. I have created a script that can be used to create the schema and load the data from the files. The file is located in geoip.sql in the EAR file attached to this post.

use geoip;

–drop table blocks;
–drop table locations;

create table locations (
loc_id integer not null primary key,
country varchar(2) not null,
region varchar(2),
city varchar(255),
postal_code varchar(10),
latitude double,
longitude double,
dma_code integer,
area_code integer) type InnoDB;

create table blocks (
start_ip_num bigint not null,
end_ip_num bigint not null,
loc_id integer not null) type InnoDB;

alter table locations
add unique index loc_idx (loc_id asc);

alter table blocks
add primary key (start_ip_num, end_ip_num);

alter table blocks
add foreign key blk_loc_fk (loc_id) references locations (loc_id);

alter table blocks
add unique index start_ip_idx (start_ip_num desc),
add unique index end_ip_idx (end_ip_num desc);

load data infile ‘/home/anthony/GeoLiteCity_20060901/GeoLiteCity-Location.csv’ into table locations fields terminated by ‘,’ enclosed by ‘”‘;
commit;

load data infile ‘/home/anthony/GeoLiteCity_20060901/GeoLiteCity-Blocks.csv’ into table blocks fields terminated by ‘,’ enclosed by ‘”‘;
commit;

This schema script is specific to MySQL. You will have to modify it in order to use a different DBMS. The first thing we do is tell MySQL to use the geoip database. The script assumes you have already created an empty database and assigned the appropriate permissions to a user. The next two lines are commented out, but you will need to uncomment them if you decide to reload the data or recreate the tables from scratch. The steps to create the tables and indexes are straight forward for anyone familiar with SQL so I will skip explaining them in depth here. Just be sure to change the path in the load data lines to the path on your machine where the CSV files are located. Once you’re done with that you can save the file and run it. I ran it by doing:

mysql -u geoip -p < geoip.sql

Don’t panic if it takes a while. It’s loading a lot of data and the speed depends greatly on your MySQL configuration and hardware. On my Pentium 4 2.4 GHz the CPU was steady at about 50% utilization and MySQL used up to 418 MB of RAM. The total time to load the data was just over 9 minutes. Once you’re done with this step you’ve got a database full of geographic data for just about every IPv4 address block. It won’t be much use to you until we discuss converting an IP address to an IP number though. That’s the next topic of discussion.

You can download the IPAddressLocator.EAR file to get an idea of where this project is going.  It’s not the latest version because I’m going to post about the current shortcomings and what I’ve done to remedy them.

Feel free to leave questions or comments and I will respond to them. If you have suggestions, improvements or ideas for where this type of data may be useful please let me know. I look forward to hearing your feedback.

1 comment so far ↓

#1 Dan on 12.05.07 at 10:53 pm

Thanks for the post. =)

In case you’re still delving into this stuff here are some things I found that greatly reduced the query time.

In `blocks` i have start_ip_num and end_ip_num as “int(10) unsigned not null” instead of big ints;

I’ve also got start_ip_num as the primary key, with end_ip_num as a unique key.

loc_id is the same…

Take the query for a melbourne ip address:
select loc_id from blocks where start_ip_num >= 3405775035 and end_ip_num <= 3405775035

Lookup time went from approx 0.90 seconds to 0.00 seconds.

=)

Leave a Comment