NFL Stats in Perl

I like to do statistical analysis, and over the years, I've grown less interested in drudge work and more in letting my tools work for me. This is no different in sports, where I've long had an interest in statistical analysis. Finding out which teams are good isn't that hard, but with teams having byes, the stats in the paper sometimes aren't enough.

The issue is pretty simple. There are byes these days, and while the Sunday papers and sports sites can give you cumulative performance, they really don't give you performance rates. When comparing a team that has played 7 games to a team that has played 8 games, the things you want to know are:

Statistics like these change the way you can look at teams. They show what to expect in the next game, how they fare relative to other teams. It makes it possible to compare teams across byes, and better understand what is coming next.

One of the fundamental issues in doing this kind of work are tools. Well, in this case I've been using Perl a great deal. It is an interpreter, it handles text, databases, and the Internet extremely well. For calculation, it isn't a speed demon but it's more than fast enough for what we're doing here. We'll be using MySQL as the database piece, and since we're trying to keep our work to a minimum, we'll be using web sites as the source of our information.

The Gory Details

What we're going to start with is a piece of code that can parse data from NFL.com. We don't want the whole site, we just want game scores. If you take a look at nfl.com, the you can see that they have a page of scores, that the scores appear to be enclosed in little boxes of HTML "stuff", and that the boxes appear to have a repetitive pattern, used over and over again. What we're going to do is scrape NFL.com for game scores.The code for our scraper is here.

We're going to use the module LWP::UserAgent to fetch the data and then the module HTML::TreeBuilder to break the data down into usable chunks. The Treebuilder module allows us to identify useful segments of code, and thanks to the clean design of the page, TreeBuilder makes it easy to find the data.

The data on the scores pages of nfl.com are set up as follows: All games are kept within a div tag assigned to a class named scoreBox. All the game content is within this tag. So we first search for all div tags with an attribute, class, whose value is scoreBox. At this point we have all the stuff we need to dig deeper. Want to know if the game is finished? There is a div whose class is either scoreBoxHeader gameOver or scoreBoxHeader gameOn. We look for this div next. Scores are kept within a div whose class is named scoresBoxTopTeamScore, and you can find an abbreviated team name in an A record kept within the div whose class is named scoresBoxTopTeamLogo. Knowing this, we can search the tree to all our scoreBox divs and then peruse the div contents to get our scores.

Our first program is a start, but it doesn't do you any good unless you can store your data someplace. For now I'm using MySQL. It's simple enough to install the MySQL database and then login to MySQL. Type in create database nfl_2007; and then grant all to nfl_2007.* to default_user@'localhost' identified by 'default_password'; Now, we have a database with a username and password.

I don't like storing usernames and passwords in my programs if I don't have to, so I'm going to add another module to the mix. Config::Simple allows me to create Windows style configuration files, and keep all my site dependent variables in there. So, with that in mind, I end up writing:

#
# Sample configuration file
#
[nfl_2007]
table = games
user = default_user
pass = default_pass

[nfl_2006]
table = games
user = default_user
pass = default_pass

I save the file as "nfl.config". For now this keeps passwords out of the code itself. Now, we need to add the ability to make database connections to my set of programs. That uses yet another module, known as the DBI module. Using the DBI module, we'll write a short piece of code to create our table for us. It's a simple table, we're not trying to be exceptionally fancy on a first cut. But the code is here.

We now also add some tools to handle command line parameters (Getopt::Long) and others to generate POD, or plain old documentation (Pod::Usage). We use these to enhance our scraper. We add the ability to handle week ranges, so that we can input all sorts of data at once. And the final product is the program shown here.

After the above we can worry about programs to examine downloaded data and calculate ratings. Again, this program is a start, a template we can use to do fancier things. For now the program rates teams on four criteria:

  1. winning percentage
  2. point spread per game
  3. points scored per game
  4. points allowed per game

These tools are enough to begin analysis, and enough to disambiguate closely ranked teams. The program supports a syntax that allows you to analyze whole seasons or parts of them. If I wanted to analyze just the 3rd through 6th games of the 2007 season, I'd say:

rate_teams.pl --range 3-6 2007

Otherwise, to analyze just, say, the 2006 season, you would use:

rate_teams.pl 2006

Modules needed

We'll list modules used by these programs, at least the ones in Ubuntu. In Windows you can install ActiveState Perl, and the modules used are either part of the standard ActiveState distrbution, or can be downloaded with the ppm utility that ActiveState provides. In Ubuntu, you'll be looking to add these Perl packages:

If you install MySQL, then libdbi-perl and libdbd-mysql-perl get installed for you.