Google Maps for library publication info

Where do books come from?

Summary

This program takes publication information from library records, stored in OPACs and retrieved online or imported from export files, and plots that information on maps. A given location is marked with some kind of aggregate value, such as the number of chosen books that come from there or the earliest date of publication.

Unlike OCLC WorldMap, which displays large amounts of mostly static data with authoritative geographical information at a coarse country level, this program tries to display moderate amounts of data arrived at dynamically through a query and to refine geographical positions down to the city starting from more free-form data.

The aim is to use as much bibliographic information as is available to determine as accurately as possible where it is referring to. For instance, "Cambridge" in the human-readable record is ambiguous between Cambridge, Massachusetts and Cambridge, Cambridgeshire. But given an additional machine-readable record that says MA or UK, we can disambiguate. Lacking that, we can use the knowledge that the publisher is the Harvard University Press or the Cambridge University Press. In fact, given no more information than Harvard University Press, as might happen with Amazon data, we can make a good bet that is it from greater Boston. This is done with a user-extended rule-based system.

The approach can also make up for cataloging anomalies.

Samples

Background

This began as a further home cataloging exercise. The idea was enhanced based on some of the recent "mashup" discussions on the LibraryThing blogs.

Methodology

The major pieces of the implementation are as follows.
  1. Access to publication bibliographic records. These are Java classes.
  2. Static data on authoritative geographical locales, such as mappings between MARC codes and ISO country codes, together with latitude and longitude. This information is easily tracked down on the web, and turned into flat files with editor macros for loading into a database, where it can be updated from time to time as needed.
  3. Pattern matching to take the semi-structured information from step 1 and split it up further by understanding some common cataloging conventions. This could be separate code writing in a pattern-oriented scripting language like Perl. But right now it is Java code using regular expressions.
  4. A database mapping place names as they occur in publication info to actual localities with coordinates. What is interesting about this is that it could be an online distributed effort, with users entering new mappings to refine their results. Note that the number of places that books get published is significantly smaller then the number of books or the number of places overall. For now, I have only put together some JSP web pages as a proof of concept.
  5. An online geocoder to aid in getting coordinates and to automate simple matches. I am using MetaCarta GeoParser and Google Maps. The former can return multiple matches. The latter is not very forgiving about format and tends to give up in the face of ambiguity. A much better general-purpose geocoder that took everything known from the record free-form would obviate the need for better versions of some of the other parts. There may well be other ones available; I didn't look very hard.
  6. XSLT and Javascript to convert the XML results data into HTML for interactive display browser in a browser.

Challenges

LibraryThing data gotten from Amazon does not have place of publication. I cleaned up my own data to add it; I suspect others would prefer to do so as well. Another possibility, mentioned above, is to add patterns that match the publisher name, together with the unknown place, to the mappings database. Certainly for the major publishers in New York City like Random House under its various brands. And a representative place could be chosen for ones like the Oxford University Press.

Client side XSLT rendering does not quite work with FireFox. The problem is that the document object model presented to Javascript still has parts of the original XML DOM tree in addition to the transformed HTML DOM tree. Different stylesheet settings can get it close to working, although they have problems with accented characters. But even those fail deep inside Google Maps API.

There is no technical reason that mapping could not be to street level locations, rather than just city. More realistically, one might want to distinguish that Harvard University Press is 02138 and MIT Press is 02139. (Although ZIP codes can be problematic. I live in one that has parts in three different municipalities, which are in turn in three different counties.)

It is somewhat inconvenient to have to wave the mouse around to see the associated value information. But many label windows all stacked on top of one another is a mess, too. The attempt to have some slight color gradation for the values doesn't seem quite good enough, either. Perhaps something more garish really is needed.

One of the ideas that came up in the online discussions is displaying the oldest book for a number of locations. Doing this without any further qualification on the books is too many records to process; many servers refuse to have result sets larger than 10,000. It should still be possible by asking the query server to sort the records ascending by date and then taking just the first record and ignoring the rest. Then run this query once for each from a set of country codes or city matches. Unfortunately, I was unable to find a unsecured Z39.50 server that supports sorting.

Related to the above, I thought that a map of sources of incunabula would be interesting. But I am unable to find a server that supports a bib-1 relation attribute other than the default. I may not have figured out the necessary other attributes properly. The alternative is to have fifty separate queries, each matching one year from 1450 to 1500. Or use an entirely different search; Yale tags incunabula with a 690 local subject.

The place group feature is something less useful than it might be because MARC bib-1 use attributes do not have anything to match the 008/15-17 field; only the 260$a field. So one cannot count books by state, for instance.

There are issues with character sets. usmarc records should be in the MARC8 character set, an extension of ANSEL. But they are sometimes in the ISO-8859-1 character set.

Getting a deep link from a record in a query result for later display is library-specific. OCLC registers how to do this given an ISBN. (Or not, although I have found that the non-ISBN searches often don't work.) In the case where multiple books from the same place / publisher are aggregated into a count greater than one, the info window should a link to a search that combines the original query with a place condition. I could not get this to work with any of the libraries I used. LOC comes close, but there is no SRW bath. field for place of publication, only publisher.

Rules for publisher only records could be automated by recording the place as complete library records pass through the system.

Running

All server queries and map generation can be done standalone from the command line with a mininum of requirements (pretty much just Java Runtime). To refine the rules to deal with data anomalies or gaps in the external geocoders' knowledge, a simple browser interface runs against the server (which can be local). A good sense of how much (or how little) this interface does can be gathered form the associated help.

Standalone

One time setup:
Either one of:
Precompiled
  • Install the Java Runtime Environment (JRE) version 5 if you have not already.
  • Download the Dependent Libraries archive and unzip it into zlibmap/lib.
  • Open a shell / command prompt window connected to the zlibmap directory.
  • Run bin\createdb to initialize the place database.
  • Test by running simplequery or runquery (see below).
Compile yourself
  • Install the Java SE Development Kit (JDK) version 5 if you have not already.
  • Install Apache Ant if you have not already.
  • Install all the dependent toolkits listed below.
  • Run ant copy-external all create-database import-database in zlibmap. If there are any problems, try these targets one at a time.
  • Test by running ant test-simple or test-query.
Commands:

Some shell / command prompt scripts are included to make running standalone a little easier. All they do is invoke the Java programs.

The .cmd files are for Windows (tested on XP) and the .sh files are for Unix (tested on Debian Linux).

Local web server

These instructions are for Apache Tomcat, but everything is fairly basic and standard.
Either one of:
Precompiled
Compile yourself
  • Compile standalone using Ant as above.
  • Run ant tomcat-install in zlibmap.

Dependencies

The following FOSS pieces or web servies were used.

Terms

This software is released as open source under The MIT License. You use it at your own risk. See license.txt for details.