Vectorized Hathi Trust Features

A recent paper of mine described a new method for turning the full digital library into a vectorized set of features –based on word counts–that ordinary computing hardware can handle.1

To get a sense of what sorts of textual properties these vectors give insight to, you can read (in addition to the second half of the paper above) my visual bibliography of 13 million Hathi books or my discussion of 130,000 works of fiction.

This page is a guide for anyone who actually wants to use them for exploration or research. The can be useful in a variety of cases beyond the ones I describe in the article.

Simple route: python and half-precision features.

If you want to try exploring these features, I’d recommend the following setup.

First, install the python package to work with SRP files.. This is pretty easy: you just type pip install git+git:// into a terminal window, and then import SRP next time you’re in python. The python module exposes a number of ways to work with the files, including a simple interface for iterating through them one row at a time that you can use to create any extracts you like. (The format is the same as used by Google’s word2vec files, so code for working with them will work as well.2)

Second, download a copy of the features from zenodo. DOI This link takes you not to the full 1280 dimensional features I used in the paper but to a more compact version. That means you can download a pretty good representation of the entire Hathi trust–about 1 kilobyte of information per book–in 17GB of data. There are also segmentations by language and year if you just want to look at–say–French books. Because of the half-precision floats, this set can only be read with the python package above. The python package can read these binary files into a variety of more useful formats.

Why the strange binary format?

The big challenge here is that size matters, a lot. There are now 18 million books in the Hathi Trust, and to get a useful vector representation of any one of them you need at least a few hundred vectorized points–let’s say 640. If I tried to distribute these as numbers in a text file, each point takes ten characters, including spaces (e.g., -2.398139), it would take up. 640 * 10 * 15,000,000 = 96GB of space.

The binary format is only 640 dimensions (which is probably about 70% of the information for half the size); it also stores numbers as half-precision floats, which lets numbers be represented more compactly–if only to a few decimal points in just two characters.

Example operations

I have included a number of examples in the docs for the python package of tasks you might want to perform, such as:

  1. Taking a subset of the full Hathi collection (100,000 works of fiction) based on identifiers, and exploring the major clusters within fiction.
  2. Creating a new SRP representation of text files and plotting dimensionality reductions of them by language and time
  3. Searching for copies of one set of books in the full HathiTrust collection, and using Hathi metadata to identify duplicates and find errors in local item descriptions.
  4. Training a classifier based on library metadata using TensorFlow, and then applyinig that classification to other sorts of text.

There are some other promising venues that I haven’t followed up on at length that I’m happy to talk to anyone about.

  1. SRP features are not compact representations; they waste information bandwidth in order to preserve space for other languages or vocabularies that might exist in the future. That’s by design. But a good autoencoder design might be able to preserve most of the data for English SRP features while reducing the size by another 30-60%.
  2. The classification probabilities–like those produced by the fourth example above–may be quite useful in themselves as features for measuring cultural change.

Full data: 1280-dimensional features.

The full data are available, in pieces, from Northeastern’s digital repository This dataset is a little large for normal handling: I have worked with it as a single 64 gigabyte file, but to make downloading feasible I have chopped it into several different 2GB files by language and year. You can download the files that are useful. You can also contact me if you want the full set through some other medium.

R and Javascript libraries.

I have put somewhat less effort into optimizing libraries for R and Javascript. The javascript library is especially useful, though, in building interactive websites where users shouldn’t upload long texts directly, whether for reasons of computational efficiency or privacy.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. “Efficient Estimation of Word Representations in Vector Space.” arXiv Preprint arXiv:1301.3781, 2013.

Schmidt, Benjamin. “Stable Random Projection: Lightweight, General-Purpose Dimensionality Reduction for Digitized Libraries.” Journal of Cultural Analytics, 2018.

  1. Benjamin Schmidt, “Stable Random Projection: Lightweight, General-Purpose Dimensionality Reduction for Digitized Libraries,” Journal of Cultural Analytics, 2018, link.↩

  2. Tomas Mikolov et al., “Efficient Estimation of Word Representations in Vector Space,” arXiv Preprint arXiv:1301.3781, 2013,↩