Board of Supervisors: finding related matters with tf-idf

Tags: data

Last time, we scraped San Francisco’s Legistar to get all the “matters” (resolutions, hearings, etc.) for San Francisco’s Board of Supervisors. Now we’ll use this to do something somewhat interesting: adding the ability to query for matters related to a certain matter. For example, file #990712 reads as follows:

Resolution authorizing the San Francisco Public Library to accept and expend $63,647 awarded by the California Library Services Board of the California State Library as part of the California library literacy service to help fund the adult literacy program conducted by the San Francisco Public Library.

Let’s try to find matters similar to this one. We’ll start by using tf-idf in order to convert our matters’ titles into vectors. First, some imports and data loading:

import numpy as np
import nltk
import json
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

rows = []
with open("cleaned.json") as fp:
    rows = [json.loads(line) for line in fp]
titles = [row["Title"] for row in rows]

Next, we’ll add a function to stem and tokenize our titles before shoving them into TfidfVectorizer:

def tokenizer(text):
    return [
        stemmer.stem(token.lower())
        for s in nltk.sent_tokenize(text)
        for token in nltk.word_tokenize(s)
        if token.isalpha()
    ]

Now let’s invoke the TfidfVectorizer:

tfidf_vectorizer = TfidfVectorizer(
    max_df=0.8,
    min_df=0.001,
    stop_words="english",
    use_idf=True,
    tokenizer=tokenizer,
    ngram_range=(1, 3),
)
tfidf_matrix = tfidf_vectorizer.fit_transform(titles)

That wasn’t that bad. Now we can query for “most similar” by trying to find the vectors with the largest dot product. Note that TfidfVectorizer results in vectors with norm 1, so this is equivalent to cosine similarity. For example, to find the 10 documents most similar to document #1234, we can run:

k = 10
doc_id = 1234
sim = linear_kernel(-tfidf_matrix[doc_id], tfidf_matrix)
sim.argpartition(k)[0][:k]

There’s a somewhat subtle and probably useless micro-optimization here: we’re using numpy.argpartition to avoid sorting the entire array. Since argpartition sorts in ascending order, we’ll want the most similar vectors to have the smallest values. That’s why we have the negative sign in -tfidf_matrix[doc_id], so that we’re computing the negation of the cosine similarity. Another subtle point is that this includes the document itself in the “most similar” list since it has cosine similarity of 1, that’s easy to fix if desired.

In the spirit of using Datasette for things we are probably not supposed to use it for, let’s shove this logic into a SQLite UDF so that we can query it from there. First, we’ll need to construct a map from the row indexes in tfidf_matrix to the corresponding FileNumbers in the SQLite database. We’ll also serialize this all via pickle, so that we can easily unpickle it later:

import pickle

filenumbers = [row["FileNumber"] for row in rows]
with open("tfidf.data", "wb") as fp:
    pickle.dump((filenumbers, tfidf_matrix), fp)

Now we’ll pivot to creating a custom Datasette plugin:

from datasette import hookimpl
from sklearn.metrics.pairwise import linear_kernel
import json
import pickle

filenumbers, tfidf_matrix = pickle.load(
    open("tfidf_data.pickle", "rb")
)
filenumbers_to_idx = {
    v: i for i, v in enumerate(filenumber_to_idx)
}


def find_similar(filenumber, k):
    doc_id = filenumber_to_idx.get(filenumber)
    if doc_id is None:
        return "[]"
    v = -tfidf_matrix[doc_id]
    sims = linear_kernel(v, tfidf_matrix)
    result = [
        filenumbers[d] for d in sims.argpartition(k)[0][:k]
    ]
    return json.dumps(result)


@hookimpl
def prepare_connection(conn):
    conn.create_function(
        "find_similar", 2, find_similar, deterministic=True
    )

This creates a (deterministic) SQLite function find_similar, which takes two arguments: (1) the filenumber to find similar arguments for and (2) the number of results requested. In order to return multiple results, we’ll serialize our output using json.dumps. We’ll pull out the individual values via the JSON1 functions. The json_each function will create a SQLite table from an array. For example, we can find the top 10 similar matters to a given :filenumber:

select
  legistar.*
from
  json_each(find_similar(:filenumber, 10)) relevant,
  legistar
where
  legistar.FileNumber = relevant.value

We can also add this as a Datasette canned query for maximum ease of use, by creating a metadata.json and including it in there:

{
  "databases": {
    "legistar": {
      "queries": {
        "find_similiar": {
          "sql": "select legistar.* from json_each(find_similar(:filenumber, 10)) relevant, legistar where legistar.FileNumber = relevant.value",
          "hide_sql": true,
          "title": "Find similar matters"
        }
      }
    }
  }
}

A demo of the “Find similar matters” canned query.

As a closing note, Legistar also has a “Relevant Matters” field too, but I suspect it’s manually populated since it’s often incomplete. As a cross-referencing & validation step, it’d be nice to compare how often tf-idf is also able to identify these “relevant” manners. Of course, that requires making ~2^{15} HTTP requests to scrape them from Legistar, but maybe it’s worth it. A project for another day. :)