Board of Supervisors: finding related matters with tf-idf
Tags: data
Last time, we scraped San Francisco’s Legistar to get all the “matters” (resolutions, hearings, etc.) for San Francisco’s Board of Supervisors. Now we’ll use this to do something somewhat interesting: adding the ability to query for matters related to a certain matter. For example, file #990712
reads as follows:
Resolution authorizing the San Francisco Public Library to accept and expend $63,647 awarded by the California Library Services Board of the California State Library as part of the California library literacy service to help fund the adult literacy program conducted by the San Francisco Public Library.
Let’s try to find matters similar to this one. We’ll start by using tf-idf in order to convert our matters’ titles into vectors. First, some imports and data loading:
import numpy as np
import nltk
import json
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
= []
rows with open("cleaned.json") as fp:
= [json.loads(line) for line in fp]
rows = [row["Title"] for row in rows] titles
Next, we’ll add a function to stem and tokenize our titles before shoving them into TfidfVectorizer
:
def tokenizer(text):
return [
stemmer.stem(token.lower())for s in nltk.sent_tokenize(text)
for token in nltk.word_tokenize(s)
if token.isalpha()
]
Now let’s invoke the TfidfVectorizer:
= TfidfVectorizer(
tfidf_vectorizer =0.8,
max_df=0.001,
min_df="english",
stop_words=True,
use_idf=tokenizer,
tokenizer=(1, 3),
ngram_range
)= tfidf_vectorizer.fit_transform(titles) tfidf_matrix
That wasn’t that bad. Now we can query for “most similar” by trying to find the vectors with the largest dot product. Note that TfidfVectorizer
results in vectors with norm 1, so this is equivalent to cosine similarity. For example, to find the 10 documents most similar to document #1234, we can run:
= 10
k = 1234
doc_id = linear_kernel(-tfidf_matrix[doc_id], tfidf_matrix)
sim 0][:k] sim.argpartition(k)[
There’s a somewhat subtle and probably useless micro-optimization here: we’re using numpy.argpartition
to avoid sorting the entire array. Since argpartition
sorts in ascending order, we’ll want the most similar vectors to have the smallest values. That’s why we have the negative sign in -tfidf_matrix[doc_id]
, so that we’re computing the negation of the cosine similarity. Another subtle point is that this includes the document itself in the “most similar” list since it has cosine similarity of 1, that’s easy to fix if desired.
In the spirit of using Datasette for things we are probably not supposed to use it for, let’s shove this logic into a SQLite UDF so that we can query it from there. First, we’ll need to construct a map from the row indexes in tfidf_matrix
to the corresponding FileNumber
s in the SQLite database. We’ll also serialize this all via pickle
, so that we can easily unpickle it later:
import pickle
= [row["FileNumber"] for row in rows]
filenumbers with open("tfidf.data", "wb") as fp:
pickle.dump((filenumbers, tfidf_matrix), fp)
Now we’ll pivot to creating a custom Datasette plugin:
from datasette import hookimpl
from sklearn.metrics.pairwise import linear_kernel
import json
import pickle
= pickle.load(
filenumbers, tfidf_matrix open("tfidf_data.pickle", "rb")
)= {
filenumbers_to_idx for i, v in enumerate(filenumber_to_idx)
v: i
}
def find_similar(filenumber, k):
= filenumber_to_idx.get(filenumber)
doc_id if doc_id is None:
return "[]"
= -tfidf_matrix[doc_id]
v = linear_kernel(v, tfidf_matrix)
sims = [
result for d in sims.argpartition(k)[0][:k]
filenumbers[d]
]return json.dumps(result)
@hookimpl
def prepare_connection(conn):
conn.create_function("find_similar", 2, find_similar, deterministic=True
)
This creates a (deterministic) SQLite function find_similar
, which takes two arguments: (1) the filenumber to find similar arguments for and (2) the number of results requested. In order to return multiple results, we’ll serialize our output using json.dumps
. We’ll pull out the individual values via the JSON1 functions. The json_each
function will create a SQLite table from an array. For example, we can find the top 10 similar matters to a given :filenumber
:
select
*
legistar.from
:filenumber, 10)) relevant,
json_each(find_similar(
legistarwhere
= relevant.value legistar.FileNumber
We can also add this as a Datasette canned query for maximum ease of use, by creating a metadata.json
and including it in there:
{
"databases": {
"legistar": {
"queries": {
"find_similiar": {
"sql": "select legistar.* from json_each(find_similar(:filenumber, 10)) relevant, legistar where legistar.FileNumber = relevant.value",
"hide_sql": true,
"title": "Find similar matters"
}
}
}
}
}

As a closing note, Legistar also has a “Relevant Matters” field too, but I suspect it’s manually populated since it’s often incomplete. As a cross-referencing & validation step, it’d be nice to compare how often tf-idf is also able to identify these “relevant” manners. Of course, that requires making ~2^{15} HTTP requests to scrape them from Legistar, but maybe it’s worth it. A project for another day. :)