Board of Supervisors: finding related matters with tf-idf
Tags: data
Last time, we scraped San Francisco’s Legistar
to get all the “matters” (resolutions, hearings, etc.) for San
Francisco’s Board of Supervisors. Now we’ll use this to do something
somewhat interesting: adding the ability to query for matters related
to a certain matter. For example, file #990712
reads as follows:
Resolution authorizing the San Francisco Public Library to accept and expend $63,647 awarded by the California Library Services Board of the California State Library as part of the California library literacy service to help fund the adult literacy program conducted by the San Francisco Public Library.
Let’s try to find matters similar to this one. We’ll start by using tf-idf in order to convert our matters’ titles into vectors. First, some imports and data loading:
import numpy as np
import nltk
import json
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
= []
rows with open("cleaned.json") as fp:
= [json.loads(line) for line in fp]
rows = [row["Title"] for row in rows] titles
Next, we’ll add a function to stem and tokenize our titles before
shoving them into TfidfVectorizer
:
def tokenizer(text):
return [
stemmer.stem(token.lower())for s in nltk.sent_tokenize(text)
for token in nltk.word_tokenize(s)
if token.isalpha()
]
Now let’s invoke the TfidfVectorizer:
= TfidfVectorizer(
tfidf_vectorizer =0.8,
max_df=0.001,
min_df="english",
stop_words=True,
use_idf=tokenizer,
tokenizer=(1, 3),
ngram_range
)= tfidf_vectorizer.fit_transform(titles) tfidf_matrix
That wasn’t that bad. Now we can query for “most similar” by trying to
find the vectors with the largest dot product. Note that
TfidfVectorizer
results in vectors with norm 1, so this is equivalent
to cosine similarity.
For example, to find the 10 documents most similar to document #1234, we
can run:
= 10
k = 1234
doc_id = linear_kernel(-tfidf_matrix[doc_id], tfidf_matrix)
sim 0][:k] sim.argpartition(k)[
There’s a somewhat subtle and probably useless micro-optimization here:
we’re using
numpy.argpartition
to avoid sorting the entire array. Since argpartition
sorts in
ascending order, we’ll want the most similar vectors to have the
smallest values. That’s why we have the negative sign in
-tfidf_matrix[doc_id]
, so that we’re computing the negation of the
cosine similarity. Another subtle point is that this includes the
document itself in the “most similar” list since it has cosine
similarity of 1, that’s easy to fix if desired.
In the spirit of using Datasette for things we
are probably not supposed to use it for, let’s shove this logic into a
SQLite UDF so that we can query it from there. First, we’ll need to
construct a map from the row indexes in tfidf_matrix
to the
corresponding FileNumber
s in the SQLite database. We’ll also serialize
this all via pickle
, so that we can easily unpickle it later:
import pickle
= [row["FileNumber"] for row in rows]
filenumbers with open("tfidf.data", "wb") as fp:
pickle.dump((filenumbers, tfidf_matrix), fp)
Now we’ll pivot to creating a custom Datasette plugin:
from datasette import hookimpl
from sklearn.metrics.pairwise import linear_kernel
import json
import pickle
= pickle.load(
filenumbers, tfidf_matrix open("tfidf_data.pickle", "rb")
)= {
filenumbers_to_idx for i, v in enumerate(filenumber_to_idx)
v: i
}
def find_similar(filenumber, k):
= filenumber_to_idx.get(filenumber)
doc_id if doc_id is None:
return "[]"
= -tfidf_matrix[doc_id]
v = linear_kernel(v, tfidf_matrix)
sims = [
result for d in sims.argpartition(k)[0][:k]
filenumbers[d]
]return json.dumps(result)
@hookimpl
def prepare_connection(conn):
conn.create_function("find_similar", 2, find_similar, deterministic=True
)
This creates a (deterministic) SQLite function find_similar
, which
takes two arguments: (1) the filenumber to find similar arguments for
and (2) the number of results requested. In order to return multiple
results, we’ll serialize our output using json.dumps
. We’ll pull out
the individual values via the JSON1
functions. The
json_each
function will
create a SQLite table from an array. For example, we can find the top 10
similar matters to a given :filenumber
:
select
*
legistar.from
:filenumber, 10)) relevant,
json_each(find_similar(
legistarwhere
= relevant.value legistar.FileNumber
We can also add this as a Datasette canned
query
for maximum ease of use, by creating a metadata.json
and including it
in there:
{
"databases": {
"legistar": {
"queries": {
"find_similiar": {
"sql": "select legistar.* from json_each(find_similar(:filenumber, 10)) relevant, legistar where legistar.FileNumber = relevant.value",
"hide_sql": true,
"title": "Find similar matters"
}
}
}
}
}

As a closing note, Legistar also has a “Relevant Matters” field too, but I suspect it’s manually populated since it’s often incomplete. As a cross-referencing & validation step, it’d be nice to compare how often tf-idf is also able to identify these “relevant” manners. Of course, that requires making ~2^{15} HTTP requests to scrape them from Legistar, but maybe it’s worth it. A project for another day. :)