University Age Using SPARQL and Wikipedia

Tags: data

I wanted to know “of the universities where students attend today, when were they founded”? Turns out Wikidata (from the makers of Wikipedia) can be used to answer this question. We can write something called a SPARQL query to aggregate “total students by university year of founding”:

SELECT ?year (SUM(?st) AS ?total_students) WHERE {
  {
    SELECT (MIN(YEAR(?date)) AS ?year) (SAMPLE(?students) AS ?st) WHERE {
      ?university (wdt:P31/(wdt:P279*)) wd:Q3918;
        wdt:P571 ?date;
        wdt:P17 wd:Q30;
        wdt:P2196 ?students.
      SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
    }
    GROUP BY ?university
  }
}
GROUP BY ?year
ORDER BY (?year)

I won’t attempt to explain this query – Wikidata itself has a lovely help page which does a better job than I could, and also contains a lot of other examples. If you’d like to play with this query further, you can open it in the query editor.

Of course, the overall provenance of this data is a little questionable – we’re essentially aggregating a bunch of Wikipedia infoboxes. Some of the data may be uncited, and some of the data we need may be missing. It’s not a coherent dataset either: the different rows in the result measure student enrollment at different years. I don’t think I would use this for a serious data project, but it’s fun for these quick little “what if” questions.

Once you have the query, it’s not too hard to create a Vega-Lite visualization. If you click “Download”, you can download the result as JSON in a format which works directly with Vega-Lite.

Anyway, here’s what the Vega-Lite spec looks like. It’s pretty bog-standard stuff. Perhaps the only interesting thing is I’ve used continuousBandSize to stop the bars from overlapping in a weird way.

const spec = {
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "data": {
    "values": []
  },
  "width": "container",
  "height": 600,
  "title": {
    "text": "US student enrollment by college inception year",
    "subtitle": "Source: Wikidata",
    "fontSize": 24
  },
  "mark": "bar",
  "encoding": {
    "x": {
      "field": "year",
      "type": "temporal",
      "axis": {
        "labelAngle": 0
      },
      "title": "Year university was founded"
    },
    "y": {
      "field": "total_students",
      "type": "quantitative",
      "title": "Total students"
    }
  },
  "config": {
    "bar": {
      "continuousBandSize": 2
    }
  }
};
Posted on 2022-06-01