Static Analysis Using Google BigQuery

Github provides public access to a massive amount of open-source code. Github analyzes this source code in order to create security alerts. I investigate using Google’s BigQuery dataset to search for possible security issues, namely XXE attacks.

XML External Entity (XXE)

The XML External Entity attack (CWE-611) is an attack when a misconfigured XML parser takes in untrusted XML input. The untrusted input then contains a reference to an “external entity”, which causes the parser to do bad things (from a security perspective).

If you have done any web development, you’ve probably seen XML entities. One example is the entity &lt;, which an XML parser expands to the literal <. All XML parsers begin by expanding these entities.

XML also allows you to declare custom entities, for example:

<!ENTITY euro "&#8364;">

This declares the entity &euro; to expand to the entity &#8364; which corresponds to the unicode character .

However, the spec also allows for external entities. For example, one can write:

<!DOCTYPE foo [ <!ENTITY xxe SYSTEM "file:///etc/passwd"> ]>
<foo>&xxe;</foo>

This indicates that the entity &xxe; should be expanded to the contents of the file /etc/passwd. The entity is external, since it references something outside of the XML file. If the contents of the XML file are returned to the attacker directly, they can use this to read the contents of any file. But it turns out that XXEs can be abused to leak the contents of any file even if the attacker does not see the parsed XML! Explaining this out-of-band XXE variant would take us a little off-topic, but you can read the details here.

Unfortunately, Java XML libraries are typically vulnerable as their default settings parse external entities, which is rarely what the application developer intends. Detecting XXEs via looking at the code is very easy since ther are some patterns which are obviously vulnerable. Things like command injection and improper deserializers would also be good candidates for our basic static analysis.

BigQuerying the Data

The database of all Github code is a whopping 2 TB of data! I ran the following query, which I developed based on a test suite:

-- Determines if NEEDLE is a substring of HAYSTACK
CREATE TEMPORARY FUNCTION STRSTR(haystack STRING, needle STRING) AS
  (CAST(STRPOS(haystack, needle) AS BOOL));

SELECT c.id, c.content, f.repo_name, f.ref, f.path
FROM `bigquery-public-data.github_repos.contents` AS c
INNER JOIN `bigquery-public-data.github_repos.files` AS f
ON c.id = f.id
WHERE
  -- Test00
  ENDS_WITH(f.path, ".java") AND
  -- Test01
  (STRSTR(c.content, "SAXParserFactory.newInstance") AND 
   NOT STRSTR(c.content, "disallow-general-entities")) OR
  -- Test02
  (STRSTR(c.content, "DocumentBuilderFactory.newInstance") AND 
   NOT STRSTR(c.content, "disallow-doctype-decl")) OR
  -- Test03
  (STRSTR(c.content, "XMLInputFactory.newFactory") AND 
   NOT STRSTR(c.content, "XMLInputFactory.SUPPORT_DTD")) OR
  -- Test04
  (STRSTR(c.content, "TransformerFactory.newInstance") AND 
   NOT STRSTR(c.content, "XMLConstants.ACCESS_EXTERNAL_DTD")) OR
  -- Test05 and Test06
  (STRSTR(c.content, "SchemaFactory.newInstance") AND 
   NOT STRSTR(c.content, "XMLConstants.ACCESS_EXTERNAL_DTD")) OR
  -- Test07
  (STRSTR(c.content, "XMLReaderFactory.createXMLReader") AND 
   NOT STRSTR(c.content, "disallow-doctype-decl")) OR
  -- Test08
  (STRSTR(c.content, "SAXTransformerFactory.newInstance") AND 
   NOT STRSTR(c.content, "XMLConstants.ACCESS_EXTERNAL_DTD")) OR
  -- Test09
  (STRSTR(c.content, "SAXReader") AND 
   NOT STRSTR(c.content, "disallow-doctype-decl")) OR
  -- Test10
  (STRSTR(c.content, "SAXBuilder") AND 
   NOT STRSTR(c.content, "disallow-doctype-decl"))
;

The query is extremely stupid, but it gets the job done! Essentially, it looks through each possible XXE vector and sees if the corresponding patch is present. A more intelligent method would be to use BigQuery’s capability to run arbitrary Javascript code, as which has been used to run static analysis tools before, and use that for more complicated detection. I opted for this method instead for simplicity and speed.

Fortunately and unfortunately, I got many results from the query: 163424 files contained possible XXEs!

The detector is obviously imperfect, and leads to a fair amount of false positives which must be manually sorted out. I thumbed through some of the results manually. In a lot of cases, even if it was vulnerable to an XXE, it was rarely an issue in practice. XML is commonly used for configuration files, since most configuration files let you do worse things. This made classification quite laborous, and it would be a good thing to improve before more widescale use.