Blog

Moving to a "Known Web"

Simon Thompson · May 14th, 2018

Imagine that, instead of crawling the whole internet to discover new domains, you were just told about them as soon as they launched in near-realtime. Even better; imagine that this was available to anybody, not just the huge tech companies. What could the possibilities be, and what could be built?

A while back now, I posted a tweet with a musing I'd had after reading up on the Certificate Transparency project. I was wondering if the underlying technologies - which are designed to allow for the auditing and monitoring of SSL certificate issuance - could serve a wider purpose in the area of web crawling, and whether it (or something similar) could trigger a shift in the way that crawlers operate in the distant future. In this post, I wanted to share my line of thinking.

Before we start to look at Certificate Transparency itself, let's recap on the various methods by which an existing crawler (be it for a search engine, or any other purpose) might discover previously-unseen domains to process;

Standard Crawling
Relies on sites being linked to in order to discover them, and can be resource intensive (you may need to crawl an entire site just to discover one new domain).
Manual Submission
Relies on people manually submitting their sites to you.
Parsing TLD Zone files
Only discovers the apex domain - you'll be missing any subdomains.

Each of these has their own merits and drawbacks, but one thing in common is this; you need to somehow go looking for new sites to crawl. This is where Certificate Transparency comes in…

What is Certificate Transparency?

Certificate Transparency is a framework designed to allow monitoring and auditing of SSL Certificate issuance. When a trusted Certificate Authority (such as Let's Encrypt or Cloudflare) issues a new certificate, they push an entry with it's details to a number of cryptographically-verified public logs which can then be read by any number of consumers. An example consumer is the crt.sh tool, which allows us to view all of the certificates generated for this domain (simon-thompson.me).

Needless to say, this additional layer of transparency is a good thing for security on the web. It allows site owners to detect mis-issued certificates which could be impersonating them, and it allows rogue CA's to be identified easily. If you'd like to read up a bit more on CT itself or want more technical details, I recommend reading Scott Helme's introductory post, plus the official site itself.

Over the past few years, CT has increasingly become a requirement. For instance, Chrome now requires that an SSL certificate is logged via CT otherwise it simply won't trust it - a move which has encouraged CA's to adopt the technology. When you combine that with the increasing shift to have HTTPS as the default (e.g. Chrome's UI changes and Google's incorporation of HTTPS as a ranking signal), we're increasingly headed towards a web where the majority of the public-facing web is going to be using an SSL Certificate and, by extension, getting logged into a CT log (all good things, might I add).

Use Cases & Caveats

So we've got a near-realtime stream of domains, but what can we actually use it for? Some examples I can think of are;

Search Engines
Albeit more limited by the caveats which I'll detail momentarily, it's possible that search engines could choose to use Certificate Transparency logs as a source of domains for their crawlers.
Phishing Detection
Facebook (and others) have done some work in this area already, but it's possible to rapidly detect possible phishing attacks.
Vulnerability Analysis
Given that most people won't be aware that the certificates they generate are being logged, a large amount of staging, development and otherwise hidden environments can be exposed.
As an aside, if you're hoping to protect against this you should look to implement security methods - such as HTTP basic auth for staging sites - before or as soon after generating an SSL certificate as possible.
Assume that it will be public knowledge and poked by vulnerability scanners within a matter of hours at most. There are also some options around redaction in CT which you may wish to review.

Of course, this list is non-exhaustive. The thing which interests me about all of this though, is that certificate transparency essentially democratises the list of sites on the web so people can build whatever they want on top of it.

As always, there are some caveats to the data available;

By definition, you're only going to discover sites which have had a certificate generated and are on HTTPS. This will be an increasingly large portion of the web as time goes by, but you'll still be missing non-secure sites and might need to detect them through other means.
A lot of these domains are probably not designed to be public. For example, a huge amount of certs are for things like webmail / staging / control panel subdomains so, depending on your use case, aren't worth pursuing. You could filter these out, of course.

How to access the logs

Now, you may be wondering how easy it is (or isn't) to tap into the logs to try this out for yourself. It is possible to directly read them, however for the scope of this blog post and experimenting there's an easier (and currently free) option - namely certstream - which abstracts away the work of parsing the huge logs and turns it into one simple stream.

Using their code samples, you can very quickly get something up and running which looks like the below.

Summary

I'm not as naive as to believe that things will change overnight, especially given that a large portion of the web (including some major sites) is still not on HTTPS, but I can't help but feel that we'll see a move towards a "known web" where a large amount of domains can be discovered quite easily with less overhead than required today.

This could mean sites being picked up and crawled more quickly, vulnerabilities being discovered before a site's even made public, and information being disclosed that was assumed to be private (although, if you're relying on security by obscurity, that's probably not ideal anyway!).

Overall though, I'm excited to see what unexpected use cases come out as a side effect of Certificate Transparency as a technology.