Blog

Moving to a "Known Web"

Simon Thompson · May 14th, 2018

Imagine that, instead of crawling the whole internet to discover new domains, you were just told about them as soon as they launched in near-realtime. Even better; imagine that this was available to anybody, not just the huge tech companies. What could the possibilities be, and what could be built?

A while back now, I posted a tweet with a musing I'd had after reading up on the Certificate Transparency project. I was wondering if the underlying technologies - which are designed to allow for the auditing and monitoring of SSL certificate issuance - could serve a wider purpose in the area of web crawling, and whether it (or something similar) could trigger a shift in the way that crawlers operate in the distant future. In this post, I wanted to share my line of thinking.


Before we start to look at Certificate Transparency itself, let's recap on the various methods by which an existing crawler (be it for a search engine, or any other purpose) might discover previously-unseen domains to process;

Each of these has their own merits and drawbacks, but one thing in common is this; you need to somehow go looking for new sites to crawl. This is where Certificate Transparency comes in…

What is Certificate Transparency?

Certificate Transparency is a framework designed to allow monitoring and auditing of SSL Certificate issuance. When a trusted Certificate Authority (such as Let's Encrypt or Cloudflare) issues a new certificate, they push an entry with it's details to a number of cryptographically-verified public logs which can then be read by any number of consumers. An example consumer is the crt.sh tool, which allows us to view all of the certificates generated for this domain (simon-thompson.me).

Needless to say, this additional layer of transparency is a good thing for security on the web. It allows site owners to detect mis-issued certificates which could be impersonating them, and it allows rogue CA's to be identified easily. If you'd like to read up a bit more on CT itself or want more technical details, I recommend reading Scott Helme's introductory post, plus the official site itself.

Over the past few years, CT has increasingly become a requirement. For instance, Chrome now requires that an SSL certificate is logged via CT otherwise it simply won't trust it - a move which has encouraged CA's to adopt the technology. When you combine that with the increasing shift to have HTTPS as the default (e.g. Chrome's UI changes and Google's incorporation of HTTPS as a ranking signal), we're increasingly headed towards a web where the majority of the public-facing web is going to be using an SSL Certificate and, by extension, getting logged into a CT log (all good things, might I add).

Use Cases & Caveats

So we've got a near-realtime stream of domains, but what can we actually use it for? Some examples I can think of are;

Of course, this list is non-exhaustive. The thing which interests me about all of this though, is that certificate transparency essentially democratises the list of sites on the web so people can build whatever they want on top of it.

As always, there are some caveats to the data available;

How to access the logs

Now, you may be wondering how easy it is (or isn't) to tap into the logs to try this out for yourself. It is possible to directly read them, however for the scope of this blog post and experimenting there's an easier (and currently free) option - namely certstream - which abstracts away the work of parsing the huge logs and turns it into one simple stream.

Using their code samples, you can very quickly get something up and running which looks like the below.

Summary

I'm not as naive as to believe that things will change overnight, especially given that a large portion of the web (including some major sites) is still not on HTTPS, but I can't help but feel that we'll see a move towards a "known web" where a large amount of domains can be discovered quite easily with less overhead than required today.

This could mean sites being picked up and crawled more quickly, vulnerabilities being discovered before a site's even made public, and information being disclosed that was assumed to be private (although, if you're relying on security by obscurity, that's probably not ideal anyway!).

Overall though, I'm excited to see what unexpected use cases come out as a side effect of Certificate Transparency as a technology.