Blog

Moving to a “Known Web” – Using Certificate Transparency to crawl the internet

Imagine that, instead of crawling the whole internet to discover new domains, you were just told about them as soon as they launched in near-realtime. Even better; imagine that this was available to anybody, not just the huge tech companies. What could the possibilities be, and what could be built?

A while back now, I posted a tweet with a musing I’d had after reading up on the Certificate Transparency project. I was wondering if the underlying technologies – which are designed to allow for the auditing and monitoring of SSL certificate issuance – could serve a wider purpose in the area of web crawling, and whether it (or something similar) could trigger a shift in the way that crawlers operate in the distant future. In this post, I wanted to share my line of thinking.

Before we start to look at Certificate Transparency itself, let’s recap on the various methods by which an existing crawler (be it for a search engine, or any other purpose) might discover previously-unseen domains to process;

Standard CrawlingRelies on sites being linked to in order to discover them, and can be resource intensive (you may need to crawl an entire site just to discover one new domain).
Manual SubmissionRelies on people manually submitting their sites to you.
Parsing TLD Zone FilesOnly discovers the apex domain – you’ll be missing any subdomains.

Each of these has their own merits and drawbacks, but one thing in common is this; you need to somehow go looking for new sites to crawl. This is where Certificate Transparency comes in…

What is Certificate Transparency?

Certificate Transparency is a framework designed to allow monitoring and auditing of SSL Certificate issuance. When a trusted Certificate Authority (such as Let’s Encrypt or Cloudflare) issues a new certificate, they push an entry with it’s details to a number of cryptographically-verified public logs which can then be read by any number of consumers. An example consumer is the crt.sh tool, which allows us to view all of the certificates generated for this domain (simon-thompson.me).

Needless to say, this additional layer of transparency is a good thing for security on the web. It allows site owners to detect mis-issued certificates which could be impersonating them, and it allows rogue CA’s to be identified easily. If you’d like to read up a bit more on CT itself or want more technical details, I recommend reading Scott Helme’s introductory post, plus the official site itself.

Over the past few years, CT has increasingly become a requirement. For instance, Chrome now requires that an SSL certificate is logged via CT otherwise it simply won’t trust it – a move which has encouraged CA’s to adopt the technology. When you combine that with the increasing shift to have HTTPS as the default (e.g. Chrome’s UI changes and Google’s incorporation of HTTPS as a ranking signal), we’re increasingly headed towards a web where the majority of the public-facing web is going to be using an SSL Certificate and, by extension, getting logged into a CT log (all good things, might I add).

Use Cases & Caveats

So we’ve got a near-realtime stream of domains, but what can we actually use it for? Some examples I can think of are;

Search Engines
Albeit more limited by the caveats which I’ll detail momentarily, it’s possible that search engines could choose to use Certificate Transparency logs as a source of domains for their crawlers.
Phishing DetectionFacebook (and others) have done some work in this area already, but it’s possible to rapidly detect possible phishing attacks.
Vulnerability Analysis
Given that most people won’t be aware that the certificates they generate are being logged, a large amount of staging, development and otherwise hidden environments can be exposed.

As an aside, if you’re hoping to protect against this you should look to implement security methods – such as HTTP basic auth for staging sites – before or as soon after generating an SSL certificate as possible. Assume that it will be public knowledge and being poked by vulnerability scanners within a matter of hours at most. There are also some options around redaction in CT which you may wish to review.

Of course, this list is non-exhaustive. The thing which interests me about all of this though, is that certificate transparency essentially democratises the list of sites on the web so people can build whatever they want on top of it.

As always, there are some caveats to the data available;

  1. By definition, you’re only going to discover sites which have had a certificate generated and are on HTTPS. This will be an increasingly large portion of the web as time goes by, but you’ll still be missing non-secure sites and might need to detect them through other means.
  2. A lot of these domains are probably not designed to be public. For example, a huge amount of certs are for things like webmail / staging / control panel subdomains so, depending on your use case, aren’t worth pursuing. You could filter these out, of course.

How to access the logs

Now, you may be wondering how easy it is (or isn’t) to tap into the logs to try this out for yourself. It is possible to directly read them, however for the scope of this blog post and experimenting there’s an easier (and currently free) option – namely certstream – which abstracts away the work of parsing the huge logs and turns it into one simple stream.

Using their code samples, you can very quickly get something up and running which looks like the below.

Summary

I’m not as naive as to believe that things will change overnight, especially given that a large portion of the web (including some major sites) is still not on HTTPS, but I can’t help but feel that we’ll see a move towards a “known web” where a large amount of domains can be discovered quite easily with less overhead than required today.

This could mean sites being picked up and crawled more quickly, vulnerabilities being discovered before a site’s even made public, and information being disclosed that was assumed to be private (although, if you’re relying on security by obscurity, that’s probably not ideal anyway!).

Overall though, I’m excited to see what unexpected use cases come out as a side effect of Certificate Transparency as a technology.

Simple DOM Manipulation via jQuery in Cloudflare Workers

I've recently been trying out Cloudflare Workers as part of another write-up which I'll be sharing soon, and I'm really excited about their potential. For those who aren't familiar with them, Cloudflare Workers allow you to write custom JavaScript code to run "on the edge" (i.e. in Cloudflare's data centres) which can modify a user's request on it's way to your origin server, or the response on the way back. This is pretty exciting, as it opens up a range of options across the board in the realms of security, performance and customisation (among others).

The Problem

The particular use-case I'm looking at involves lots of DOM manipulation – i.e. changing page content – something which is quite tricky with Cloudflare Workers currently as your only options are string manipulation and regex, which can get unwieldy very quickly. Ideally we'd be able to use the same tools that we use with JavaScript in a browser – like document.querySelectorAll to find elements matching a CSS selector – but Workers aren't Browsers, so unfortunately these methods aren't available to us.

After a number of tests I emailed to get the opinion of the Cloudflare Developer Help team, who let me know that they're planning improvements in this area in the near future, but in the meantime they were aware of another user who had managed to incorporate some DOM functionality into their Worker by browserifying the Node.js dom-parser module and including it in their Worker, so that might be an option to investigate in the meantime. I tinkered with this and got it working pretty quickly, but it got me thinking; this is a good option for getting data out of the page, but what if we could include something like jQuery in a Cloudflare worker? This would reduce the complexity of DOM manipulation massively, potentially even for non-developers, and allow us to easily modify the response before sending it to the client.

The Solution

It turns out, the cheerio module for Node.js provides exactly what we need – a server-side implementation of jQuery.

After a few hours of testing and tweaking, I've managed to get a proof of concept working which embeds Cheerio (jQuery) into a Cloudflare Worker. If you'd like to give it a go yourself, you can play around with the code in the Playground I've put together (alternatively the source is available in this gist). Feel free to make use of either in your own projects!

In the example below, I'm using jQuery to modify the response from the server and change the content of all h1 tags to be "¯\_(ツ)_/¯".

You can also apply CSS styles, as seen in the example below (note that all of the changes happen "on the edge" before the response is sent to the client – no Javascript is running in the client / browser);

The possibilities here are huge!

Bundling npm modules

If you're a developer interested in how to bundle NPM modules into a Worker script, the steps are roughly as follows;

  1. Install Browserify globally
  2. Create a new node project with npm init, and npm install the module(s) you need
  3. Create a file main.js, and add a require('module') for each module
  4. Run browserify main.js -o bundle.js
  5. Look through bundle.js and find function(require,module,exports){…} – your code will go inside of this function. Drop the sample code from the Playground site in that function, paste it back into the Playground, and check the console to see if you get any errors

As an additional step, you can minify the output of Step 4 by using an ES6-compatible minifier (like https://skalman.github.io/UglifyJS-online/) – this will give you some tidier code to work with. I did find that this required a bit of tweaking of minifier settings to get it working correctly, so if you're struggling – give me a shout via Twitter and I'm happy to chat!

Year In Review: 2017

2017 has been a great year for me in many ways, so I wanted to pull together some of my highlights in a quick post.

Writing & Research

Thanks to friendly nudges and support from colleagues, 2017 was the first year where I’ve started publicly sharing some of the R&D work I get to do alongside my day job. The initial catalyst was the reception to a tweet Chris put out about a piece of research we worked on together involving hourly rank tracking.

Given the evident community interest around this, I wrote up a bit more about our findings on the then newly released StrategiQ Medium blog, and went along to the talk Chris gave at Search London about it.

Luckily, as a follow-up to this I was also able to write about a (perhaps obvious) side-effect of the hourly rank tracking which we’d noticed in Google Search Console.

And finally, I polished up and released something I’d been experimenting with for quite a while – a way to view referring Twitter users in Google Analytics.

Work

2017 saw another full year at StrategiQ with a number of site launches which I’m particularly proud of. We’ve also made great strides with our development standards and hosting infrastructure, which is something I’m keen to share more about in 2018 to try and help other agencies put better processes in place too.

Open Source & Code

Mostly tied in with the blog posts above, I was able to release a few small open source projects over on GitHub. These were;

  • ghks – A Node key/value store, which uses GitHub gists for persistent storage.
  • twitlytics-server – A Node app which resolves a t.co referring URL to the original tweet / tweeter.
  • WPVersion – A JS Module and PHP Class for detecting the version of WordPress being used on a given site.

I’ve also finished up Louise’s personal website, something we’ve wanted to do for ages now, so if you happen to be looking for Music Tuition in Witham, check out her site!

Plans for 2018

Whilst I’m not one for setting specific “New Year’s Resolutions”, I’m definitely hoping to share more blog posts both here and through work in 2018, plus start actually shipping a few of the side-projects I’ve had bubbling away – so watch this space!

Back to WordPress

Until today this site ran on a script called “statik” which I wrote to turn markdown files into a website. Whilst it’s served a purpose, it didn’t do particularly well when it came to blogging – something I’m hoping to do more of.

Having had the chance to see “WordPress Done Right” over the past months, I’ve opted to switch this site over to WordPress.

Over at StrategiQ we’ve got a pretty decent setup for our sites, so I’m taking the same approach here; DNS routed through CloudFlare (with caching enabled), pointing via CNAME to hosting with WPEngine. Whilst slightly more pricey than just running a VPS, the reliability is worth it in my mind.

Hopefully this will be a motivator to write more, but also serves as a good test bed for the plugins and research I’ll be doing over the coming months.