Tracking published crawler IP ranges
Simon Thompson
A key part of ensuring a site's online visibility is making sure that good crawlers (e.g. Googlebot, Bingbot) aren't being blocked by rate limits or bot protections in your CDNs or other security systems.
Alongside a reliable regular expression to verify distinct user agents, it's important to maintain an up-to-date list of IP ranges that each crawler operates from. These ranges change frequently, so keeping track of them all can prove tricky.
Fortunately, this is becoming simpler thanks to a de-facto standard emerging, with the major crawlers publishing their IP prefixes in a common JSON format. For instance, here's a sample of what Google provides for Googlebot:
{
"creationTime": "2025-07-18T14:46:17.000000",
"prefixes": [
{ "ipv6Prefix": "2001:4860:4801:10::/64" },
{ "ipv4Prefix": "192.178.4.0/27" }
/** Truncated **/
]
}
These feeds can be periodically consumed by your security systems and/or tooling to generate a fresh allowlist and avoid good bots being blocked.
To help keep track of changes to these feeds, I've set up a repo which uses Simon Willison's git scraping technique to track any changes for each feed into the ./src/ directory, plus generate a combined feed for all sources.
Below is a list of the feeds being tracked at the time of writing:
Vendor | Source |
---|---|
googlebot.json | |
special-crawlers.json | |
user-triggered-fetchers.json | |
user-triggered-fetchers-google.json | |
OpenAI | searchbot.json |
OpenAI | chatgpt-user.json |
OpenAI | gptbot.json |
Perplexity | perplexitybot.json |
Perplexity | perplexity-user.json |
Microsoft | bingbot.json |
DuckDuckGo | duckduckbot.json |
DuckDuckGo | duckassistbot.json |
Apple | applebot.json |
Mistral | mistralai-user-ips.json |
CommonCrawl | ccbot.json |
You can view the repository using the link below:
If you have any questions, please feel free to reach out!