Skip to main content

Cloudflare Cache For Crawlers

Why we cache bots

Search and AI crawlers make thousands of anonymous GET/HEAD hits against tenant storefronts. Every request used to traverse Caddy → Cloudflare → API Gateway → farfalla, forcing us to render full HTML repeatedly. Enabling cache at Cloudflare means those bots now receive a four-hour cached copy, while regular browsers bypass cache because their user agents do not match the crawler allow-list.

Current rule

  • Ruleset: http_request_cache_settings → rule 243fedab25584185a62473f4b68b16c9
  • Match: http.host contains "farfalla-entry-point.publica.la" and UA matches one of Amazonbot|Anchor Browser|Applebot|archive.org_bot|bingbot|Bytespider|CCBot|ChatGPT-User|ClaudeBot|Claude-SearchBot|Claude-User|DuckAssistBot|FacebookBot|Googlebot|Google-CloudVertexBot|GPTBot|meta-externalagent|meta-externalfetcher|MistralAI-User|Novellum|OAI-SearchBot|PerplexityBot|Perplexity-User|PetalBot|ProRataInc|Timpibot
  • Action: set_cache_settings
    • cache: true, origin_cache_control: false, origin_error_page_passthru: true
    • edge_ttl.mode: override_origin, edge_ttl.default: 14400 seconds (4 h)
    • Cache key uses the origin host/path/query + Geo dimension (per-country). Language dimension is currently disabled, so multilingual HTML should include locale in the URL itself.

Because Caddy always proxies tenant domains to farfalla-entry-point.publica.la, this filter applies to every storefront hostname even though Cloudflare technically only sees the internal host header.

Operational notes

  1. Testing: Use CF-Cache-Status headers while hitting a tenant domain with one of the bot user agents (e.g., curl -A GPTBot https://tenant-domain/library). Expect MISSHIT after the first request.
  2. Purging: Invalidate specific URLs when content changes faster than four hours using curl -X POST .../purge_cache with files payload, or rely on farfalla’s existing purge hooks.
  3. API snapshot: Fetch the live JSON anytime:
export CF_API_TOKEN=...
ZONE_ID="$(curl -s -H "Authorization: Bearer $CF_API_TOKEN" "https://api.cloudflare.com/client/v4/zones?name=publica.la" | jq -r '.result[0].id')"
curl -s -H "Authorization: Bearer $CF_API_TOKEN" "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/rulesets/56902a1ea0f64980a5dc883c2e602096" |
jq '.result.rules[] | select(.id=="243fedab25584185a62473f4b68b16c9")'
  1. Adjusting TTL: Increase edge_ttl.default if cache hit ratio is high and purge automation is reliable. Drop it if crawlers complain about stale HTML.
  2. Future exclusions: If bots can reach any authenticated or dynamic views (admin, checkout, preview, API), extend the rule expression with not starts_with(http.request.uri.path, "/admin"), etc., before enabling cache there.

Troubleshooting checklist

  • Cache miss for bots → confirm user agent string exactly matches one of the listed tokens.
  • Bots still blocked → ensure WAF/firewall rules are not blocking those UAs on the proxy hostname.
  • Wrong language served → ensure locale lives in the path/query when user.lang is not part of the cache key.
  • Logged-in pages cached → confirm the browser user agent (or any extensions) is not spoofing one of the crawler identifiers; if it is, the request will match the cache rule regardless of cookies.

JS smoke test

Run this Node.js (18+) snippet locally to exercise the rule. It hits every target twice per agent so you can observe MISSHIT for bots and DYNAMIC for normal browsers. Hashing the body lets you confirm /library renders different HTML per hostname.

import crypto from 'node:crypto';

const targets = [
{ label: 'La Tercera /library', url: 'https://kiosco.latercera.com/library' },
{ label: 'Bajalibros /library', url: 'https://ar.bajalibros.com/library' },
{ label: 'Bajalibros publication', url: 'https://ar.bajalibros.com/library/publication/horoscopo-chino-2026' },
];

const agents = [
{ label: 'bot', ua: 'Googlebot', repeat: 2 },
{ label: 'browser', ua: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36', repeat: 1 },
];

for (const target of targets) {
for (const agent of agents) {
for (let i = 1; i <= agent.repeat; i += 1) {
const res = await fetch(target.url, { headers: { 'User-Agent': agent.ua } });
const body = await res.text();
const hash = crypto.createHash('sha1').update(body).digest('hex');
console.log(
`${agent.label.toUpperCase()} pass ${i} | ${target.label} | status=${res.status} | cf-cache-status=${res.headers.get('cf-cache-status')} | digest=${hash.slice(0,10)}`,
);
}
}
}
X

Graph View