Cloudflare Cache For Crawlers

Why we cache bots

Search and AI crawlers make thousands of anonymous GET/HEAD hits against tenant storefronts. Every request used to traverse Caddy → Cloudflare → API Gateway → farfalla, forcing us to render full HTML repeatedly. Enabling cache at Cloudflare means those bots now receive a four-hour cached copy, while regular browsers bypass cache because their user agents do not match the crawler allow-list.

Current rule

Ruleset: http_request_cache_settings → rule 243fedab25584185a62473f4b68b16c9
Match: http.host contains "farfalla-entry-point.publica.la" and UA matches one of Amazonbot|Anchor Browser|Applebot|archive.org_bot|bingbot|Bytespider|CCBot|ChatGPT-User|ClaudeBot|Claude-SearchBot|Claude-User|DuckAssistBot|FacebookBot|Googlebot|Google-CloudVertexBot|GPTBot|meta-externalagent|meta-externalfetcher|MistralAI-User|Novellum|OAI-SearchBot|PerplexityBot|Perplexity-User|PetalBot|ProRataInc|Timpibot
Action: set_cache_settings
- cache: true, origin_cache_control: false, origin_error_page_passthru: true
- edge_ttl.mode: override_origin, edge_ttl.default: 14400 seconds (4 h)
- Cache key uses the origin host/path/query + Geo dimension (per-country). Language dimension is currently disabled, so multilingual HTML should include locale in the URL itself.

Because Caddy always proxies tenant domains to farfalla-entry-point.publica.la, this filter applies to every storefront hostname even though Cloudflare technically only sees the internal host header.

Operational notes

Testing: Use CF-Cache-Status headers while hitting a tenant domain with one of the bot user agents (e.g., curl -A GPTBot https://tenant-domain/library). Expect MISS → HIT after the first request.
Purging: Invalidate specific URLs when content changes faster than four hours using curl -X POST .../purge_cache with files payload, or rely on farfalla’s existing purge hooks.
API snapshot: Fetch the live JSON anytime:

export CF_API_TOKEN=...
ZONE_ID="$(curl -s -H "Authorization: Bearer $CF_API_TOKEN" "https://api.cloudflare.com/client/v4/zones?name=publica.la" | jq -r '.result[0].id')"
curl -s -H "Authorization: Bearer $CF_API_TOKEN" "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/rulesets/56902a1ea0f64980a5dc883c2e602096" |
  jq '.result.rules[] | select(.id=="243fedab25584185a62473f4b68b16c9")'

Adjusting TTL: Increase edge_ttl.default if cache hit ratio is high and purge automation is reliable. Drop it if crawlers complain about stale HTML.
Future exclusions: If bots can reach any authenticated or dynamic views (admin, checkout, preview, API), extend the rule expression with not starts_with(http.request.uri.path, "/admin"), etc., before enabling cache there.

Troubleshooting checklist

Cache miss for bots → confirm user agent string exactly matches one of the listed tokens.
Bots still blocked → ensure WAF/firewall rules are not blocking those UAs on the proxy hostname.
Wrong language served → ensure locale lives in the path/query when user.lang is not part of the cache key.
Logged-in pages cached → confirm the browser user agent (or any extensions) is not spoofing one of the crawler identifiers; if it is, the request will match the cache rule regardless of cookies.

JS smoke test

Run this Node.js (18+) snippet locally to exercise the rule. It hits every target twice per agent so you can observe MISS → HIT for bots and DYNAMIC for normal browsers. Hashing the body lets you confirm /library renders different HTML per hostname.

import crypto from 'node:crypto';

const targets = [
  { label: 'La Tercera /library', url: 'https://kiosco.latercera.com/library' },
  { label: 'Bajalibros /library', url: 'https://ar.bajalibros.com/library' },
  {
    label: 'Bajalibros publication',
    url: 'https://ar.bajalibros.com/library/publication/horoscopo-chino-2026',
  },
];

const agents = [
  { label: 'bot', ua: 'Googlebot', repeat: 2 },
  {
    label: 'browser',
    ua: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36',
    repeat: 1,
  },
];

for (const target of targets) {
  for (const agent of agents) {
    for (let i = 1; i <= agent.repeat; i += 1) {
      const res = await fetch(target.url, {
        headers: { 'User-Agent': agent.ua },
      });
      const body = await res.text();
      const hash = crypto.createHash('sha1').update(body).digest('hex');
      console.log(
        `${agent.label.toUpperCase()} pass ${i} | ${target.label} | status=${res.status} | cf-cache-status=${res.headers.get('cf-cache-status')} | digest=${hash.slice(0, 10)}`
      );
    }
  }
}

Why we cache bots​

Current rule​

Operational notes​

Troubleshooting checklist​

JS smoke test​

Graph View

Why we cache bots

Current rule

Operational notes

Troubleshooting checklist

JS smoke test