Result cache follow-up — issue #283¶

Status: Open Owner: backbone maintainers Tracking issue: #291 Source PR: #290

This is the working document for the open items left behind by the result-cache landing (issue #283 / PR #290). The cache schema, service, API integration, worker hooks, beat tasks, Prometheus counter, and pre-warm config schema are all shipped on main. What remains is operationalising the feature: running the beat process, populating the curated pre-warm list, validating the hit rate in production, and documenting the version-bump invariant where operators will see it. Each section below says what to do, when to do it, and how to know it is done.

Item 1 — Run a Celery beat process in deployment¶

When: Before the cache can deliver any value. Without beat, the schedule in src/climate_lama/worker/celery_app.py (prewarm_result_cache at 03:00 UTC, purge_orphan_cache_entries at 04:00 UTC) is decorative — the worker container only consumes from the queue, it does not wake itself up on a cron.

Effort: Low — one new compose service plus a deployment runbook update.

Steps¶

Add a worker-beat service to docker-compose.yml next to worker. It must use the same image but different command:

worker-beat:
  image: ghcr.io/cortomaltese3/climate-lama-worker:${CLIMATE_LAMA_TAG:-latest}
  # or `build: { context: ., dockerfile: docker/worker.Dockerfile }` for dev
  environment:
    <<: *worker-env  # reuse the worker's anchored env block
  depends_on:
    postgres:
      condition: service_healthy
    redis:
      condition: service_healthy
  command: celery -A climate_lama.worker.celery_app beat --loglevel=info
  restart: unless-stopped

Exactly one beat process must run per environment. Two beats running against the same broker will dispatch every scheduled task twice.

Pick the schedule storage. Default Celery beat uses an in-memory PersistentScheduler keyed by a local file (celerybeat-schedule.db). For a single-node deployment that is fine. For a redundant deployment, replace it with redbeat (Redis-backed) so the schedule is shared and only one beat fires at a time:

uv add celery-redbeat

Then in celery_app.py:

celery_app.conf.beat_scheduler = "redbeat.RedBeatScheduler"
celery_app.conf.redbeat_redis_url = settings.celery_broker_url

Decision driver: are we running ≥2 worker hosts? If yes → redbeat. If self-host single-VM → leave the default.

Add a Helm CronJob or Deployment for beat when the managed-hosting chart lands (see docs/ops/release-pipeline-followup.md Item 5). One replica, no HPA, restart policy Always.
Smoke-test locally by starting the new service and watching logs:

docker compose up -d worker-beat
docker compose logs -f worker-beat

You should see beat: Starting... and within a minute, log lines like Scheduler: Sending due task prewarm-result-cache-nightly. To force a firing immediately for verification, temporarily change the crontab to crontab(minute="*") on a branch, observe one tick, then revert.

Verify¶

docker compose ps shows worker-beat as running (healthy).
Worker logs at 03:00/04:00 UTC contain prewarm_result_cache: dispatched=N already_cached=M and purge_orphan_cache_entries: deleted=N log lines respectively.
climate_lama_celery_task_runs_total{task_name="climate_lama.worker.tasks.prewarm_result_cache"} Prometheus counter increments once per night.

Item 2 — Curate `config/prewarm.yaml`¶

When: After Item 1 ships and at least one production-shaped dataset is loaded. Curating against demo data wastes time — the entries reference real hazard_dataset_id / exposure_dataset_id values that must exist in the target environment.

Effort: Medium — data-ops task, mostly catalog research, not code.

Goal: ~20 scenarios chosen for coverage of the highest-value combinations: river flood + tropical cyclone × the most-requested exposure sets × baseline + 2030 + 2050 horizons × discount-rate presets (Stern, Nordhaus). The exact list is judgment-based; the schema is documented in config/prewarm.yaml.

Steps¶

List the candidate datasets:

gh api repos/CortoMaltese3/climate-lama/issues --jq '.[] | select(.title | contains("dataset"))'
# plus the live catalog:
curl https://<env>/v1/hazards | jq '.data[] | {id, name, haz_type, region}'
curl https://<env>/v1/exposures | jq '.data[] | {dataset_id, name}'

For each candidate, derive the cache-key inputs:
hazard_dataset_id, hazard_dataset_sha256 — from GET /v1/hazards/{id}. The sha column is hazard_datasets.sha256 (not dataset_sha256); the value in YAML is whatever that column holds.
exposure_dataset_id, exposure_dataset_sha256 — from GET /v1/exposures?dataset_id=... (read dataset_sha256 off any row).
impact_function_id, impact_function_version — from GET /v1/impact-functions. Pin the version explicitly so a future coefficient change implicitly invalidates these entries.
scenario_inputs — match the saved scenarios users actually run; if none exist yet, seed with {"year": 2030, "discount_rate": 0.02, "growth_rate": 0.01} plus a 2050 variant.
Replace the commented example block in config/prewarm.yaml with the curated list. Keep the schema header — it's the only docs operators read for this file.
Bake it into the worker image. The worker already reads config/prewarm.yaml relative to the project root; docker/worker.Dockerfile COPY needs to include the file (verify during this work — if not, add it).
Trigger one manual run to confirm the entries are well-formed:

docker compose exec worker celery -A climate_lama.worker.celery_app \
  call climate_lama.worker.tasks.prewarm_result_cache

The task returns {"dispatched": N, "already_cached": M}. Check dispatched + already_cached equals the number of YAML entries — any shortfall means a malformed entry was logged and skipped (grep prewarm: skipping malformed entry).

Verify the resulting jobs land in jobs and produce result_cache rows:

SELECT count(*) FROM result_cache WHERE created_at > now() - interval '1 hour';

Must be ≥ dispatched once those pre-warm jobs complete.

Verify¶

config/prewarm.yaml contains ~20 production-shaped entries (no placeholder deadbeef shas).
Manual prewarm_result_cache run logs dispatched=N already_cached=0 on first run, dispatched=0 already_cached=N on the second run.
Cold-start request that matches a pre-warm entry returns 200 with cache_hit: true end-to-end.

Item 3 — Validate cache hit rate in production¶

When: ~2 weeks after Items 1+2 are running in production with real traffic. Earlier, the denominator is too small to interpret.

Effort: Low — a Grafana panel query and a one-paragraph readout.

Goal: Confirm the cache is actually delivering value. If hit rate is materially below expectations, decide whether to (a) curate more pre-warm entries, (b) widen the cache-key tolerance (e.g. round growth_rate to 4 dp), or (c) declare the feature low-leverage and de-prioritise the follow-ups.

Steps¶

Open the climate-lama dashboard in Grafana. The result-cache panel should already exist via the label-pattern query climate_lama_result_cache_hits_total{result=~"hit|miss"} — if it doesn't, add it.
Run the hit-rate query for a 7-day window:

sum(rate(climate_lama_result_cache_hits_total{result="hit"}[7d]))
/
sum(rate(climate_lama_result_cache_hits_total[7d]))

Interpret:
≥30% — feature is delivering. Move on to Item 4.
10–30% — likely under-curated pre-warm list. Add 5–10 more entries focused on the highest-traffic miss patterns (next step).
<10% — investigate. Could be: cache key too granular (every request truly different), impact_functions.version getting bumped too often (legitimate or accidental), or the pre-warm task is silently failing.
Identify miss patterns by joining jobs.params with the cache:

SELECT params->>'hazard_dataset_id' AS haz, count(*)
FROM jobs
WHERE job_type = 'impact_calc'
  AND params ? 'cache_key'
  AND created_at > now() - interval '7 days'
  AND params->>'cache_key' NOT IN (SELECT cache_key FROM result_cache)
GROUP BY 1 ORDER BY 2 DESC LIMIT 20;

The top rows are candidates for pre-warm curation.

Write a one-paragraph readout in this file under "Outcomes" with the hit rate, the period, and the decision (more curation, widen key, or accept).

Verify¶

Hit-rate query returns a number ≥0 (the metric exists and is being scraped).
Readout paragraph is added to this file under "Outcomes" with the date.
If hit rate is <30%, a follow-up curation task is filed against config/prewarm.yaml.

Item 4 — Audit `purge_orphan_cache_entries` logs¶

When: Quarterly, or whenever the result_cache row count drifts from expectations. Cheap to run, valuable as a smoke test that the cleanup beat task is actually firing.

Effort: Low — one log query and one SQL count.

Rationale: Cache rows have ON DELETE CASCADE on the FK, so the purge_orphan_cache_entries task should rarely delete anything (only when an impact_results row was hard-deleted via raw SQL bypassing the FK action). A consistently non-zero deleted=N is a signal someone is hard-deleting results out-of-band; a permanently zero count plus growing table size is a signal the beat task is not running at all.

Steps¶

Pull the last 30 days of purge logs:

# Loki / structlog query — adjust to local stack
docker compose logs worker-beat --since 720h | \
  grep -E "purge_orphan_cache_entries: deleted="

Cross-check the row count and orphan count directly:

SELECT count(*) AS total_cache_rows FROM result_cache;
SELECT count(*) AS dangling FROM result_cache rc
LEFT JOIN impact_results ir ON ir.id = rc.result_id
WHERE ir.id IS NULL;

dangling should be zero. Anything else means purge_orphan_cache_entries is failing, beat is not running, or rows are being created faster than the daily purge can keep up (very unlikely — orphans only appear via hard-delete bypassing FK).

If dangling > 0: trigger a manual purge to confirm the task itself works:

docker compose exec worker celery -A climate_lama.worker.celery_app \
  call climate_lama.worker.tasks.purge_orphan_cache_entries

Re-run the SQL query — dangling must be 0. If not, the task is buggy (file a bug); if yes, beat is the problem (return to Item 1).

Record the audit date and the deleted-count distribution in this file under "Outcomes".

Verify¶

dangling count is 0 after the audit.
purge_orphan_cache_entries log lines exist for at least 75% of days in the audited window (some days may legitimately miss if the worker-beat was restarted).
Audit entry recorded in this file.

Item 5 — Document the `impact_functions.version` bump invariant¶

When: Immediately after Item 1 ships. The invariant is already in docs/CLIMATE_LAMA.md under "Result Cache" but operators won't read that doc when they sit down to edit a curve. Put the warning where they're actively working.

Effort: Trivial — two doc edits.

Rationale: The cache key includes impact_functions.version, so any change to MDD/PAA coefficients or intensity_unit that affects result values must be accompanied by a version bump. A bug fix that silently changes the curve without bumping the version will serve stale cached results forever (until manually purged).

Steps¶

Add a banner to the impact-function admin runbook. If the runbook doesn't exist yet (verify under docs/admin/), the banner goes in:
the OpenAPI description for PUT /v1/impact-functions/{id},
any CLI / migration helper that mutates the table.
Add a SQL trigger as a backstop. The trigger raises a NOTICE (not an error — too disruptive) when MDD/PAA columns change without version also changing in the same UPDATE:

CREATE OR REPLACE FUNCTION warn_impact_function_version_bump()
RETURNS trigger AS $$
BEGIN
  IF (NEW.mdd_x IS DISTINCT FROM OLD.mdd_x
      OR NEW.mdd_y IS DISTINCT FROM OLD.mdd_y
      OR NEW.paa_x IS DISTINCT FROM OLD.paa_x
      OR NEW.paa_y IS DISTINCT FROM OLD.paa_y
      OR NEW.intensity_unit IS DISTINCT FROM OLD.intensity_unit)
     AND NEW.version = OLD.version
  THEN
    RAISE WARNING 'impact_functions(%) coefficients changed without bumping version — result_cache will serve stale data',
      NEW.id;
  END IF;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER warn_impact_function_version_bump
  BEFORE UPDATE ON impact_functions
  FOR EACH ROW EXECUTE FUNCTION warn_impact_function_version_bump();

Ship as a new Alembic migration. WARNING is intentional — silently blocking the UPDATE breaks legitimate flows (e.g. attribution-only metadata changes); the warning shows up in logs where the audit log already lives.

Add a unit test asserting the trigger fires on a coefficient change without a version bump (use caplog or equivalent to capture the warning).
Update docs/CLIMATE_LAMA.md with a one-line cross-reference to this trigger so the contract and the enforcement mechanism live next to each other.

Verify¶

Coefficient-change UPDATE without version bump emits a WARNING in postgres logs.
Coefficient-change UPDATE with version bump is silent.
Unit test exercises both paths.
Both the runbook (or OpenAPI docstring) and docs/CLIMATE_LAMA.md reference each other.

Definition of done (all items)¶

Item	Done when
1 — Beat process	`worker-beat` container running; nightly logs show prewarm + purge firings
2 — Curate prewarm.yaml	~20 production entries; manual run shows `dispatched=N → already_cached=N` on retry
3 — Validate hit rate	Grafana panel populated; readout paragraph added; decision recorded
4 — Purge audit	`dangling` count is 0; audit recorded in this file
5 — Version-bump invariant	Trigger ships in a migration; runbook + CLIMATE_LAMA.md cross-reference

Item 1 must ship first — Items 2-4 require a running beat process to be meaningful. Item 5 is independent and can be picked up at any time.

Outcomes¶

(Append validation readouts and audit notes here as items close.)

Result cache follow-up — issue #283¶

Item 1 — Run a Celery beat process in deployment¶

Steps¶

Verify¶

Item 2 — Curate config/prewarm.yaml¶

Steps¶

Verify¶

Item 3 — Validate cache hit rate in production¶

Steps¶

Verify¶

Item 4 — Audit purge_orphan_cache_entries logs¶

Steps¶

Verify¶

Item 5 — Document the impact_functions.version bump invariant¶

Steps¶

Verify¶

Definition of done (all items)¶

Outcomes¶

Item 2 — Curate `config/prewarm.yaml`¶

Item 4 — Audit `purge_orphan_cache_entries` logs¶

Item 5 — Document the `impact_functions.version` bump invariant¶