Result cache follow-up — issue #283¶
Status: Open Owner: backbone maintainers Tracking issue: #291 Source PR: #290
This is the working document for the open items left behind by the result-cache
landing (issue #283 / PR #290). The cache schema, service, API integration,
worker hooks, beat tasks, Prometheus counter, and pre-warm config schema are
all shipped on main. What remains is operationalising the feature:
running the beat process, populating the curated pre-warm list, validating the
hit rate in production, and documenting the version-bump invariant where
operators will see it. Each section below says what to do, when to do it, and
how to know it is done.
Item 1 — Run a Celery beat process in deployment¶
When: Before the cache can deliver any value. Without beat, the schedule
in src/climate_lama/worker/celery_app.py (prewarm_result_cache at
03:00 UTC, purge_orphan_cache_entries at 04:00 UTC) is decorative — the
worker container only consumes from the queue, it does not wake itself up on
a cron.
Effort: Low — one new compose service plus a deployment runbook update.
Steps¶
- Add a
worker-beatservice todocker-compose.ymlnext toworker. It must use the same image but different command:
worker-beat:
image: ghcr.io/cortomaltese3/climate-lama-worker:${CLIMATE_LAMA_TAG:-latest}
# or `build: { context: ., dockerfile: docker/worker.Dockerfile }` for dev
environment:
<<: *worker-env # reuse the worker's anchored env block
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
command: celery -A climate_lama.worker.celery_app beat --loglevel=info
restart: unless-stopped
Exactly one beat process must run per environment. Two beats running against the same broker will dispatch every scheduled task twice.
- Pick the schedule storage. Default Celery beat uses an in-memory
PersistentSchedulerkeyed by a local file (celerybeat-schedule.db). For a single-node deployment that is fine. For a redundant deployment, replace it withredbeat(Redis-backed) so the schedule is shared and only one beat fires at a time:
Then in celery_app.py:
celery_app.conf.beat_scheduler = "redbeat.RedBeatScheduler"
celery_app.conf.redbeat_redis_url = settings.celery_broker_url
Decision driver: are we running ≥2 worker hosts? If yes → redbeat. If self-host single-VM → leave the default.
-
Add a Helm
CronJoborDeploymentfor beat when the managed-hosting chart lands (seedocs/ops/release-pipeline-followup.mdItem 5). One replica, no HPA, restart policyAlways. -
Smoke-test locally by starting the new service and watching logs:
You should see beat: Starting... and within a minute, log lines like
Scheduler: Sending due task prewarm-result-cache-nightly. To force a
firing immediately for verification, temporarily change the crontab to
crontab(minute="*") on a branch, observe one tick, then revert.
Verify¶
docker compose psshowsworker-beatasrunning (healthy).- Worker logs at 03:00/04:00 UTC contain
prewarm_result_cache: dispatched=N already_cached=Mandpurge_orphan_cache_entries: deleted=Nlog lines respectively. climate_lama_celery_task_runs_total{task_name="climate_lama.worker.tasks.prewarm_result_cache"}Prometheus counter increments once per night.
Item 2 — Curate config/prewarm.yaml¶
When: After Item 1 ships and at least one production-shaped dataset is
loaded. Curating against demo data wastes time — the entries reference real
hazard_dataset_id / exposure_dataset_id values that must exist in the
target environment.
Effort: Medium — data-ops task, mostly catalog research, not code.
Goal: ~20 scenarios chosen for coverage of the highest-value
combinations: river flood + tropical cyclone × the most-requested exposure
sets × baseline + 2030 + 2050 horizons × discount-rate presets (Stern,
Nordhaus). The exact list is judgment-based; the schema is documented in
config/prewarm.yaml.
Steps¶
- List the candidate datasets:
gh api repos/CortoMaltese3/climate-lama/issues --jq '.[] | select(.title | contains("dataset"))'
# plus the live catalog:
curl https://<env>/v1/hazards | jq '.data[] | {id, name, haz_type, region}'
curl https://<env>/v1/exposures | jq '.data[] | {dataset_id, name}'
-
For each candidate, derive the cache-key inputs:
-
hazard_dataset_id,hazard_dataset_sha256— fromGET /v1/hazards/{id}. The sha column ishazard_datasets.sha256(notdataset_sha256); the value in YAML is whatever that column holds. exposure_dataset_id,exposure_dataset_sha256— fromGET /v1/exposures?dataset_id=...(readdataset_sha256off any row).impact_function_id,impact_function_version— fromGET /v1/impact-functions. Pin the version explicitly so a future coefficient change implicitly invalidates these entries.-
scenario_inputs— match the saved scenarios users actually run; if none exist yet, seed with{"year": 2030, "discount_rate": 0.02, "growth_rate": 0.01}plus a 2050 variant. -
Replace the commented example block in
config/prewarm.yamlwith the curated list. Keep the schema header — it's the only docs operators read for this file. -
Bake it into the worker image. The worker already reads
config/prewarm.yamlrelative to the project root;docker/worker.DockerfileCOPY needs to include the file (verify during this work — if not, add it). -
Trigger one manual run to confirm the entries are well-formed:
docker compose exec worker celery -A climate_lama.worker.celery_app \
call climate_lama.worker.tasks.prewarm_result_cache
The task returns {"dispatched": N, "already_cached": M}. Check
dispatched + already_cached equals the number of YAML entries — any
shortfall means a malformed entry was logged and skipped (grep
prewarm: skipping malformed entry).
- Verify the resulting jobs land in
jobsand produceresult_cacherows:
Must be ≥ dispatched once those pre-warm jobs complete.
Verify¶
config/prewarm.yamlcontains ~20 production-shaped entries (no placeholderdeadbeefshas).- Manual
prewarm_result_cacherun logsdispatched=N already_cached=0on first run,dispatched=0 already_cached=Non the second run. - Cold-start request that matches a pre-warm entry returns
200withcache_hit: trueend-to-end.
Item 3 — Validate cache hit rate in production¶
When: ~2 weeks after Items 1+2 are running in production with real traffic. Earlier, the denominator is too small to interpret.
Effort: Low — a Grafana panel query and a one-paragraph readout.
Goal: Confirm the cache is actually delivering value. If hit rate is
materially below expectations, decide whether to (a) curate more
pre-warm entries, (b) widen the cache-key tolerance (e.g. round
growth_rate to 4 dp), or (c) declare the feature low-leverage and
de-prioritise the follow-ups.
Steps¶
-
Open the climate-lama dashboard in Grafana. The result-cache panel should already exist via the label-pattern query
climate_lama_result_cache_hits_total{result=~"hit|miss"}— if it doesn't, add it. -
Run the hit-rate query for a 7-day window:
sum(rate(climate_lama_result_cache_hits_total{result="hit"}[7d]))
/
sum(rate(climate_lama_result_cache_hits_total[7d]))
-
Interpret:
-
≥30% — feature is delivering. Move on to Item 4.
- 10–30% — likely under-curated pre-warm list. Add 5–10 more entries focused on the highest-traffic miss patterns (next step).
-
<10% — investigate. Could be: cache key too granular (every request truly different),
impact_functions.versiongetting bumped too often (legitimate or accidental), or the pre-warm task is silently failing. -
Identify miss patterns by joining
jobs.paramswith the cache:
SELECT params->>'hazard_dataset_id' AS haz, count(*)
FROM jobs
WHERE job_type = 'impact_calc'
AND params ? 'cache_key'
AND created_at > now() - interval '7 days'
AND params->>'cache_key' NOT IN (SELECT cache_key FROM result_cache)
GROUP BY 1 ORDER BY 2 DESC LIMIT 20;
The top rows are candidates for pre-warm curation.
- Write a one-paragraph readout in this file under "Outcomes" with the hit rate, the period, and the decision (more curation, widen key, or accept).
Verify¶
- Hit-rate query returns a number ≥0 (the metric exists and is being scraped).
- Readout paragraph is added to this file under "Outcomes" with the date.
- If hit rate is <30%, a follow-up curation task is filed against
config/prewarm.yaml.
Item 4 — Audit purge_orphan_cache_entries logs¶
When: Quarterly, or whenever the result_cache row count drifts from
expectations. Cheap to run, valuable as a smoke test that the cleanup beat
task is actually firing.
Effort: Low — one log query and one SQL count.
Rationale: Cache rows have ON DELETE CASCADE on the FK, so the
purge_orphan_cache_entries task should rarely delete anything (only when
an impact_results row was hard-deleted via raw SQL bypassing the FK
action). A consistently non-zero deleted=N is a signal someone is
hard-deleting results out-of-band; a permanently zero count plus growing
table size is a signal the beat task is not running at all.
Steps¶
- Pull the last 30 days of purge logs:
# Loki / structlog query — adjust to local stack
docker compose logs worker-beat --since 720h | \
grep -E "purge_orphan_cache_entries: deleted="
- Cross-check the row count and orphan count directly:
SELECT count(*) AS total_cache_rows FROM result_cache;
SELECT count(*) AS dangling FROM result_cache rc
LEFT JOIN impact_results ir ON ir.id = rc.result_id
WHERE ir.id IS NULL;
dangling should be zero. Anything else means
purge_orphan_cache_entries is failing, beat is not running, or rows
are being created faster than the daily purge can keep up (very
unlikely — orphans only appear via hard-delete bypassing FK).
- If
dangling > 0: trigger a manual purge to confirm the task itself works:
docker compose exec worker celery -A climate_lama.worker.celery_app \
call climate_lama.worker.tasks.purge_orphan_cache_entries
Re-run the SQL query — dangling must be 0. If not, the task is buggy
(file a bug); if yes, beat is the problem (return to Item 1).
- Record the audit date and the deleted-count distribution in this file under "Outcomes".
Verify¶
danglingcount is 0 after the audit.purge_orphan_cache_entrieslog lines exist for at least 75% of days in the audited window (some days may legitimately miss if the worker-beat was restarted).- Audit entry recorded in this file.
Item 5 — Document the impact_functions.version bump invariant¶
When: Immediately after Item 1 ships. The invariant is already in
docs/CLIMATE_LAMA.md under "Result Cache" but
operators won't read that doc when they sit down to edit a curve. Put the
warning where they're actively working.
Effort: Trivial — two doc edits.
Rationale: The cache key includes impact_functions.version, so
any change to MDD/PAA coefficients or intensity_unit that affects
result values must be accompanied by a version bump. A bug fix that
silently changes the curve without bumping the version will serve stale
cached results forever (until manually purged).
Steps¶
- Add a banner to the impact-function admin runbook. If the runbook
doesn't exist yet (verify under
docs/admin/), the banner goes in: - the OpenAPI description for
PUT /v1/impact-functions/{id}, -
any CLI / migration helper that mutates the table.
-
Add a SQL trigger as a backstop. The trigger raises a NOTICE (not an error — too disruptive) when MDD/PAA columns change without
versionalso changing in the same UPDATE:
CREATE OR REPLACE FUNCTION warn_impact_function_version_bump()
RETURNS trigger AS $$
BEGIN
IF (NEW.mdd_x IS DISTINCT FROM OLD.mdd_x
OR NEW.mdd_y IS DISTINCT FROM OLD.mdd_y
OR NEW.paa_x IS DISTINCT FROM OLD.paa_x
OR NEW.paa_y IS DISTINCT FROM OLD.paa_y
OR NEW.intensity_unit IS DISTINCT FROM OLD.intensity_unit)
AND NEW.version = OLD.version
THEN
RAISE WARNING 'impact_functions(%) coefficients changed without bumping version — result_cache will serve stale data',
NEW.id;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER warn_impact_function_version_bump
BEFORE UPDATE ON impact_functions
FOR EACH ROW EXECUTE FUNCTION warn_impact_function_version_bump();
Ship as a new Alembic migration. WARNING is intentional — silently blocking the UPDATE breaks legitimate flows (e.g. attribution-only metadata changes); the warning shows up in logs where the audit log already lives.
-
Add a unit test asserting the trigger fires on a coefficient change without a version bump (use
caplogor equivalent to capture the warning). -
Update
docs/CLIMATE_LAMA.mdwith a one-line cross-reference to this trigger so the contract and the enforcement mechanism live next to each other.
Verify¶
- Coefficient-change UPDATE without version bump emits a WARNING in postgres logs.
- Coefficient-change UPDATE with version bump is silent.
- Unit test exercises both paths.
- Both the runbook (or OpenAPI docstring) and
docs/CLIMATE_LAMA.mdreference each other.
Definition of done (all items)¶
| Item | Done when |
|---|---|
| 1 — Beat process | worker-beat container running; nightly logs show prewarm + purge firings |
| 2 — Curate prewarm.yaml | ~20 production entries; manual run shows dispatched=N → already_cached=N on retry |
| 3 — Validate hit rate | Grafana panel populated; readout paragraph added; decision recorded |
| 4 — Purge audit | dangling count is 0; audit recorded in this file |
| 5 — Version-bump invariant | Trigger ships in a migration; runbook + CLIMATE_LAMA.md cross-reference |
Item 1 must ship first — Items 2-4 require a running beat process to be meaningful. Item 5 is independent and can be picked up at any time.
Outcomes¶
(Append validation readouts and audit notes here as items close.)