adopt pre-modelrevert clearpilot tree (d639e28) as the new head

Discard the modelrevert tree adoption (8b4b7e0) and the in-process park
short-circuits / cached-output / dashcam-idle work that came with it
(0dc8002, 37e095e). Restore the clearpilot tree as it stood at d639e28 —
the parked-controlsd manager-process split, the GPS-disable in locationd,
the controlsd UI hooks, the boardd ignition-edge safety_setter_thread
fix. After a full /data/params/d wipe and re-calibration drive, the
modelrevert-tree variant overcorrected on turns; reverting to the
parked-controlsd architecture (which Brian had previously vetted and
documented in 887b9c9 + 27cad05) and starting fresh.

Single new commit, no merge — file state matches d639e28 byte-for-byte.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-26 14:17:25 -05:00
parent 7a584a7e39
commit ab9158bfb7
22 changed files with 955 additions and 236 deletions
@@ -0,0 +1,355 @@
# Session: 2026-04-26 — Baseline Revert + Parked-Controlsd Mode
## Context
This session was driven by a regression: the steering wheel "feels like
it pulls right" during normal driving, with no clear smoking gun. The
suspicion was that one of the variable-FPS / standstill-throttling
changes (added to reduce parked-state fan noise and CPU load) bled into
on-road driving behavior in a hard-to-isolate way.
Strategy: revert all driving-relevant logic to a known-good baseline
captured in `/projects/openpilot/archive/clearpilot` (HEAD `980f0aa`,
July 2024), keep all of the ClearPilot UI/dashcam/telemetry/bench-mode
infrastructure intact on top, then attack the parked-fan-noise problem
fresh from a different angle that doesn't touch driving logic at all.
Three commits landed in this order on branch `clearpilot`:
| SHA | Title |
|---|---|
| `47321e3` | restore driving logic to pre-variable-fps baseline |
| `f7e602c` | controlsd: re-wire UI hooks on top of restored baseline |
| `887b9c9` | parked-controlsd mode: shut down heavy stack while ignition+park |
Pre-revert tip was `62a403d`. The on-device agent should treat
**`62a403d` as "the broken version"** when looking at history.
---
## Commit 1: `47321e3` — Baseline restore
Reverted the following files wholesale to their `980f0aa` archive copy:
- `selfdrive/controls/controlsd.py`
- `selfdrive/controls/lib/events.py`
- `selfdrive/controls/lib/longitudinal_planner.py`
- `selfdrive/modeld/modeld.py`
- `selfdrive/modeld/dmonitoringmodeld.py`
- `selfdrive/locationd/calibrationd.py`
- `selfdrive/locationd/paramsd.py`
- `selfdrive/locationd/torqued.py`
- `selfdrive/car/interfaces.py`
- `selfdrive/car/hyundai/carstate.py` (CAN-FD telemetry preserved as a
commented block at the bottom of `update_canfd` — re-enable by
uncommenting; the `tlog` import is also commented out).
- `selfdrive/monitoring/dmonitoringd.py`
- `selfdrive/frogpilot/controls/frogpilot_planner.py`
- `common/realtime.py`
### Intentionally NOT restored (kept as the post-`62a403d` version)
- `selfdrive/thermald/*` — fan/power tuning kept as-is.
- `selfdrive/car/hyundai/carcontroller.py` and `hyundaicanfd.py`
reviewed; the only delta vs baseline is hoisting the
`no_lat_lane_change` Params read out of the 100Hz hot path
(~5% carcontroller CPU) by passing the bit as an argument. This is
a perf-only change with no behavioral effect outside lane changes.
- `cereal/services.py`, `cereal/custom.capnp` — additive only.
`custom.capnp` adds `latRequested @3` and `noLatLaneChange @4` fields
to `FrogPilotCarControl`; capnp tag numbers are append-only so this is
safe to leave in even though baseline code doesn't write to them.
- `selfdrive/manager/*`, `common/params.cc` — heavy ClearPilot
infrastructure (bench mode, log dir, dashcamd, gpsd, ClearPilot params).
- All `selfdrive/ui/`, `selfdrive/clearpilot/`, `system/clearpilot/`.
### Things removed by the restore (no longer in the tree)
- Standstill frame skipping in modeld (was: skip GPU inference 19/20
frames at standstill, report 0 dropped frames to fool controlsd).
- Standstill frame skipping in dmonitoringmodeld.
- Model standby logic + `ModelStandby`/`ModelStandbyTs` reads in
controlsd's comm-issue suppression path.
- Parked-cycle skip in `state_control()` (10Hz vs 100Hz when in Park).
- Calibrationd validity decoupling from `sm.all_checks()`.
- Post-engage 2s commIssue/location/params suppression window.
- Per-cycle carstate write-gating for `CarSpeedLimit`/`CarIsMetric` in
carstate.py.
- The diff-based carstate telemetry calls (preserved commented out).
These were the candidates for the steering-pull regression. The on-device
agent should **not re-introduce any of these** without a deliberate plan
and a test session. The user's intent is to get baseline driving feel
confirmed first, then re-introduce optimizations one at a time and drive
each.
---
## Commit 2: `f7e602c` — UI hooks on top of baseline
Baseline `controlsd.py` doesn't have the UI plumbing the ClearPilot UI
expects. This commit re-adds only the things needed for the existing UI
to keep working — pure params-write plumbing, no actuator effect.
### Changes (all in `selfdrive/controls/controlsd.py`)
1. **Import `SpeedState`** from `openpilot.selfdrive.clearpilot.speed_logic`.
2. **`Controls.__init__`**:
- `params_memory.put_bool("ScreenDisplayMode", 0)`
`params_memory.put_int("ScreenDisplayMode", 0)` (UI reads it as int).
- Added `self.speed_state = SpeedState()`, `self.speed_state_frame = 0`,
`self.was_driving_gear = False`.
3. **SubMaster** — added `gpsLocation` to the subscriber list, with
`ignore_alive` / `ignore_avg_freq` / `ignore_valid` so missing GPS
doesn't trigger commIssue.
4. **`clearpilot_state_control(...)`** rewritten from a simple 3-state
cycle into the documented 5-state ScreenDisplayMode machine:
- Auto-wake on park→drive edge if currently in screen-off (state 3).
- LFA button transitions in drive: `0→4, 1→2, 2→3, 3→4, 4→2`.
- LFA button transitions outside drive: any except 3 → 3, state 3 → 0.
- Speed/cruise-warning overlay tick at ~2Hz (`speed_state.update(...)`)
reading `gpsLocation`, `CarSpeedLimit` param, `self.is_metric`,
`CS.cruiseState`. This is what writes
`ClearpilotSpeedDisplay`/`ClearpilotSpeedLimitDisplay`/
`ClearpilotCruiseWarning` for the UI overlay.
5. **Lane-change suppression sync** — at the existing baseline
`clearpilot_disable_lat_on_lane_change` block (around line 687), in
addition to the existing `params_memory.put_bool("no_lat_lane_change", ...)`,
also set `self.frogpilot_variables.no_lat_lane_change = True/False`.
This is required because the kept (post-`62a403d`) carcontroller
reads off `frogpilot_variables`, not Params.
### What does NOT change
No edits to lateral or longitudinal control paths. No new actuator-side
behavior. The UI features (nightrider/screen-off/auto day-night, speed
overlay, cruise warning chime) get wired back up purely through param
writes.
---
## Commit 3: `887b9c9` — Parked-controlsd mode
Architectural fix for the original problem this whole session was
chasing: while ignition is on but the car is in Park, the entire onroad
stack (modeld, planner, control, locationd, calibrationd, paramsd,
torqued, dmonitoring*, soundd, loggerd) is running and burning CPU/fan
even though none of it is needed.
Solution: redefine "onroad" as **ignition AND not parked** instead of
just **ignition**. Reuse the existing `started`-based process gating in
manager. Add a tiny second controlsd variant that runs while parked,
just to keep CAN flowing so thermald can see when gear leaves Park.
### Files
#### NEW: `selfdrive/controls/controlsd_parked.py`
Minimal entry point. Roughly:
```python
def main():
config_realtime_process(4, Priority.CTRL_HIGH)
card = CarD() # blocks until first CAN, fingerprints car
card.initialize()
fv = _make_default_frogpilot_variables() # safe False/0 SimpleNamespace
while True:
card.state_update(fv) # publishes carState/carOutput/carParams
```
`CarD.state_update` blocks via `drain_sock_raw(wait_for_one=True)`, so
the loop is paced by CAN traffic — no extra sleep, no CPU spin.
The default `frogpilot_variables` sets these to safe values so
`CarInterfaceBase.update` doesn't `AttributeError`:
`conditional_experimental_mode`, `experimental_mode_via_distance`,
`traffic_mode`, `sport_plus`, `long_pitch`, `no_lat_lane_change` — all
False.
#### `selfdrive/thermald/thermald.py`
- `onroad_conditions` now also has `"not_parked"`. Initialized to
`False` (assume parked at boot) so the heavy stack waits for carState
to confirm gear has left Park before spinning up.
- New module-level loop variables: `is_parked = True`,
`parked_since: float | None = None`, `PARKED_HYSTERESIS_S = 1.5`,
`ignition_param_prev: bool | None = None`.
- New block right after the panda-disconnect check (around line 258 in
current state) reads `sm['carState'].gearShifter`:
- Gear == Park: latch `parked_since`, flip `is_parked = True` after
1.5s of continuous Park (hysteresis).
- Gear != Park: clear `parked_since`, `is_parked = False` immediately
(no hysteresis going out).
- Reverse is treated as **not parked** — driver is moving.
- `onroad_conditions["not_parked"] = not is_parked` every tick.
- New `IgnitionOn` Params write, edge-driven (only on change of
`onroad_conditions["ignition"]`) so we don't hammer the persistent
filesystem 2x/sec.
#### `selfdrive/manager/process_config.py`
- New predicate:
```python
def parked_only(started, params, CP):
return params.get_bool("IgnitionOn") and not started
```
- New process entry directly after the existing `controlsd` entry:
```python
PythonProcess("controlsd_parked", "selfdrive.controls.controlsd_parked", parked_only),
```
- `controlsd` is unchanged (still `only_onroad`). Mutually exclusive
with `parked_only` because `started` is the negation of the relevant
condition.
#### `selfdrive/manager/manager.py`
- Single line in `manager_init()` seeding `IgnitionOn=False` so the
predicate evaluates correctly before thermald's first tick.
#### `common/params.cc`
- New entry in the alphabetical I-block:
```cpp
{"IgnitionOn", CLEAR_ON_MANAGER_START},
```
- Manager predicates can only see persistent Params (not pandaStates
or `/dev/shm/params`), which is why thermald has to expose ignition
this way.
### State machine summary
| Ignition | Gear | Predicates true | What runs |
|----------|-------------|-----------------------------|-------------------------------|
| off | any | none | always_run only (ui, thermald, pandad, deleter, …) |
| on | Park (>1.5s)| parked_only | always_run + controlsd_parked |
| on | not Park | only_onroad (=> all `started`) | full onroad stack |
The transition between rows 2 and 3 is purely manager process
swapping driven by predicate flips — no IPC handshake, no
self-termination. Thermald sees gear change, flips `not_parked`,
`should_start = all(onroad_conditions.values())` flips, manager kills
the wrong variant and spawns the right one on its next tick.
---
## What the on-device agent needs to know / debug
### Build prerequisites
A new param key was added (`IgnitionOn`), so the C++ params whitelist
needs a fresh build. Per `CLAUDE.md` "Adding New Params":
```bash
chown -R comma:comma /data/openpilot
rm -f /data/openpilot/prebuilt /data/openpilot/common/params.o /data/openpilot/common/libcommon.a
su - comma -c "bash /data/openpilot/build_only.sh"
```
`build_only.sh` already deletes `prebuilt` but **does not** delete
`params.o` / `libcommon.a` — verify those are gone before building or
the new key won't be picked up and `Params().put_bool("IgnitionOn", …)`
will throw `UnknownKeyName` in manager_init or thermald.
### How to verify the swap is working
```bash
# 1. Watch which controlsd variant is running
watch -n 1 'ps -ef | grep -E "controlsd(_parked)?" | grep -v grep'
# 2. Watch the gating signals
watch -n 1 'echo "IgnitionOn:"; cat /data/params/d/IgnitionOn 2>/dev/null; \
echo; echo "deviceState.started:" ; \
python3 -c "import cereal.messaging as m; \
s=m.sub_sock(\"deviceState\", timeout=1000); \
print(m.recv_one(s).deviceState.started)"'
# 3. Watch gear via carState
python3 -c "
import cereal.messaging as m
s = m.sub_sock('carState', timeout=2000)
while True:
msg = m.recv_one(s)
if msg: print(msg.carState.gearShifter)
"
```
Expected behavior:
- Ignition off: neither variant in `ps`. `IgnitionOn` is `0`/missing.
- Ignition on, in Park: `controlsd_parked` in `ps`, `controlsd` is not.
`started` is False. After ~1.5s of confirmed Park, modeld and friends
should have stopped.
- Shift to Drive: `controlsd_parked` disappears within ~500ms, full
`controlsd` appears, all the onroad processes spin up. There will be
a brief carState gap during the swap (~0.52s).
### What to suspect first if something breaks
1. **Manager crash on startup** with `UnknownKeyName: IgnitionOn`. Means
`params.o`/`libcommon.a` weren't rebuilt. Delete them and rebuild.
2. **`controlsd_parked` keeps respawning / dying.** Check
`/data/log2/current/controlsd_parked.log`. Most likely either:
- `CarD.__init__` is hanging on `get_one_can` because pandad isn't
up yet — should only matter on the very first boot.
- A `frogpilot_variables` attribute we missed defaulting; add it to
`_make_default_frogpilot_variables` in `controlsd_parked.py`.
3. **Full controlsd never spawns after shift to Drive.** Check
`IgnitionOn` (should be `1`), check carState.gearShifter (should not
be `park`/`unknown`), check thermald.log for `should_start` logic.
Also check that `carState` is actually being published by
`controlsd_parked`.
4. **Steering still pulls right.** The whole point of commit 1 is to
rule out the variable-FPS work as the cause. If the symptom persists
on baseline-restored driving logic, the suspect list shifts to:
- Anything still on the kept side (carcontroller's
`no_lat_lane_change` plumbing, thermald-related changes,
custom.capnp additions).
- Hardware: a calibration that drifted, an alignment issue, panda
CAN bus issue, or torque tuning that was modified outside of these
files.
- The custom driving model selected (`Params("Model")`). Confirm
which `.thneed`/`.onnx` is loaded — none of the model files
themselves were changed in this session.
5. **Panda safety alerts during park transition.** If panda logs a
"lost heartbeat" or drops to NOOUTPUT mode in the swap window, we
need controlsd_parked to issue a no-op carcontrol heartbeat to keep
panda happy. Not implemented in this session — flag for follow-up.
### Open follow-up items (NOT done in this session)
- **Cold-start latency on shift-from-Park.** Modeld load + calibration
warmup may produce a noticeable gap before lateral/long are ready.
Anticipatory wake on `CS.brakePressed && in_park` is the planned
mitigation if needed.
- **Methodically reintroduce optimizations.** Once baseline driving
feels right, the standstill optimizations (modeld 1fps, fan clamps,
etc.) can come back one at a time, each with a drive test.
- **The CAN-FD telemetry block in `selfdrive/car/hyundai/carstate.py`**
is preserved as a commented block at the bottom of `update_canfd()`.
Re-enabling requires uncommenting + restoring the `tlog` import at
the top of the file.
---
## Key file index for fast navigation
```
selfdrive/controls/controlsd.py # full controlsd, restored to baseline + UI hooks re-added
selfdrive/controls/controlsd_parked.py # NEW: parked-only CAN listener
selfdrive/controls/clearpilot_state_control # in controlsd.py, ~line 1255 — 5-state ScreenDisplayMode + speed_state tick
selfdrive/thermald/thermald.py # gear-aware not_parked + IgnitionOn writer (~line 258)
selfdrive/manager/process_config.py # parked_only predicate + new entry
selfdrive/manager/manager.py # IgnitionOn seed in manager_init
common/params.cc # IgnitionOn registered (CLEAR_ON_MANAGER_START)
selfdrive/car/card.py # CarD class — used by both controlsd variants, unchanged
selfdrive/clearpilot/speed_logic.py # SpeedState class — unchanged, called from controlsd.clearpilot_state_control
selfdrive/car/hyundai/carstate.py # restored baseline + commented telemetry block at end of update_canfd
```
## Reproducing the diff per commit
```bash
git show 47321e3 # baseline restore
git show f7e602c # UI hooks
git show 887b9c9 # parked mode
git diff 62a403d..887b9c9 # full session delta
```
@@ -0,0 +1,358 @@
# Session: 2026-04-26 — GPS disabled in locationd; calibrationd-still-stale notes
## Context
Followup session to `2026-04-26-0914-baseline-revert-and-parked-mode`.
After the baseline restore, the manager wouldn't start cleanly and the car
exhibited a "drifting right on straight roads, model rescues us mid-curve"
symptom. This session unblocked the startup chain end-to-end so the car can
boot and run, disabled GPS as an input to locationd (the actual fix that
made the drift go away), and pinned down — but did **not** solve — why
`liveCalibration.valid` is still stuck at `False` and what that latently
breaks downstream.
Pre-session tip: `27cad05`. Single combined commit covering four code
changes plus this README:
- `selfdrive/boardd/boardd.cc``safety_setter_thread` on ignition edge
- `selfdrive/controls/controlsd.py` — drop unregistered
`no_lat_lane_change` Params write, wire `FPCC.noLatLaneChange` for UI
- `cereal/services.py` — deviceState/managerState back to 2Hz to match
restored `DT_TRML`
- `selfdrive/locationd/locationd.cc` — ignore GPS as Kalman input
(reversible flag)
---
## Commit 1: boardd — safety_setter_thread on ignition edge
### Symptom
Manager started fine but `controlsd_parked` blocked indefinitely at:
```
set_obd_multiplexing (selfdrive/car/fw_versions.py:236)
fingerprint (selfdrive/car/car_helpers.py:149)
get_car (selfdrive/car/car_helpers.py:210)
__init__ (selfdrive/car/card.py:60)
main (selfdrive/controls/controlsd_parked.py:41)
```
`set_obd_multiplexing` does `params.get_bool("ObdMultiplexingChanged", block=True)`
it waits for **boardd's `safety_setter_thread`** to ack. That thread spawns
only on the **rising edge of `IsOnroad`** (`boardd.cc:476`). `IsOnroad` is set
by manager from the `started` flag (`helpers.py:49`). With the parked-mode split
from the prior session, `started` requires `not_parked`, which requires thermald
to see `carState.gearShifter != park`, which requires `controlsd_parked` to
publish carState — which it can't do until OBD multiplexing is acked.
Classic deadlock: ignition rising no longer implies IsOnroad rising.
### Fix
`selfdrive/boardd/boardd.cc:476` — change the trigger from IsOnroad rising
edge to **ignition rising edge**. Adds `bool ignition_last = false;` next to
`is_onroad_last`, swaps the gate variable, sets `ignition_last = ignition;`
after. Restores stock openpilot's intent: set safety as soon as the bus is
alive. Both controlsd variants need it; the thread's phase 2 (waiting on
`ControlsReady`) is harmless in parked mode — it just sits.
`is_onroad`/`is_onroad_last` left in place; they're now unused beyond this
gate but the read of `params.getBool("IsOnroad")` still happens, in case any
future logic wants it.
---
## Commit 2: controlsd — drop unregistered Params write
### Symptom
After the boardd fix, controlsd_parked progressed but full controlsd
crashed at first `state_control` cycle:
```
File "selfdrive/controls/controlsd.py", line 693, in state_control
self.params_memory.put_bool("no_lat_lane_change", False)
common.params_pyx.UnknownKeyName: b'no_lat_lane_change'
```
The baseline-restored controlsd unconditionally writes `no_lat_lane_change`
to memory params on every state_control cycle. That key was never
registered in `common/params.cc`. Pre-revert (`62a403d`) controlsd didn't
write the param at all — it set `self.FPCC.noLatLaneChange` (capnp field).
The baseline brought back a code path the fork's params.cc never matched.
### Fix
In `selfdrive/controls/controlsd.py:687-694`, remove the two
`params_memory.put_bool("no_lat_lane_change", ...)` calls and replace with
`self.FPCC.noLatLaneChange = True/False` (matches what the kept UI code in
`onroad.cc:897` and `ui.cc:122` is reading from cereal). The
`frogpilot_variables.no_lat_lane_change` writes were already there — those
are what the kept Hyundai carcontroller actually reads at 100Hz.
No actuator change. Pure plumbing.
---
## Commit 3: services.py — restore deviceState/managerState rates
### Symptom
After fixing the controlsd crash, controlsd booted but immediately fired
**continuous `commIssue`** with:
```
"not_freq_ok": ["deviceState", "managerState"]
```
### Diagnosis
`common/realtime.py` was reverted to `DT_TRML = 0.5` (thermald → 2Hz) in
the baseline restore. But `cereal/services.py` still declared `deviceState`
and `managerState` at 5Hz from earlier 4Hz-fan-control work. The freq window
is `[0.8 × min, 1.2 × max]` — declared 5Hz means [4.0, 6.0]Hz. Thermald
publishing at 2Hz fell well below 4.0 → freq_ok=False every cycle →
commIssue every cycle.
Already documented as a known footgun in `CLAUDE.md` ("Changing a Service's
Publish Rate").
### Fix
`cereal/services.py:33,73` — set both back to 2Hz with a comment pointing
at `DT_TRML`. After this, `not_freq_ok=[]` and the continuous commIssue
stopped — only one transient commIssue at startup remained (warmup).
---
## Commit 4: locationd — ignore GPS as Kalman input (reversible)
### Why
The car was drifting right on straight roads. Pre-revert, the user
reported this had been working for years; the only practically-new thing
is that `system/clearpilot/gpsd.py` (AT-command-based GPS) had recently
**started actually getting fixes**, where for a long time it wasn't.
Two concrete data problems with the GPS feed for selfdrive purposes:
- `gpsd.py:221` hard-codes `gps.vNED = [0.0, 0.0, 0.0]` while the user is
moving 28 m/s. locationd's `handle_gps` derives `OBSERVATION_ECEF_VEL`
from this; the Kalman gets "GPS says you're stopped" while accelerometer
says otherwise.
- `gpsd.py:216,222` populate horizontalAccuracy/verticalAccuracy from
`hdop * 5` (a rough conversion); `bearingAccuracyDeg = 10.0` and
`speedAccuracy = 1.0` are constants. None of these match what a real
GNSS chip reports.
These flow into the latcontrol_torque pipeline indirectly through
`liveLocationKalman.angularVelocityCalibrated` (used as
`actual_curvature_llk = ... / CS.vEgo` blended with steering-angle-derived
curvature) and through `liveLocationKalman.calibratedOrientationNED`
(pitch). When the Kalman has wrong velocity observations, those derived
fields go wrong — and on a straight crowned road, the controller's
"actual_curvature" picture is off-center, biasing torque output.
### What the user does NOT want disabled
gpsd publishes are still consumed by:
- UI speed indicator and the ClearPilot status overlay
- `dashcamd` for `.srt` subtitle sidecars (lat/lon per segment)
- `timed.py` for system-clock setting via `unixTimestampMillis`
- `gpsd.py` itself for sunset/sunrise → `IsDaylight` (auto night mode)
- `telemetryd.py` for the `gps` group in the CSV
So we cannot just stop publishing — only locationd should ignore.
### Fix
`selfdrive/locationd/locationd.cc:310-323` — add a `clearpilot_disable_gps`
const at the top of `Localizer::handle_gps` and OR it into the existing
reject condition. With it true, every gpsLocation message falls through to
`determine_gps_mode(current_time)` (openpilot's stock no-GPS path:
`input_fake_gps_observations` once position uncertainty grows past
`SANE_GPS_UNCERTAINTY`, otherwise nothing). `last_gps_msg` never updates,
`is_gps_ok()` returns False, `liveLocationKalman.gpsOK = false`.
Effect verified by user: drift improved noticeably. The torque controller
is no longer fed contradictory GPS-vs-IMU velocity observations through
the Kalman.
To re-enable GPS as a Kalman input: flip the `clearpilot_disable_gps`
constant to `false` and rebuild. Self-contained edit.
---
## Calibrationd: still stale, root-cause partially understood
### What is happening
`calibrationd` publishes `liveCalibration.valid = sm.all_checks()` on every
cycle, where `sm` polls `cameraOdometry` and non-polls `carState` and
`carParams`. We measured: **120 publishes in 30 seconds, every single one
`valid=False`** with `calStatus=calibrated, calPerc=100, validBlocks=50`.
The body of the calibration is converged — the validity flag is stuck off.
### Why this matters
`liveCalibration.valid=False` cascades:
1. `locationd.cc:715``filterInitialized = sm.allAliveAndValid()`.
liveCalibration is in the sub list and not in `ignore_alive` /
`ignore_valid`. So filterInitialized stays False forever.
2. With Kalman uninitialized, `liveLocationKalman` still publishes but the
body fields are empty/default. `liveLocationKalman.status = uninitialized`.
3. `paramsd.py` subscribes to `liveLocationKalman` with `poll='liveLocationKalman'`
and gates its update logic on `sm.all_checks()`. When liveLocationKalman
itself isn't in a sane state, paramsd's `roll`, `angleOffsetDeg`,
`steerRatio`, `stiffnessFactor` either never converge or converge to bad
values.
4. `latcontrol_torque.py:135-136` uses `params.angleOffsetDeg` to compute
`actual_curvature_vm` and `params.roll` for `roll_compensation`. With
`roll=0`, no compensation for a crowned/banked road. With wrong
`angleOffsetDeg`, the closed-loop "actual curvature" measurement is
biased.
So the latent risk is: even with our GPS fix, the controller is running
**without learned roll compensation and without a learned steering-angle
offset**. Symptom-free on straight, level pavement; biased on banked roads.
### Why `valid=False`
Inside calibrationd's SubMaster, `sm.all_checks() = all_alive AND all_freq_ok
AND all_valid`. We measured each:
- `cameraOdometry`: alive=True, valid=True, freq_ok=True ✓
- `carState`: alive flickers True/False, valid=True, **freq_ok=False (every cycle)**
- `carParams`: alive=True after first arrival, valid=True, freq_ok=False
but excluded from `all_freq_ok` because `_check_avg_freq` skips services
with `frequency < 0.99` Hz (carParams is 0.02 Hz declared) — so it doesn't
fail the gate.
The smoking gun is **`carState.freq_ok = False` from inside calibrationd's
`poll='cameraOdometry'` SubMaster**.
Direct measurement (`/tmp/test_subs.py`):
| poll arg | carState observed rate | freq_ok |
|---|---:|:-:|
| `None` (all polled) | 97.40 Hz | True |
| `'cameraOdometry'` | **2.24 Hz** | **False** |
| `'carState'` | 98.58 Hz | True |
carState is published at 100 Hz to a 10MB shared-memory MSGQ queue. From
inside `poll='cameraOdometry'`, our non-blocking `recv_one_or_none(carState)`
returns None ~88% of the time. `/tmp/diag_recv.py`:
```
cameraOdometry msgs received: 121
carState recv calls: 121, hits: 14 (11.6% hit rate)
avg cycle duration: 50.0ms
```
So we're calling `recv` 20× per second on carState's queue, and finding
the queue empty 9 out of 10 times — even though carState is being published
at 100 Hz to that same queue with a `conflate=True` socket option.
### The MSGQ NUM_READERS = 12 hypothesis
`cereal/messaging/msgq.h:9` defines `#define NUM_READERS 12`. When a 13th
subscriber tries to subscribe to a queue, `msgq.cc:182-197` **invalidates
ALL readers simultaneously** (`*q->read_valids[i] = false` for all i)
to "reset and re-register". On the next read, an invalidated reader's
`msgq_msg_recv` jumps to `msgq_reset_reader` (line 347) and ends up with
`read_pointer == write_pointer` (caught up to current), returning size 0.
Subscribers to `carState` in our running system include: controlsd,
plannerd, locationd, calibrationd, paramsd, dmonitoringd, frogpilot_process,
telemetryd, statsd. That's already nine. The introspection scripts
(`/tmp/check_*.py`, `/tmp/measure_freq.py`, `/tmp/diag_recv.py`,
`/tmp/cal_view.py`) and any UI-side subscribers add more, can easily
push past 12 and trigger global eviction. Once evicted, a long-running
subscriber stays in the "find empty queue / reset / try again" loop, which
is what we measured.
This is **partially confirmed** but not proven definitively. The
investigation was paused before instrumenting the queue header to count
slot churn. The smoking-gun would be: print `q->num_readers` from inside
calibrationd at boot vs. during steady state and watch it tick up to 12+.
### Things to consider for the actual fix
1. **`7ee923b` already solved this exact problem.** It changed
calibrationd's publish to:
```python
# was: calibrator.send_data(pm, sm.all_checks())
# to: calibrator.send_data(pm, calibrator.cal_status == log.LiveCalibrationData.Status.calibrated)
```
The commit message documented the exact failure mode (cascade through
locationd uninitialized → paramsd steerRatio≈0 / stiffnessFactor≈0 →
nonsense curvature commands). It was reverted by `47321e3` as part of
"restore driving logic to pre-variable-fps baseline" — but that revert
was about isolating the drift cause, and the calibrationd change here
is *not* in the variable-FPS family. **Re-applying `7ee923b` is
probably the right next move**, narrowly scoped to calibrationd.py.
2. Less attractive alternatives: bump `NUM_READERS`; switch
carState to `poll=None` in calibrationd (more cycles per update,
higher CPU); add `ignore_average_freq=['carState']` to calibrationd's
SubMaster (treats freq glitches as benign, but keeps the cascade for
alive/valid).
---
## Steer-fault alert investigation (separate symptom)
User saw "Steering Temporarily Unavailable" alerts during a test drive
even though we hadn't touched lateral-control code. Captured in
`realdata/00000081--528e2aa03a--0/rlog`:
- Faults occur in **brief 50100 ms pulses** clustered while moving slowly
(`vEgo` 2.75.2 m/s ≈ 612 mph).
- Each pulse correlates with **large driver wheel torque** (-250, +273,
-187, +119 Nm) — i.e. the user actively turning the wheel during a
parking-lot maneuver.
- `cruise.enabled = False` throughout — openpilot was not engaged.
- The car's MDPS sets `LKA_FAULT` at low speeds when torque is high; that
bit maps directly to `cs_out.steerFaultTemporary` (`carstate.py:259`),
which fires `steerTempUnavailableSilent` regardless of engagement
(`ET.WARNING` displays unconditionally).
User reports they've "never had this issue before" — implying earlier
ClearPilot revisions either gated the fault on speed or used a different
no-lateral path. **Not confirmed which.** Open follow-up.
---
## Open follow-ups (ordered by likely return)
1. **Re-apply `7ee923b`** — gate `liveCalibration.valid` on `calStatus`,
not `sm.all_checks()`. Unblocks locationd init → paramsd convergence →
real `params.roll` and `params.angleOffsetDeg` for the torque
controller. Latent benefit beyond what GPS-disable alone gave us.
2. **Investigate the persistent low-speed `steerTempUnavailable` alert.**
Either (a) gate `steerFaultTemporary` on `vEgo > ~8 m/s` in
`carstate.py:259`, or (b) find what the previous fork did — possibly
stopped sending tester CAN messages on park, possibly suppressed the
alert specifically during a park transition window.
3. **Suppress LKAS fault display when shifting drive → park.** The user
reports the car shows an LKAS-fault icon when openpilot keeps publishing
tester-present CAN messages after entering park. Investigation needed
in `selfdrive/car/hyundai/carcontroller.py` and `hyundaicanfd.py` to
gate tester messages on gear ≠ park.
4. **Wheel-torque headroom edit.** User mentioned a community-known edit
that allows slightly higher steering torque on the Hyundai panda safety
model. Research target: panda safety code for HYUNDAI_CANFD safety
model and the `MAX_TORQUE` / per-cycle delta limits.
5. **Single startup `commIssue` event.** Even with all our fixes, controlsd
logs one transient commIssue right after `controlsd.initialized`
(timeout=true after 6s). The `invalid` set at that moment is
downstream services still warming up (liveCalibration, liveLocationKalman,
liveParameters, liveTorqueParameters, frogpilotPlan, longitudinalPlan,
driverMonitoringState). Most should clear once the calibrationd issue
is fixed; remaining ones are normal warmup.
6. **gpsd.py vNED / accuracy fields.** Out of scope for this session
(we disabled GPS in locationd instead), but if GPS is ever re-enabled,
`gpsd.py:216,221-224` need real values: vNED from
`speed × {cos(bearing), sin(bearing), 0}`, and accuracy fields from
actual modem reports rather than hard-coded constants.