adopt modelrevert tree as the new clearpilot baseline
Replaces clearpilot's working state wholesale with the modelrevert branch's
tree (modelrevert tip cea422b). Discards the parked-controlsd manager-process
split and the two session READMEs that documented it; keeps the simpler
in-process park short-circuits (controlsd state_control, plannerd, frogpilot_process)
and the cached-output decimation (modeld, dmonitoringmodeld) that achieve
the same goal with less moving parts. Also brings in the locationd GPS
ignore, the calibrationd valid=calStatus gate, and the model-revert lineage's
controlsd / paramsd / torqued / events.py / carstate.py / interfaces.py.
This is a single new commit on clearpilot (no merge), so the branch advances
linearly while the file state matches modelrevert exactly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,358 +0,0 @@
|
||||
# Session: 2026-04-26 — GPS disabled in locationd; calibrationd-still-stale notes
|
||||
|
||||
## Context
|
||||
|
||||
Followup session to `2026-04-26-0914-baseline-revert-and-parked-mode`.
|
||||
After the baseline restore, the manager wouldn't start cleanly and the car
|
||||
exhibited a "drifting right on straight roads, model rescues us mid-curve"
|
||||
symptom. This session unblocked the startup chain end-to-end so the car can
|
||||
boot and run, disabled GPS as an input to locationd (the actual fix that
|
||||
made the drift go away), and pinned down — but did **not** solve — why
|
||||
`liveCalibration.valid` is still stuck at `False` and what that latently
|
||||
breaks downstream.
|
||||
|
||||
Pre-session tip: `27cad05`. Single combined commit covering four code
|
||||
changes plus this README:
|
||||
|
||||
- `selfdrive/boardd/boardd.cc` — `safety_setter_thread` on ignition edge
|
||||
- `selfdrive/controls/controlsd.py` — drop unregistered
|
||||
`no_lat_lane_change` Params write, wire `FPCC.noLatLaneChange` for UI
|
||||
- `cereal/services.py` — deviceState/managerState back to 2Hz to match
|
||||
restored `DT_TRML`
|
||||
- `selfdrive/locationd/locationd.cc` — ignore GPS as Kalman input
|
||||
(reversible flag)
|
||||
|
||||
---
|
||||
|
||||
## Commit 1: boardd — safety_setter_thread on ignition edge
|
||||
|
||||
### Symptom
|
||||
|
||||
Manager started fine but `controlsd_parked` blocked indefinitely at:
|
||||
|
||||
```
|
||||
set_obd_multiplexing (selfdrive/car/fw_versions.py:236)
|
||||
fingerprint (selfdrive/car/car_helpers.py:149)
|
||||
get_car (selfdrive/car/car_helpers.py:210)
|
||||
__init__ (selfdrive/car/card.py:60)
|
||||
main (selfdrive/controls/controlsd_parked.py:41)
|
||||
```
|
||||
|
||||
`set_obd_multiplexing` does `params.get_bool("ObdMultiplexingChanged", block=True)` —
|
||||
it waits for **boardd's `safety_setter_thread`** to ack. That thread spawns
|
||||
only on the **rising edge of `IsOnroad`** (`boardd.cc:476`). `IsOnroad` is set
|
||||
by manager from the `started` flag (`helpers.py:49`). With the parked-mode split
|
||||
from the prior session, `started` requires `not_parked`, which requires thermald
|
||||
to see `carState.gearShifter != park`, which requires `controlsd_parked` to
|
||||
publish carState — which it can't do until OBD multiplexing is acked.
|
||||
|
||||
Classic deadlock: ignition rising no longer implies IsOnroad rising.
|
||||
|
||||
### Fix
|
||||
|
||||
`selfdrive/boardd/boardd.cc:476` — change the trigger from IsOnroad rising
|
||||
edge to **ignition rising edge**. Adds `bool ignition_last = false;` next to
|
||||
`is_onroad_last`, swaps the gate variable, sets `ignition_last = ignition;`
|
||||
after. Restores stock openpilot's intent: set safety as soon as the bus is
|
||||
alive. Both controlsd variants need it; the thread's phase 2 (waiting on
|
||||
`ControlsReady`) is harmless in parked mode — it just sits.
|
||||
|
||||
`is_onroad`/`is_onroad_last` left in place; they're now unused beyond this
|
||||
gate but the read of `params.getBool("IsOnroad")` still happens, in case any
|
||||
future logic wants it.
|
||||
|
||||
---
|
||||
|
||||
## Commit 2: controlsd — drop unregistered Params write
|
||||
|
||||
### Symptom
|
||||
|
||||
After the boardd fix, controlsd_parked progressed but full controlsd
|
||||
crashed at first `state_control` cycle:
|
||||
|
||||
```
|
||||
File "selfdrive/controls/controlsd.py", line 693, in state_control
|
||||
self.params_memory.put_bool("no_lat_lane_change", False)
|
||||
common.params_pyx.UnknownKeyName: b'no_lat_lane_change'
|
||||
```
|
||||
|
||||
The baseline-restored controlsd unconditionally writes `no_lat_lane_change`
|
||||
to memory params on every state_control cycle. That key was never
|
||||
registered in `common/params.cc`. Pre-revert (`62a403d`) controlsd didn't
|
||||
write the param at all — it set `self.FPCC.noLatLaneChange` (capnp field).
|
||||
The baseline brought back a code path the fork's params.cc never matched.
|
||||
|
||||
### Fix
|
||||
|
||||
In `selfdrive/controls/controlsd.py:687-694`, remove the two
|
||||
`params_memory.put_bool("no_lat_lane_change", ...)` calls and replace with
|
||||
`self.FPCC.noLatLaneChange = True/False` (matches what the kept UI code in
|
||||
`onroad.cc:897` and `ui.cc:122` is reading from cereal). The
|
||||
`frogpilot_variables.no_lat_lane_change` writes were already there — those
|
||||
are what the kept Hyundai carcontroller actually reads at 100Hz.
|
||||
|
||||
No actuator change. Pure plumbing.
|
||||
|
||||
---
|
||||
|
||||
## Commit 3: services.py — restore deviceState/managerState rates
|
||||
|
||||
### Symptom
|
||||
|
||||
After fixing the controlsd crash, controlsd booted but immediately fired
|
||||
**continuous `commIssue`** with:
|
||||
|
||||
```
|
||||
"not_freq_ok": ["deviceState", "managerState"]
|
||||
```
|
||||
|
||||
### Diagnosis
|
||||
|
||||
`common/realtime.py` was reverted to `DT_TRML = 0.5` (thermald → 2Hz) in
|
||||
the baseline restore. But `cereal/services.py` still declared `deviceState`
|
||||
and `managerState` at 5Hz from earlier 4Hz-fan-control work. The freq window
|
||||
is `[0.8 × min, 1.2 × max]` — declared 5Hz means [4.0, 6.0]Hz. Thermald
|
||||
publishing at 2Hz fell well below 4.0 → freq_ok=False every cycle →
|
||||
commIssue every cycle.
|
||||
|
||||
Already documented as a known footgun in `CLAUDE.md` ("Changing a Service's
|
||||
Publish Rate").
|
||||
|
||||
### Fix
|
||||
|
||||
`cereal/services.py:33,73` — set both back to 2Hz with a comment pointing
|
||||
at `DT_TRML`. After this, `not_freq_ok=[]` and the continuous commIssue
|
||||
stopped — only one transient commIssue at startup remained (warmup).
|
||||
|
||||
---
|
||||
|
||||
## Commit 4: locationd — ignore GPS as Kalman input (reversible)
|
||||
|
||||
### Why
|
||||
|
||||
The car was drifting right on straight roads. Pre-revert, the user
|
||||
reported this had been working for years; the only practically-new thing
|
||||
is that `system/clearpilot/gpsd.py` (AT-command-based GPS) had recently
|
||||
**started actually getting fixes**, where for a long time it wasn't.
|
||||
|
||||
Two concrete data problems with the GPS feed for selfdrive purposes:
|
||||
|
||||
- `gpsd.py:221` hard-codes `gps.vNED = [0.0, 0.0, 0.0]` while the user is
|
||||
moving 28 m/s. locationd's `handle_gps` derives `OBSERVATION_ECEF_VEL`
|
||||
from this; the Kalman gets "GPS says you're stopped" while accelerometer
|
||||
says otherwise.
|
||||
- `gpsd.py:216,222` populate horizontalAccuracy/verticalAccuracy from
|
||||
`hdop * 5` (a rough conversion); `bearingAccuracyDeg = 10.0` and
|
||||
`speedAccuracy = 1.0` are constants. None of these match what a real
|
||||
GNSS chip reports.
|
||||
|
||||
These flow into the latcontrol_torque pipeline indirectly through
|
||||
`liveLocationKalman.angularVelocityCalibrated` (used as
|
||||
`actual_curvature_llk = ... / CS.vEgo` blended with steering-angle-derived
|
||||
curvature) and through `liveLocationKalman.calibratedOrientationNED`
|
||||
(pitch). When the Kalman has wrong velocity observations, those derived
|
||||
fields go wrong — and on a straight crowned road, the controller's
|
||||
"actual_curvature" picture is off-center, biasing torque output.
|
||||
|
||||
### What the user does NOT want disabled
|
||||
|
||||
gpsd publishes are still consumed by:
|
||||
- UI speed indicator and the ClearPilot status overlay
|
||||
- `dashcamd` for `.srt` subtitle sidecars (lat/lon per segment)
|
||||
- `timed.py` for system-clock setting via `unixTimestampMillis`
|
||||
- `gpsd.py` itself for sunset/sunrise → `IsDaylight` (auto night mode)
|
||||
- `telemetryd.py` for the `gps` group in the CSV
|
||||
|
||||
So we cannot just stop publishing — only locationd should ignore.
|
||||
|
||||
### Fix
|
||||
|
||||
`selfdrive/locationd/locationd.cc:310-323` — add a `clearpilot_disable_gps`
|
||||
const at the top of `Localizer::handle_gps` and OR it into the existing
|
||||
reject condition. With it true, every gpsLocation message falls through to
|
||||
`determine_gps_mode(current_time)` (openpilot's stock no-GPS path:
|
||||
`input_fake_gps_observations` once position uncertainty grows past
|
||||
`SANE_GPS_UNCERTAINTY`, otherwise nothing). `last_gps_msg` never updates,
|
||||
`is_gps_ok()` returns False, `liveLocationKalman.gpsOK = false`.
|
||||
|
||||
Effect verified by user: drift improved noticeably. The torque controller
|
||||
is no longer fed contradictory GPS-vs-IMU velocity observations through
|
||||
the Kalman.
|
||||
|
||||
To re-enable GPS as a Kalman input: flip the `clearpilot_disable_gps`
|
||||
constant to `false` and rebuild. Self-contained edit.
|
||||
|
||||
---
|
||||
|
||||
## Calibrationd: still stale, root-cause partially understood
|
||||
|
||||
### What is happening
|
||||
|
||||
`calibrationd` publishes `liveCalibration.valid = sm.all_checks()` on every
|
||||
cycle, where `sm` polls `cameraOdometry` and non-polls `carState` and
|
||||
`carParams`. We measured: **120 publishes in 30 seconds, every single one
|
||||
`valid=False`** with `calStatus=calibrated, calPerc=100, validBlocks=50`.
|
||||
The body of the calibration is converged — the validity flag is stuck off.
|
||||
|
||||
### Why this matters
|
||||
|
||||
`liveCalibration.valid=False` cascades:
|
||||
|
||||
1. `locationd.cc:715` — `filterInitialized = sm.allAliveAndValid()`.
|
||||
liveCalibration is in the sub list and not in `ignore_alive` /
|
||||
`ignore_valid`. So filterInitialized stays False forever.
|
||||
2. With Kalman uninitialized, `liveLocationKalman` still publishes but the
|
||||
body fields are empty/default. `liveLocationKalman.status = uninitialized`.
|
||||
3. `paramsd.py` subscribes to `liveLocationKalman` with `poll='liveLocationKalman'`
|
||||
and gates its update logic on `sm.all_checks()`. When liveLocationKalman
|
||||
itself isn't in a sane state, paramsd's `roll`, `angleOffsetDeg`,
|
||||
`steerRatio`, `stiffnessFactor` either never converge or converge to bad
|
||||
values.
|
||||
4. `latcontrol_torque.py:135-136` uses `params.angleOffsetDeg` to compute
|
||||
`actual_curvature_vm` and `params.roll` for `roll_compensation`. With
|
||||
`roll=0`, no compensation for a crowned/banked road. With wrong
|
||||
`angleOffsetDeg`, the closed-loop "actual curvature" measurement is
|
||||
biased.
|
||||
|
||||
So the latent risk is: even with our GPS fix, the controller is running
|
||||
**without learned roll compensation and without a learned steering-angle
|
||||
offset**. Symptom-free on straight, level pavement; biased on banked roads.
|
||||
|
||||
### Why `valid=False`
|
||||
|
||||
Inside calibrationd's SubMaster, `sm.all_checks() = all_alive AND all_freq_ok
|
||||
AND all_valid`. We measured each:
|
||||
|
||||
- `cameraOdometry`: alive=True, valid=True, freq_ok=True ✓
|
||||
- `carState`: alive flickers True/False, valid=True, **freq_ok=False (every cycle)**
|
||||
- `carParams`: alive=True after first arrival, valid=True, freq_ok=False
|
||||
but excluded from `all_freq_ok` because `_check_avg_freq` skips services
|
||||
with `frequency < 0.99` Hz (carParams is 0.02 Hz declared) — so it doesn't
|
||||
fail the gate.
|
||||
|
||||
The smoking gun is **`carState.freq_ok = False` from inside calibrationd's
|
||||
`poll='cameraOdometry'` SubMaster**.
|
||||
|
||||
Direct measurement (`/tmp/test_subs.py`):
|
||||
|
||||
| poll arg | carState observed rate | freq_ok |
|
||||
|---|---:|:-:|
|
||||
| `None` (all polled) | 97.40 Hz | True |
|
||||
| `'cameraOdometry'` | **2.24 Hz** | **False** |
|
||||
| `'carState'` | 98.58 Hz | True |
|
||||
|
||||
carState is published at 100 Hz to a 10MB shared-memory MSGQ queue. From
|
||||
inside `poll='cameraOdometry'`, our non-blocking `recv_one_or_none(carState)`
|
||||
returns None ~88% of the time. `/tmp/diag_recv.py`:
|
||||
|
||||
```
|
||||
cameraOdometry msgs received: 121
|
||||
carState recv calls: 121, hits: 14 (11.6% hit rate)
|
||||
avg cycle duration: 50.0ms
|
||||
```
|
||||
|
||||
So we're calling `recv` 20× per second on carState's queue, and finding
|
||||
the queue empty 9 out of 10 times — even though carState is being published
|
||||
at 100 Hz to that same queue with a `conflate=True` socket option.
|
||||
|
||||
### The MSGQ NUM_READERS = 12 hypothesis
|
||||
|
||||
`cereal/messaging/msgq.h:9` defines `#define NUM_READERS 12`. When a 13th
|
||||
subscriber tries to subscribe to a queue, `msgq.cc:182-197` **invalidates
|
||||
ALL readers simultaneously** (`*q->read_valids[i] = false` for all i)
|
||||
to "reset and re-register". On the next read, an invalidated reader's
|
||||
`msgq_msg_recv` jumps to `msgq_reset_reader` (line 347) and ends up with
|
||||
`read_pointer == write_pointer` (caught up to current), returning size 0.
|
||||
|
||||
Subscribers to `carState` in our running system include: controlsd,
|
||||
plannerd, locationd, calibrationd, paramsd, dmonitoringd, frogpilot_process,
|
||||
telemetryd, statsd. That's already nine. The introspection scripts
|
||||
(`/tmp/check_*.py`, `/tmp/measure_freq.py`, `/tmp/diag_recv.py`,
|
||||
`/tmp/cal_view.py`) and any UI-side subscribers add more, can easily
|
||||
push past 12 and trigger global eviction. Once evicted, a long-running
|
||||
subscriber stays in the "find empty queue / reset / try again" loop, which
|
||||
is what we measured.
|
||||
|
||||
This is **partially confirmed** but not proven definitively. The
|
||||
investigation was paused before instrumenting the queue header to count
|
||||
slot churn. The smoking-gun would be: print `q->num_readers` from inside
|
||||
calibrationd at boot vs. during steady state and watch it tick up to 12+.
|
||||
|
||||
### Things to consider for the actual fix
|
||||
|
||||
1. **`7ee923b` already solved this exact problem.** It changed
|
||||
calibrationd's publish to:
|
||||
```python
|
||||
# was: calibrator.send_data(pm, sm.all_checks())
|
||||
# to: calibrator.send_data(pm, calibrator.cal_status == log.LiveCalibrationData.Status.calibrated)
|
||||
```
|
||||
The commit message documented the exact failure mode (cascade through
|
||||
locationd uninitialized → paramsd steerRatio≈0 / stiffnessFactor≈0 →
|
||||
nonsense curvature commands). It was reverted by `47321e3` as part of
|
||||
"restore driving logic to pre-variable-fps baseline" — but that revert
|
||||
was about isolating the drift cause, and the calibrationd change here
|
||||
is *not* in the variable-FPS family. **Re-applying `7ee923b` is
|
||||
probably the right next move**, narrowly scoped to calibrationd.py.
|
||||
2. Less attractive alternatives: bump `NUM_READERS`; switch
|
||||
carState to `poll=None` in calibrationd (more cycles per update,
|
||||
higher CPU); add `ignore_average_freq=['carState']` to calibrationd's
|
||||
SubMaster (treats freq glitches as benign, but keeps the cascade for
|
||||
alive/valid).
|
||||
|
||||
---
|
||||
|
||||
## Steer-fault alert investigation (separate symptom)
|
||||
|
||||
User saw "Steering Temporarily Unavailable" alerts during a test drive
|
||||
even though we hadn't touched lateral-control code. Captured in
|
||||
`realdata/00000081--528e2aa03a--0/rlog`:
|
||||
|
||||
- Faults occur in **brief 50–100 ms pulses** clustered while moving slowly
|
||||
(`vEgo` 2.7–5.2 m/s ≈ 6–12 mph).
|
||||
- Each pulse correlates with **large driver wheel torque** (-250, +273,
|
||||
-187, +119 Nm) — i.e. the user actively turning the wheel during a
|
||||
parking-lot maneuver.
|
||||
- `cruise.enabled = False` throughout — openpilot was not engaged.
|
||||
- The car's MDPS sets `LKA_FAULT` at low speeds when torque is high; that
|
||||
bit maps directly to `cs_out.steerFaultTemporary` (`carstate.py:259`),
|
||||
which fires `steerTempUnavailableSilent` regardless of engagement
|
||||
(`ET.WARNING` displays unconditionally).
|
||||
|
||||
User reports they've "never had this issue before" — implying earlier
|
||||
ClearPilot revisions either gated the fault on speed or used a different
|
||||
no-lateral path. **Not confirmed which.** Open follow-up.
|
||||
|
||||
---
|
||||
|
||||
## Open follow-ups (ordered by likely return)
|
||||
|
||||
1. **Re-apply `7ee923b`** — gate `liveCalibration.valid` on `calStatus`,
|
||||
not `sm.all_checks()`. Unblocks locationd init → paramsd convergence →
|
||||
real `params.roll` and `params.angleOffsetDeg` for the torque
|
||||
controller. Latent benefit beyond what GPS-disable alone gave us.
|
||||
2. **Investigate the persistent low-speed `steerTempUnavailable` alert.**
|
||||
Either (a) gate `steerFaultTemporary` on `vEgo > ~8 m/s` in
|
||||
`carstate.py:259`, or (b) find what the previous fork did — possibly
|
||||
stopped sending tester CAN messages on park, possibly suppressed the
|
||||
alert specifically during a park transition window.
|
||||
3. **Suppress LKAS fault display when shifting drive → park.** The user
|
||||
reports the car shows an LKAS-fault icon when openpilot keeps publishing
|
||||
tester-present CAN messages after entering park. Investigation needed
|
||||
in `selfdrive/car/hyundai/carcontroller.py` and `hyundaicanfd.py` to
|
||||
gate tester messages on gear ≠ park.
|
||||
4. **Wheel-torque headroom edit.** User mentioned a community-known edit
|
||||
that allows slightly higher steering torque on the Hyundai panda safety
|
||||
model. Research target: panda safety code for HYUNDAI_CANFD safety
|
||||
model and the `MAX_TORQUE` / per-cycle delta limits.
|
||||
5. **Single startup `commIssue` event.** Even with all our fixes, controlsd
|
||||
logs one transient commIssue right after `controlsd.initialized`
|
||||
(timeout=true after 6s). The `invalid` set at that moment is
|
||||
downstream services still warming up (liveCalibration, liveLocationKalman,
|
||||
liveParameters, liveTorqueParameters, frogpilotPlan, longitudinalPlan,
|
||||
driverMonitoringState). Most should clear once the calibrationd issue
|
||||
is fixed; remaining ones are normal warmup.
|
||||
6. **gpsd.py vNED / accuracy fields.** Out of scope for this session
|
||||
(we disabled GPS in locationd instead), but if GPS is ever re-enabled,
|
||||
`gpsd.py:216,221-224` need real values: vNED from
|
||||
`speed × {cos(bearing), sin(bearing), 0}`, and accuracy fields from
|
||||
actual modem reports rather than hard-coded constants.
|
||||
Reference in New Issue
Block a user