Four logically-related changes that together get the manager booting
cleanly end-to-end after the prior session's baseline revert + parked-mode
split, plus a session doc.
boardd: trigger safety_setter_thread on ignition rising edge, not on
IsOnroad rising edge. The parked-mode split broke the stock assumption
that ignition rising implies IsOnroad rising — IsOnroad now requires
`started`, which requires thermald to see carState != park, which
requires controlsd_parked to publish carState, which can't happen until
boardd acks OBD multiplexing. Triggering on ignition edge restores the
"set safety as soon as the bus is alive" intent for both controlsd
variants.
controlsd: drop the two `params_memory.put_bool("no_lat_lane_change",
...)` calls. The key was never registered in common/params.cc so
state_control crashed with UnknownKeyName on first cycle. The carcontroller
reads off `frogpilot_variables.no_lat_lane_change` (in-process), which
controlsd already sets; the UI reads `frogpilotCarControl.NoLatLaneChange`
from cereal, which nobody was setting in the restored controlsd. Add
`self.FPCC.noLatLaneChange = True/False` in the same lane-change branch
so the UI lane-edge indicator reflects state. No actuator change.
cereal/services.py: restore deviceState/managerState declarations to 2Hz
to match the restored DT_TRML=0.5 (thermald at 2Hz). Earlier fan-control
work bumped both to 5Hz; the realtime.py revert undid the thermald rate
bump but services.py wasn't reverted, so freq window [4.0, 6.0]Hz failed
on every cycle and controlsd fired commIssue continuously.
locationd: add a `clearpilot_disable_gps` const at the top of handle_gps
and OR it into the existing reject condition. With it true, every
gpsLocation message falls through to determine_gps_mode() — openpilot's
stock no-GPS path. last_gps_msg never updates, is_gps_ok() permanently
false. gpsd is untouched so UI / dashcamd / clock-set / night-mode auto-
switch keep working unchanged. The user's "drift right on straight roads"
symptom went away after this edit; the previous gpsd.py was hard-coding
vNED=[0,0,0] while the car was moving, feeding the Kalman contradictory
GPS-vs-IMU velocity observations that propagated into latcontrol_torque
through liveLocationKalman.angularVelocityCalibrated. Reversible by
flipping the const to false.
sessions/: a single README documenting this session. Includes the
calibrationd-still-stale investigation — liveCalibration.valid stuck at
False because of a Python SubMaster freq_ok issue with carState under
poll='cameraOdometry' (likely MSGQ NUM_READERS=12 eviction with too many
subscribers). 7ee923b already solved this exact failure mode by gating
calibrationd's publish on calStatus instead of sm.all_checks(); that
commit was reverted in 47321e3 as part of the variable-FPS rollback
but is unrelated to that family and is the natural next move.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
16 KiB
Session: 2026-04-26 — GPS disabled in locationd; calibrationd-still-stale notes
Context
Followup session to 2026-04-26-0914-baseline-revert-and-parked-mode.
After the baseline restore, the manager wouldn't start cleanly and the car
exhibited a "drifting right on straight roads, model rescues us mid-curve"
symptom. This session unblocked the startup chain end-to-end so the car can
boot and run, disabled GPS as an input to locationd (the actual fix that
made the drift go away), and pinned down — but did not solve — why
liveCalibration.valid is still stuck at False and what that latently
breaks downstream.
Pre-session tip: 27cad05. Single combined commit covering four code
changes plus this README:
selfdrive/boardd/boardd.cc—safety_setter_threadon ignition edgeselfdrive/controls/controlsd.py— drop unregisteredno_lat_lane_changeParams write, wireFPCC.noLatLaneChangefor UIcereal/services.py— deviceState/managerState back to 2Hz to match restoredDT_TRMLselfdrive/locationd/locationd.cc— ignore GPS as Kalman input (reversible flag)
Commit 1: boardd — safety_setter_thread on ignition edge
Symptom
Manager started fine but controlsd_parked blocked indefinitely at:
set_obd_multiplexing (selfdrive/car/fw_versions.py:236)
fingerprint (selfdrive/car/car_helpers.py:149)
get_car (selfdrive/car/car_helpers.py:210)
__init__ (selfdrive/car/card.py:60)
main (selfdrive/controls/controlsd_parked.py:41)
set_obd_multiplexing does params.get_bool("ObdMultiplexingChanged", block=True) —
it waits for boardd's safety_setter_thread to ack. That thread spawns
only on the rising edge of IsOnroad (boardd.cc:476). IsOnroad is set
by manager from the started flag (helpers.py:49). With the parked-mode split
from the prior session, started requires not_parked, which requires thermald
to see carState.gearShifter != park, which requires controlsd_parked to
publish carState — which it can't do until OBD multiplexing is acked.
Classic deadlock: ignition rising no longer implies IsOnroad rising.
Fix
selfdrive/boardd/boardd.cc:476 — change the trigger from IsOnroad rising
edge to ignition rising edge. Adds bool ignition_last = false; next to
is_onroad_last, swaps the gate variable, sets ignition_last = ignition;
after. Restores stock openpilot's intent: set safety as soon as the bus is
alive. Both controlsd variants need it; the thread's phase 2 (waiting on
ControlsReady) is harmless in parked mode — it just sits.
is_onroad/is_onroad_last left in place; they're now unused beyond this
gate but the read of params.getBool("IsOnroad") still happens, in case any
future logic wants it.
Commit 2: controlsd — drop unregistered Params write
Symptom
After the boardd fix, controlsd_parked progressed but full controlsd
crashed at first state_control cycle:
File "selfdrive/controls/controlsd.py", line 693, in state_control
self.params_memory.put_bool("no_lat_lane_change", False)
common.params_pyx.UnknownKeyName: b'no_lat_lane_change'
The baseline-restored controlsd unconditionally writes no_lat_lane_change
to memory params on every state_control cycle. That key was never
registered in common/params.cc. Pre-revert (62a403d) controlsd didn't
write the param at all — it set self.FPCC.noLatLaneChange (capnp field).
The baseline brought back a code path the fork's params.cc never matched.
Fix
In selfdrive/controls/controlsd.py:687-694, remove the two
params_memory.put_bool("no_lat_lane_change", ...) calls and replace with
self.FPCC.noLatLaneChange = True/False (matches what the kept UI code in
onroad.cc:897 and ui.cc:122 is reading from cereal). The
frogpilot_variables.no_lat_lane_change writes were already there — those
are what the kept Hyundai carcontroller actually reads at 100Hz.
No actuator change. Pure plumbing.
Commit 3: services.py — restore deviceState/managerState rates
Symptom
After fixing the controlsd crash, controlsd booted but immediately fired
continuous commIssue with:
"not_freq_ok": ["deviceState", "managerState"]
Diagnosis
common/realtime.py was reverted to DT_TRML = 0.5 (thermald → 2Hz) in
the baseline restore. But cereal/services.py still declared deviceState
and managerState at 5Hz from earlier 4Hz-fan-control work. The freq window
is [0.8 × min, 1.2 × max] — declared 5Hz means [4.0, 6.0]Hz. Thermald
publishing at 2Hz fell well below 4.0 → freq_ok=False every cycle →
commIssue every cycle.
Already documented as a known footgun in CLAUDE.md ("Changing a Service's
Publish Rate").
Fix
cereal/services.py:33,73 — set both back to 2Hz with a comment pointing
at DT_TRML. After this, not_freq_ok=[] and the continuous commIssue
stopped — only one transient commIssue at startup remained (warmup).
Commit 4: locationd — ignore GPS as Kalman input (reversible)
Why
The car was drifting right on straight roads. Pre-revert, the user
reported this had been working for years; the only practically-new thing
is that system/clearpilot/gpsd.py (AT-command-based GPS) had recently
started actually getting fixes, where for a long time it wasn't.
Two concrete data problems with the GPS feed for selfdrive purposes:
gpsd.py:221hard-codesgps.vNED = [0.0, 0.0, 0.0]while the user is moving 28 m/s. locationd'shandle_gpsderivesOBSERVATION_ECEF_VELfrom this; the Kalman gets "GPS says you're stopped" while accelerometer says otherwise.gpsd.py:216,222populate horizontalAccuracy/verticalAccuracy fromhdop * 5(a rough conversion);bearingAccuracyDeg = 10.0andspeedAccuracy = 1.0are constants. None of these match what a real GNSS chip reports.
These flow into the latcontrol_torque pipeline indirectly through
liveLocationKalman.angularVelocityCalibrated (used as
actual_curvature_llk = ... / CS.vEgo blended with steering-angle-derived
curvature) and through liveLocationKalman.calibratedOrientationNED
(pitch). When the Kalman has wrong velocity observations, those derived
fields go wrong — and on a straight crowned road, the controller's
"actual_curvature" picture is off-center, biasing torque output.
What the user does NOT want disabled
gpsd publishes are still consumed by:
- UI speed indicator and the ClearPilot status overlay
dashcamdfor.srtsubtitle sidecars (lat/lon per segment)timed.pyfor system-clock setting viaunixTimestampMillisgpsd.pyitself for sunset/sunrise →IsDaylight(auto night mode)telemetryd.pyfor thegpsgroup in the CSV
So we cannot just stop publishing — only locationd should ignore.
Fix
selfdrive/locationd/locationd.cc:310-323 — add a clearpilot_disable_gps
const at the top of Localizer::handle_gps and OR it into the existing
reject condition. With it true, every gpsLocation message falls through to
determine_gps_mode(current_time) (openpilot's stock no-GPS path:
input_fake_gps_observations once position uncertainty grows past
SANE_GPS_UNCERTAINTY, otherwise nothing). last_gps_msg never updates,
is_gps_ok() returns False, liveLocationKalman.gpsOK = false.
Effect verified by user: drift improved noticeably. The torque controller is no longer fed contradictory GPS-vs-IMU velocity observations through the Kalman.
To re-enable GPS as a Kalman input: flip the clearpilot_disable_gps
constant to false and rebuild. Self-contained edit.
Calibrationd: still stale, root-cause partially understood
What is happening
calibrationd publishes liveCalibration.valid = sm.all_checks() on every
cycle, where sm polls cameraOdometry and non-polls carState and
carParams. We measured: 120 publishes in 30 seconds, every single one
valid=False with calStatus=calibrated, calPerc=100, validBlocks=50.
The body of the calibration is converged — the validity flag is stuck off.
Why this matters
liveCalibration.valid=False cascades:
locationd.cc:715—filterInitialized = sm.allAliveAndValid(). liveCalibration is in the sub list and not inignore_alive/ignore_valid. So filterInitialized stays False forever.- With Kalman uninitialized,
liveLocationKalmanstill publishes but the body fields are empty/default.liveLocationKalman.status = uninitialized. paramsd.pysubscribes toliveLocationKalmanwithpoll='liveLocationKalman'and gates its update logic onsm.all_checks(). When liveLocationKalman itself isn't in a sane state, paramsd'sroll,angleOffsetDeg,steerRatio,stiffnessFactoreither never converge or converge to bad values.latcontrol_torque.py:135-136usesparams.angleOffsetDegto computeactual_curvature_vmandparams.rollforroll_compensation. Withroll=0, no compensation for a crowned/banked road. With wrongangleOffsetDeg, the closed-loop "actual curvature" measurement is biased.
So the latent risk is: even with our GPS fix, the controller is running without learned roll compensation and without a learned steering-angle offset. Symptom-free on straight, level pavement; biased on banked roads.
Why valid=False
Inside calibrationd's SubMaster, sm.all_checks() = all_alive AND all_freq_ok AND all_valid. We measured each:
cameraOdometry: alive=True, valid=True, freq_ok=True ✓carState: alive flickers True/False, valid=True, freq_ok=False (every cycle)carParams: alive=True after first arrival, valid=True, freq_ok=False but excluded fromall_freq_okbecause_check_avg_freqskips services withfrequency < 0.99Hz (carParams is 0.02 Hz declared) — so it doesn't fail the gate.
The smoking gun is carState.freq_ok = False from inside calibrationd's
poll='cameraOdometry' SubMaster.
Direct measurement (/tmp/test_subs.py):
| poll arg | carState observed rate | freq_ok |
|---|---|---|
None (all polled) |
97.40 Hz | True |
'cameraOdometry' |
2.24 Hz | False |
'carState' |
98.58 Hz | True |
carState is published at 100 Hz to a 10MB shared-memory MSGQ queue. From
inside poll='cameraOdometry', our non-blocking recv_one_or_none(carState)
returns None ~88% of the time. /tmp/diag_recv.py:
cameraOdometry msgs received: 121
carState recv calls: 121, hits: 14 (11.6% hit rate)
avg cycle duration: 50.0ms
So we're calling recv 20× per second on carState's queue, and finding
the queue empty 9 out of 10 times — even though carState is being published
at 100 Hz to that same queue with a conflate=True socket option.
The MSGQ NUM_READERS = 12 hypothesis
cereal/messaging/msgq.h:9 defines #define NUM_READERS 12. When a 13th
subscriber tries to subscribe to a queue, msgq.cc:182-197 invalidates
ALL readers simultaneously (*q->read_valids[i] = false for all i)
to "reset and re-register". On the next read, an invalidated reader's
msgq_msg_recv jumps to msgq_reset_reader (line 347) and ends up with
read_pointer == write_pointer (caught up to current), returning size 0.
Subscribers to carState in our running system include: controlsd,
plannerd, locationd, calibrationd, paramsd, dmonitoringd, frogpilot_process,
telemetryd, statsd. That's already nine. The introspection scripts
(/tmp/check_*.py, /tmp/measure_freq.py, /tmp/diag_recv.py,
/tmp/cal_view.py) and any UI-side subscribers add more, can easily
push past 12 and trigger global eviction. Once evicted, a long-running
subscriber stays in the "find empty queue / reset / try again" loop, which
is what we measured.
This is partially confirmed but not proven definitively. The
investigation was paused before instrumenting the queue header to count
slot churn. The smoking-gun would be: print q->num_readers from inside
calibrationd at boot vs. during steady state and watch it tick up to 12+.
Things to consider for the actual fix
7ee923balready solved this exact problem. It changed calibrationd's publish to:The commit message documented the exact failure mode (cascade through locationd uninitialized → paramsd steerRatio≈0 / stiffnessFactor≈0 → nonsense curvature commands). It was reverted by# was: calibrator.send_data(pm, sm.all_checks()) # to: calibrator.send_data(pm, calibrator.cal_status == log.LiveCalibrationData.Status.calibrated)47321e3as part of "restore driving logic to pre-variable-fps baseline" — but that revert was about isolating the drift cause, and the calibrationd change here is not in the variable-FPS family. Re-applying7ee923bis probably the right next move, narrowly scoped to calibrationd.py.- Less attractive alternatives: bump
NUM_READERS; switch carState topoll=Nonein calibrationd (more cycles per update, higher CPU); addignore_average_freq=['carState']to calibrationd's SubMaster (treats freq glitches as benign, but keeps the cascade for alive/valid).
Steer-fault alert investigation (separate symptom)
User saw "Steering Temporarily Unavailable" alerts during a test drive
even though we hadn't touched lateral-control code. Captured in
realdata/00000081--528e2aa03a--0/rlog:
- Faults occur in brief 50–100 ms pulses clustered while moving slowly
(
vEgo2.7–5.2 m/s ≈ 6–12 mph). - Each pulse correlates with large driver wheel torque (-250, +273, -187, +119 Nm) — i.e. the user actively turning the wheel during a parking-lot maneuver.
cruise.enabled = Falsethroughout — openpilot was not engaged.- The car's MDPS sets
LKA_FAULTat low speeds when torque is high; that bit maps directly tocs_out.steerFaultTemporary(carstate.py:259), which firessteerTempUnavailableSilentregardless of engagement (ET.WARNINGdisplays unconditionally).
User reports they've "never had this issue before" — implying earlier ClearPilot revisions either gated the fault on speed or used a different no-lateral path. Not confirmed which. Open follow-up.
Open follow-ups (ordered by likely return)
- Re-apply
7ee923b— gateliveCalibration.validoncalStatus, notsm.all_checks(). Unblocks locationd init → paramsd convergence → realparams.rollandparams.angleOffsetDegfor the torque controller. Latent benefit beyond what GPS-disable alone gave us. - Investigate the persistent low-speed
steerTempUnavailablealert. Either (a) gatesteerFaultTemporaryonvEgo > ~8 m/sincarstate.py:259, or (b) find what the previous fork did — possibly stopped sending tester CAN messages on park, possibly suppressed the alert specifically during a park transition window. - Suppress LKAS fault display when shifting drive → park. The user
reports the car shows an LKAS-fault icon when openpilot keeps publishing
tester-present CAN messages after entering park. Investigation needed
in
selfdrive/car/hyundai/carcontroller.pyandhyundaicanfd.pyto gate tester messages on gear ≠ park. - Wheel-torque headroom edit. User mentioned a community-known edit
that allows slightly higher steering torque on the Hyundai panda safety
model. Research target: panda safety code for HYUNDAI_CANFD safety
model and the
MAX_TORQUE/ per-cycle delta limits. - Single startup
commIssueevent. Even with all our fixes, controlsd logs one transient commIssue right aftercontrolsd.initialized(timeout=true after 6s). Theinvalidset at that moment is downstream services still warming up (liveCalibration, liveLocationKalman, liveParameters, liveTorqueParameters, frogpilotPlan, longitudinalPlan, driverMonitoringState). Most should clear once the calibrationd issue is fixed; remaining ones are normal warmup. - gpsd.py vNED / accuracy fields. Out of scope for this session
(we disabled GPS in locationd instead), but if GPS is ever re-enabled,
gpsd.py:216,221-224need real values: vNED fromspeed × {cos(bearing), sin(bearing), 0}, and accuracy fields from actual modem reports rather than hard-coded constants.