On 15/05/2026 3:11 am, Amir Ayupov wrote:
In a system-wide `perf record -e cs_etm/.../u` capture on aarch64, synthesized samples emitted by `perf script --itrace=il64` are sometimes attributed to the WRONG sample.pid/tid (and to the wrong EL/cpumode) for the chunk of branches that straddle a context-switch boundary on a CPU. A branch actually retired by process A is emitted with sample.pid set to the thread that next ran on the same CPU.
Mechanism:
- ETM emits CONTEXTIDR/EL packets in-stream when the kernel updates CONTEXTIDR_EL1 on context switch / EL change. OpenCSD turns these into OCSD_GEN_TRC_ELEM_PE_CONTEXT elements interleaved with OCSD_GEN_TRC_ELEM_INSTR_RANGE elements for retired branch ranges.
- cs_etm_decoder__buffer_range() queues each INSTR_RANGE into packet_queue->packet_buffer[]; packets carry start/end addrs, instr_count, last-instruction info, etc., but NO owner identity.
- PE_CONTEXT goes through cs_etm_decoder__set_tid() -> cs_etm__set_thread(), which immediately mutates tidq->thread and tidq->el. Queued packets are not drained first; reset_timestamp() is called so the next TIMESTAMP triggers OCSD_RESP_WAIT and a drain.
- By drain time in cs_etm__process_traceid_queue() -> cs_etm__sample(), sample.pid/tid is read from the now-mutated tidq->thread and sample.cpumode from the now-mutated tidq->el. Pre-context INSTR_RANGEs get the post-context owner.
The same race affects branch samples via tidq->prev_packet_thread / tidq->prev_packet_el, captured at packet-swap time from tidq->thread / tidq->el (which may already have flipped).
This is independent of PERF_RECORD_SWITCH_CPU_WIDE, which is deliberately not used to assign sample identity in this path. The bug applies to any cs_etm capture with in-stream CONTEXTIDR (PIDFMT_CTXTID or PIDFMT_CTXTID2).
Effect on downstream tools: branches that should belong to the previous thread on the CPU get attributed to the next thread. When the two threads share a binary, leaked branches' VAs land in the wrong thread's mappings; samples whose IPs land in r-x mappings silently pollute that binary's profile, while samples landing in R-only/RW mappings show up as out-of-range / non-text samples. Either way, AutoFDO/BOLT profiles built from `perf script --itrace` output of system-wide cs_etm captures contain misattributed samples.
Concrete example from `perf script --itrace=il64` of the same captured branch (same timestamp, same IP, same from/to addrs) before and after this fix:
before: launcher_multia 2638146/2638146 705897.219172: \ fffcda6b124c 0xfffcda641958/0xfffcda6b123c after: ws-tcf-sr-io13 2736581/2741587 705897.219172: \ fffcda6b124c 0xfffcda641958/0xfffcda6b123c
The branch was retired by ws-tcf-sr-io13 (tid 2741587) but, before the fix, was attributed to launcher_multia (the next thread to run on that CPU after the context switch). After the fix, it is correctly attributed to ws-tcf-sr-io13.
Why not "drain on PE_CONTEXT then switch" (deferred-set_thread): tidq->thread has two consumers \u2014 sample emission needs the OUTGOING identity for queued packets, but cs_etm__mem_access() needs the CURRENT thread's maps to fetch instruction bytes for OpenCSD. The two needs are temporally inverted; a single tidq->thread cannot serve both. Keeping tidq->thread current and stamping owner identity per packet is the only design that decouples them cleanly.
Fix: capture the owning pid/tid/EL on each buffered packet at cs_etm_decoder__buffer_packet() time (before any subsequent PE_CONTEXT can mutate tidq->thread / tidq->el), and read them at sample emission time.
- struct cs_etm_packet gains pid_t pid, pid_t tid, int el (storing an ocsd_ex_level value; typed as int so the struct does not depend on OpenCSD headers, which are only included inside HAVE_CSTRACE_SUPPORT).
- cs_etm__etmq_get_pid_tid_el() (formerly cs_etm__etmq_get_pid_tid) returns all three.
- cs_etm__synth_instruction_sample() reads sample.pid / sample.tid from tidq->packet->{pid,tid} and derives sample.cpumode from tidq->packet->el.
- cs_etm__synth_branch_sample() reads sample.pid / sample.tid / cpumode from tidq->prev_packet->{pid,tid,el}.
- The separate prev_packet_thread / prev_packet_el bookkeeping in cs_etm__packet_swap() / cs_etm__init_traceid_queue() / cs_etm__free_traceid_queues() is removed; the per-packet stamp on prev_packet now carries that information.
Cost: 12 bytes added to struct cs_etm_packet (~12-16 KB per packet_queue with CS_ETM_PACKET_MAX_BUFFER=1024), 16 bytes saved per cs_etm_traceid_queue (one struct thread * + one ocsd_ex_level).
A residual gap: cs_etm__copy_insn() reads sample.insn bytes via cs_etm__mem_access(), which still uses tidq->thread (the current thread), so the inline insn bytes for an outgoing-thread sample may be looked up against the wrong address space. Fixing this requires threading the packet's owner pid through cs_etm__mem_access and is left for a follow-up. sample.ip / sample.pid attribution \u2014 what AutoFDO/BOLT consume \u2014 is correct.
Hi Amir,
Can you test the patch here to see if it fixes your issue [1]?
We thought it didn't make sense to store the thread on every packet when there is only one active thread for the decoder and one for sample generation. We also fixed the other issue mentioned above about cs_etm__copy_insn() not working.
Thanks James
[1]: https://lore.kernel.org/linux-perf-users/20260526-james-cs-context-tracking-...