There was a report of NULL pointer dereference in ETF enable path for perf CS mode with PID monitoring. It is almost 100% reproducible when the process to monitor is something very active such as chrome and with ETF as the sink and not ETR. Currently in a bid to find the pid, the owner is dereferenced via task_pid_nr() call in tmc_enable_etf_sink_perf() and with owner being NULL, we get a NULL pointer dereference.
Looking at the ETR and other places in the kernel, ETF and the ETB are the only places trying to dereference the task(owner) in tmc_enable_etf_sink_perf() which is also called from the sched_in path as in the call trace. Owner(task) is NULL even in the case of ETR in tmc_enable_etr_sink_perf(), but since we cache the PID in alloc_buffer() callback and it is done as part of etm_setup_aux() when allocating buffer for ETR sink, we never dereference this NULL pointer and we are safe. So lets do the same thing with ETF and ETB and cache the PID to which the cs_buffer belongs in alloc_buffer() callback for ETF and ETB as done for ETR. This will also remove the unnecessary function calls (task_pid_nr()) in tmc_enable_etr_sink_perf() and etb_enable_perf().
In addition to this, add a check to validate event->owner before dereferencing it in ETR, ETB and ETF to avoid any possible NULL pointer dereference crashes in their corresponding alloc_buffer callbacks and check for kernel events as well.
Easily reproducible running below:
perf record -e cs_etm/@tmc_etf0/ -N -p <pid>
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000548 Mem abort info: ESR = 0x96000006 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000006 CM = 0, WnR = 0 <snip>... Call trace: tmc_enable_etf_sink+0xe4/0x280 coresight_enable_path+0x168/0x1fc etm_event_start+0x8c/0xf8 etm_event_add+0x38/0x54 event_sched_in+0x194/0x2ac group_sched_in+0x54/0x12c flexible_sched_in+0xd8/0x120 visit_groups_merge+0x100/0x16c ctx_flexible_sched_in+0x50/0x74 ctx_sched_in+0xa4/0xa8 perf_event_sched_in+0x60/0x6c perf_event_context_sched_in+0x98/0xe0 __perf_event_task_sched_in+0x5c/0xd8 finish_task_switch+0x184/0x1cc schedule_tail+0x20/0xec ret_from_fork+0x4/0x18
Sai Prakash Ranjan (4): perf/core: Export is_kernel_event() coresight: tmc-etf: Fix NULL ptr dereference in tmc_enable_etf_sink_perf() coresight: etb10: Fix possible NULL ptr dereference in etb_enable_perf() coresight: tmc-etr: Fix possible NULL ptr dereference in get_perf_etr_buf_cpu_wide()
drivers/hwtracing/coresight/coresight-etb10.c | 8 +++++++- drivers/hwtracing/coresight/coresight-priv.h | 2 ++ drivers/hwtracing/coresight/coresight-tmc-etf.c | 8 +++++++- drivers/hwtracing/coresight/coresight-tmc-etr.c | 6 +++++- include/linux/perf_event.h | 2 ++ kernel/events/core.c | 3 ++- 6 files changed, 25 insertions(+), 4 deletions(-)
base-commit: f4cb5e9daedf56671badc93ac7f364043aa33886
Export is_kernel_event() to be used by coresight drivers in later changes to check for kernel events and bail out.
Suggested-by: Suzuki K Poulose suzuki.poulose@arm.com Signed-off-by: Sai Prakash Ranjan saiprakash.ranjan@codeaurora.org --- include/linux/perf_event.h | 2 ++ kernel/events/core.c | 3 ++- 2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 04a49ccc7beb..230299168f3d 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1415,6 +1415,7 @@ extern void perf_event_task_tick(void); extern int perf_event_account_interrupt(struct perf_event *event); extern int perf_event_period(struct perf_event *event, u64 value); extern u64 perf_event_pause(struct perf_event *event, bool reset); +extern bool is_kernel_event(struct perf_event *event); #else /* !CONFIG_PERF_EVENTS: */ static inline void * perf_aux_output_begin(struct perf_output_handle *handle, @@ -1507,6 +1508,7 @@ static inline u64 perf_event_pause(struct perf_event *event, bool reset) { return 0; } +static bool is_kernel_event(struct perf_event *event) { return false; } #endif
#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL) diff --git a/kernel/events/core.c b/kernel/events/core.c index 7ed5248f0445..e5db79961a2c 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -173,10 +173,11 @@ static void perf_ctx_unlock(struct perf_cpu_context *cpuctx,
#define TASK_TOMBSTONE ((void *)-1L)
-static bool is_kernel_event(struct perf_event *event) +bool is_kernel_event(struct perf_event *event) { return READ_ONCE(event->owner) == TASK_TOMBSTONE; } +EXPORT_SYMBOL_GPL(is_kernel_event);
/* * On task ctx scheduling...
There was a report of NULL pointer dereference in ETF enable path for perf CS mode with PID monitoring. It is almost 100% reproducible when the process to monitor is something very active such as chrome and with ETF as the sink and not ETR. Currently in a bid to find the pid, the owner is dereferenced via task_pid_nr() call in tmc_enable_etf_sink_perf() and with owner being NULL, we get a NULL pointer dereference.
Looking at the ETR and other places in the kernel, ETF and the ETB are the only places trying to dereference the task(owner) in tmc_enable_etf_sink_perf() which is also called from the sched_in path as in the call trace. Owner(task) is NULL even in the case of ETR in tmc_enable_etr_sink_perf(), but since we cache the PID in alloc_buffer() callback and it is done as part of etm_setup_aux() when allocating buffer for ETR sink, we never dereference this NULL pointer and we are safe. So lets do the same thing with ETF and cache the PID to which the cs_buffer belongs in tmc_alloc_etf_buffer() as done for ETR. This will also remove the unnecessary function calls(task_pid_nr()) since we are caching the PID. In addition to this, add a check to validate event->owner which will prevent any possible NULL pointer dereferences and check for kernel events.
Easily reproducible running below:
perf record -e cs_etm/@tmc_etf0/ -N -p <pid>
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000548 Mem abort info: ESR = 0x96000006 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000006 CM = 0, WnR = 0 <snip>... Call trace: tmc_enable_etf_sink+0xe4/0x280 coresight_enable_path+0x168/0x1fc etm_event_start+0x8c/0xf8 etm_event_add+0x38/0x54 event_sched_in+0x194/0x2ac group_sched_in+0x54/0x12c flexible_sched_in+0xd8/0x120 visit_groups_merge+0x100/0x16c ctx_flexible_sched_in+0x50/0x74 ctx_sched_in+0xa4/0xa8 perf_event_sched_in+0x60/0x6c perf_event_context_sched_in+0x98/0xe0 __perf_event_task_sched_in+0x5c/0xd8 finish_task_switch+0x184/0x1cc schedule_tail+0x20/0xec ret_from_fork+0x4/0x18
Fixes: 880af782c6e8 ("coresight: tmc-etf: Add support for CPU-wide trace scenarios") Suggested-by: Suzuki K Poulose suzuki.poulose@arm.com Signed-off-by: Sai Prakash Ranjan saiprakash.ranjan@codeaurora.org --- drivers/hwtracing/coresight/coresight-priv.h | 2 ++ drivers/hwtracing/coresight/coresight-tmc-etf.c | 8 +++++++- 2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/hwtracing/coresight/coresight-priv.h b/drivers/hwtracing/coresight/coresight-priv.h index 65a29293b6cb..f5f654ea2994 100644 --- a/drivers/hwtracing/coresight/coresight-priv.h +++ b/drivers/hwtracing/coresight/coresight-priv.h @@ -87,6 +87,7 @@ enum cs_mode { * struct cs_buffer - keep track of a recording session' specifics * @cur: index of the current buffer * @nr_pages: max number of pages granted to us + * @pid: PID this cs_buffer belongs to * @offset: offset within the current buffer * @data_size: how much we collected in this run * @snapshot: is this run in snapshot mode @@ -95,6 +96,7 @@ enum cs_mode { struct cs_buffers { unsigned int cur; unsigned int nr_pages; + pid_t pid; unsigned long offset; local_t data_size; bool snapshot; diff --git a/drivers/hwtracing/coresight/coresight-tmc-etf.c b/drivers/hwtracing/coresight/coresight-tmc-etf.c index 44402d413ebb..86ff0dda0444 100644 --- a/drivers/hwtracing/coresight/coresight-tmc-etf.c +++ b/drivers/hwtracing/coresight/coresight-tmc-etf.c @@ -227,6 +227,7 @@ static int tmc_enable_etf_sink_perf(struct coresight_device *csdev, void *data) unsigned long flags; struct tmc_drvdata *drvdata = dev_get_drvdata(csdev->dev.parent); struct perf_output_handle *handle = data; + struct cs_buffers *buf = etm_perf_sink_config(handle);
spin_lock_irqsave(&drvdata->spinlock, flags); do { @@ -243,7 +244,7 @@ static int tmc_enable_etf_sink_perf(struct coresight_device *csdev, void *data) }
/* Get a handle on the pid of the process to monitor */ - pid = task_pid_nr(handle->event->owner); + pid = buf->pid;
if (drvdata->pid != -1 && drvdata->pid != pid) { ret = -EBUSY; @@ -391,6 +392,10 @@ static void *tmc_alloc_etf_buffer(struct coresight_device *csdev, { int node; struct cs_buffers *buf; + struct task_struct *task = READ_ONCE(event->owner); + + if (!task || is_kernel_event(event)) + return NULL;
node = (event->cpu == -1) ? NUMA_NO_NODE : cpu_to_node(event->cpu);
@@ -399,6 +404,7 @@ static void *tmc_alloc_etf_buffer(struct coresight_device *csdev, if (!buf) return NULL;
+ buf->pid = task_pid_nr(task); buf->snapshot = overwrite; buf->nr_pages = nr_pages; buf->data_pages = pages;
On Thu, Oct 22, 2020 at 04:27:52PM +0530, Sai Prakash Ranjan wrote:
Looking at the ETR and other places in the kernel, ETF and the ETB are the only places trying to dereference the task(owner) in tmc_enable_etf_sink_perf() which is also called from the sched_in path as in the call trace.
@@ -391,6 +392,10 @@ static void *tmc_alloc_etf_buffer(struct coresight_device *csdev, { int node; struct cs_buffers *buf;
- struct task_struct *task = READ_ONCE(event->owner);
- if (!task || is_kernel_event(event))
return NULL;
This is *wrong*... why do you care about who owns the events?
On 2020-10-22 17:02, Peter Zijlstra wrote:
On Thu, Oct 22, 2020 at 04:27:52PM +0530, Sai Prakash Ranjan wrote:
Looking at the ETR and other places in the kernel, ETF and the ETB are the only places trying to dereference the task(owner) in tmc_enable_etf_sink_perf() which is also called from the sched_in path as in the call trace.
@@ -391,6 +392,10 @@ static void *tmc_alloc_etf_buffer(struct coresight_device *csdev, { int node; struct cs_buffers *buf;
- struct task_struct *task = READ_ONCE(event->owner);
- if (!task || is_kernel_event(event))
return NULL;
This is *wrong*... why do you care about who owns the events?
The original issue was the owner being NULL and causing a NULL pointer dereference. I did ask some time back if it is valid for the owner to be NULL [1] and should probably be handled in events core?
[1] https://lore.kernel.org/lkml/c0e1f99a0a2480dfc8d788bb424d3f08@codeaurora.org...
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000548 Mem abort info: ESR = 0x96000006 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000006 CM = 0, WnR = 0 <snip>... Call trace: tmc_enable_etf_sink+0xe4/0x280 coresight_enable_path+0x168/0x1fc etm_event_start+0x8c/0xf8 etm_event_add+0x38/0x54 event_sched_in+0x194/0x2ac group_sched_in+0x54/0x12c flexible_sched_in+0xd8/0x120 visit_groups_merge+0x100/0x16c ctx_flexible_sched_in+0x50/0x74 ctx_sched_in+0xa4/0xa8 perf_event_sched_in+0x60/0x6c perf_event_context_sched_in+0x98/0xe0 __perf_event_task_sched_in+0x5c/0xd8 finish_task_switch+0x184/0x1cc schedule_tail+0x20/0xec ret_from_fork+0x4/0x18
Thanks, Sai
On Thu, Oct 22, 2020 at 06:19:37PM +0530, Sai Prakash Ranjan wrote:
On 2020-10-22 17:02, Peter Zijlstra wrote:
On Thu, Oct 22, 2020 at 04:27:52PM +0530, Sai Prakash Ranjan wrote:
Looking at the ETR and other places in the kernel, ETF and the ETB are the only places trying to dereference the task(owner) in tmc_enable_etf_sink_perf() which is also called from the sched_in path as in the call trace.
@@ -391,6 +392,10 @@ static void *tmc_alloc_etf_buffer(struct coresight_device *csdev, { int node; struct cs_buffers *buf;
- struct task_struct *task = READ_ONCE(event->owner);
- if (!task || is_kernel_event(event))
return NULL;
This is *wrong*... why do you care about who owns the events?
The original issue was the owner being NULL and causing a NULL pointer dereference. I did ask some time back if it is valid for the owner to be NULL [1] and should probably be handled in events core?
No, what I asked is why do you care about ->owner to begin with? That seems wrong. A driver should not touch ->owner _at_all_.
On 2020-10-22 19:04, Peter Zijlstra wrote:
On Thu, Oct 22, 2020 at 06:19:37PM +0530, Sai Prakash Ranjan wrote:
On 2020-10-22 17:02, Peter Zijlstra wrote:
On Thu, Oct 22, 2020 at 04:27:52PM +0530, Sai Prakash Ranjan wrote:
Looking at the ETR and other places in the kernel, ETF and the ETB are the only places trying to dereference the task(owner) in tmc_enable_etf_sink_perf() which is also called from the sched_in path as in the call trace.
@@ -391,6 +392,10 @@ static void *tmc_alloc_etf_buffer(struct coresight_device *csdev, { int node; struct cs_buffers *buf;
- struct task_struct *task = READ_ONCE(event->owner);
- if (!task || is_kernel_event(event))
return NULL;
This is *wrong*... why do you care about who owns the events?
The original issue was the owner being NULL and causing a NULL pointer dereference. I did ask some time back if it is valid for the owner to be NULL [1] and should probably be handled in events core?
No, what I asked is why do you care about ->owner to begin with? That seems wrong. A driver should not touch ->owner _at_all_.
Ah ok, so Suzuki explained that in other reply and if there is some other better way?
Thanks, Sai
On 10/22/20 12:32 PM, Peter Zijlstra wrote:
On Thu, Oct 22, 2020 at 04:27:52PM +0530, Sai Prakash Ranjan wrote:
Looking at the ETR and other places in the kernel, ETF and the ETB are the only places trying to dereference the task(owner) in tmc_enable_etf_sink_perf() which is also called from the sched_in path as in the call trace.
@@ -391,6 +392,10 @@ static void *tmc_alloc_etf_buffer(struct coresight_device *csdev, { int node; struct cs_buffers *buf;
- struct task_struct *task = READ_ONCE(event->owner);
- if (!task || is_kernel_event(event))
return NULL;
This is *wrong*... why do you care about who owns the events?
This is due to the special case of the CoreSight configuration, where a "sink" (where the trace data is captured) is shared by multiple Trace units. So, we could share the "sink" for multiple trace units if they are tracing the events that belong to the same "perf" session. (The userspace tool could decode the trace data based on the TraceID in the trace packets). Is there a better way to do this ?
Suzuki
On Thu, Oct 22, 2020 at 02:30:21PM +0100, Suzuki Poulose wrote:
On 10/22/20 12:32 PM, Peter Zijlstra wrote:
On Thu, Oct 22, 2020 at 04:27:52PM +0530, Sai Prakash Ranjan wrote:
Looking at the ETR and other places in the kernel, ETF and the ETB are the only places trying to dereference the task(owner) in tmc_enable_etf_sink_perf() which is also called from the sched_in path as in the call trace.
@@ -391,6 +392,10 @@ static void *tmc_alloc_etf_buffer(struct coresight_device *csdev, { int node; struct cs_buffers *buf;
- struct task_struct *task = READ_ONCE(event->owner);
- if (!task || is_kernel_event(event))
return NULL;
This is *wrong*... why do you care about who owns the events?
This is due to the special case of the CoreSight configuration, where a "sink" (where the trace data is captured) is shared by multiple Trace units. So, we could share the "sink" for multiple trace units if they are tracing the events that belong to the same "perf" session. (The userspace tool could decode the trace data based on the TraceID in the trace packets). Is there a better way to do this ?
I thought we added sink identification through perf_event_attr::config2 ?
On 10/22/20 4:06 PM, Peter Zijlstra wrote:
On Thu, Oct 22, 2020 at 02:30:21PM +0100, Suzuki Poulose wrote:
On 10/22/20 12:32 PM, Peter Zijlstra wrote:
On Thu, Oct 22, 2020 at 04:27:52PM +0530, Sai Prakash Ranjan wrote:
Looking at the ETR and other places in the kernel, ETF and the ETB are the only places trying to dereference the task(owner) in tmc_enable_etf_sink_perf() which is also called from the sched_in path as in the call trace.
@@ -391,6 +392,10 @@ static void *tmc_alloc_etf_buffer(struct coresight_device *csdev, { int node; struct cs_buffers *buf;
- struct task_struct *task = READ_ONCE(event->owner);
- if (!task || is_kernel_event(event))
return NULL;
This is *wrong*... why do you care about who owns the events?
This is due to the special case of the CoreSight configuration, where a "sink" (where the trace data is captured) is shared by multiple Trace units. So, we could share the "sink" for multiple trace units if they are tracing the events that belong to the same "perf" session. (The userspace tool could decode the trace data based on the TraceID in the trace packets). Is there a better way to do this ?
I thought we added sink identification through perf_event_attr::config2 ?
Correct. attr:config2 identifies the "sink" for the collection. But, that doesn't solve the problem we have here. If two separate perf sessions use the "same sink", we don't want to mix the trace data into the same sink for events from different sessions.
Thus, we need a way to check if a new event starting the tracing on an ETM belongs to the same session as the one already pumping the trace into the sink.
We use event->owner pid for this check and thats where we encountered a NULL event->owner. Looking at the code further, we identified that kernel events could also trigger this issue.
Suzuki
Hi Peter,
On Thu, Oct 22, 2020 at 04:32:36PM +0100, Suzuki Poulose wrote:
On 10/22/20 4:06 PM, Peter Zijlstra wrote:
On Thu, Oct 22, 2020 at 02:30:21PM +0100, Suzuki Poulose wrote:
On 10/22/20 12:32 PM, Peter Zijlstra wrote:
On Thu, Oct 22, 2020 at 04:27:52PM +0530, Sai Prakash Ranjan wrote:
Looking at the ETR and other places in the kernel, ETF and the ETB are the only places trying to dereference the task(owner) in tmc_enable_etf_sink_perf() which is also called from the sched_in path as in the call trace.
@@ -391,6 +392,10 @@ static void *tmc_alloc_etf_buffer(struct coresight_device *csdev, { int node; struct cs_buffers *buf;
- struct task_struct *task = READ_ONCE(event->owner);
- if (!task || is_kernel_event(event))
return NULL;
This is *wrong*... why do you care about who owns the events?
This is due to the special case of the CoreSight configuration, where a "sink" (where the trace data is captured) is shared by multiple Trace units. So, we could share the "sink" for multiple trace units if they are tracing the events that belong to the same "perf" session. (The userspace tool could decode the trace data based on the TraceID in the trace packets). Is there a better way to do this ?
I thought we added sink identification through perf_event_attr::config2 ?
Correct. attr:config2 identifies the "sink" for the collection. But, that doesn't solve the problem we have here. If two separate perf sessions use the "same sink", we don't want to mix the trace data into the same sink for events from different sessions.
Thus, we need a way to check if a new event starting the tracing on an ETM belongs to the same session as the one already pumping the trace into the sink.
Suzuki's depiction of the usecase is accurate. Using the pid of the process that created the events comes out of a discussion you and I had in the common area by the Intel booth at ELC in Edinburgh in the fall of 2018. At the time I exposed the problem of having multiple events sharing the same HW resources and you advised to proceed this way.
That being said it is plausible that I did not expressed myself clearly enough for you to understand the full extend of the problem. If that is the case we are more than willing to revisit that solution. Do you see a better option than what has currently been implemented?
Many thanks, Mathieu
We use event->owner pid for this check and thats where we encountered a NULL event->owner. Looking at the code further, we identified that kernel events could also trigger this issue.
Suzuki
On Thu, Oct 22, 2020 at 03:20:33PM -0600, Mathieu Poirier wrote:
Suzuki's depiction of the usecase is accurate. Using the pid of the process that created the events comes out of a discussion you and I had in the common area by the Intel booth at ELC in Edinburgh in the fall of 2018. At the time I exposed the problem of having multiple events sharing the same HW resources and you advised to proceed this way.
Bah, I was afraid of that. I desperately tried to find correspondence on it, but alas, verbal crap doesn't end up in the Sent folder :-/
That being said it is plausible that I did not expressed myself clearly enough for you to understand the full extend of the problem. If that is the case we are more than willing to revisit that solution. Do you see a better option than what has currently been implemented?
Moo... that really could've done with a comment I suppose.
So then I don't understand the !->owner issue, that only happens when the task dies, which cannot be concurrent with event creation. Are you somehow accessing ->owner later?
As for the kernel events.. why do you care about the actual task_struct * in there? I see you're using it to grab the task-pid, but how is that useful?
Hi Peter
On 10/23/20 8:39 AM, Peter Zijlstra wrote:
On Thu, Oct 22, 2020 at 03:20:33PM -0600, Mathieu Poirier wrote:
Suzuki's depiction of the usecase is accurate. Using the pid of the process that created the events comes out of a discussion you and I had in the common area by the Intel booth at ELC in Edinburgh in the fall of 2018. At the time I exposed the problem of having multiple events sharing the same HW resources and you advised to proceed this way.
Bah, I was afraid of that. I desperately tried to find correspondence on it, but alas, verbal crap doesn't end up in the Sent folder :-/
That being said it is plausible that I did not expressed myself clearly enough for you to understand the full extend of the problem. If that is the case we are more than willing to revisit that solution. Do you see a better option than what has currently been implemented?
Moo... that really could've done with a comment I suppose.
So then I don't understand the !->owner issue, that only happens when the task dies, which cannot be concurrent with event creation. Are you
Part of the patch from Sai, fixes this by avoiding the dereferencing after event creation (by caching it). But the kernel events needs fixing.
One follow up question on the !->owner issue. Given the ->owner is dying, does it prevent events from being scheduled ? Or is there a delay between that and eventually stopping the events. In this case, we hit the issue when :
A A or B ?
event_start() ... event->owner = NULL
READ_ONCE(event->owner);
Is this expected ?
somehow accessing ->owner later?
As for the kernel events.. why do you care about the actual task_struct
- in there? I see you're using it to grab the task-pid, but how is that
useful?
Correct, kernel events are something that the driver didn't account for. May be we could handle this case with a "special pid" and simply disallow sharing (which is fine I believe, given there are not grouping for the kernel created events).
Suzuki
On Fri, Oct 23, 2020 at 09:49:53AM +0100, Suzuki Poulose wrote:
On 10/23/20 8:39 AM, Peter Zijlstra wrote:
So then I don't understand the !->owner issue, that only happens when the task dies, which cannot be concurrent with event creation. Are you
Part of the patch from Sai, fixes this by avoiding the dereferencing after event creation (by caching it). But the kernel events needs fixing.
One follow up question on the !->owner issue. Given the ->owner is dying, does it prevent events from being scheduled ? Or is there a delay between that and eventually stopping the events. In this case, we hit the issue when :
A A or B ?
event_start() ... event->owner = NULL
READ_ONCE(event->owner);
Is this expected ?
Yeah, teardown is a bit of an effort. Also, you can pass an fd over a unix socket to another process, so this isn't something you can rely on in any case.
The perf tool doesn't do it, but the kernel infra should be able to deal with someone doing a perf-deamon of sorts, where you can request a perf event and recieve a fd from it.
Imagine the fun ;-)
As for the kernel events.. why do you care about the actual task_struct
- in there? I see you're using it to grab the task-pid, but how is that
useful?
Correct, kernel events are something that the driver didn't account for. May be we could handle this case with a "special pid" and simply disallow sharing (which is fine I believe, given there are not grouping for the kernel created events).
Why do you need a pid in the first place? Can't you use the "task_struct *" as a value?
On 10/23/20 10:23 AM, Peter Zijlstra wrote:
On Fri, Oct 23, 2020 at 09:49:53AM +0100, Suzuki Poulose wrote:
On 10/23/20 8:39 AM, Peter Zijlstra wrote:
So then I don't understand the !->owner issue, that only happens when the task dies, which cannot be concurrent with event creation. Are you
Part of the patch from Sai, fixes this by avoiding the dereferencing after event creation (by caching it). But the kernel events needs fixing.
One follow up question on the !->owner issue. Given the ->owner is dying, does it prevent events from being scheduled ? Or is there a delay between that and eventually stopping the events. In this case, we hit the issue when :
A A or B ?
event_start() ... event->owner = NULL
READ_ONCE(event->owner);
Is this expected ?
Yeah, teardown is a bit of an effort. Also, you can pass an fd over a unix socket to another process, so this isn't something you can rely on in any case.
The perf tool doesn't do it, but the kernel infra should be able to deal with someone doing a perf-deamon of sorts, where you can request a perf event and recieve a fd from it.
Imagine the fun ;-)
As for the kernel events.. why do you care about the actual task_struct
- in there? I see you're using it to grab the task-pid, but how is that
useful?
Correct, kernel events are something that the driver didn't account for. May be we could handle this case with a "special pid" and simply disallow sharing (which is fine I believe, given there are not grouping for the kernel created events).
Why do you need a pid in the first place? Can't you use the "task_struct *" as a value?
We could. But, without a refcount on the task pointer, that could be tricky, even though we don't dereference it. In the same situation, if the tsk owner dies and is freed and is reallocated to a new perf session task but with different PID, we could be mixing things up again ?
Special pid here could be -1.
On Fri, Oct 23, 2020 at 09:49:53AM +0100, Suzuki Poulose wrote:
On 10/23/20 8:39 AM, Peter Zijlstra wrote:
So then I don't understand the !->owner issue, that only happens when the task dies, which cannot be concurrent with event creation. Are you
Part of the patch from Sai, fixes this by avoiding the dereferencing after event creation (by caching it). But the kernel events needs fixing.
I'm fundamentally failing here. Creating a link to the sink is strictly event-creation time. Why would you ever need it again later? Later you already have the sink setup.
On 10/23/20 10:41 AM, Peter Zijlstra wrote:
On Fri, Oct 23, 2020 at 09:49:53AM +0100, Suzuki Poulose wrote:
On 10/23/20 8:39 AM, Peter Zijlstra wrote:
So then I don't understand the !->owner issue, that only happens when the task dies, which cannot be concurrent with event creation. Are you
Part of the patch from Sai, fixes this by avoiding the dereferencing after event creation (by caching it). But the kernel events needs fixing.
I'm fundamentally failing here. Creating a link to the sink is strictly event-creation time. Why would you ever need it again later? Later you already have the sink setup.
Sorry for the lack of clarity here, and you are not alone unless you have drowned in the CoreSight topologies ;-)
Typically current generation of systems have the following topology :
CPU0 etm0 \ \ ________ / \ CPU1 / \ etm1 \ \ /------- sink0 CPU2 / etm2 \ / \ ________ / / CPU3 / etm3
i.e, Multiple ETMs share a sink. [for the sake of simplicity, I have used one sink. Even though there could be potential sinks (of different types), none of them are private to the ETMs. So, in a nutshell, a sink can be reached by multiple ETMs. ]
Now, for a session :
perf record -e cs_etm/sinkid=sink0/u workload
We create an event per CPU (say eventN, which are scheduled based on the threads that could execute on the CPU. At this point we have finalized the sink0, and have allocated necessary buffer for the sink0.
Now, when the threads are scheduled on the CPUs, we start the appropriate events for the CPUs.
e.g, CPU0 sched -> workload:0 - > etm0->event0_start -> Turns all the components upto sink0, starting the trace collection in the buffer.
Now, if another CPU, CPU1 starts tracing event1 for workload:1 thread, it will eventually try to turn ON the sink0.Since sink0 is already active tracing event0, we could allow this to go through and collect the trace in the *same hardware buffer* (which can be demuxed from the single AUX record using the TraceID in the packets). Please note that we do double buffering and hardware buffer is copied only when the sink0 is stopped (see below).
But, if the event scheduled on CPU1 doesn't belong to the above session, but belongs to different perf session (say, perf record -e cs_etm/sinkid=sink0/u benchmark),
we can't allow this to succeed and mix the trace data in to that of workload and thus fail the operation.
In a nutshell, since the sinks are shared, we start the sink on the first event and keeps sharing the sink buffer with any event that belongs to the same session (using refcounts). The sink is only released for other sessions, when there are no more events in the session tracing on any of the ETMs.
I know this is fundamentally a topology issue, but that is not something we can fix. But the situation is changing and we are starting to see systems with per-CPU sinks.
Hope this helps.
Suzuki
On Fri, Oct 23, 2020 at 11:34:32AM +0100, Suzuki Poulose wrote:
On 10/23/20 10:41 AM, Peter Zijlstra wrote:
On Fri, Oct 23, 2020 at 09:49:53AM +0100, Suzuki Poulose wrote:
On 10/23/20 8:39 AM, Peter Zijlstra wrote:
So then I don't understand the !->owner issue, that only happens when the task dies, which cannot be concurrent with event creation. Are you
Part of the patch from Sai, fixes this by avoiding the dereferencing after event creation (by caching it). But the kernel events needs fixing.
I'm fundamentally failing here. Creating a link to the sink is strictly event-creation time. Why would you ever need it again later? Later you already have the sink setup.
Sorry for the lack of clarity here, and you are not alone unless you have drowned in the CoreSight topologies ;-)
Typically current generation of systems have the following topology :
CPU0 etm0 \ \ ________ / \ CPU1 / \ etm1 \ \ /------- sink0 CPU2 / etm2 \ / \ ________ / / CPU3 / etm3
i.e, Multiple ETMs share a sink. [for the sake of simplicity, I have used one sink. Even though there could be potential sinks (of different types), none of them are private to the ETMs. So, in a nutshell, a sink can be reached by multiple ETMs. ]
Now, for a session :
perf record -e cs_etm/sinkid=sink0/u workload
We create an event per CPU (say eventN, which are scheduled based on the threads that could execute on the CPU. At this point we have finalized the sink0, and have allocated necessary buffer for the sink0.
Now, when the threads are scheduled on the CPUs, we start the appropriate events for the CPUs.
e.g, CPU0 sched -> workload:0 - > etm0->event0_start -> Turns all the components upto sink0, starting the trace collection in the buffer.
Now, if another CPU, CPU1 starts tracing event1 for workload:1 thread, it will eventually try to turn ON the sink0.Since sink0 is already active tracing event0, we could allow this to go through and collect the trace in the *same hardware buffer* (which can be demuxed from the single AUX record using the TraceID in the packets). Please note that we do double buffering and hardware buffer is copied only when the sink0 is stopped (see below).
But, if the event scheduled on CPU1 doesn't belong to the above session, but belongs to different perf session (say, perf record -e cs_etm/sinkid=sink0/u benchmark),
we can't allow this to succeed and mix the trace data in to that of workload and thus fail the operation.
In a nutshell, since the sinks are shared, we start the sink on the first event and keeps sharing the sink buffer with any event that belongs to the same session (using refcounts). The sink is only released for other sessions, when there are no more events in the session tracing on any of the ETMs.
I know this is fundamentally a topology issue, but that is not something we can fix. But the situation is changing and we are starting to see systems with per-CPU sinks.
Hope this helps.
I think I'm more confused now :-/
Where do we use ->owner after event creation? The moment you create your eventN you create the link to sink0. That link either succeeds (same 'cookie') or fails.
If it fails, event creation fails, the end.
On success, we have the sink pointer in our event and we never ever need to look at ->owner ever again.
I'm also not seeing why exactly we need ->owner in the first place.
Suppose we make the sink0 device return -EBUSY on open() when it is active. Then a perf session can open the sink0 device, create perf events and attach them to the sink0 device using perf_event_attr::config2. The events will attach to sink0 and increment its usage count, such that any further open() will fail.
Once the events are created, the perf tool close()s the sink0 device, which is now will in-use by the events. No other events can be attached to it.
Or are you doing the event->sink mapping every time you do: pmu::add()? That sounds insane.
On 10/23/20 11:54 AM, Peter Zijlstra wrote:
On Fri, Oct 23, 2020 at 11:34:32AM +0100, Suzuki Poulose wrote:
On 10/23/20 10:41 AM, Peter Zijlstra wrote:
On Fri, Oct 23, 2020 at 09:49:53AM +0100, Suzuki Poulose wrote:
On 10/23/20 8:39 AM, Peter Zijlstra wrote:
So then I don't understand the !->owner issue, that only happens when the task dies, which cannot be concurrent with event creation. Are you
Part of the patch from Sai, fixes this by avoiding the dereferencing after event creation (by caching it). But the kernel events needs fixing.
I'm fundamentally failing here. Creating a link to the sink is strictly event-creation time. Why would you ever need it again later? Later you already have the sink setup.
Sorry for the lack of clarity here, and you are not alone unless you have drowned in the CoreSight topologies ;-)
Typically current generation of systems have the following topology :
CPU0 etm0 \ \ ________ / \ CPU1 / \ etm1 \ \ /------- sink0 CPU2 / etm2 \ / \ ________ / / CPU3 / etm3
i.e, Multiple ETMs share a sink. [for the sake of simplicity, I have used one sink. Even though there could be potential sinks (of different types), none of them are private to the ETMs. So, in a nutshell, a sink can be reached by multiple ETMs. ]
Now, for a session :
perf record -e cs_etm/sinkid=sink0/u workload
We create an event per CPU (say eventN, which are scheduled based on the threads that could execute on the CPU. At this point we have finalized the sink0, and have allocated necessary buffer for the sink0.
Now, when the threads are scheduled on the CPUs, we start the appropriate events for the CPUs.
e.g, CPU0 sched -> workload:0 - > etm0->event0_start -> Turns all the components upto sink0, starting the trace collection in the buffer.
Now, if another CPU, CPU1 starts tracing event1 for workload:1 thread, it will eventually try to turn ON the sink0.Since sink0 is already active tracing event0, we could allow this to go through and collect the trace in the *same hardware buffer* (which can be demuxed from the single AUX record using the TraceID in the packets). Please note that we do double buffering and hardware buffer is copied only when the sink0 is stopped (see below).
But, if the event scheduled on CPU1 doesn't belong to the above session, but belongs to different perf session (say, perf record -e cs_etm/sinkid=sink0/u benchmark),
we can't allow this to succeed and mix the trace data in to that of workload and thus fail the operation.
In a nutshell, since the sinks are shared, we start the sink on the first event and keeps sharing the sink buffer with any event that belongs to the same session (using refcounts). The sink is only released for other sessions, when there are no more events in the session tracing on any of the ETMs.
I know this is fundamentally a topology issue, but that is not something we can fix. But the situation is changing and we are starting to see systems with per-CPU sinks.
Hope this helps.
I think I'm more confused now :-/
Where do we use ->owner after event creation? The moment you create your eventN you create the link to sink0. That link either succeeds (same 'cookie') or fails.
The event->sink link is established at creation. At event::add(), we check the sink is free (i.e, it is inactive) or is used by an event of the same session (this is where the owner field *was* required. But this is not needed anymore, as we cache the "owner" read pid in the handle->rb->aux_priv for each event and this is compared against the pid from the handle currently driving the hardware)
If it fails, event creation fails, the end.
On success, we have the sink pointer in our event and we never ever need to look at ->owner ever again.
Correct, we don't need it after the event creation "one part of the patch" as explained above.
I'm also not seeing why exactly we need ->owner in the first place.
Suppose we make the sink0 device return -EBUSY on open() when it is active. Then a perf session can open the sink0 device, create perf events and attach them to the sink0 device using perf_event_attr::config2. The events will attach to sink0 and increment its usage count, such that any further open() will fail.
Thats where we are diverging. The sink device doesn't have any fops. It is all managed by the coresight driver transparent to the perf tool. All the perf tool does is, specifying which sink to use (btw, we now have automatic sink selection support which gets rid of this, and uses the best possible sink e.g, in case of per-CPU sinks).
Once the events are created, the perf tool close()s the sink0 device, which is now will in-use by the events. No other events can be attached to it.
Or are you doing the event->sink mapping every time you do: pmu::add()? That sounds insane.
Sink is already mapped at event create. But yes, the refcount on the sink is managed at start/stop. Thats when we need to make sure that the event being scheduled belongs to the same owner as the one already driving the sink.
That way another session could use the same sink if it is free. i.e
perf record -e cs_etm/@sink0/u --per-thread app1
and
perf record -e cs_etm/@sink0/u --per-thread app2
both can work as long as the sink is not used by the other session.
Suzuki
On Fri, Oct 23, 2020 at 01:56:47PM +0100, Suzuki Poulose wrote:
On 10/23/20 11:54 AM, Peter Zijlstra wrote:
I think I'm more confused now :-/
Where do we use ->owner after event creation? The moment you create your eventN you create the link to sink0. That link either succeeds (same 'cookie') or fails.
The event->sink link is established at creation. At event::add(), we check the sink is free (i.e, it is inactive) or is used by an event of the same session (this is where the owner field *was* required. But this is not needed anymore, as we cache the "owner" read pid in the handle->rb->aux_priv for each event and this is compared against the pid from the handle currently driving the hardware)
*groan*.. that's going to be a mess with sinks that are shared between CPUs :/
I'm also not seeing why exactly we need ->owner in the first place.
Suppose we make the sink0 device return -EBUSY on open() when it is active. Then a perf session can open the sink0 device, create perf events and attach them to the sink0 device using perf_event_attr::config2. The events will attach to sink0 and increment its usage count, such that any further open() will fail.
Thats where we are diverging. The sink device doesn't have any fops. It is all managed by the coresight driver transparent to the perf tool. All the perf tool does is, specifying which sink to use (btw, we now have automatic sink selection support which gets rid of this, and uses the best possible sink e.g, in case of per-CPU sinks).
per-CPU sinks sounds a lot better.
I'm really not convinced it makes sense to do what you do with shared sinks though. You'll loose random parts of the execution trace because of what the other CPUs do.
Full exclusive sink access is far more deterministic.
Once the events are created, the perf tool close()s the sink0 device, which is now will in-use by the events. No other events can be attached to it.
Or are you doing the event->sink mapping every time you do: pmu::add()? That sounds insane.
Sink is already mapped at event create. But yes, the refcount on the sink is managed at start/stop. Thats when we need to make sure that the event being scheduled belongs to the same owner as the one already driving the sink.
pmu::add() I might hope, because pmu::start() is not allowed to fail.
That way another session could use the same sink if it is free. i.e
perf record -e cs_etm/@sink0/u --per-thread app1
and
perf record -e cs_etm/@sink0/u --per-thread app2
both can work as long as the sink is not used by the other session.
Like said above, if sink is shared between CPUs, that's going to be a trainwreck :/ Why do you want that?
And once you have per-CPU sinks like mentioned above, the whole problem goes away.
On 10/23/20 2:16 PM, Peter Zijlstra wrote:
On Fri, Oct 23, 2020 at 01:56:47PM +0100, Suzuki Poulose wrote:
On 10/23/20 11:54 AM, Peter Zijlstra wrote:
I think I'm more confused now :-/
Where do we use ->owner after event creation? The moment you create your eventN you create the link to sink0. That link either succeeds (same 'cookie') or fails.
The event->sink link is established at creation. At event::add(), we check the sink is free (i.e, it is inactive) or is used by an event of the same session (this is where the owner field *was* required. But this is not needed anymore, as we cache the "owner" read pid in the handle->rb->aux_priv for each event and this is compared against the pid from the handle currently driving the hardware)
*groan*.. that's going to be a mess with sinks that are shared between CPUs :/
I'm also not seeing why exactly we need ->owner in the first place.
Suppose we make the sink0 device return -EBUSY on open() when it is active. Then a perf session can open the sink0 device, create perf events and attach them to the sink0 device using perf_event_attr::config2. The events will attach to sink0 and increment its usage count, such that any further open() will fail.
Thats where we are diverging. The sink device doesn't have any fops. It is all managed by the coresight driver transparent to the perf tool. All the perf tool does is, specifying which sink to use (btw, we now have automatic sink selection support which gets rid of this, and uses the best possible sink e.g, in case of per-CPU sinks).
per-CPU sinks sounds a lot better.
I'm really not convinced it makes sense to do what you do with shared sinks though. You'll loose random parts of the execution trace because of what the other CPUs do.
The ETM trace protocol has in built TraceID to distinguish the packets and thus we could decode the trace streams from the shared buffer. [ But, we don't have buffer overflow interrupts (I am keeping the lid closed on that can, for the sake of keeping sanity ;-) ), and thus any shared session could easily loose data unless we tune the AUX buffer size to a really large buffer ].
Full exclusive sink access is far more deterministic.
Once the events are created, the perf tool close()s the sink0 device, which is now will in-use by the events. No other events can be attached to it.
Or are you doing the event->sink mapping every time you do: pmu::add()? That sounds insane.
Sink is already mapped at event create. But yes, the refcount on the sink is managed at start/stop. Thats when we need to make sure that the event being scheduled belongs to the same owner as the one already driving the sink.
pmu::add() I might hope, because pmu::start() is not allowed to fail.
Right. If we can't get the sink, we simply truncate the buffer.
That way another session could use the same sink if it is free. i.e
perf record -e cs_etm/@sink0/u --per-thread app1
and
perf record -e cs_etm/@sink0/u --per-thread app2
both can work as long as the sink is not used by the other session.
Like said above, if sink is shared between CPUs, that's going to be a trainwreck :/ Why do you want that?
That ship has sailed. That is how the current generation of systems are, unfortunately. But as I said, this is changing and there are guidelines in place to avoid these kind of topologies. With the future technologies, this will be completely gone.
And once you have per-CPU sinks like mentioned above, the whole problem goes away.
True, until then, this is the best we could do.
Suzuki
On Fri, Oct 23, 2020 at 02:29:54PM +0100, Suzuki Poulose wrote:
On 10/23/20 2:16 PM, Peter Zijlstra wrote:
On Fri, Oct 23, 2020 at 01:56:47PM +0100, Suzuki Poulose wrote:
That way another session could use the same sink if it is free. i.e
perf record -e cs_etm/@sink0/u --per-thread app1
and
perf record -e cs_etm/@sink0/u --per-thread app2
both can work as long as the sink is not used by the other session.
Like said above, if sink is shared between CPUs, that's going to be a trainwreck :/ Why do you want that?
That ship has sailed. That is how the current generation of systems are, unfortunately. But as I said, this is changing and there are guidelines in place to avoid these kind of topologies. With the future technologies, this will be completely gone.
I understand that the hardware is like that, but why do you want to support this insanity in software?
If you only allow a single sink user (group) at the same time, your problem goes away. Simply disallow the above scenario, do not allow concurrent sink users if sinks are shared like this.
Have the perf-record of app2 above fail because the sink is in-user already.
Only if the hardware has per-CPU sinks can you allow this.
On Fri, Oct 23, 2020 at 03:44:16PM +0200, Peter Zijlstra wrote:
On Fri, Oct 23, 2020 at 02:29:54PM +0100, Suzuki Poulose wrote:
On 10/23/20 2:16 PM, Peter Zijlstra wrote:
On Fri, Oct 23, 2020 at 01:56:47PM +0100, Suzuki Poulose wrote:
That way another session could use the same sink if it is free. i.e
perf record -e cs_etm/@sink0/u --per-thread app1
and
perf record -e cs_etm/@sink0/u --per-thread app2
both can work as long as the sink is not used by the other session.
Like said above, if sink is shared between CPUs, that's going to be a trainwreck :/ Why do you want that?
That ship has sailed. That is how the current generation of systems are, unfortunately. But as I said, this is changing and there are guidelines in place to avoid these kind of topologies. With the future technologies, this will be completely gone.
I understand that the hardware is like that, but why do you want to support this insanity in software?
If you only allow a single sink user (group) at the same time, your problem goes away. Simply disallow the above scenario, do not allow concurrent sink users if sinks are shared like this.
Have the perf-record of app2 above fail because the sink is in-user already.
I agree with you that --per-thread scenarios are easy to deal with, but to support cpu-wide scenarios events must share a sink (because there is one event per CPU). CPU-wide support can't be removed because it has been around for close to a couple of years and heavily used. I also think using the pid of the process that created the events, i.e perf, is a good idea. We just need to agree on how to gain access to it.
In Sai's patch you objected to the following:
struct task_struct *task = READ_ONCE(event->owner);
if (!task || is_kernel_event(event))
Would it be better to use task_nr_pid(current) instead of event->owner? The end result will be exactly the same. There is also no need to check the validity of @current since it is a user process.
Thanks, Mathieu
[1]. https://elixir.bootlin.com/linux/latest/source/kernel/events/core.c#L6170
Only if the hardware has per-CPU sinks can you allow this.
Hello guys,
On 2020-10-24 02:07, Mathieu Poirier wrote:
On Fri, Oct 23, 2020 at 03:44:16PM +0200, Peter Zijlstra wrote:
On Fri, Oct 23, 2020 at 02:29:54PM +0100, Suzuki Poulose wrote:
On 10/23/20 2:16 PM, Peter Zijlstra wrote:
On Fri, Oct 23, 2020 at 01:56:47PM +0100, Suzuki Poulose wrote:
That way another session could use the same sink if it is free. i.e
perf record -e cs_etm/@sink0/u --per-thread app1
and
perf record -e cs_etm/@sink0/u --per-thread app2
both can work as long as the sink is not used by the other session.
Like said above, if sink is shared between CPUs, that's going to be a trainwreck :/ Why do you want that?
That ship has sailed. That is how the current generation of systems are, unfortunately. But as I said, this is changing and there are guidelines in place to avoid these kind of topologies. With the future technologies, this will be completely gone.
I understand that the hardware is like that, but why do you want to support this insanity in software?
If you only allow a single sink user (group) at the same time, your problem goes away. Simply disallow the above scenario, do not allow concurrent sink users if sinks are shared like this.
Have the perf-record of app2 above fail because the sink is in-user already.
I agree with you that --per-thread scenarios are easy to deal with, but to support cpu-wide scenarios events must share a sink (because there is one event per CPU). CPU-wide support can't be removed because it has been around for close to a couple of years and heavily used. I also think using the pid of the process that created the events, i.e perf, is a good idea. We just need to agree on how to gain access to it.
In Sai's patch you objected to the following:
struct task_struct *task = READ_ONCE(event->owner);
if (!task || is_kernel_event(event))
Would it be better to use task_nr_pid(current) instead of event->owner? The end result will be exactly the same. There is also no need to check the validity of @current since it is a user process.
We have devices deployed where these crashes are seen consistently, so for some immediate relief, could we atleast get some fix in this cycle without major design overhaul which would likely take more time. Perhaps my first patch [1] without any check for owner or I can post a new version as Suzuki suggested [2] dropping the export of is_kernel_event(). Then we can always work on top of it based on the conclusion of this discussion, we will atleast not have the systems crash in the meantime, thoughts?
[1] https://lore.kernel.org/patchwork/patch/1318098/ [2] https://lore.kernel.org/lkml/fa6cdf34-88a0-1050-b9ea-556d0a9438cb@arm.com/
Thanks, Sai
On Fri, Oct 30, 2020 at 01:29:56PM +0530, Sai Prakash Ranjan wrote:
Hello guys,
On 2020-10-24 02:07, Mathieu Poirier wrote:
On Fri, Oct 23, 2020 at 03:44:16PM +0200, Peter Zijlstra wrote:
On Fri, Oct 23, 2020 at 02:29:54PM +0100, Suzuki Poulose wrote:
On 10/23/20 2:16 PM, Peter Zijlstra wrote:
On Fri, Oct 23, 2020 at 01:56:47PM +0100, Suzuki Poulose wrote:
That way another session could use the same sink if it is free. i.e
perf record -e cs_etm/@sink0/u --per-thread app1
and
perf record -e cs_etm/@sink0/u --per-thread app2
both can work as long as the sink is not used by the other session.
Like said above, if sink is shared between CPUs, that's going to be a trainwreck :/ Why do you want that?
That ship has sailed. That is how the current generation of systems are, unfortunately. But as I said, this is changing and there are guidelines in place to avoid these kind of topologies. With the future technologies, this will be completely gone.
I understand that the hardware is like that, but why do you want to support this insanity in software?
If you only allow a single sink user (group) at the same time, your problem goes away. Simply disallow the above scenario, do not allow concurrent sink users if sinks are shared like this.
Have the perf-record of app2 above fail because the sink is in-user already.
I agree with you that --per-thread scenarios are easy to deal with, but to support cpu-wide scenarios events must share a sink (because there is one event per CPU). CPU-wide support can't be removed because it has been around for close to a couple of years and heavily used. I also think using the pid of the process that created the events, i.e perf, is a good idea. We just need to agree on how to gain access to it.
In Sai's patch you objected to the following:
struct task_struct *task = READ_ONCE(event->owner);
if (!task || is_kernel_event(event))
Would it be better to use task_nr_pid(current) instead of event->owner? The end result will be exactly the same. There is also no need to check the validity of @current since it is a user process.
We have devices deployed where these crashes are seen consistently, so for some immediate relief, could we atleast get some fix in this cycle without major design overhaul which would likely take more time. Perhaps my first patch [1] without any check for owner or I can post a new version as Suzuki suggested [2] dropping the export of is_kernel_event(). Then we can always work on top of it based on the conclusion of this discussion, we will atleast not have the systems crash in the meantime, thoughts?
For the time being I think [1], exactly the way it is, is a reasonable way forward.
Regards, Mathieu
[1] https://lore.kernel.org/patchwork/patch/1318098/ [2] https://lore.kernel.org/lkml/fa6cdf34-88a0-1050-b9ea-556d0a9438cb@arm.com/
Thanks, Sai
-- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Hi Mathieu,
On 2020-10-30 22:18, Mathieu Poirier wrote:
On Fri, Oct 30, 2020 at 01:29:56PM +0530, Sai Prakash Ranjan wrote:
Hello guys,
On 2020-10-24 02:07, Mathieu Poirier wrote:
On Fri, Oct 23, 2020 at 03:44:16PM +0200, Peter Zijlstra wrote:
On Fri, Oct 23, 2020 at 02:29:54PM +0100, Suzuki Poulose wrote:
On 10/23/20 2:16 PM, Peter Zijlstra wrote:
On Fri, Oct 23, 2020 at 01:56:47PM +0100, Suzuki Poulose wrote:
> That way another session could use the same sink if it is free. i.e > > perf record -e cs_etm/@sink0/u --per-thread app1 > > and > > perf record -e cs_etm/@sink0/u --per-thread app2 > > both can work as long as the sink is not used by the other session.
Like said above, if sink is shared between CPUs, that's going to be a trainwreck :/ Why do you want that?
That ship has sailed. That is how the current generation of systems are, unfortunately. But as I said, this is changing and there are guidelines in place to avoid these kind of topologies. With the future technologies, this will be completely gone.
I understand that the hardware is like that, but why do you want to support this insanity in software?
If you only allow a single sink user (group) at the same time, your problem goes away. Simply disallow the above scenario, do not allow concurrent sink users if sinks are shared like this.
Have the perf-record of app2 above fail because the sink is in-user already.
I agree with you that --per-thread scenarios are easy to deal with, but to support cpu-wide scenarios events must share a sink (because there is one event per CPU). CPU-wide support can't be removed because it has been around for close to a couple of years and heavily used. I also think using the pid of the process that created the events, i.e perf, is a good idea. We just need to agree on how to gain access to it.
In Sai's patch you objected to the following:
struct task_struct *task = READ_ONCE(event->owner);
if (!task || is_kernel_event(event))
Would it be better to use task_nr_pid(current) instead of event->owner? The end result will be exactly the same. There is also no need to check the validity of @current since it is a user process.
We have devices deployed where these crashes are seen consistently, so for some immediate relief, could we atleast get some fix in this cycle without major design overhaul which would likely take more time. Perhaps my first patch [1] without any check for owner or I can post a new version as Suzuki suggested [2] dropping the export of is_kernel_event(). Then we can always work on top of it based on the conclusion of this discussion, we will atleast not have the systems crash in the meantime, thoughts?
For the time being I think [1], exactly the way it is, is a reasonable way forward.
Sure, I just checked now and [1] still applies neatly on top of coresight next branch.
[1] https://lore.kernel.org/patchwork/patch/1318098/
Thanks, Sai
On Fri, Oct 30, 2020 at 10:56:09PM +0530, Sai Prakash Ranjan wrote:
Hi Mathieu,
On 2020-10-30 22:18, Mathieu Poirier wrote:
On Fri, Oct 30, 2020 at 01:29:56PM +0530, Sai Prakash Ranjan wrote:
Hello guys,
On 2020-10-24 02:07, Mathieu Poirier wrote:
On Fri, Oct 23, 2020 at 03:44:16PM +0200, Peter Zijlstra wrote:
On Fri, Oct 23, 2020 at 02:29:54PM +0100, Suzuki Poulose wrote:
On 10/23/20 2:16 PM, Peter Zijlstra wrote: > On Fri, Oct 23, 2020 at 01:56:47PM +0100, Suzuki Poulose wrote:
> > That way another session could use the same sink if it is free. i.e > > > > perf record -e cs_etm/@sink0/u --per-thread app1 > > > > and > > > > perf record -e cs_etm/@sink0/u --per-thread app2 > > > > both can work as long as the sink is not used by the other session. > > Like said above, if sink is shared between CPUs, that's going to be a > trainwreck :/ Why do you want that?
That ship has sailed. That is how the current generation of systems are, unfortunately. But as I said, this is changing and there are guidelines in place to avoid these kind of topologies. With the future technologies, this will be completely gone.
I understand that the hardware is like that, but why do you want to support this insanity in software?
If you only allow a single sink user (group) at the same time, your problem goes away. Simply disallow the above scenario, do not allow concurrent sink users if sinks are shared like this.
Have the perf-record of app2 above fail because the sink is in-user already.
I agree with you that --per-thread scenarios are easy to deal with, but to support cpu-wide scenarios events must share a sink (because there is one event per CPU). CPU-wide support can't be removed because it has been around for close to a couple of years and heavily used. I also think using the pid of the process that created the events, i.e perf, is a good idea. We just need to agree on how to gain access to it.
In Sai's patch you objected to the following:
struct task_struct *task = READ_ONCE(event->owner);
if (!task || is_kernel_event(event))
Would it be better to use task_nr_pid(current) instead of event->owner? The end result will be exactly the same. There is also no need to check the validity of @current since it is a user process.
We have devices deployed where these crashes are seen consistently, so for some immediate relief, could we atleast get some fix in this cycle without major design overhaul which would likely take more time. Perhaps my first patch [1] without any check for owner or I can post a new version as Suzuki suggested [2] dropping the export of is_kernel_event(). Then we can always work on top of it based on the conclusion of this discussion, we will atleast not have the systems crash in the meantime, thoughts?
For the time being I think [1], exactly the way it is, is a reasonable way forward.
Sure, I just checked now and [1] still applies neatly on top of coresight next branch.
I have applied both patches that were part of the set.
Thanks, Sai
-- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
There was a report of NULL pointer dereference in ETF enable path for perf CS mode with PID monitoring. It is almost 100% reproducible when the process to monitor is something very active such as chrome and with ETF as the sink.
But code path shows that ETB has a similar path as ETF, so there could be possible NULL pointer dereference crash in ETB as well. Currently in a bid to find the pid, the owner is dereferenced via task_pid_nr() call in etb_enable_perf() and with owner being NULL, we can get a NULL pointer dereference, so have a similar change as ETF where we cache PID in alloc_buffer() callback which is called as the part of etm_setup_aux(). This will reduce the task_pid_nr() function call overheads as well. In addition to this, add a check to validate event->owner before dereferencing it to fix any possible NULL pointer dereference crashes and check for kernel events.
Fixes: 75d7dbd38824 ("coresight: etb10: Add support for CPU-wide trace scenarios") Suggested-by: Suzuki K Poulose suzuki.poulose@arm.com Signed-off-by: Sai Prakash Ranjan saiprakash.ranjan@codeaurora.org --- drivers/hwtracing/coresight/coresight-etb10.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/hwtracing/coresight/coresight-etb10.c b/drivers/hwtracing/coresight/coresight-etb10.c index 248cc82c838e..9d2f1ab0e29e 100644 --- a/drivers/hwtracing/coresight/coresight-etb10.c +++ b/drivers/hwtracing/coresight/coresight-etb10.c @@ -176,6 +176,7 @@ static int etb_enable_perf(struct coresight_device *csdev, void *data) unsigned long flags; struct etb_drvdata *drvdata = dev_get_drvdata(csdev->dev.parent); struct perf_output_handle *handle = data; + struct cs_buffers *buf = etm_perf_sink_config(handle);
spin_lock_irqsave(&drvdata->spinlock, flags);
@@ -186,7 +187,7 @@ static int etb_enable_perf(struct coresight_device *csdev, void *data) }
/* Get a handle on the pid of the process to monitor */ - pid = task_pid_nr(handle->event->owner); + pid = buf->pid;
if (drvdata->pid != -1 && drvdata->pid != pid) { ret = -EBUSY; @@ -376,6 +377,10 @@ static void *etb_alloc_buffer(struct coresight_device *csdev, { int node; struct cs_buffers *buf; + struct task_struct *task = READ_ONCE(event->owner); + + if (!task || is_kernel_event(event)) + return NULL;
node = (event->cpu == -1) ? NUMA_NO_NODE : cpu_to_node(event->cpu);
@@ -383,6 +388,7 @@ static void *etb_alloc_buffer(struct coresight_device *csdev, if (!buf) return NULL;
+ buf->pid = task_pid_nr(task); buf->snapshot = overwrite; buf->nr_pages = nr_pages; buf->data_pages = pages;
There was a report of NULL pointer dereference in ETF enable path for perf CS mode with PID monitoring. It is almost 100% reproducible when the process to monitor is something very active such as chrome and with ETF as the sink but the ETR may also be susceptible to this crash.
Currently in a bid to find the pid, the owner is dereferenced via task_pid_nr() call in get_perf_etr_buf_cpu_wide() and with owner being NULL and no proper validation of event->owner, we can have a possible NULL pointer dereference, so add a check to validate event->owner before dereferencing it to fix any possible NULL pointer dereference crashes and also check for kernel events.
Fixes: 3147da92a8a8 ("coresight: tmc-etr: Allocate and free ETR memory buffers for CPU-wide scenarios") Suggested-by: Suzuki K Poulose suzuki.poulose@arm.com Signed-off-by: Sai Prakash Ranjan saiprakash.ranjan@codeaurora.org --- drivers/hwtracing/coresight/coresight-tmc-etr.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/hwtracing/coresight/coresight-tmc-etr.c b/drivers/hwtracing/coresight/coresight-tmc-etr.c index 714f9e867e5f..305dfdd5345a 100644 --- a/drivers/hwtracing/coresight/coresight-tmc-etr.c +++ b/drivers/hwtracing/coresight/coresight-tmc-etr.c @@ -1381,6 +1381,10 @@ static void *tmc_alloc_etr_buffer(struct coresight_device *csdev, { struct etr_perf_buffer *etr_perf; struct tmc_drvdata *drvdata = dev_get_drvdata(csdev->dev.parent); + struct task_struct *task = READ_ONCE(event->owner); + + if (!task || is_kernel_event(event)) + return NULL;
etr_perf = tmc_etr_setup_perf_buf(drvdata, event, nr_pages, pages, snapshot); @@ -1389,7 +1393,7 @@ static void *tmc_alloc_etr_buffer(struct coresight_device *csdev, return NULL; }
- etr_perf->pid = task_pid_nr(event->owner); + etr_perf->pid = task_pid_nr(task); etr_perf->snapshot = snapshot; etr_perf->nr_pages = nr_pages; etr_perf->pages = pages;
On 2020-10-22 16:27, Sai Prakash Ranjan wrote:
There was a report of NULL pointer dereference in ETF enable path for perf CS mode with PID monitoring. It is almost 100% reproducible when the process to monitor is something very active such as chrome and with ETF as the sink and not ETR. Currently in a bid to find the pid, the owner is dereferenced via task_pid_nr() call in tmc_enable_etf_sink_perf() and with owner being NULL, we get a NULL pointer dereference.
Looking at the ETR and other places in the kernel, ETF and the ETB are the only places trying to dereference the task(owner) in tmc_enable_etf_sink_perf() which is also called from the sched_in path as in the call trace. Owner(task) is NULL even in the case of ETR in tmc_enable_etr_sink_perf(), but since we cache the PID in alloc_buffer() callback and it is done as part of etm_setup_aux() when allocating buffer for ETR sink, we never dereference this NULL pointer and we are safe. So lets do the same thing with ETF and ETB and cache the PID to which the cs_buffer belongs in alloc_buffer() callback for ETF and ETB as done for ETR. This will also remove the unnecessary function calls (task_pid_nr()) in tmc_enable_etr_sink_perf() and etb_enable_perf().
In addition to this, add a check to validate event->owner before dereferencing it in ETR, ETB and ETF to avoid any possible NULL pointer dereference crashes in their corresponding alloc_buffer callbacks and check for kernel events as well.
Easily reproducible running below:
perf record -e cs_etm/@tmc_etf0/ -N -p <pid>
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000548 Mem abort info: ESR = 0x96000006 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000006 CM = 0, WnR = 0 <snip>... Call trace: tmc_enable_etf_sink+0xe4/0x280 coresight_enable_path+0x168/0x1fc etm_event_start+0x8c/0xf8 etm_event_add+0x38/0x54 event_sched_in+0x194/0x2ac group_sched_in+0x54/0x12c flexible_sched_in+0xd8/0x120 visit_groups_merge+0x100/0x16c ctx_flexible_sched_in+0x50/0x74 ctx_sched_in+0xa4/0xa8 perf_event_sched_in+0x60/0x6c perf_event_context_sched_in+0x98/0xe0 __perf_event_task_sched_in+0x5c/0xd8 finish_task_switch+0x184/0x1cc schedule_tail+0x20/0xec ret_from_fork+0x4/0x18
Sai Prakash Ranjan (4): perf/core: Export is_kernel_event() coresight: tmc-etf: Fix NULL ptr dereference in tmc_enable_etf_sink_perf() coresight: etb10: Fix possible NULL ptr dereference in etb_enable_perf() coresight: tmc-etr: Fix possible NULL ptr dereference in get_perf_etr_buf_cpu_wide()
drivers/hwtracing/coresight/coresight-etb10.c | 8 +++++++- drivers/hwtracing/coresight/coresight-priv.h | 2 ++ drivers/hwtracing/coresight/coresight-tmc-etf.c | 8 +++++++- drivers/hwtracing/coresight/coresight-tmc-etr.c | 6 +++++- include/linux/perf_event.h | 2 ++ kernel/events/core.c | 3 ++- 6 files changed, 25 insertions(+), 4 deletions(-)
base-commit: f4cb5e9daedf56671badc93ac7f364043aa33886
Please ignore this series, I will need to resend.
Thanks, Sai
On 2020-10-22 16:40, Sai Prakash Ranjan wrote:
On 2020-10-22 16:27, Sai Prakash Ranjan wrote:
There was a report of NULL pointer dereference in ETF enable path for perf CS mode with PID monitoring. It is almost 100% reproducible when the process to monitor is something very active such as chrome and with ETF as the sink and not ETR. Currently in a bid to find the pid, the owner is dereferenced via task_pid_nr() call in tmc_enable_etf_sink_perf() and with owner being NULL, we get a NULL pointer dereference.
Looking at the ETR and other places in the kernel, ETF and the ETB are the only places trying to dereference the task(owner) in tmc_enable_etf_sink_perf() which is also called from the sched_in path as in the call trace. Owner(task) is NULL even in the case of ETR in tmc_enable_etr_sink_perf(), but since we cache the PID in alloc_buffer() callback and it is done as part of etm_setup_aux() when allocating buffer for ETR sink, we never dereference this NULL pointer and we are safe. So lets do the same thing with ETF and ETB and cache the PID to which the cs_buffer belongs in alloc_buffer() callback for ETF and ETB as done for ETR. This will also remove the unnecessary function calls (task_pid_nr()) in tmc_enable_etr_sink_perf() and etb_enable_perf().
In addition to this, add a check to validate event->owner before dereferencing it in ETR, ETB and ETF to avoid any possible NULL pointer dereference crashes in their corresponding alloc_buffer callbacks and check for kernel events as well.
Easily reproducible running below:
perf record -e cs_etm/@tmc_etf0/ -N -p <pid>
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000548 Mem abort info: ESR = 0x96000006 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000006 CM = 0, WnR = 0 <snip>... Call trace: tmc_enable_etf_sink+0xe4/0x280 coresight_enable_path+0x168/0x1fc etm_event_start+0x8c/0xf8 etm_event_add+0x38/0x54 event_sched_in+0x194/0x2ac group_sched_in+0x54/0x12c flexible_sched_in+0xd8/0x120 visit_groups_merge+0x100/0x16c ctx_flexible_sched_in+0x50/0x74 ctx_sched_in+0xa4/0xa8 perf_event_sched_in+0x60/0x6c perf_event_context_sched_in+0x98/0xe0 __perf_event_task_sched_in+0x5c/0xd8 finish_task_switch+0x184/0x1cc schedule_tail+0x20/0xec ret_from_fork+0x4/0x18
Sai Prakash Ranjan (4): perf/core: Export is_kernel_event() coresight: tmc-etf: Fix NULL ptr dereference in tmc_enable_etf_sink_perf() coresight: etb10: Fix possible NULL ptr dereference in etb_enable_perf() coresight: tmc-etr: Fix possible NULL ptr dereference in get_perf_etr_buf_cpu_wide()
drivers/hwtracing/coresight/coresight-etb10.c | 8 +++++++- drivers/hwtracing/coresight/coresight-priv.h | 2 ++ drivers/hwtracing/coresight/coresight-tmc-etf.c | 8 +++++++- drivers/hwtracing/coresight/coresight-tmc-etr.c | 6 +++++- include/linux/perf_event.h | 2 ++ kernel/events/core.c | 3 ++- 6 files changed, 25 insertions(+), 4 deletions(-)
base-commit: f4cb5e9daedf56671badc93ac7f364043aa33886
Please ignore this series, I will need to resend.
Please ignore my previous ignore request and let me know if exporting is_kernel_event() is ok.
Thanks, Sai