Greetings,
I recorded the program "ls" (statically linked to provide a single executable as a memory accesses file).
I recorded the program using perf, and then extracted the actual raw trace data from the perf.data file using a little tool i wrote. I can use OpenCSD to fully decode the trace produced by perf.
I also recorded the "ls" util using an API i wrote from kernel mode. I published the API here as an* [RFC]*. Basically, i start recording and stop recording whenever the __process__ of my interest is scheduling in. This post is not much about requesting a review for my API.. but i do have some issues with the trace that is produced by this API, and i'm not quite sure why.
I use the OpenCSD directly in my code, and register a decoder callback for every generic trace element. When my callback is called, i simply print the element string representation(e.g. OCSD_GEN_TRC_ELEM_INSTR_RANGE).
Now, the weird thing is the perf and API produce the same generic elements until a certain element:
OCSD_GEN_TRC_ELEM_TRACE_ON() ... ... ... same elements... ... same elements... ... same elements... ... ...
And eventually diverge from each other. I assume the perf trace is going in the right direction, but my trace simply starts going nuts. The last *__common__* generic element is the following:
OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec range=0x4148f4:[0x414910] (ISA=A64) E iBR A64:ret )
After this element, perf trace goes in a different route, and the API right afterwards produced a very *weird *instruction range element:
OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec range=0x414910:[0x498a20] (ISA=A64) E --- )
There is no way this 0x498a20 address was reached, and i cannot see any proof for it in the trace itself(using ptm2human). It seems that the decoder keeps decoding and disassembling opcodes until it reaches 0x498a20... my memory callback(callback that is called if the decoder needs memory that isn't present) is called for the address 0x498a20. From the on, the trace just goes into a very weird path. I can't explain the address branches that are taken from here on.
Any ideas on how to approach this? OpenCSD experts would be appreciated. I have attached the perf and API trace, and the "ls" executable which is loaded into address 0x400000. I also attached the ETMv4 config for every trace(trace id, etc..). There is no need to create multiple decoders for different trace ids, theres only a single ID for a single decoder.
Thanks, Mike.
Hi Mike,
I've looked at the data you supplied.
I created test snapshot directories so that I could run each of the trace data files through the trc_pkt_lister test program (the attached .tgz file contains these, plus the results).
Now the two trace files are different sizes - this is explained by the fact that the api trace run had cycle counts switched on, whereas the perf run did not - plus the perf run turned off the trace while in kernel calls - the api left the trace on, though filtering out the kernel - but a certain amount of sync packets have come through adding to the size.
Now looking at the results I cannot see the 0x4148f4 location in either trace dump (perf_ls2.ppl and api_ls2.ppl in the .tgz).
There are no obvious differences I could detect in the results, though they are difficult to compare given the difference in output.
The effect you are seeing does look like some sort of runaway - with the decoder searching for opcodes - possibly in a section of the ls binary file that does not contain executable code - till it happens upon something that looks like an opcode.
At this point I cannot explain the difference you and I are seeing given the data provided. Can you look at the snapshot results, and see if there is anything there? You can re-run the tests I ran if you rename ls to ls.bin and put on level up from the ss-perf or ss-api snapshot directories where the file is referenced to.
Regards
Mike
On 17 September 2018 at 13:44, Mike Bazov mike@perception-point.io wrote:
Greetings,
I recorded the program "ls" (statically linked to provide a single executable as a memory accesses file).
I recorded the program using perf, and then extracted the actual raw trace data from the perf.data file using a little tool i wrote. I can use OpenCSD to fully decode the trace produced by perf.
I also recorded the "ls" util using an API i wrote from kernel mode. I published the API here as an [RFC]. Basically, i start recording and stop recording whenever the __process__ of my interest is scheduling in. This post is not much about requesting a review for my API.. but i do have some issues with the trace that is produced by this API, and i'm not quite sure why.
I use the OpenCSD directly in my code, and register a decoder callback for every generic trace element. When my callback is called, i simply print the element string representation(e.g. OCSD_GEN_TRC_ELEM_INSTR_RANGE).
Now, the weird thing is the perf and API produce the same generic elements until a certain element:
OCSD_GEN_TRC_ELEM_TRACE_ON() ... ... ... same elements... ... same elements... ... same elements... ... ...
And eventually diverge from each other. I assume the perf trace is going in the right direction, but my trace simply starts going nuts. The last __common__ generic element is the following:
OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec range=0x4148f4:[0x414910] (ISA=A64) E iBR A64:ret )
After this element, perf trace goes in a different route, and the API right afterwards produced a very weird instruction range element:
OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec range=0x414910:[0x498a20] (ISA=A64) E --- )
There is no way this 0x498a20 address was reached, and i cannot see any proof for it in the trace itself(using ptm2human). It seems that the decoder keeps decoding and disassembling opcodes until it reaches 0x498a20... my memory callback(callback that is called if the decoder needs memory that isn't present) is called for the address 0x498a20. From the on, the trace just goes into a very weird path. I can't explain the address branches that are taken from here on.
Any ideas on how to approach this? OpenCSD experts would be appreciated. I have attached the perf and API trace, and the "ls" executable which is loaded into address 0x400000. I also attached the ETMv4 config for every trace(trace id, etc..). There is no need to create multiple decoders for different trace ids, theres only a single ID for a single decoder.
Thanks, Mike.
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
Hi Mike,
I've looked further at this today, and can see a location where a large block appears in both the api and perf trace data on decode using the library test program.
There does appear to be an issue if the decoder is in a "waiting for address state" i.e. it has lost track usually because an area of memory is unavailable, and an exception packet is seen - the exception address appears to be used twice - both to complete an address range and as an exception return - hence in this case the improbable large block. I need to look into this in more detail and fix it up.
However - I am seeing before this the api and perf decodes have diverged, which suggests an issue elsewhere too perhaps. I do need to look deeper into this as well. I am not 100% certain that using the ls.bin as a full memory image at 0x400000 is necessarily working in the snapshot tests - there might be another offset needed to access the correct opcodes for the trace.
I'll let you know if I make further progress.
On 17 September 2018 at 16:53, Mike Leach mike.leach@linaro.org wrote:
Hi Mike,
I've looked at the data you supplied.
I created test snapshot directories so that I could run each of the trace data files through the trc_pkt_lister test program (the attached .tgz file contains these, plus the results).
Now the two trace files are different sizes - this is explained by the fact that the api trace run had cycle counts switched on, whereas the perf run did not - plus the perf run turned off the trace while in kernel calls - the api left the trace on, though filtering out the kernel - but a certain amount of sync packets have come through adding to the size.
Now looking at the results I cannot see the 0x4148f4 location in either trace dump (perf_ls2.ppl and api_ls2.ppl in the .tgz).
There are no obvious differences I could detect in the results, though they are difficult to compare given the difference in output.
The effect you are seeing does look like some sort of runaway - with the decoder searching for opcodes - possibly in a section of the ls binary file that does not contain executable code - till it happens upon something that looks like an opcode.
At this point I cannot explain the difference you and I are seeing given the data provided. Can you look at the snapshot results, and see if there is anything there? You can re-run the tests I ran if you rename ls to ls.bin and put on level up from the ss-perf or ss-api snapshot directories where the file is referenced to.
Regards
Mike
On 17 September 2018 at 13:44, Mike Bazov mike@perception-point.io wrote:
Greetings,
I recorded the program "ls" (statically linked to provide a single executable as a memory accesses file).
I recorded the program using perf, and then extracted the actual raw trace data from the perf.data file using a little tool i wrote. I can use OpenCSD to fully decode the trace produced by perf.
I also recorded the "ls" util using an API i wrote from kernel mode. I published the API here as an [RFC]. Basically, i start recording and stop recording whenever the __process__ of my interest is scheduling in. This post is not much about requesting a review for my API.. but i do have some issues with the trace that is produced by this API, and i'm not quite sure why.
I use the OpenCSD directly in my code, and register a decoder callback for every generic trace element. When my callback is called, i simply print the element string representation(e.g. OCSD_GEN_TRC_ELEM_INSTR_RANGE).
Now, the weird thing is the perf and API produce the same generic elements until a certain element:
OCSD_GEN_TRC_ELEM_TRACE_ON() ... ... ... same elements... ... same elements... ... same elements... ... ...
And eventually diverge from each other. I assume the perf trace is going in the right direction, but my trace simply starts going nuts. The last __common__ generic element is the following:
OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec range=0x4148f4:[0x414910] (ISA=A64) E iBR A64:ret )
After this element, perf trace goes in a different route, and the API right afterwards produced a very weird instruction range element:
OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec range=0x414910:[0x498a20] (ISA=A64) E --- )
There is no way this 0x498a20 address was reached, and i cannot see any proof for it in the trace itself(using ptm2human). It seems that the decoder keeps decoding and disassembling opcodes until it reaches 0x498a20... my memory callback(callback that is called if the decoder needs memory that isn't present) is called for the address 0x498a20. From the on, the trace just goes into a very weird path. I can't explain the address branches that are taken from here on.
Any ideas on how to approach this? OpenCSD experts would be appreciated. I have attached the perf and API trace, and the "ls" executable which is loaded into address 0x400000. I also attached the ETMv4 config for every trace(trace id, etc..). There is no need to create multiple decoders for different trace ids, theres only a single ID for a single decoder.
Thanks, Mike.
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
-- Mike Leach Principal Engineer, ARM Ltd. Manchester Design Centre. UK
Hi Mike,
I have looked into this issue further, found my previous assumption to be wrong, and unfortunately have come to the conclusion that the generated trace is somehow wrong / corrupt, or the supplied image is not what was run when the trace was generated.
If you look at the attached analysis of the trace generated from the ls_api.cs data [analysis001.txt] This is at the very start of the traced image.
The first few packets [raw packets (0)] show the sync and start at 00000000004003f0 <_start>: followed by the first 'E' atom that marks the branch to 0x41a158. The next two 'E' atoms get us to 0x41a028.
At this point we get an exception packet, followed by a preferred return address packet [ raw packets (2)]. This return address is 0x400630.
The rules from the ETM architecture specification 4.0-4.4 p6-242 state:-
"The Exception packet contains an address. This means that execution has continued from the target of the most recent P0 element, up to, but not including, that address, and a trace analyzer must analyze each instruction in this range."
Thus the decoder is required to analyze from the previous P0 element - the 'E' atom that marked the branch to 0x41a028, until the preferred return address. This is actually lower than the start address, which results in a huge range seen here, and also seen by you in the example you described. The decoder effectively runs off the end of the memory image before it stops.
The trace should be indicating an address after but relatively close to 0x41a028 - as otherwise an atom would have been emitted by the cbnz 41a054.
If I examine the start of the perf_ls.cs decode, I see the same 3 'E' atoms followed by the odd data fault exception.
So for the first few branches at least, the perf and api captures go in the same direction.
Given the it is unlikely that the generated trace packets are incorrect - it seems more likely that the 'ls' image being used for decode is not what is generating this trace. Since we have to analyze opcodes to follow the 'E' and 'N' atoms, decode relies on accurate memory images being fed into the decoder. The only actual addresses we have explicitly stated in the trace are the start: 0x4003f0, and the exception return address 0x400360. The others are synthesized from the supplied image.
There may be a case for checking when decoding the exception packet that the address is not behind the current location and throwing an error, but beyond that I do not at this point believe that the decoder is at fault.
Regards
Mike
On 18 September 2018 at 19:32, Mike Leach mike.leach@linaro.org wrote:
Hi Mike,
I've looked further at this today, and can see a location where a large block appears in both the api and perf trace data on decode using the library test program.
There does appear to be an issue if the decoder is in a "waiting for address state" i.e. it has lost track usually because an area of memory is unavailable, and an exception packet is seen - the exception address appears to be used twice - both to complete an address range and as an exception return - hence in this case the improbable large block. I need to look into this in more detail and fix it up.
However - I am seeing before this the api and perf decodes have diverged, which suggests an issue elsewhere too perhaps. I do need to look deeper into this as well. I am not 100% certain that using the ls.bin as a full memory image at 0x400000 is necessarily working in the snapshot tests - there might be another offset needed to access the correct opcodes for the trace.
I'll let you know if I make further progress.
On 17 September 2018 at 16:53, Mike Leach mike.leach@linaro.org wrote:
Hi Mike,
I've looked at the data you supplied.
I created test snapshot directories so that I could run each of the trace data files through the trc_pkt_lister test program (the attached .tgz file contains these, plus the results).
Now the two trace files are different sizes - this is explained by the fact that the api trace run had cycle counts switched on, whereas the perf run did not - plus the perf run turned off the trace while in kernel calls - the api left the trace on, though filtering out the kernel - but a certain amount of sync packets have come through adding to the size.
Now looking at the results I cannot see the 0x4148f4 location in either trace dump (perf_ls2.ppl and api_ls2.ppl in the .tgz).
There are no obvious differences I could detect in the results, though they are difficult to compare given the difference in output.
The effect you are seeing does look like some sort of runaway - with the decoder searching for opcodes - possibly in a section of the ls binary file that does not contain executable code - till it happens upon something that looks like an opcode.
At this point I cannot explain the difference you and I are seeing given the data provided. Can you look at the snapshot results, and see if there is anything there? You can re-run the tests I ran if you rename ls to ls.bin and put on level up from the ss-perf or ss-api snapshot directories where the file is referenced to.
Regards
Mike
On 17 September 2018 at 13:44, Mike Bazov mike@perception-point.io wrote:
Greetings,
I recorded the program "ls" (statically linked to provide a single executable as a memory accesses file).
I recorded the program using perf, and then extracted the actual raw trace data from the perf.data file using a little tool i wrote. I can use OpenCSD to fully decode the trace produced by perf.
I also recorded the "ls" util using an API i wrote from kernel mode. I published the API here as an [RFC]. Basically, i start recording and stop recording whenever the __process__ of my interest is scheduling in. This post is not much about requesting a review for my API.. but i do have some issues with the trace that is produced by this API, and i'm not quite sure why.
I use the OpenCSD directly in my code, and register a decoder callback for every generic trace element. When my callback is called, i simply print the element string representation(e.g. OCSD_GEN_TRC_ELEM_INSTR_RANGE).
Now, the weird thing is the perf and API produce the same generic elements until a certain element:
OCSD_GEN_TRC_ELEM_TRACE_ON() ... ... ... same elements... ... same elements... ... same elements... ... ...
And eventually diverge from each other. I assume the perf trace is going in the right direction, but my trace simply starts going nuts. The last __common__ generic element is the following:
OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec range=0x4148f4:[0x414910] (ISA=A64) E iBR A64:ret )
After this element, perf trace goes in a different route, and the API right afterwards produced a very weird instruction range element:
OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec range=0x414910:[0x498a20] (ISA=A64) E --- )
There is no way this 0x498a20 address was reached, and i cannot see any proof for it in the trace itself(using ptm2human). It seems that the decoder keeps decoding and disassembling opcodes until it reaches 0x498a20... my memory callback(callback that is called if the decoder needs memory that isn't present) is called for the address 0x498a20. From the on, the trace just goes into a very weird path. I can't explain the address branches that are taken from here on.
Any ideas on how to approach this? OpenCSD experts would be appreciated. I have attached the perf and API trace, and the "ls" executable which is loaded into address 0x400000. I also attached the ETMv4 config for every trace(trace id, etc..). There is no need to create multiple decoders for different trace ids, theres only a single ID for a single decoder.
Thanks, Mike.
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
-- Mike Leach Principal Engineer, ARM Ltd. Manchester Design Centre. UK
-- Mike Leach Principal Engineer, ARM Ltd. Manchester Design Centre. UK