On Wed, 24 May 2017 12:48:04 -0500 Sebastian Pop sebpop@gmail.com wrote:
On Wed, May 24, 2017 at 11:36 AM, Mathieu Poirier mathieu.poirier@linaro.org wrote:
Are the instructions in the autoFDO section of the HOWTO.md on GitHub sufficient to test this or there is another way?
Here is how I tested it: (supposing that perf.data contains an ETM trace)
# perf inject -i perf.data -o inj --itrace=il64 --strip # perf report -i inj -D &> dump
and I inspected the addresses from the last branch stack in the output dump with the addresses of the disassembled program from:
# objdump -d sort
Re-running the AutoFDO process with these two patches continue to make the resultant executable perform worse, however:
$ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 5306 ms $ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 5304 ms $ taskset -c 2 ./sort-O3-autofdo Bubble sorting array of 30000 elements 5851 ms $ taskset -c 2 ./sort-O3-autofdo Bubble sorting array of 30000 elements 5889 ms $ taskset -c 2 ./sort-O3-autofdo Bubble sorting array of 30000 elements 5888 ms $ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 5318 ms
The gcov file generated from the inj.data (no matter whether it's --itrace=il64 or --itrace=i100usle) still looks wrong:
$ ~/git/autofdo/dump_gcov -gcov_version=1 sort-O3.gcov sort_array total:19309128 head:0 0: 0 1: 0 5: 0 6: 0 7.1: 0 7.3: 0 8.3: 0 15: 2 16: 2 17: 2 10: start total:0 1: 0 11: bubble_sort total:19309119 2: 1566 4: 6266668 5: 6071341 7: 6266668 9: 702876 12: stop total:3 2: 0 3: 1 4: 1 5: 1 main total:1 head:0 0: 0 2: 0 4: 1 1: cmd_line total:0 3: 0 4: 0 5: 0 6: 0
Whereas the one generated by intel-pt run looks correct, showing the swap (11: bubble_sort 7,8) as executed less times:
kim@juno sort-etm$ ~/git/autofdo/dump_gcov -gcov_version=1 ../sort-O3.gcov sort_array total:105658 head:0 0: 0 5: 0 6: 0 7.1: 0 7.3: 0 8.3: 0 16: 0 17: 0 1: printf total:0 2: 0 10: start total:0 1: 0 11: bubble_sort total:105658 2: 14 4: 28740 5: 28628 7: 9768 8: 9768 9: 28740 12: stop total:0 2: 0 3: 0 4: 0 5: printf total:0 2: 0 15: printf total:0 2: 0
I have to run the 'perf inject' on the x86 host because of the aforementioned:
0x350 [0x50]: failed to process type: 1
problem when trying to run it natively on the aarch64 target.
However, it doesn't matter whether I run the create_gcov - like so btw:
~/git/autofdo/create_gcov --binary=sort-O3 --profile=inj.data --gcov=sort-O3.gcov -gcov_version=1
on the x86 host or the aarch64 target: I still get the same (negative performance) results.
As Sebastian asked, if I take the intel-pt sourced inject generated .gcov onto the target and rebuild sort, the performance improves:
$ gcc -g -O3 -fauto-profile=../sort-O3.gcov ./sort.c -o ./sort-O3-autofdo $ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 5309 ms $ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 5310 ms $ taskset -c 2 ./sort-O3-autofdo Bubble sorting array of 30000 elements 4443 ms $ taskset -c 2 ./sort-O3-autofdo Bubble sorting array of 30000 elements 4443 ms
And if I take the ETM-generated gcov and use that to build a new x86_64 binary, it indeed performs worse on x86_64 also:
$ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 1502 ms $ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 1500 ms $ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 1501 ms $ taskset -c 2 ./sort-O3-autofdo-etmgcov Bubble sorting array of 30000 elements 1907 ms $ taskset -c 2 ./sort-O3-autofdo-etmgcov Bubble sorting array of 30000 elements 1893 ms $ taskset -c 2 ./sort-O3-autofdo-etmgcov Bubble sorting array of 30000 elements 1907 ms
Kim