next/master boot: 270 boots: 35 failed, 213 passed with 20 offline, 2 untried/unknown (next-20171207)
Full Boot Summary: https://kernelci.org/boot/all/job/next/branch/master/kernel/next-20171207/ Full Build Summary: https://kernelci.org/build/next/branch/master/kernel/next-20171207/
Tree: next Branch: master Git Describe: next-20171207 Git Commit: e40fd8d6b4d9f59b160faa1736f78fc07533ff37 Git URL: http://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git Tested: 91 unique boards, 23 SoC families, 36 builds out of 213
Boot Regressions Detected:
arm:
exynos_defconfig: exynos5800-peach-pi: lab-collabora: new failure (last pass: next-20171205)
multi_v7_defconfig: am437x-gp-evm_rootfs:nfs: lab-ti: failing since 2 days (last pass: next-20171204 - first fail: next-20171205) dra72-evm_rootfs:nfs: lab-ti: failing since 3 days (last pass: next-20171201 - first fail: next-20171204) sun4i-a10-cubieboard: lab-baylibre-seattle: failing since 55 days (last pass: next-20171009 - first fail: next-20171012) tegra124-nyan-big: lab-collabora: failing since 33 days (last pass: next-20171102 - first fail: next-20171103)
multi_v7_defconfig+CONFIG_ARM_LPAE=y: exynos5410-odroidxu: lab-baylibre-seattle: failing since 106 days (last pass: next-20170810 - first fail: next-20170823) sun7i-a20-bananapi: lab-baylibre-seattle: failing since 107 days (last pass: next-20170811 - first fail: next-20170822) sun7i-a20-cubieboard2: lab-baylibre-seattle: failing since 107 days (last pass: next-20170811 - first fail: next-20170822) sun7i-a20-cubietruck: lab-baylibre-seattle: failing since 107 days (last pass: next-20170803 - first fail: next-20170822) tegra124-nyan-big: lab-collabora: failing since 34 days (last pass: next-20171102 - first fail: next-20171103)
multi_v7_defconfig+CONFIG_EFI=y: tegra124-nyan-big: lab-collabora: failing since 34 days (last pass: next-20171102 - first fail: next-20171103)
multi_v7_defconfig+CONFIG_EFI=y+CONFIG_ARM_LPAE=y: tegra124-nyan-big: lab-collabora: failing since 34 days (last pass: next-20171102 - first fail: next-20171103)
multi_v7_defconfig+CONFIG_LKDTM=y: tegra124-nyan-big: lab-collabora: failing since 34 days (last pass: next-20171102 - first fail: next-20171103)
multi_v7_defconfig+CONFIG_PROVE_LOCKING=y: bcm2836-rpi-2-b: lab-collabora: failing since 17 days (last pass: next-20171117 - first fail: next-20171120) tegra124-nyan-big: lab-collabora: failing since 34 days (last pass: next-20171102 - first fail: next-20171103)
multi_v7_defconfig+CONFIG_SMP=n: tegra124-nyan-big: lab-collabora: failing since 34 days (last pass: next-20171102 - first fail: next-20171103)
multi_v7_defconfig+CONFIG_THUMB2_KERNEL=y+CONFIG_ARM_MODULE_PLTS=y: tegra124-nyan-big: lab-collabora: failing since 34 days (last pass: next-20171102 - first fail: next-20171103)
multi_v7_defconfig+kselftest: exynos5800-peach-pi: lab-collabora: failing since 27 days (last pass: next-20171108 - first fail: next-20171110) tegra124-nyan-big: lab-collabora: failing since 33 days (last pass: next-20171102 - first fail: next-20171103)
omap2plus_defconfig: am335x-bone_rootfs:nfs: lab-ti: new failure (last pass: next-20171206) am335x-evmsk_rootfs:nfs: lab-ti: new failure (last pass: next-20171206) am437x-sk-evm_rootfs:nfs: lab-ti: new failure (last pass: next-20171206) am572x-idk_rootfs:nfs: lab-ti: failing since 1 day (last pass: next-20171205 - first fail: next-20171206) dra7-evm_rootfs:nfs: lab-ti: failing since 1 day (last pass: next-20171205 - first fail: next-20171206) dra72-evm_rootfs:nfs: lab-ti: failing since 3 days (last pass: next-20171201 - first fail: next-20171204) omap4-sdp-es23plus_rootfs:nfs: lab-ti: failing since 1 day (last pass: next-20171205 - first fail: next-20171206)
sunxi_defconfig: sun4i-a10-cubieboard: lab-baylibre-seattle: failing since 55 days (last pass: next-20171009 - first fail: next-20171012)
tegra_defconfig: tegra124-nyan-big: lab-collabora: failing since 29 days (last pass: next-20171107 - first fail: next-20171108)
x86:
defconfig+CONFIG_LKDTM=y: x86-atom330: lab-mhart: new failure (last pass: next-20171205)
defconfig+kselftest: x86-atom330: lab-mhart: failing since 1 day (last pass: next-20171205 - first fail: next-20171206)
x86_64_defconfig: x86-atom330: lab-mhart: new failure (last pass: next-20171206)
Boot Failures Detected:
x86:
defconfig+CONFIG_LKDTM=y x86-atom330: 1 failed lab
defconfig+kselftest x86-atom330: 1 failed lab
x86_64_defconfig x86-atom330: 1 failed lab
arm:
multi_v7_defconfig am437x-gp-evm_rootfs:nfs: 1 failed lab dra72-evm_rootfs:nfs: 1 failed lab keystone-k2g-evm_rootfs:nfs: 1 failed lab keystone-k2hk-evm_rootfs:nfs: 1 failed lab sun4i-a10-cubieboard: 1 failed lab tegra124-nyan-big: 1 failed lab
sunxi_defconfig sun4i-a10-cubieboard: 1 failed lab
multi_v7_defconfig+kselftest exynos5800-peach-pi: 1 failed lab rk3288-rock2-square: 1 failed lab tegra124-nyan-big: 1 failed lab
multi_v7_defconfig+CONFIG_EFI=y tegra124-nyan-big: 1 failed lab
omap2plus_defconfig am335x-bone_rootfs:nfs: 1 failed lab am335x-evmsk_rootfs:nfs: 1 failed lab am437x-sk-evm_rootfs:nfs: 1 failed lab am572x-idk_rootfs:nfs: 1 failed lab dra7-evm_rootfs:nfs: 1 failed lab dra72-evm_rootfs:nfs: 1 failed lab omap4-sdp-es23plus_rootfs:nfs: 1 failed lab
multi_v7_defconfig+CONFIG_SMP=n tegra124-nyan-big: 1 failed lab
multi_v7_defconfig+CONFIG_PROVE_LOCKING=y bcm2836-rpi-2-b: 1 failed lab rk3288-rock2-square: 1 failed lab tegra124-nyan-big: 1 failed lab
multi_v7_defconfig+CONFIG_THUMB2_KERNEL=y+CONFIG_ARM_MODULE_PLTS=y tegra124-nyan-big: 1 failed lab
multi_v7_defconfig+CONFIG_EFI=y+CONFIG_ARM_LPAE=y tegra124-nyan-big: 1 failed lab
multi_v7_defconfig+CONFIG_LKDTM=y tegra124-nyan-big: 1 failed lab
exynos_defconfig exynos5800-peach-pi: 1 failed lab
multi_v7_defconfig+CONFIG_ARM_LPAE=y exynos5410-odroidxu: 1 failed lab sun7i-a20-bananapi: 1 failed lab sun7i-a20-cubieboard2: 1 failed lab sun7i-a20-cubietruck: 1 failed lab tegra124-nyan-big: 1 failed lab
tegra_defconfig tegra124-nyan-big: 1 failed lab
Offline Platforms:
arm64:
defconfig+CONFIG_RANDOMIZE_BASE=y: juno-r2: 1 offline lab mt8173-evb: 1 offline lab
defconfig+CONFIG_CPU_BIG_ENDIAN=y: juno-r2: 1 offline lab mt8173-evb: 1 offline lab
defconfig: juno-r2: 1 offline lab mt8173-evb: 1 offline lab
arm:
multi_v7_defconfig: alpine-db: 1 offline lab at91-sama5d3_xplained: 1 offline lab at91-sama5d4_xplained: 1 offline lab mt7623n-bananapi-bpi-r2: 1 offline lab mt8135-evbp1: 1 offline lab rk3288-veyron-jerry: 1 offline lab socfpga_cyclone5_de0_sockit: 1 offline lab tegra124-jetson-tk1: 1 offline lab
bcm2835_defconfig: bcm2835-rpi-b: 1 offline lab
multi_v7_defconfig+CONFIG_ARM_LPAE=y: rk3288-veyron-jerry: 1 offline lab tegra124-jetson-tk1: 1 offline lab
sama5_defconfig: at91-sama5d3_xplained: 1 offline lab at91-sama5d4_xplained: 1 offline lab
tegra_defconfig: tegra124-jetson-tk1: 1 offline lab
--- For more info write to info@kernelci.org
On Thu, Dec 07, 2017 at 03:54:47PM -0800, kernelci.org bot wrote:
Today's -next failed to boot on peach-pi:
exynos_defconfig: exynos5800-peach-pi: lab-collabora: new failure (last pass: next-20171205)
with details at https://kernelci.org/boot/id/5a2a2e7859b5141bc2afa17c/ (including logs and comparisons with other boots, the last good boot was Wednesday). It looks like it hangs somewhere late on in boot, the last output on the console is:
[ 4.827139] smsc95xx 3-1.1:1.0 eth0: register 'smsc95xx' at usb-xhci-hcd.3.auto-1.1, smsc95xx USB 2.0 Ethernet, 94:eb:2c:00:03:c0 [ 5.781037] dma-pl330 3880000.adma: Loaded driver for PL330 DMAC-241330 [ 5.786247] dma-pl330 3880000.adma: DBUFF-4x8bytes Num_Chans-6 Num_Peri-16 Num_Events-6 [ 5.819200] dma-pl330 3880000.adma: PM domain MAU will not be powered off [ 64.529228] random: crng init done
and there's failures earlier to instantiate the display.
On Fri, Dec 08, 2017 at 12:20:07PM +0000, Mark Brown wrote:
On Thu, Dec 07, 2017 at 03:54:47PM -0800, kernelci.org bot wrote:
Today's -next failed to boot on peach-pi:
exynos_defconfig: exynos5800-peach-pi: lab-collabora: new failure (last pass: next-20171205)
with details at https://kernelci.org/boot/id/5a2a2e7859b5141bc2afa17c/ (including logs and comparisons with other boots, the last good boot was Wednesday). It looks like it hangs somewhere late on in boot, the last output on the console is:
[ 4.827139] smsc95xx 3-1.1:1.0 eth0: register 'smsc95xx' at usb-xhci-hcd.3.auto-1.1, smsc95xx USB 2.0 Ethernet, 94:eb:2c:00:03:c0 [ 5.781037] dma-pl330 3880000.adma: Loaded driver for PL330 DMAC-241330 [ 5.786247] dma-pl330 3880000.adma: DBUFF-4x8bytes Num_Chans-6 Num_Peri-16 Num_Events-6 [ 5.819200] dma-pl330 3880000.adma: PM domain MAU will not be powered off [ 64.529228] random: crng init done
and there's failures earlier to instantiate the display.
I just noticed that further up the log there's a lockdep splat with a conflict between the genpd and clock API locking - an ABBA issue with genpd->mlock and the clock API prepare_lock.
On Fri, Dec 8, 2017 at 1:27 PM, Mark Brown broonie@kernel.org wrote:
On Fri, Dec 08, 2017 at 12:20:07PM +0000, Mark Brown wrote:
On Thu, Dec 07, 2017 at 03:54:47PM -0800, kernelci.org bot wrote:
Today's -next failed to boot on peach-pi:
exynos_defconfig: exynos5800-peach-pi: lab-collabora: new failure (last pass: next-20171205)
with details at https://kernelci.org/boot/id/5a2a2e7859b5141bc2afa17c/ (including logs and comparisons with other boots, the last good boot was Wednesday). It looks like it hangs somewhere late on in boot, the last output on the console is:
[ 4.827139] smsc95xx 3-1.1:1.0 eth0: register 'smsc95xx' at usb-xhci-hcd.3.auto-1.1, smsc95xx USB 2.0 Ethernet, 94:eb:2c:00:03:c0 [ 5.781037] dma-pl330 3880000.adma: Loaded driver for PL330 DMAC-241330 [ 5.786247] dma-pl330 3880000.adma: DBUFF-4x8bytes Num_Chans-6 Num_Peri-16 Num_Events-6 [ 5.819200] dma-pl330 3880000.adma: PM domain MAU will not be powered off [ 64.529228] random: crng init done
and there's failures earlier to instantiate the display.
I just noticed that further up the log there's a lockdep splat with a conflict between the genpd and clock API locking - an ABBA issue with genpd->mlock and the clock API prepare_lock.
+Cc Marek Szyprowski,
The lockdep issue and display failures (including regulator warning) were present for some time. They also appear in boot log for next-20171206 (https://storage.kernelci.org/next/master/next-20171206/arm/exynos_defconfig/...). The difference is that 20171208 hangs on "random: crng init done" which did not appear before at all.
The only recent changes in samsung-soc tree which could affect Exynos5800 is - ARM: dts: exynos: Add audio power domain support to Exynos542x SoCs https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux.git/commit/?h=nex... - ARM: dts: exynos: Add CPU perf counters to Exynos54xx boards https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux.git/commit/?h=nex...
But both of them were present in next for few days so they should hit 20171206 as well.
Maybe the issue comes from different subsystem?
Best regards, Krzysztof
Hi Krzysztof,
On 2017-12-08 13:33, Krzysztof Kozlowski wrote:
On Fri, Dec 8, 2017 at 1:27 PM, Mark Brown broonie@kernel.org wrote:
On Fri, Dec 08, 2017 at 12:20:07PM +0000, Mark Brown wrote:
On Thu, Dec 07, 2017 at 03:54:47PM -0800, kernelci.org bot wrote:
Today's -next failed to boot on peach-pi:
exynos_defconfig: exynos5800-peach-pi: lab-collabora: new failure (last pass: next-20171205)
with details at https://kernelci.org/boot/id/5a2a2e7859b5141bc2afa17c/ (including logs and comparisons with other boots, the last good boot was Wednesday). It looks like it hangs somewhere late on in boot, the last output on the console is:
[ 4.827139] smsc95xx 3-1.1:1.0 eth0: register 'smsc95xx' at usb-xhci-hcd.3.auto-1.1, smsc95xx USB 2.0 Ethernet, 94:eb:2c:00:03:c0 [ 5.781037] dma-pl330 3880000.adma: Loaded driver for PL330 DMAC-241330 [ 5.786247] dma-pl330 3880000.adma: DBUFF-4x8bytes Num_Chans-6 Num_Peri-16 Num_Events-6 [ 5.819200] dma-pl330 3880000.adma: PM domain MAU will not be powered off [ 64.529228] random: crng init done
and there's failures earlier to instantiate the display.
I just noticed that further up the log there's a lockdep splat with a conflict between the genpd and clock API locking - an ABBA issue with genpd->mlock and the clock API prepare_lock.
+Cc Marek Szyprowski,
The lockdep issue and display failures (including regulator warning) were present for some time. They also appear in boot log for next-20171206 (https://storage.kernelci.org/next/master/next-20171206/arm/exynos_defconfig/...). The difference is that 20171208 hangs on "random: crng init done" which did not appear before at all.
"random: crng init done" happens about a minute after boot, so if board boots correctly to system prompt before that time, there will be no such message.
The only recent changes in samsung-soc tree which could affect Exynos5800 is
- ARM: dts: exynos: Add audio power domain support to Exynos542x SoCs
https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux.git/commit/?h=nex...
- ARM: dts: exynos: Add CPU perf counters to Exynos54xx boards
https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux.git/commit/?h=nex...
But both of them were present in next for few days so they should hit 20171206 as well.
Maybe the issue comes from different subsystem?
The only change that has been recently merged and is related to the hardware available on peach-pit is this patch:
https://www.spinics.net/lists/linux-samsung-soc/msg61232.html
It probably changed the order of driver initialization, but I have no idea what causes the deadlock (from the "random:" message I see that kernel is somehow still operational).
Best regards
On 12/08, Marek Szyprowski wrote:
Hi Krzysztof,
On 2017-12-08 13:33, Krzysztof Kozlowski wrote:
On Fri, Dec 8, 2017 at 1:27 PM, Mark Brown broonie@kernel.org wrote:
On Fri, Dec 08, 2017 at 12:20:07PM +0000, Mark Brown wrote:
On Thu, Dec 07, 2017 at 03:54:47PM -0800, kernelci.org bot wrote:
Today's -next failed to boot on peach-pi:
exynos_defconfig: exynos5800-peach-pi: lab-collabora: new failure (last pass: next-20171205)
with details at https://kernelci.org/boot/id/5a2a2e7859b5141bc2afa17c/ (including logs and comparisons with other boots, the last good boot was Wednesday). It looks like it hangs somewhere late on in boot, the last output on the console is:
[ 4.827139] smsc95xx 3-1.1:1.0 eth0: register 'smsc95xx' at usb-xhci-hcd.3.auto-1.1, smsc95xx USB 2.0 Ethernet, 94:eb:2c:00:03:c0 [ 5.781037] dma-pl330 3880000.adma: Loaded driver for PL330 DMAC-241330 [ 5.786247] dma-pl330 3880000.adma: DBUFF-4x8bytes Num_Chans-6 Num_Peri-16 Num_Events-6 [ 5.819200] dma-pl330 3880000.adma: PM domain MAU will not be powered off [ 64.529228] random: crng init done
and there's failures earlier to instantiate the display.
I just noticed that further up the log there's a lockdep splat with a conflict between the genpd and clock API locking - an ABBA issue with genpd->mlock and the clock API prepare_lock.
+Cc Marek Szyprowski,
The lockdep issue and display failures (including regulator warning) were present for some time. They also appear in boot log for next-20171206 (https://storage.kernelci.org/next/master/next-20171206/arm/exynos_defconfig/...). The difference is that 20171208 hangs on "random: crng init done" which did not appear before at all.
I haven't looked at the lockdep splat yet, but is that happening because of runtime PM usage by the clk framework?
Hi Stephen,
On 2017-12-08 17:59, Stephen Boyd wrote:
On 12/08, Marek Szyprowski wrote:
On 2017-12-08 13:33, Krzysztof Kozlowski wrote:
On Fri, Dec 8, 2017 at 1:27 PM, Mark Brown broonie@kernel.org wrote:
On Fri, Dec 08, 2017 at 12:20:07PM +0000, Mark Brown wrote:
On Thu, Dec 07, 2017 at 03:54:47PM -0800, kernelci.org bot wrote:
Today's -next failed to boot on peach-pi:
exynos_defconfig: exynos5800-peach-pi: lab-collabora: new failure (last pass: next-20171205)
with details at https://kernelci.org/boot/id/5a2a2e7859b5141bc2afa17c/ (including logs and comparisons with other boots, the last good boot was Wednesday). It looks like it hangs somewhere late on in boot, the last output on the console is:
[ 4.827139] smsc95xx 3-1.1:1.0 eth0: register 'smsc95xx' at usb-xhci-hcd.3.auto-1.1, smsc95xx USB 2.0 Ethernet, 94:eb:2c:00:03:c0 [ 5.781037] dma-pl330 3880000.adma: Loaded driver for PL330 DMAC-241330 [ 5.786247] dma-pl330 3880000.adma: DBUFF-4x8bytes Num_Chans-6 Num_Peri-16 Num_Events-6 [ 5.819200] dma-pl330 3880000.adma: PM domain MAU will not be powered off [ 64.529228] random: crng init done
and there's failures earlier to instantiate the display.
I just noticed that further up the log there's a lockdep splat with a conflict between the genpd and clock API locking - an ABBA issue with genpd->mlock and the clock API prepare_lock.
+Cc Marek Szyprowski,
The lockdep issue and display failures (including regulator warning) were present for some time. They also appear in boot log for next-20171206 (https://storage.kernelci.org/next/master/next-20171206/arm/exynos_defconfig/...). The difference is that 20171208 hangs on "random: crng init done" which did not appear before at all.
I haven't looked at the lockdep splat yet, but is that happening because of runtime PM usage by the clk framework?
This is a false positive. The deplock doesn't distinguish each domain instance. Only some instances of exynos power domains use clocks (as an old workaround of the lack possibility to integrate proper clock rate/topology restoration after power off/on cycle in the clock provider driver).
Those clock controllers, which implements runtime pm, are assigned to power domain, which doesn't touch clocks at all.
I still have no idea how to fix the code to make deplock happy.
Best regards
Hi Shuah,
Do you have a bit of spare time for Exynos kernel development? Could you investigate why Peach-Pi(t) Chromebooks fails to boot with recent kernels? If I remember correctly, you had access to those boards.
The failure itself seems to be caused by the following patch: https://patchwork.kernel.org/patch/10067711/ which got merged as 510353a63796 to v4.15-rc3 and fixed the boot issue on Snow Chromebook (Exynos 5250 based). However I don't see any path how it might deadlock and cause boot failure on Exynos 5420/5800 Chromebooks. I don't have access to Peach Chromebooks to reproduce and our Snow works fine.
Here are some logs: v4.15-rc3 failure: https://storage.kernelci.org/mainline/master/v4.15-rc3/arm/exynos_defconfig/... next-20171207 first next failure: https://storage.kernelci.org/next/master/next-20171207/arm/exynos_defconfig/...
Here is a report on the first boot failure in linux-next:
On 2017-12-11 10:28, Marek Szyprowski wrote:
Hi Stephen,
On 2017-12-08 17:59, Stephen Boyd wrote:
On 12/08, Marek Szyprowski wrote:
On 2017-12-08 13:33, Krzysztof Kozlowski wrote:
On Fri, Dec 8, 2017 at 1:27 PM, Mark Brown broonie@kernel.org wrote:
On Fri, Dec 08, 2017 at 12:20:07PM +0000, Mark Brown wrote:
On Thu, Dec 07, 2017 at 03:54:47PM -0800, kernelci.org bot wrote:
Today's -next failed to boot on peach-pi:
> exynos_defconfig: > exynos5800-peach-pi: > lab-collabora: new failure (last pass: next-20171205) with details at https://kernelci.org/boot/id/5a2a2e7859b5141bc2afa17c/ (including logs and comparisons with other boots, the last good boot was Wednesday). It looks like it hangs somewhere late on in boot, the last output on the console is:
[ 4.827139] smsc95xx 3-1.1:1.0 eth0: register 'smsc95xx' at usb-xhci-hcd.3.auto-1.1, smsc95xx USB 2.0 Ethernet, 94:eb:2c:00:03:c0 [ 5.781037] dma-pl330 3880000.adma: Loaded driver for PL330 DMAC-241330 [ 5.786247] dma-pl330 3880000.adma: DBUFF-4x8bytes Num_Chans-6 Num_Peri-16 Num_Events-6 [ 5.819200] dma-pl330 3880000.adma: PM domain MAU will not be powered off [ 64.529228] random: crng init done
and there's failures earlier to instantiate the display.
I just noticed that further up the log there's a lockdep splat with a conflict between the genpd and clock API locking - an ABBA issue with genpd->mlock and the clock API prepare_lock.
+Cc Marek Szyprowski,
The lockdep issue and display failures (including regulator warning) were present for some time. They also appear in boot log for next-20171206 (https://storage.kernelci.org/next/master/next-20171206/arm/exynos_defconfig/...). The difference is that 20171208 hangs on "random: crng init done" which did not appear before at all.
I haven't looked at the lockdep splat yet, but is that happening because of runtime PM usage by the clk framework?
This is a false positive. The deplock doesn't distinguish each domain instance. Only some instances of exynos power domains use clocks (as an old workaround of the lack possibility to integrate proper clock rate/topology restoration after power off/on cycle in the clock provider driver).
Those clock controllers, which implements runtime pm, are assigned to power domain, which doesn't touch clocks at all.
I still have no idea how to fix the code to make deplock happy.
Best regards
Hi Marek,
On 12/11/2017 03:43 AM, Marek Szyprowski wrote:
Hi Shuah,
Do you have a bit of spare time for Exynos kernel development? Could you investigate why Peach-Pi(t) Chromebooks fails to boot with recent kernels? If I remember correctly, you had access to those boards.
Unfortunately I don't have Peach-Pi(t) Chromebook.
thanks, -- Shuah
On 12/11, Marek Szyprowski wrote:
Hi Stephen,
On 2017-12-08 17:59, Stephen Boyd wrote:
On 12/08, Marek Szyprowski wrote:
On 2017-12-08 13:33, Krzysztof Kozlowski wrote:
On Fri, Dec 8, 2017 at 1:27 PM, Mark Brown broonie@kernel.org wrote:
On Fri, Dec 08, 2017 at 12:20:07PM +0000, Mark Brown wrote:
On Thu, Dec 07, 2017 at 03:54:47PM -0800, kernelci.org bot wrote:
Today's -next failed to boot on peach-pi:
> exynos_defconfig: > exynos5800-peach-pi: > lab-collabora: new failure (last pass: next-20171205) with details at https://kernelci.org/boot/id/5a2a2e7859b5141bc2afa17c/ (including logs and comparisons with other boots, the last good boot was Wednesday). It looks like it hangs somewhere late on in boot, the last output on the console is:
[ 4.827139] smsc95xx 3-1.1:1.0 eth0: register 'smsc95xx' at usb-xhci-hcd.3.auto-1.1, smsc95xx USB 2.0 Ethernet, 94:eb:2c:00:03:c0 [ 5.781037] dma-pl330 3880000.adma: Loaded driver for PL330 DMAC-241330 [ 5.786247] dma-pl330 3880000.adma: DBUFF-4x8bytes Num_Chans-6 Num_Peri-16 Num_Events-6 [ 5.819200] dma-pl330 3880000.adma: PM domain MAU will not be powered off [ 64.529228] random: crng init done
and there's failures earlier to instantiate the display.
I just noticed that further up the log there's a lockdep splat with a conflict between the genpd and clock API locking - an ABBA issue with genpd->mlock and the clock API prepare_lock.
+Cc Marek Szyprowski,
The lockdep issue and display failures (including regulator warning) were present for some time. They also appear in boot log for next-20171206 (https://storage.kernelci.org/next/master/next-20171206/arm/exynos_defconfig/...). The difference is that 20171208 hangs on "random: crng init done" which did not appear before at all.
I haven't looked at the lockdep splat yet, but is that happening because of runtime PM usage by the clk framework?
This is a false positive. The deplock doesn't distinguish each domain instance. Only some instances of exynos power domains use clocks (as an old workaround of the lack possibility to integrate proper clock rate/topology restoration after power off/on cycle in the clock provider driver).
Those clock controllers, which implements runtime pm, are assigned to power domain, which doesn't touch clocks at all.
I still have no idea how to fix the code to make deplock happy.
Right. Once lockdep complains lockdep turns itself off, so we lose the ability to detect other problems. Even if it's a false positive, it's a potential problem on some device so it's concerning that runtime PM usage from clk framework has created this potential problem.
Is it possible to remove the clk operations from the exynos power domains? You say it's to deal with the lack of rate/topology restoration so maybe it can be changed. That will at least allow lockdep to continue working here and detect the "real" deadlock here. Otherwise, do we need to revert runtime PM for clk framework and back out all the Samsung changes on top of that? If we need to do that, we need to do it soon.
We'll need to think about how to resolve the cross-subsystem locking problem regardless. We definitely want to have genpd be able to do CCF things, and CCF to use runtime PM and genpds too. It seems that we have a classic AB-BA deadlock potential between the clk prepare lock and the genpd domain mutex. Both frameworks are holding a mutex across the operations of their providers (either clk_ops or genpd power_on/off) so we can't have the CCF call genpd things and genpd call CCF things or lockdep will complain. I was worried about runtime PM usage by CCF causing this problem, but I missed that genpd was behind runtime PM so I didn't consider the locks in that part of the chain. Ugh.
Maybe we can have runtime PM things done outside of the prepare lock in CCF, that way we aren't holding any locks that genpd may need to use. That would fix the problem, but would expose us to clk tree topology changes happening while we enable runtime PM for clks. It would be great if we could drop all framework level locks when we call into provider drivers. I'm not sure how to do that yet, but that's probably the end goal.
Anyway, this needs some thought to figure out how to redesign the CCF locking scheme so this problem doesn't exist.
Hi Stephen,
On 2017-12-19 21:05, Stephen Boyd wrote:
On 12/11, Marek Szyprowski wrote:
On 2017-12-08 17:59, Stephen Boyd wrote:
On 12/08, Marek Szyprowski wrote:
On 2017-12-08 13:33, Krzysztof Kozlowski wrote:
On Fri, Dec 8, 2017 at 1:27 PM, Mark Brown broonie@kernel.org wrote:
On Fri, Dec 08, 2017 at 12:20:07PM +0000, Mark Brown wrote: > On Thu, Dec 07, 2017 at 03:54:47PM -0800, kernelci.org bot wrote: > > Today's -next failed to boot on peach-pi: > >> exynos_defconfig: >> exynos5800-peach-pi: >> lab-collabora: new failure (last pass: next-20171205) > with details at https://kernelci.org/boot/id/5a2a2e7859b5141bc2afa17c/ > (including logs and comparisons with other boots, the last good boot was > Wednesday). It looks like it hangs somewhere late on in boot, the last > output on the console is: > > [ 4.827139] smsc95xx 3-1.1:1.0 eth0: register 'smsc95xx' at usb-xhci-hcd.3.auto-1.1, smsc95xx USB 2.0 Ethernet, 94:eb:2c:00:03:c0 > [ 5.781037] dma-pl330 3880000.adma: Loaded driver for PL330 DMAC-241330 > [ 5.786247] dma-pl330 3880000.adma: DBUFF-4x8bytes Num_Chans-6 Num_Peri-16 Num_Events-6 > [ 5.819200] dma-pl330 3880000.adma: PM domain MAU will not be powered off > [ 64.529228] random: crng init done > > and there's failures earlier to instantiate the display. I just noticed that further up the log there's a lockdep splat with a conflict between the genpd and clock API locking - an ABBA issue with genpd->mlock and the clock API prepare_lock.
+Cc Marek Szyprowski,
The lockdep issue and display failures (including regulator warning) were present for some time. They also appear in boot log for next-20171206 (https://storage.kernelci.org/next/master/next-20171206/arm/exynos_defconfig/...). The difference is that 20171208 hangs on "random: crng init done" which did not appear before at all.
I haven't looked at the lockdep splat yet, but is that happening because of runtime PM usage by the clk framework?
This is a false positive. The deplock doesn't distinguish each domain instance. Only some instances of exynos power domains use clocks (as an old workaround of the lack possibility to integrate proper clock rate/topology restoration after power off/on cycle in the clock provider driver).
Those clock controllers, which implements runtime pm, are assigned to power domain, which doesn't touch clocks at all.
I still have no idea how to fix the code to make deplock happy.
Right. Once lockdep complains lockdep turns itself off, so we lose the ability to detect other problems. Even if it's a false positive, it's a potential problem on some device so it's concerning that runtime PM usage from clk framework has created this potential problem.
Is it possible to remove the clk operations from the exynos power domains? You say it's to deal with the lack of rate/topology restoration so maybe it can be changed.
Yes, it can be changed. Those rate/topology restoration should be done in exynos5420 clk driver, which should also implement runtime PM. However there is still one issue to be resolved. Current runtime PM / generic power domains bindings doesn't allow to assign a device (clock controller in this case) to more than one power domain. To get it working we would need to have a clock controller object (subdevice?) for each power domain.
Such approach has been already rejected by You in the initial Exynos4 clock controller runtime PM patches.
Exynos4 case could have been modeled in a different, frankly speaking a bit more close to real hardware details. Exynos4 ISP clocks registers (that part which is in fact under power domain) are in separate address space region, so this in the end has been modeled as separate clock controller, which was easily assigned to respective power domain.
Exynos5420/5422 (also partially Exynos5250) is much more problematic, because registers of all clocks are mixed together and MUXes which loose state after power domain suspend are in common register (SRC_TOP3). It cannot be modeled with multiple clock controllers, one per each power domain.
That will at least allow lockdep to continue working here and detect the "real" deadlock here. Otherwise, do we need to revert runtime PM for clk framework and back out all the Samsung changes on top of that? If we need to do that, we need to do it soon.
I would like to avoid this if possible.
We'll need to think about how to resolve the cross-subsystem locking problem regardless. We definitely want to have genpd be able to do CCF things, and CCF to use runtime PM and genpds too. It seems that we have a classic AB-BA deadlock potential between the clk prepare lock and the genpd domain mutex. Both frameworks are holding a mutex across the operations of their providers (either clk_ops or genpd power_on/off) so we can't have the CCF call genpd things and genpd call CCF things or lockdep will complain.
One thing to notice is locking granularity. CCF have a single, common mutex with non-standard semantic (thread re-entrant), while genpd has a standard per object mutex.
Changing CCF to use per-clock mutex instead of global prepare lock will probably solve the possible dead-lock, but it will not make deplock happy, because deplock doesn't distinguish object instances.
I was worried about runtime PM usage by CCF causing this problem, but I missed that genpd was behind runtime PM so I didn't consider the locks in that part of the chain. Ugh.
Maybe we can have runtime PM things done outside of the prepare lock in CCF, that way we aren't holding any locks that genpd may need to use. That would fix the problem, but would expose us to clk tree topology changes happening while we enable runtime PM for clks. It would be great if we could drop all framework level locks when we call into provider drivers. I'm not sure how to do that yet, but that's probably the end goal.
Anyway, this needs some thought to figure out how to redesign the CCF locking scheme so this problem doesn't exist.
CCF locking scheme already suffers from the other issues, like long waiting on common prepare lock and possible AB-BA deadlocks, which were mentioned in the "clk: Add per-controller locks to fix deadlocks" thread: https://lkml.org/lkml/2016/8/16/442
I thought the we will be able to continue that work, but sadly there were other more urgent issue to resolve first.
After some more thoughts about that patchset, it should be quite easy to switch to per-clock mutexes (instead of per-controller/per-provider). This would however not solve all the discussed issues.
Best regards
kernel-build-reports@lists.linaro.org