root@OpenWrt:/# [ 797.612241] Internal error: Oops - Undefined instruction: 0000000002000000 [#1] SMP
[ 797.619934] Modules linked in: pppoe ppp_async nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet vhost_net vhost pppox ppp_generic nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack lzo vhost_iotlb slhc sfp nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 mdio_i2c lzo_rle lzo_decompress lzo_compress e1000e crc_ccitt atlantic pac1934 emc2301 emc181x emc17xx gpio_pca953x i2c_mux_pca954x i2c_mux i2c_dev sp805_wdt dwmac_sun8i dwmac_rk dwmac_imx nicvf nicpf thunder_bgx thunder_xcv dwmac_generic stmmac_platform stmmac rvu_nicvf rvu_nicpf rvu_af rvu_mbox mvpp2 mvneta vmxnet3 fec fsl_enetc fsl_enetc_mdio fsl_enetc_ierb fsl_dpaa2_eth sctp udp_tunnel libcrc32c ip6_udp_tunnel tun mdio_thunder mdio_cavium mdio_bcm_unimac
[ 797.620117] xgmac_mdio pcs_lynx fsl_mc_dpio genet nls_utf8 pcs_xpcs marvell10g marvell macsec sha512_generic seqiv jitterentropy_rng drbg md5 hmac nls_iso8859_1 nls_cp437 pf_ring rtc_rx8025 vfat fat ptp broadcom bcm_phy_lib aquantia hwmon pps_core phylink
[ 797.729778] CPU: 3 PID: 5078 Comm: qemu-system-aar Not tainted 5.15.137 #0
[ 797.736660] Hardware name: traverse ten64/ten64, BIOS 2020.07-rc1-ga94e0d21 03/15/2022
[ 797.744582] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 797.751550] pc : 0xffff8000080b3000
[ 797.755039] lr : 0xffff8000080b308c
[ 797.758524] sp : ffff80000ebdb6c0
[ 797.761835] x29: ffff80000ebdb6c0 x28: ffff008028b7bfc0 x27: 0000000000000000
[ 797.768979] x26: 0000000000000000 x25: 0000000000001de8 x24: ffff00835e04dc40
[ 797.776122] x23: ffff800008e38c40 x22: 0000000000000000 x21: 0000000000000000
[ 797.783265] x20: ffff008028b78000 x19: ffff00835e04dc40 x18: 0000000000000000
[ 797.790408] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffffc1628a80
[ 797.797549] x14: 027be09ace29c396 x13: 00000000000000a9 x12: 000000000000033f
[ 797.804693] x11: 000000133e7a20f3 x10: ffff008028b780c0 x9 : 0000000000000001
[ 797.811837] x8 : 0000000000000001 x7 : ffff00835e04e6a0 x6 : 00000000ffffffff
[ 797.818979] x5 : ffff0080015a00e8 x4 : ffff800008ce0670 x3 : ffff0080015a0000
[ 797.826122] x2 : 0000000000000000 x1 : ffff008028b78000 x0 : ffff00835e04dc40
[ 797.833265] Call trace:
[ 797.835709] 0xffff8000080b3000
[ 797.838849] 0xffff8000080b308c
[ 797.841988] 0xffff8000080b31e0
[ 797.845127] 0xffff8000080b3c58
[ 797.848265] 0xffff8000080b3e0c
[ 797.851404] 0xffff800008083c14
[ 797.854541] 0xffff800008032a7c
[ 797.857680] 0xffff800008053168
[ 797.860819] 0xffff80000805a488
[ 797.863958] 0xffff8000080599dc
[ 797.867096] 0xffff800008031954
[ 797.870234] 0xffff8000080319cc
[ 797.873373] 0xffff800008044fa8
[ 797.876511] 0xffff800008043c8c
[ 797.879650] 0xffff8000080476c0
[ 797.882790] 0xffff800008040144
[ 797.885928] 0xffff800008035ab8
[ 797.889065] 0xffff80000824ae74
[ 797.892205] 0xffff80000802551c
[ 797.895343] 0xffff800008025630
[ 797.898481] 0xffff800008a8c5b8
[ 797.901619] 0xffff800008a8c968
[ 797.904758] 0xffff800008011638
[ 797.907905] Code: f9000bf3 aa0003f3 f944b803 f9404024 (f9404065)
[ 797.914002] ---[ end trace 2ed853593c9edb35 ]---
[ 797.918622] Kernel panic - not syncing: Oops - Undefined instruction: Fatal exception
[ 797.926455] SMP: stopping secondary CPUs
[ 798.970378] SMP: failed to stop secondary CPUs 0-1,3-4
[ 798.975517] Kernel Offset: disabled
[ 798.979000] CPU features: 0x0,00000001,20000846
[ 798.983527] Memory Limit: none
[ 798.986576] Rebooting in 3 seconds..
[ 801.990261] SMP: stopping secondary CPUs
[ 803.034176] SMP: failed to stop secondary CPUs 0-1,3-4
I believe this is after turning the fan speed down from 5500 RPM (loud) to 3000 RPM.
But I’m not sure why this is resulting in a panic?
I guess heat - as it happens if I’m running two VMs and they’re both under load.
I suspect it’s something in my driver that is causing this. OpenWrt has stripped the symbols from the kernel so I’ll need to track them down.
Is the fan spinning down (or trying to) automatically (via the thermal subsystem) or are you setting a specific speed via sysfs?
No the fan speed doesn’t change until I set it via sysfs. it’s then 1-8 hours until a crash.
For some reason not all of my init scripts are running correctly when the system boots so not everything recovers but this is separate to this issue.
I’m not sure it’s related to the fan driver. I can now reproduce the crash in one of two ways:
Wireshark + SSH + rsync file transfer a few gigabytes to my parents’ place.
or
stress-ng -a 4 inside a qemu VM for about 90 seconds.
Can’t reproduce on a 23.05.5 system.
Linux blackbox 5.15.167 #0 SMP Mon Sep 23 12:34:46 2024 aarch64 GNU/Linux
Do not run the stress-ng tool without a time limit, the VM eventually hung and sent out a NMI, but the host system stayed up!
I believe it’s heat related.
[ 8737.235649] Internal error: Oops - Undefined instruction: 0000000002000000 [#1] SMP
[ 8737.243332] Modules linked in: ath9k ath9k_common xt_connlimit qcserial pppoe ppp_async option nft_fib_inet nf_flow_table_ ipv6 nf_flow_table_ipv4 nf_flow_table_inet nf_conncount mt7921e mt7921_common iwlmvm iwldvm cdc_mbim ath9k_hw ath11k_pci ath1 1k ath10k_pci ath10k_core ath xt_state xt_helper xt_conntrack xt_connmark xt_connbytes xt_CT wireguard vhost_net vhost usb_ww an rndis_host qmi_wwan pppox ppp_generic nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_o bjref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_co unter nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack_netlink nf_conntrack mt792x_lib mt7915e mt7615e mt7615_common mt76_connac_lib mt76 mhi_wwan_mbim mhi_wwan_ctrl mac80211 lzo libchacha20poly1305 iwlwifi iptable_raw iptable_mangle iptable _filter ipt_REJECT ipt_ECN ip_tables ftdi_sio cfg80211 cdc_ncm cdc_ether asix xt_time xt_tcpudp xt_tcpmss xt_statistic xt_rec ent
[ 8737.243513] xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_ecn xt_dscp xt_comment xt_TCPMSS xt_LOG xt_HL xt_DSCP xt_CLASSIFY x_tables wwan vhost_iotlb vhci_hcd usbserial usbnet usbip_host usbip_core smsc slhc sfp sch_cake ravb qrtr_mhi q rtr qmi_helpers poly1305_neon nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 mhi_net mhi mdio_i2c mdio_gpio mdio_bitbang lzo_rle lzo_decompress lzo_compress libcurve25519_generic libcrc32c e1000e compat cls_flower cdc_wdm cdc_acm ax88796b atlantic act_vlan cls_bpf act_bpf sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_route cls _matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact configs pac1934 emc2301 emc181x emc17xx gpio_pca953x i2c_ mux_pca954x i2c_mux i2c_dev sp805_wdt dwmac_rk dwmac_imx nicvf nicpf thunder_bgx thunder_xcv dwmac_generic stmmac_platform st mmac rvu_nicvf rvu_nicpf rvu_af rvu_mbox mvpp2 mvneta vmxnet3 fec fsl_enetc fsl_enetc_mdio fsl_enetc_ierb fsl_dpaa2_eth ifb
[ 8737.331056] ip6_tunnel oid_registry tunnel6 ip_tunnel veth tun mdio_thunder mdio_cavium mdio_bcm_unimac xgmac_mdio pcs_ly nx fsl_mc_dpio autofs4 genet nls_utf8 pcs_xpcs marvell10g marvell vxlan udp_tunnel ip6_udp_tunnel macsec sha512_generic sha25 6_generic libsha256 seqiv jitterentropy_rng drbg michael_mic kpp hmac des_generic libdes cmac authencesn authenc nls_iso8859_ 1 nls_cp437 uas gpio_keys leds_gpio rtc_rx8025 tpm_i2c_atmel igb vfat fat button_hotplug ptp realtek broadcom bcm_phy_lib aqu antia hwmon crc_ccitt pps_core phylink mii tpm
[ 8737.465705] CPU: 3 PID: 9015 Comm: qemu-system-aar Not tainted 5.15.167 #0
[ 8737.472585] Hardware name: traverse ten64/ten64, BIOS 2020.07-rc1-ga94e0d21 03/15/2022
[ 8737.480504] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 8737.487468] pc : kvm_vcpu_check_block+0xf0/0x110
[ 8737.492098] lr : kvm_vcpu_check_block+0x34/0x110
[ 8737.496718] sp : ffff80001104b9c0
[ 8737.500027] x29: ffff80001104b9c0 x28: 00000000000000a0 x27: ffff0080458bc700
[ 8737.507166] x26: ffff008029254600 x25: 0000000000000258 x24: 0000000000000001
[ 8737.514306] x23: 0000000000000001 x22: ffff008029254600 x21: 0000000000000000
[ 8737.521447] x20: 000007f259d63125 x19: ffff0080458ba700 x18: 0000000000000000
[ 8737.528589] x17: ffff8083550df000 x16: ffff800009190000 x15: 0000000000000000
[ 8737.535730] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 8737.542870] x11: 0000000000000040 x10: ffff800009023140 x9 : 0000000000000001
[ 8737.550011] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000003
[ 8737.557152] x5 : 0000000000000002 x4 : 0000000000000000 x3 : 0000000000000001
[ 8737.564295] x2 : ffff0080458ba738 x1 : 0000000000000000 x0 : 0000000000000001
[ 8737.571436] Call trace:
[ 8737.573880] kvm_vcpu_check_block+0xf0/0x110
[ 8737.578158] kvm_vcpu_block+0x6c/0x2e0
[ 8737.581908] kvm_handle_wfx+0x80/0xc0
[ 8737.585572] handle_exit+0x60/0x144
[ 8737.589063] kvm_arch_vcpu_ioctl_run+0x1a0/0x7cc
[ 8737.593686] kvm_vcpu_ioctl+0x1f4/0x62c
[ 8737.597524] __arm64_sys_ioctl+0x664/0x14f0
[ 8737.601710] invoke_syscall.constprop.0+0x5c/0x110
[ 8737.606505] do_el0_svc+0x6c/0x150
[ 8737.609909] el0_svc+0x28/0xc0
[ 8737.612969] el0t_64_sync_handler+0xe8/0x114
[ 8737.617242] el0t_64_sync+0x1a4/0x1a8
[ 8737.620910] Code: d50323bf d65f03c0 d5033abf 9100e262 (f9800051)
[ 8737.627005] ---[ end trace 5c1efc0078b7241b ]---
[ 8737.631621] Kernel panic - not syncing: Oops - Undefined instruction: Fatal exception
[ 8737.639453] SMP: stopping secondary CPUs
[ 8737.643384] Kernel Offset: disabled
[ 8737.646867] CPU features: 0x0,00000001,20000846
[ 8737.651395] Memory Limit: none
[ 8737.654444] Rebooting in 3 seconds..
Last logged temperatures (while true; do sleep 1 && cat /sys/class/hwmon/hwmon0/temp1_input; done
)
78375
78875
77500
after approximately 20 minutes of running Geekbench 6 Preview in a Debian chroot on the system.
Sorry for the late response, I’ve been pretty busy recently.
The temp readings show it failing at a lower temperature than I expect (the SoC itself will march on much higher), but other things on the board can influence that.
I have a device I’m trying to reproduce the issue on, and I’ve gotten a small number of crashes, but not as frequent as you are seeing.
If you tweak the fan target up a little higher (like 3500), do you still have stability issues?
Ultimately I think the fan targets need to be modified in the device tree so there can be a “forced” spinup when it’s on edge of the stable zone, and the current trip point (70C) in the device tree is too low. (And I’ve found that if you manually set the fan target when you are over the trip point, the kernel won’t return to it’s ‘ramping’ behavior)
edit: I’ve had “better” luck getting crash issues (not kernel, but userspace) today with the warmer weather. So device tree tweaks are the next step.
It happens at 4500 as well - I think it’s just summer heat!