Upgraded 23.05.2 to 23.05.5, no arp from qemu VMs

voltagex · October 1, 2024, 2:32pm

This is mostly a placeholder so I can go to bed, but I’ve reverted to 23.05.2 for now as when upgrading to https://downloads.openwrt.org/releases/23.05.5/targets/armsr/armv8/openwrt-23.05.5-armsr-armv8-generic-ext4-combined.img.gz and reinstalling qemu, kmod-vhost-net, kmod-tun, my VMs started but no arp made it out on to the wider network

So: Ten64 → VM 1, worked
My PC → VM 1 - no ping, no SSH, nothing.

mrcheap · October 2, 2024, 1:24am

Sound similar to the ethernet driver / kernel update I had with Proxmox and VMs earlier in the year. Matt was able to provide a patched kernel as a workaround in debian.

Wonder if a OpenWRT 23.05.5 build with 23.05.2 kernel would work in the meantime until Matt can tackle it

Can you ping out to all devices from the VMs?

mcbridematt · October 2, 2024, 1:32am

Indeed. I’ll have to follow up the issue again (see this mailing list post for the explanation).

The quick solution is to turn off vhost-net, though this will reduce network->VM performance.

The kernel patch (revert) is here.

voltagex · October 2, 2024, 2:39am

I can’t re-test this as it was too disruptive - I run DNS inside a VM as well as Home Assistant.

voltagex · November 9, 2024, 4:02am

I need to work out how to set up OpenWrt to build an image with kernel symbols and hopefully retest this plus the crash I’ve been seeing.

mcbridematt · November 15, 2024, 9:22pm

This should work to get full symbols:

CONFIG_KERNEL_KALLSYMS=y

It’s under 'Global build settings → Kernel Build Options`

I found that lkdtm (in the kmod-lkdtm package) works well for checking if you have full symbols embedded:

$ cat <(echo WRITE_RO) >/sys/kernel/debug/provoke-crash/DIRECT
[   33.392400] pc : lkdtm_WRITE_RO+0x3c/0x54 [lkdtm]
[   33.393061] lr : lkdtm_WRITE_RO+0x24/0x54 [lkdtm]
[   33.393260] sp : ffff80000976bce0
[   33.393460] x29: ffff80000976bce0 x28: ffff000000ace200 x27: 0000000000000000
...
[   33.397726] Call trace:
[   33.398434]  lkdtm_WRITE_RO+0x3c/0x54 [lkdtm]
[   33.398926]  lkdtm_FORTIFIED_STRSCPY+0x404/0x588 [lkdtm]
[   33.399712]  0xffff800000f862a0
[   33.400240]  full_proxy_write+0x60/0xbc
[   33.400776]  vfs_write+0xb8/0x2d0
[   33.400935]  ksys_write+0x5c/0xe0
[   33.401215]  __arm64_sys_write+0x1c/0x30
[   33.401413]  invoke_syscall.constprop.0+0x5c/0x110
[   33.401668]  do_el0_svc+0x6c/0x150
[   33.401984]  el0_svc+0x28/0xc0
[   33.402154]  el0t_64_sync_handler+0xe8/0x114
[   33.402344]  el0t_64_sync+0x1a4/0x1a8
[   33.402734] Code: f2b579a2 f0000000 ca020021 913bc000 (f9030261)
[   33.403359] ---[ end trace 2030d990d44e1844 ]---

Without, you get something like this:

[  136.366594] Call trace:
[  136.367386]  0xffff800000e45a28 [lkdtm@000000004c000d49+0xa000]
[  136.369253]  0xffff800000e452dc [lkdtm@000000004c000d49+0xa000]
[  136.371130]  0xffff800000e432a0 [lkdtm@000000004c000d49+0xa000]
[  136.372992]  0xffff80000835df54
[  136.374011]  0xffff8000082351a8
[  136.375020]  0xffff80000823552c
[  136.376027]  0xffff8000082355cc
[  136.377032]  0xffff8000080254ac
[  136.378051]  0xffff8000080255cc
[  136.379058]  0xffff800008a94f68
[  136.380064]  0xffff800008a95318
[  136.381069]  0xffff800008011638

On the vhost-net issue, I noticed NXP have made a slight tweak to the buffer handling in their own kernel. That might fix the issue, so I’ll test that first before poking them again.

mrcheap · September 13, 2025, 2:04pm

Hi @mcbridematt any update on the NXP front?

Tried a newer kernel today to see if something had changed, but still the same issue.

mcbridematt · September 15, 2025, 7:56am

Thanks for reminding me, I was going to examine some changes in NXP’s tree but I didn’t get around to it. Sometimes the NXP team are busy with their internal release schedule so they miss what is going on in the kernel lists.

I’m working through the open source backlog at the moment so I’ll take a look at it again soon.

mcbridematt · September 26, 2025, 11:13pm

Update: I have come up with a fix which I’ve been testing for the last week. It seems OK.

except..
There appears to be another kernel bug introduced around 6.6.103 which means I haven’t been able to test it fully with muvirt. It causes the entire system to lock up and become unresponsive while sending traffic to/from a VM
I have not seen the issue on any other kernel series (like 6.12 or .6.17-rcX).
If you are using OpenWrt or muvirt for virtualization, don’t use any of our 24.10.x (6.6 kernel) builds unless they are listed on our archive page.

mcbridematt · October 21, 2025, 9:42pm

Sorry this took so long!
A fix has been committed to the kernel netdev tree, it is not part of any released version (yet). I’ll post again when it is.

The problem with VMs on 6.6.103 and later has been identified as:

This is a real head scratcher.. I’m just trying to rule out any issues on our side before posting to the kernel list about it.