Upgraded 23.05.2 to 23.05.5, no arp from qemu VMs

This is mostly a placeholder so I can go to bed, but I’ve reverted to 23.05.2 for now as when upgrading to https://downloads.openwrt.org/releases/23.05.5/targets/armsr/armv8/openwrt-23.05.5-armsr-armv8-generic-ext4-combined.img.gz and reinstalling qemu, kmod-vhost-net, kmod-tun, my VMs started but no arp made it out on to the wider network

So: Ten64 → VM 1, worked
My PC → VM 1 - no ping, no SSH, nothing.

Sound similar to the ethernet driver / kernel update I had with Proxmox and VMs earlier in the year. Matt was able to provide a patched kernel as a workaround in debian.

Wonder if a OpenWRT 23.05.5 build with 23.05.2 kernel would work in the meantime until Matt can tackle it

Can you ping out to all devices from the VMs?

Indeed. I’ll have to follow up the issue again (see this mailing list post for the explanation).

The quick solution is to turn off vhost-net, though this will reduce network->VM performance.

The kernel patch (revert) is here.

I can’t re-test this as it was too disruptive - I run DNS inside a VM as well as Home Assistant.

I need to work out how to set up OpenWrt to build an image with kernel symbols and hopefully retest this plus the crash I’ve been seeing.

This should work to get full symbols:

CONFIG_KERNEL_KALLSYMS=y

It’s under 'Global build settings → Kernel Build Options`

I found that lkdtm (in the kmod-lkdtm package) works well for checking if you have full symbols embedded:

$ cat <(echo WRITE_RO) >/sys/kernel/debug/provoke-crash/DIRECT
[   33.392400] pc : lkdtm_WRITE_RO+0x3c/0x54 [lkdtm]
[   33.393061] lr : lkdtm_WRITE_RO+0x24/0x54 [lkdtm]
[   33.393260] sp : ffff80000976bce0
[   33.393460] x29: ffff80000976bce0 x28: ffff000000ace200 x27: 0000000000000000
...
[   33.397726] Call trace:
[   33.398434]  lkdtm_WRITE_RO+0x3c/0x54 [lkdtm]
[   33.398926]  lkdtm_FORTIFIED_STRSCPY+0x404/0x588 [lkdtm]
[   33.399712]  0xffff800000f862a0
[   33.400240]  full_proxy_write+0x60/0xbc
[   33.400776]  vfs_write+0xb8/0x2d0
[   33.400935]  ksys_write+0x5c/0xe0
[   33.401215]  __arm64_sys_write+0x1c/0x30
[   33.401413]  invoke_syscall.constprop.0+0x5c/0x110
[   33.401668]  do_el0_svc+0x6c/0x150
[   33.401984]  el0_svc+0x28/0xc0
[   33.402154]  el0t_64_sync_handler+0xe8/0x114
[   33.402344]  el0t_64_sync+0x1a4/0x1a8
[   33.402734] Code: f2b579a2 f0000000 ca020021 913bc000 (f9030261)
[   33.403359] ---[ end trace 2030d990d44e1844 ]---

Without, you get something like this:

[  136.366594] Call trace:
[  136.367386]  0xffff800000e45a28 [lkdtm@000000004c000d49+0xa000]
[  136.369253]  0xffff800000e452dc [lkdtm@000000004c000d49+0xa000]
[  136.371130]  0xffff800000e432a0 [lkdtm@000000004c000d49+0xa000]
[  136.372992]  0xffff80000835df54
[  136.374011]  0xffff8000082351a8
[  136.375020]  0xffff80000823552c
[  136.376027]  0xffff8000082355cc
[  136.377032]  0xffff8000080254ac
[  136.378051]  0xffff8000080255cc
[  136.379058]  0xffff800008a94f68
[  136.380064]  0xffff800008a95318
[  136.381069]  0xffff800008011638

On the vhost-net issue, I noticed NXP have made a slight tweak to the buffer handling in their own kernel. That might fix the issue, so I’ll test that first before poking them again.