Suspect ten64 is corrupting packets

DrJosh9000 · December 1, 2022, 10:20am

Hi forum,

I’m wondering if I’ve made a config mistake in U-boot or somewhere else, or just have a slightly dodgy unit.

I have some packet captures that show that my ten64 is altering UDP header checksums of some packets. I’m not sure what the exact circumstances for the corruption are, but I can reliably reproduce by using Minecraft for Windows connecting to a friend’s server - the ten64 seems to be putting 2 octets of random into the UDP header checksum of 60-byte (on the wire) Raknet ACKs. I haven’t gone looking for more instances of corruption (e.g. for TCP) but I reckon there is some.

Topology:

Windows PC (Intel I225-V) <-> ten64 (OVS / Debian 11) <-> Google Wifi <-> internet

I’ve replicated the problem using both Open vSwitch (both with and without an SDN controller), and iproute2 bridge. I haven’t attempted to make the ten64 a router. Wiresharking on the internet side shows incoming packets are arriving with correct checksums.

Got any hints?

mcbridematt · December 1, 2022, 11:20pm

My first guess is that this is a problem with hardware checksum offloads and/or something in the network stack has incorrectly set the packet size or head/tail pointers.

Can you try disabling the hardware checksum offloads on the interface:

ethtool --offload eth0 rx off tx off

So I can try to reproduce the problem, what kernel version are you using? Is it the Debian standard kernel?

It could be worth trying a quick test with the OpenWrt out of NAND as well - reboot and select OpenWrt/NAND from the bootmenu.

DrJosh9000 · December 2, 2022, 4:32am

Thanks for taking a look. I’ll try disabling the checksum offload soon. When I was doing the captures I was playing with linux-image-traverse torvalds (6.0.0-rc6). I’ll go back and try the stock kernel as well.

DrJosh9000 · December 3, 2022, 4:11am

Disabling offload (on the port the PC is attached to) has fixed it
And the problem remains/is fixed the same way on kernel 5.10.0-19.

mcbridematt · December 5, 2022, 2:12am

Just to narrow things down a bit, can you try disabling scatter/gather egress support on it’s own:

$ ethtool -K eth0 sg off
Actual changes:
tx-scatter-gather: off
tx-generic-segmentation: off [not requested]

It was introduced just before 5.10 so that would line up with both kernel versions.

DrJosh9000 · December 5, 2022, 9:35am

Enabling/disabling scatter/gather doesn’t seem to have any effect.

In case it’s any use, here’s a sample of the pcap (just packets with bad checksums): https://drive.google.com/file/d/1Nz3-oekBlAB3IEz1n8qNTN-sjdKpZpBJ/view?usp=sharing

mcbridematt · December 19, 2022, 3:18am

Thanks for the pcap. I’ve just come back from an overseas trip so haven’t been able to look into it until today.

I have been able to reproduce the incorrect checksum issue when replaying the pcap through tcpreplay-edit -C, but not when running Minecraft Bedrock myself.

Looking at your packet capture and a capture from my own Bedrock session, the difference is a non-zero Ethernet padding:

So it looks like the invalid checksums are being generated because the non-zero padding is being included (when it should be zero for anything past the IP and UDP sections)

I’ll have to dig through the Ethernet driver to see how this can be fixed.

DrJosh9000 · December 20, 2022, 11:54pm

Thanks for looking into it, I really appreciate the time and effort

mcbridematt · January 20, 2023, 4:49am

I’ve been probing this issue again while testing 6.x kernels.
It’s definitely the non-zero padding causing the incorrect UDP checksum calculation.

There isn’t a nice way to solve it in the Ethernet driver as the frame is always seen by the kernel as either 56 bytes (last non-zero byte in the padding) or 60, instead of 49, so we would need to peek at the IP+UDP headers to determine if there is a potential padding issue.

Also, patches dealing with this issue (in other drivers) have been rejected by the kernel maintainers before.

XDP looks like the best way to ‘patch’ this issue, it would make a nice simple XDP testcase and hopefully can be used across many different kernel versions.

I’ll send my findings to NXP as well, it’s possible that an MC firmware update could fix the issue.