Reboot failure using Debian 6.1.8-1 kernel on one box of two

Continuing the discussion from Reboot command regression from 5.10 to 5.15 kernel:

Moving the discussion over here since it is related but new…

I’ve uploaded complete console captures (from power-on through reboot) for my two boxes:

Working - https://km6g.us/~kpfleming/core-a-clean.txt
Not Working - https://km6g.us/~kpfleming/core-b-clean.txt

I’ve removed all the kernel and userspace timestamps so the files are relatively diff-able. The first thing I notice in the diff output is that during the very early boot (the INFO messages) there are differences in various values but I don’t know any reason why they should be different as the hardware in the boxes is identical.

I also see that the NVMe drive in one box is reported as “x2 gen1” and in the other box as “x2 gen3”; the drives are identical and should be gen3 in both boxes, but curiously the box that works has the drive listed as gen1.

The diff doesn’t show any differences in versions of BL2, BL3, or U-boot.

At this point I’d like to find out why there are these various small differences in the early boot information for boxes that should be identical.

The reboot issue has been resolved using the lts-6-1 kernel from the Traverse repository. I’d still like to figure out why these boxes are reporting differences in the various registers/settings reported before the BL2 starts.

It appears that 6.1.12 also corrected the PCIe setup problem; both machines are reporting 8.0 GT/s x2 links for the SSDs now.

I now understand why one box had problems rebooting and the other did not: one of them arrived with sfpmode set to legacy (which also meant that the SFP+ slots didn’t actually work). Once I removed the sfpmode environment variable in the U-boot environment the SFP+ slots began working as they should.

I’ll re-do the boot captures to see if any other differences between the box remain.

@mcbridematt Should I be concerned at all that the out-of-box configuration on this unit was incorrect?

The sfpmode (and gbemode as well) variable is usually set by baremetal-deploy based on what appliance/distribution you installed. The firmware defaults to managed mode (=sfpmode/gbemode not set), though most mainstream distributions need Legacy SFP mode due to the broken driver up to kernel 6.2.

Did you move the SSD from the original box (where sfpmode would have been set to legacy) to the other one? The variables are set in the board flash so won’t move with the SSD/media.

If you need to get the SFPs working on a kernel without “working” SFP+ support, usually you just need to activate the TXDISABLE GPIO manually:
https://ten64doc.traverse.com.au/network/sfp/#example-sfp-setup-standard-linux

Also:

It’s not unusual for PCIe devices to detect at a slower speed in U-Boot, usually when Linux re-scan’s the PCIe bus they come good. In this case it did:

pci 0002:01:00.0: 15.752 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x2 link at 0002:00:00.0 (capable of 31.504 Gb/s with 8.0 GT/s PCIe x4 link)

Note:

  • The SSD/M.2 slot only has two lanes (out of the 4 total possible on the LS1088), though they are PCIe Gen3 (8.0GT/s) capable. So 15.75Gb/s is the maximum possible from the board.
  • The PCIe switch at 0002:01:00.0 always complains about only having 1 lane out of two possible. We picked the two-lane version as it’s easier to work with, but don’t have an PCIe extra lane to give it.

Some brands of NVMe controller are more likely to exhibit speed issues. I’m aware of issues with more recent Silicon Motion controllers but most other brands have been fine.

We are working through such an [PCIe link problems] issue with one vendor at the moment and might try to improve the PCIe routing in a future PCB revision.

Thanks! I understand what happened now.

On box A, I used baremetal-deploy to install Debian Bookworm onto a USB flash drive. That would have set sfpmode to legacy, presumably. I then booted that and used it to install onto the SSD.

On box B I booted the same USB flash drive and used it to install onto the SSD, but since I didn’t use baremetal-deploy there the sfpmode setting was unchanged.