Weird Grub behavior with freshly-deployed debian-stable 12

I tried installing debian-stable onto my NVMe drive using baremetal-deploy, and I ran into a bunch of various issues:

  • Grub doesn’t respect the timeout, booting the default menu item immediately, even if GRUB_TIMEOUT_STYLE=menu and GRUB_TIMEOUT=30
  • Grub config is missing the arm-smmu.disable_bypass=0 parameter, causing it to fail to finish booting
  • Due to the previous one, I had to boot recovery to mount the drive to fix it
  • The boot stalled for a long time waiting for systemd-networkd-wait-online. According to the systemd status message, there was no limit to how long it would wait, so I didn’t wait around to see how long it actually waited. I ended up just booting recovery and masking that service instead.
  • The installed OS wouldn’t accept the password I had entered for any user I could guess or for root
  • I had to boot Recovery again to fix the password, upon which I found that it ignored where I tried to tell it to use eth9 and instead used eth0. (Maybe I typed eth9 to a yes/no question instead of to the actual prompt of which device to use?)
  • Grub doesn’t seem to see any serial devices (as shown by terminal_input commands I added). I thought that was a factor in it ignoring the menu, but it’s a red herring.
  • Something (Grub? System firmware?) is trying to load a kernel parameter and garbage as EFI loaders
  • Something (Grub? System firmware?) is waiting for DPMAC links instead of showing Grub’s menu
  • The only way I could get Grub to actually show me the menu was to edit the /etc/grub.d/ scripts to remove the “if” statement guarding the addition of “UEFI Firmware Setup” and then set that as default to make it fail when it tries to forcibly boot the default entry.
fsl-mc: Booting Management Complex ... SUCCESS
fsl-mc: Management Complex booted (version: 10.36.0, boot status: 0x1)
device 0 offset 0x580000, size 0x40000
SF: 262144 bytes @ 0x580000 Read: OK
default DPL loaded
Autoboot in 5 seconds, press 's' to stop and bring up boot menu
Scanning for bootflows in all bootdevs
Seq  Method       State   Uclass    Part  Name                      Filename
---  -----------  ------  --------  ----  ------------------------  ----------------
Scanning global bootmeth 'efi_mgr':
Hunting with: simple_bus
Found 0 extension board(s).
Scanning bootdev 'nvme#0.blk#1.bootdev':
  0  efi          ready   nvme         f  nvme#0.blk#1.bootdev.part efi/boot/bootaa64.efi
** Booting bootflow 'nvme#0.blk#1.bootdev.part_f' with efi
Working FDT set to 90000000
TEN64: Using legacy fan device tree overlay
Missing TPMv2 device for EFI_TCG_PROTOCOL
INFO:    RNG Desc SUCCESS with status 0
INFO:    result a21aba44f5cb671c
Booting /efi\boot\bootaa64.efi
Failed to open efi\boot\console=ttyS0,115200 - Not Found
Failed to load image 灀?�: Not Found
start_image() returned Not Found, falling back to default loader
Welcome to GRUB!

DPMAC7@qsgmii Waiting for PHY auto negotiation to complete......... TIMEOUT !
DPMAC7@qsgmii: Could not initialize
DPMAC8@qsgmii Waiting for PHY auto negotiation to complete......... TIMEOUT !
DPMAC8@qsgmii: Could not initialize
DPMAC9@qsgmii Waiting for PHY auto negotiation to complete......... TIMEOUT !
DPMAC9@qsgmii: Could not initialize
DPMAC10@qsgmii Waiting for PHY auto negotiation to complete......... TIMEOUT !
DPMAC10@qsgmii: Could not initialize
Active input terminals:
console
Available input terminals:
serial_*
Active output terminals:
console
Available output terminals:
gfxterm serial_*
  Booting `UEFI Firmware Settings'

error: can't find command `fwsetup'.

Press any key to continue...
<finally the menu shows up>

Thanks for the report… can I just check what version of the recovery firmware you have?

Either the one included with the v0.9.1 firmware or standalone from archive.traverse.com.au should work:

# With 0.9.1
root@recovery000afa2424fd:~# grep OPENWRT_RELEASE /etc/os-release
OPENWRT_RELEASE="muvirt 22.03-SNAPSHOT+traverse r0+20138-46c43831bb"
# Latest version on archive.traverse.com.au
root@recovery000afa2424fd:/# grep OPENWRT_RELEASE /etc/os-release
OPENWRT_RELEASE="muvirt 23.05-SNAPSHOT r0+23779-5c8244842f"

I am aware of one bug in earlier recovery versions: “Special” characters in the password were not encoded correctly, making the cloud-config impossible to parse. Newer versions now write a shadow-encoded (and salted) SHA512 string instead of the plaintext password.

Recent Fedora releases also require an updated recovery environment, but the manipulator (“post-deploy” script) should tell you if that is the case.

The steps that are performed by baremetal-deploy after cloning the Debian image to disk are in this lua script.

It sounds like it has failed to mount the new root filesystem, so none of the steps in that script have been executed.

Is it possible you had a partition setup on your NVMe drive before hand that would cause the kernel (in the recovery environment) to not load the new partition table?

For example: I know having LVM daemons active can cause problems, even if you don’t have the LVM volumes activated and mounted.
The best workaround is to wipe the device first, and then reboot to download and write the OS image:

# First boot
blkdiscard -f /dev/nvme0n1
reboot
# Second boot
baremetal-deploy debian-stable /dev/nvme0n1

I agree, this one is definitely annoying!
Due to the way GRUB works, it doesn’t actually know the exact device it’s been loaded from, so it does a device search to find the file system with the configured UUID (to grab the full grub.cfg). Annoyingly, this search will cycle through network devices first.

cat /boot/efi/EFI/debian/grub.cfg
search.fs_uuid 2f498f01-7f4f-4b4a-9efd-fd33122e2c73 root

It’s been on my wishlist for a while to prevent this from happening, by removing network devices from the default boot list.

The U-Boot version we have is old and has a few issues with EFI binaries passing parameters between each other. I am working on an update at the moment which should make all those issues go away (the most recent U-Boot’s have an almost ‘complete’ EFI stack), but that is pending resolution of some device tree issues with the fan controller (I’ll explain in the other thread)