Trouble with SFP on upstream (Arch) kernel 6.2

I’m having some trouble getting an upstream kernel to run reliably. I’m running the Arm version of Arch Linux and using their upstream kernels. I have a working 5.17 kernel with the SFP ports in legacy mode, and a script which sets the 373 GPIO pin to out and writes 0 to it.

However, when trying a 6.2 kernel (linux-aarch64-6.2.2-1) I could not get any connectivity on the SFP port at all.

  • When keeping the ports in legacy mode, the GPIO script fails (looks like the export of pin 373 fails), so I get no connectivity
  • When switching the SFP ports to managed mode, the SFP module fails to initialise properly:
# cat /sys/kernel/debug/dpmac1_sfp/state 
Module state: waitdev
Module probe attempts: 0 0
Device state: detached
Main state: down
Fault recovery remaining retries: 0
PHY probe remaining retries: 0
moddef0: 1
rx_los: 0
tx_fault: 0
tx_disable: 1

# ethtool -m wan
netlink error: Operation not supported

I’ve downgraded the kernel back to 5.17 for now, but I’d like to be able to run newer kernels, ideally with the SFP ports in managed mode. Any hints for how I can get this working would be appreciated!

I also hit another bug on the 6.2 kernel, bug that appears to be unrelated to the SFP issue…

A few things to check:

Does the insertion and removal of the SFP generate events in dmesg?

# Insert
sfp dpmac2_sfp: module FS               SFPP-PC01        rev R    sn F1930247305-1    dc 200917
# Remove
sfp dpmac2_sfp: module removed

Is your kernel detecting all I2C busses?

You should have three I2C busses from the system, plus a mux/switch on i2c-2 (PCA9540) which
handles the two SFP cages.

# ls -la /sys/bus/i2c/devices/
total 0
drwxr-xr-x 2 root root 0 Mar 10 22:08 .
drwxr-xr-x 4 root root 0 Mar 10 22:08 ..
lrwxrwxrwx 1 root root 0 Mar 10 22:08 0-0011 -> ../../../devices/platform/soc/2000000.i2c/i2c-0/0-0011
lrwxrwxrwx 1 root root 0 Mar 10 22:08 0-0018 -> ../../../devices/platform/soc/2000000.i2c/i2c-0/0-0018
lrwxrwxrwx 1 root root 0 Mar 10 22:08 0-001a -> ../../../devices/platform/soc/2000000.i2c/i2c-0/0-001a
lrwxrwxrwx 1 root root 0 Mar 10 22:08 0-0029 -> ../../../devices/platform/soc/2000000.i2c/i2c-0/0-0029
lrwxrwxrwx 1 root root 0 Mar 10 22:08 0-002f -> ../../../devices/platform/soc/2000000.i2c/i2c-0/0-002f
lrwxrwxrwx 1 root root 0 Mar 10 22:08 0-004c -> ../../../devices/platform/soc/2000000.i2c/i2c-0/0-004c
lrwxrwxrwx 1 root root 0 Mar 10 22:08 0-0061 -> ../../../devices/platform/soc/2000000.i2c/i2c-0/0-0061
lrwxrwxrwx 1 root root 0 Mar 10 22:08 0-0063 -> ../../../devices/platform/soc/2000000.i2c/i2c-0/0-0063
lrwxrwxrwx 1 root root 0 Mar 10 22:08 0-0065 -> ../../../devices/platform/soc/2000000.i2c/i2c-0/0-0065
lrwxrwxrwx 1 root root 0 Mar 10 22:08 0-0076 -> ../../../devices/platform/soc/2000000.i2c/i2c-0/0-0076
lrwxrwxrwx 1 root root 0 Mar 10 22:08 1-0032 -> ../../../devices/platform/soc/2020000.i2c/i2c-1/1-0032
lrwxrwxrwx 1 root root 0 Mar 10 22:08 2-0070 -> ../../../devices/platform/soc/2030000.i2c/i2c-2/2-0070
lrwxrwxrwx 1 root root 0 Mar 10 22:08 i2c-0 -> ../../../devices/platform/soc/2000000.i2c/i2c-0
lrwxrwxrwx 1 root root 0 Mar 10 22:08 i2c-1 -> ../../../devices/platform/soc/2020000.i2c/i2c-1
lrwxrwxrwx 1 root root 0 Mar 10 22:08 i2c-2 -> ../../../devices/platform/soc/2030000.i2c/i2c-2
lrwxrwxrwx 1 root root 0 Mar 10 22:08 i2c-3 -> ../../../devices/platform/soc/2030000.i2c/i2c-2/i2c-3
lrwxrwxrwx 1 root root 0 Mar 10 22:08 i2c-4 -> ../../../devices/platform/soc/2030000.i2c/i2c-2/i2c-4

If you have i2cdetect, running it on i2c-3 (eth9 / upper) or i2c-4 (eth8 / lower) should show the module EEPROM present on address 0x50:

i2cdetect 4
WARNING! This program can confuse your I2C bus, cause data loss and worse!
I will probe file /dev/i2c-4.
I will probe address range 0x08-0x77.
Continue? [Y/n] y
     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
00:                         -- -- -- -- -- -- -- --
...
50: 50 -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
60: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
70: UU -- -- -- -- -- -- --

Some “active” SFPs (like 1/10GBase-T) will additionally expose their internal phy on another I2C address.

If you the run i2cdump on the module EEPROM, it should read out successfully:

# For eth8 SFP: 
$ i2cdump 4 0x50
No size specified (using byte-data access)
WARNING! This program can confuse your I2C bus, cause data loss and worse!
I will probe file /dev/i2c-4, address 0x50, mode byte
Continue? [Y/n] y
     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f    0123456789abcdef
00: 03 04 21 00 00 00 00 00 04 00 00 00 67 00 00 00    ??!.....?...g...
10: 00 00 01 00 46 53 20 20 20 20 20 20 20 20 20 20    ..?.FS
20: 20 20 20 20 00 00 40 20 53 46 50 50 2d 50 43 30        ..@ SFPP-PC0
30: 31 20 20 20 20 20 20 20 52 20 20 20 01 00 00 3a    1       R   ?..:
40: 00 00 00 00 46 31 39 33 30 32 34 37 33 30 35 2d    ....F1930247305-
(truncated)

It looks like 6.2 changed the GPIO chip numbering/base as well. What is your output of ls -la /sys/class/gpio?

ls -la /sys/class/gpio/
total 0
drwxr-xr-x  2 root root    0 Mar 10 22:45 .
drwxr-xr-x 67 root root    0 Mar 10 22:45 ..
--w-------  1 root root 4096 Mar 10 22:50 export
lrwxrwxrwx  1 root root    0 Mar 10 22:45 gpiochip512 -> ../../devices/platform/soc/2300000.gpio/gpio/gpiochip512
lrwxrwxrwx  1 root root    0 Mar 10 22:45 gpiochip544 -> ../../devices/platform/soc/2310000.gpio/gpio/gpiochip544
lrwxrwxrwx  1 root root    0 Mar 10 22:45 gpiochip576 -> ../../devices/platform/soc/2320000.gpio/gpio/gpiochip576
lrwxrwxrwx  1 root root    0 Mar 10 22:45 gpiochip608 -> ../../devices/platform/soc/2330000.gpio/gpio/gpiochip608
lrwxrwxrwx  1 root root    0 Mar 10 22:45 gpiochip640 -> ../../devices/platform/soc/2000000.i2c/i2c-0/0-0076/gpio/gpiochip640

So anything that referenced GPIOs 368-383 should be re-based to 640+

Thank you for the pointers! Will need to get back to the machine to get physical access to try things out, which won’t be until next week.

One comment on this bit, though:

Isn’t that a major UAPI break? :confused:

I guess so!

Sounds like it was intentional:
https://lore.kernel.org/lkml/cover.1662116601.git.christophe.leroy@csgroup.eu/T/

Main changes in v2:
...

Moving the base of dynamic allocation from 256 to 512 because there
are drivers allocating gpios as high as 400.

I should add gpio-line-names to our device tree so they will be easier to find in the future…

FWIW today I tested Arch and linux-aarch64 6.2.6-1-aarch64-ARCH and SFPs seem to work for me.
If I had to guess, it’s likely to be a I2C issue (like reading the EEPROM), but not sure how.

Okay, so I managed to get the 6.2 kernel to work with the SFP in legacy mode, after changing my setup script to fiddle with the right GPIO pin (645). Thanks for the pointer!

However, it still doesn’t work in managed mode. I get the events in dmesg:

[root@ten64 ~]# dmesg | grep -E 'sfp|wan'
[    4.807370] sfp dpmac2_sfp: Host maximum power 2.0W
[    4.814138] sfp dpmac1_sfp: Host maximum power 2.0W
[    5.136388] sfp dpmac1_sfp: module FiberStore       SFP-GE-BX        rev      sn F0607150373      dc 160717  
[    6.097454] fsl_dpaa2_eth dpni.1 sfp: renamed from eth8
[    6.132607] fsl_dpaa2_eth dpni.0 wan: renamed from eth9
[    7.233066] fsl_dpaa2_eth dpni.0 wan: Link Event: state up
[  125.558471] fsl_dpaa2_eth dpni.0 wan: Link Event: state down
[  128.101693] sfp dpmac1_sfp: module removed
[  134.239694] sfp dpmac1_sfp: module FiberStore       SFP-GE-BX        rev      sn F0607150373      dc 160717  
[  139.559273] fsl_dpaa2_eth dpni.0 wan: Link Event: state up

and I can dump the EEPROM via i2c:

[root@ten64 ~]# i2cdetect 3
WARNING! This program can confuse your I2C bus, cause data loss and worse!
I will probe file /dev/i2c-3.
I will probe address range 0x08-0x77.
Continue? [Y/n] y
     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
00:                         -- -- -- -- -- -- -- -- 
10: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
20: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
30: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
40: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
50: 50 51 -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
60: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
70: UU -- -- -- -- -- -- --                         
[root@ten64 ~]# i2cdump 3 0x50
No size specified (using byte-data access)
WARNING! This program can confuse your I2C bus, cause data loss and worse!
I will probe file /dev/i2c-3, address 0x50, mode byte
Continue? [Y/n] 
     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f    0123456789abcdef
00: 03 04 07 00 00 00 40 20 10 01 00 01 0d 00 14 c8    ???...@ ??.??.??
10: 00 00 00 00 46 69 62 65 72 53 74 6f 72 65 20 20    ....FiberStore  
20: 20 20 20 20 00 00 00 00 53 46 50 2d 47 45 2d 42        ....SFP-GE-B
30: 58 20 20 20 20 20 20 20 20 20 20 20 06 0e 00 fb    X           ??.?
40: 00 1a 00 00 46 30 36 30 37 31 35 30 33 37 33 20    .?..F0607150373 
50: 20 20 20 20 31 36 30 37 31 37 20 20 68 90 01 6f        160717  h??o
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................

But not via ethtool:

[root@ten64 ~]# ethtool -e wan
Cannot get EEPROM data: Operation not supported

[root@ten64 ~]# ethtool -m wan
netlink error: Operation not supported

The link status looks fine with ip:

[root@ten64 ~]# ip link show dev wan
11: wan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 00:0a:fa:24:2a:af brd ff:ff:ff:ff:ff:ff

but I see only egress traffic with tcpdump, nothing comes back in :frowning:

Any other ideas? :slight_smile:

Are there any messages like:

fsl_dpaa2_eth dpni.1 eth9: switched to inband/1000base-x link mode
or
fsl_dpaa2_eth dpni.1 eth9: validation of inband/10gbase-r with support 0000000,00000800,00006440 failed: -22
or
fsl_dpaa2_eth dpni.1 eth9: unsupported SFP module: no common interface modes

I wonder if it’s having trouble decoding the SFP EEPROM.

Do you have any other SFPs you can try? (Even if’s not the speed you want to use, just to see if ethtool and the kernel read it correctly)