MuVirt k3os-cluster-wizard

Hi Ten64 community,

I have just received my NAS ten64, and one of the reasons for me to buy a ten64 for me was the chance to manage a k3s cluster like proxmox within openwrt.

Here are some caveats I found so far…

First, I needed to work with latest k3os-cluster-wizard from gitlab’s master branch.

Second, the muvirt-system-setup filesystem partition for working area/cache and swap seems to have a little bug, as a lsblk shows afterwards the SWAP partition size that belongs to working area/cache and the opposite (SWAP partition size attached to lvm)

Third, I found that the that muvirt’s k3os script fails at downloading new images from scratch as the path /mnt/scratch/imglibrary does not exists after applying muvirt-system-setup, so it has to be created in advance.

Fourth, when provisioning new k3os VMs, precisely the line 142

newVMConfig[“imagesum256”] = sha256sum

should be replaced by

newVMConfig[“checksum”] = sha256sum

by looking at the lua call.

Fifth, and this is where I am stuck, it fails at creating volumes for the VMs with the following:

user.notice muvirt-provision: [Could not create a volume for k3controller]

Is there anything I am missing? How can I debug this?

Last, I noticed that It can’t create volumes from SATA’s SSD due to zlib on filesystem. I installed muvirt from the appstore to the nvme as the quickstart doc suggests.

I’d like to know also if it’d be possible to add external storage to all the new VMs as shared volumes on k3os (eg. longhorn use case) and how to make it work on SATA.

Which muvirt version did you download? The two bugs you mentioned should be fixed already but maybe there is an old version being linked somewhere.

The k3os cluster wizard also requires a patched version of k3os to work, give me a bit of time and I’ll get the details.

@mcbridematt muvirt 21.02-SNAPSHOT, r0+16209-104234e569
I got it installed from recovery’s appstore after upgrading firmware

Apologies, I was thinking of a different bug. The k3os-cluster-wizard needed a few updates.

You can either download this image and use the muvirt/OpenWrt sysupgrade:
https://archive.traverse.com.au/pub/traverse/software/muvirt/branches/master/350648520/image/muvirt-21.02-snapshot-r0-16209-104234e569-arm64-efi-generic-ext4-combined.img.gz

or reinstall a new version from the appliance store.

You need to use this URL for the k3os image:

https://archive.traverse.com.au/pub/traverse/software/muvirt/temp/k3os-arm64-muvirt.iso

This ISO contains this patch: install.sh: fix force efi on arm64 (#XXX) · mcbridematt/k3os@f2bebac · GitHub which I will do a pull request for soon.

Hey @mcbridematt, I have flashed the new firmware from muvirt’s sysupgrade menu, and set k3os base image, and yet I have the same error

Got 1 keys, continuing
Provisioning the master controller VM
ERROR: Provisioning master VM failed
muvirt-provision: [k3controller] downloading https://archive.traverse.com.au/pub/traverse/software/muvirt/temp/k3os-arm64-muvirt.iso to /mnt/muvirtwork/imglibrary//k3os.img
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  496M  100  496M    0     0  12.1M      0  0:00:40  0:00:40 --:--:-- 12.4M
muvirt-provision: [Could not create a volume for k3controller] 
root@muvirt:~# 

Should I perform a clean muvirt system setup to make new lvm partitions?

I’m guessing there was already a “k3controller” volume from the previous setup attempt.

Try removing it first:

lvdisplay # see if there are any volumes apart from muvirtwork
lvremove /dev/mapper/k3controller
rm -rf /mnt/muvirtwork/k3controller # delete any temporary files just to be sure
uci delete virt.k3controller
uci commit virt

Yes @mcbridematt , that solved the volume issue, although the wizard now shows a luci config error that prevent worker nodes to be provisioned/created

Provisioning the master controller VM
[  522.273765] /dev/loop1: Can't open blockdev
[  522.986686] br-lan: port 2(tap0) entered blocking state
[  522.991935] br-lan: port 2(tap0) entered disabled state
[  522.997325] device tap0 entered promiscuous mode
[  523.002302] br-lan: port 2(tap0) entered blocking state
[  523.007620] br-lan: port 2(tap0) entered forwarding state
[  549.207280] br-lan: port 2(tap0) entered disabled state
[  549.224923] device tap0 left promiscuous mode
[  549.229311] br-lan: port 2(tap0) entered disabled state
[  558.087552] br-lan: port 2(tap0) entered blocking state
[  558.092799] br-lan: port 2(tap0) entered disabled state
[  558.098278] device tap0 entered promiscuous mode
[  558.103314] br-lan: port 2(tap0) entered blocking state
[  558.108650] br-lan: port 2(tap0) entered forwarding state
[  584.448086] br-lan: port 2(tap0) entered disabled state
[  584.465781] device tap0 left promiscuous mode
[  584.470146] br-lan: port 2(tap0) entered disabled state
Master node provisioning done, waiting for the cluster token to be created
[  588.490334] br-lan: port 2(tap0) entered blocking state
[  588.495576] br-lan: port 2(tap0) entered disabled state
[  588.500981] device tap0 entered promiscuous mode
[  588.505916] br-lan: port 2(tap0) entered blocking state
[  588.511231] br-lan: port 2(tap0) entered forwarding state
Got cluster token: K107b945c5acd1f9551993b463d6f01c50f101b64a64a5dbd3bbc24c656aff1fe23::server:01afcbe8df59f045ca4c85285c8287b5
/usr/bin/lua: /usr/sbin/k3os-cluster-wizard:204: Failed to save K3OS master details in UCI
stack traceback:
        [C]: in function 'error'
        /usr/sbin/k3os-cluster-wizard:204: in function 'set_master_detail'
        /usr/sbin/k3os-cluster-wizard:294: in function 'mode_selfcontained'
        /usr/sbin/k3os-cluster-wizard:442: in main chunk

Are there any existing tokens saved in UCI?
e.g:

$ uci show virt.k3
virt.k3=k3os
virt.k3.token='K10ecc3b5e81d3a436c8259ab5481404d608289bedf5471f8eb8e36409921be52db::server:db8e1265fdb45c4c659c726018ce5909'
virt.k3.server='https://192.168.1.10:6443'

That is my guess about why it is refusing to save. If there are, you could try updating the token and IP address and then going back into the wizard to create worker nodes.

Not really, I made a clean install from recovery…

I haven’t been able to figure out why it failed in set_master_detail. I thought it could be having unapplied changes in UCI but that is unlikely if the controller VM has been setup successfully.

I’ve made a few changes to k3os-cluster-wizard including adding some checks to the set_master_detail function and printing out the error UCI message returned if it doesn’t save to UCI successfully, mind giving it a try?

I haven’t built a new image since these are only script changes, you can download and install an updated muvirt-k3os package:

$ wget https://archive.traverse.com.au/pub/traverse/software/muvirt/temp/muvirt-k3os_0.2.1-1_aarch64_generic.ipk
opkg install muvirt-k3os_0.2.1-1_aarch64_generic.ipk
Upgrading muvirt-k3os on root from 0.2-1 to 0.2.1-1...
Configuring muvirt-k3os.

I just noticed that baremetal-deploy was still downloading the old version of muvirt as well, the appstore list has been updated.

Yes the appstore was fetching an old version, that’s why I downloaded and installed from qcow2 image directly. I made a clean install from appstore and seems ok, but a new dhcp reservation error shows up now. Here’s the full trace…

root@muvirt:/# wget https://archive.traverse.com.au/pub/traverse/software/muvirt
/temp/muvirt-k3os_0.2.1-1_aarch64_generic.ipk
Downloading 'https://archive.traverse.com.au/pub/traverse/software/muvirt/temp/muvirt-k3os_0.2.1-1_aarch64_generic.ipk'
Connecting to 2605:6400:20:999::1:443
Writing to 'muvirt-k3os_0.2.1-1_aarch64_generic.ipk'
muvirt-k3os_0.2.1-1_ 100% |*******************************|  6216   0:00:00 ETA
Download completed (6216 bytes)
root@muvirt:/# opkg install muvirt-k3os_0.2.1-1_aarch64_generic.ipk
Upgrading muvirt-k3os on root from 0.2-1 to 0.2.1-1...
Configuring muvirt-k3os.
root@muvirt:/# k3os-cluster-wizard
K3OS self contained cluster wizard
Select mode:
        [1] - Build a self contained cluster
        [2] - Create nodes to join an existing cluster
1
Going into self contained mode

Enter the base name for this cluster, or [enter] for the default "k3"

For example: With a base name of k3, the nodes will be named:
        Controller      : k3controller
        Node 1  : k3node1
        Node 2  : k3node2
        Node X  : k3nodeX

Cluster base name:k3s-ten64
Using "k3s-ten64" as the cluster basename
Enter URL for k3os iso, or [enter] for the default https://github.com/rancher/k3os/releases/download/v0.11.1/k3os-arm64.iso
https://archive.traverse.com.au/pub/traverse/software/muvirt/temp/k3os-arm64-muvirt.iso
Downloading k3os iso from https://archive.traverse.com.au/pub/traverse/software/muvirt/temp/k3os-arm64-muvirt.iso
Downloaded image, with SHA256SUM f83703001782295b097ced9663117ca9271979c29181ca6af67172bbb17021cd
OpenWrt network to connect nodes to (must be a bridge): [lan]
Create DHCP reservation for Master node? [y/n]
y
/usr/bin/lua: /usr/sbin/k3os-cluster-wizard:96: Could not create DHCP reservation for VM "k3s-ten64controller": Invalid argument
stack traceback:
        [C]: in function 'error'
        /usr/sbin/k3os-cluster-wizard:96: in function 'create_dhcp_reservation'
        /usr/sbin/k3os-cluster-wizard:295: in function 'mode_selfcontained'
        /usr/sbin/k3os-cluster-wizard:486: in main chunk
        [C]: ?
root@muvirt:/# 

btw I’ve just realised that is an old k3os version

ten64controller [~]$ k get nodes
NAME              STATUS   ROLES    AGE    VERSION
ten64node2        Ready    <none>   10m    v1.19.2+k3s1
ten64node3        Ready    <none>   9m1s   v1.19.2+k3s1
ten64controller   Ready    master   14m    v1.19.2+k3s1
ten64node1        Ready    <none>   11m    v1.19.2+k3s1

would you mind upgrading to v0.21? as it comes with v1.21.1 k3s and traefik v2 compatible with standard ingress resource

Doh! I’ve uploaded the 1.21.1 ISO:

https://archive.traverse.com.au/pub/traverse/software/muvirt/temp/k3os-arm64-1-21.iso

Due to an issue with 1.21 not getting DHCP leases from OpenWrt/dnsmasq you also need an updated package:

https://archive.traverse.com.au/pub/traverse/software/muvirt/temp/muvirt-k3os_0.2.2-1_aarch64_generic.ipk

(The specific workaround is in muvirt-feed/568498fb)

Yes, the input handling in k3os-cluster-wizard is very rudimentary.

I actually would like to refactor all the setup code into a module so it could be used by a web based wizard as well, though this is a low priority right now.

1 Like

Hey @mcbridematt I’m sorry to bother you again, but pods get stuck at unknown status after a rebooting ten64.

ten64controller [~]$ k get pods -A
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
ten64controller [~]$ k get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system helm-install-traefik-crd-rk55m 0/1 Completed 0 42m
kube-system helm-install-traefik-mdkwc 0/1 Completed 1 42m
kube-system svclb-traefik-fzp5w 2/2 Running 0 8m29s
kube-system svclb-traefik-x4zst 2/2 Running 0 7m20s
kube-system svclb-traefik-cpnnj 2/2 Running 0 6m17s
kube-system svclb-traefik-nsfbh 0/2 Unknown 0 40m
kube-system coredns-7448499f4d-j6xr8 0/1 Unknown 0 42m
kube-system traefik-97b44b794-j46tr 0/1 Unknown 0 40m
k3os-system system-upgrade-controller-8bf4f84c4-bvzbr 0/1 Unknown 0 42m
kube-system local-path-provisioner-5ff76fc89d-bn6j9 0/1 Unknown 1 42m
kube-system metrics-server-86cbb8457f-zwxxb 0/1 Unknown 0 42m

it is a good idea to build a self contained k3os or it’d be better to join to an existing one?
In case that I upgrade the firmware or something… Btw, I also noticed that /etc/profile gets cleaned after a reboot,. how can I create persistent bash aliases for k3os? eg. alias k=‘kubectl’ or add new ssh keys to it.

I think this is because the controller might need more resources (RAM and CPU). I have seen this problem with k3s before.
If one of the major pods crashes due to lack of resources then the entire k3s server is restarted.

Try increasing the memory and number of CPU cores for the controller VM:

# Shutdown the controller VM first, then do this

uci set virt.k3testcontroller.memory=2048
uci set virt.k3testcontroller.numprocs=2
uci commit virt
/etc/init.d/muvirt start k3testcontroller

In the next version of k3os-cluster-wizard we should increase the default RAM and CPU for the controller and/or ask for a value to use.

I see self-contained k3os as more of a ‘teaching tool’. Use it learn kubernetes and do test and development.

For serious use I would set up a dedicated k3s instance hosted elsewhere (like the “cloud”) as the controller, then you just use the k3os wizard to join up worker nodes to it.

/etc/ in k3os is temporary, the contents are reconstructed every reboot (see “File system structure” in the k3os README)

Adding aliases to /home/rancher/.bash_profile and SSH keys to /home/rancher/.ssh/authorized_keys works ok.

1 Like

Hey @mcbridematt , thanks for tips :slight_smile: , I’ve just opened a PR on github for EFI support on arm64.

Btw, have you tried for k3s-upgrade for muvirt ? You think it may work? If so, which image should be applied for the k3s-upgrade plan?

k3os released support for k3s v1.21.5, and even though k3s just got released v1.22.2 recently, it won’t get into stable channel until v1.22.3 due to some breaking changes.

Edit: I have compiled the isos and published them on Alboroto / temp · GitLab

@mcbridematt question wrt qemu and muvirt, how many spare resources can I take? Is it okay to take 6 cpu and ~27 gb ram? eg. 3 nodes 9gb ram w/2cpu each?

Btw, have you tried for k3s-upgrade for muvirt ? You think it may work? If so, which image should be applied for the k3s-upgrade plan?

I answer myself, apparently it works fine… at least from latest k3os kernel …

Thanks for doing this! The latest k3os iso now works in the wizard.

I’ve written an article on the muvirt wiki which goes through using the k3os wizard as well as setting up kubectl and Portainer.

Still have to add information about joining an existing kubernetes but it isn’t too different.

I’d say that is reasonable.

As a guide, by default 85% of system memory is reserved for HugeTLB by default. With 32GB that results in 27.2GB reserved, you could reserve 95% instead (30.4GB)

@mcbridematt yes I went into some problems trying to join to an existing cluster even though I predefined the cluster details I couldn’t join ten64 nodes. So I ended up making the cluster on ten64 and setting some cloud init configs from the k3os repo to portforward http/https, disable klipper in support of metallb, setting the dns resolver, expanding the virtual mem, etc.
I can dare to say that everything works as expected after one month running, I can connect externally to my cluster using wireguard :slight_smile:

The only problem I am facing now, is that I am able to reach my router’s lan devices from my nodes (on the ten64 lan). As I pursue to expose some services through my double lan on the internet, I guess that’s the way it should.