Kubeflow Machine Learning Tips and Tricks – February 2022

Without wasting any time, let’s dive right in!

Your Questions Our Answers

1. Is there a way to auto-stop notebooks that are idle for a long time, such as overnight? We are looking to reduce resource usage.

Yes, in fact, it is a setting in Kubeflow, but it is not enabled by default.

It’s called “notebook culling”, and you can set it up so that it is enabled by default on a fresh install. But assuming you already have an instance that you want to apply it to- here are the instructions:

Kubeflow 1.5 will update how idleness is calculated. So if you upgrade from 1.4 to 1.5 expect a slight change in behavior.

Credit: Question Keith Adler – Community SlackAnswer: Alexandre Brown and Benjamin Tan

2. What’s the best way to get Metadata from an experiment?

This is the sort of question that would get blocked as ‘subjective’ on Stack Overflow and probably cost you some karma points. But this is a blog and I am OP and Moderator.

There was a lot of chat on this- the community was sort of drifting towards MLFlow’s metadata tracker is best- then my friend and co-author Boris Lubinsky chimed in:

MLFlow works great for demos, but I am afraid, it’s not scalable enough for large-scale implementations. In general, in my opinion, metadata management was always the weakest link in Kubeflow.”

I trust Boris a lot and would agree (especially since I’ve not done much with metadata tracking personally). So MLFlow, but don’t count on it at scale. Also, a user named Timos chimed in later that it is worth watching KServe as they may be developing better tracking soon.

Credit: Question Андрей Ильичёв, Answer: Compilation from Community and Boris

3. How do I enable GPUs on MiniKF local?

Without reservation, the answer is “You shouldn’t be using Vagrant.” Period. Full stop. You should use it on GCP or AWS. Personally, I prefer GCP but I have to use it locally because .

OK, you have the one in ten million corner case. Also, this keeps coming up, so I want to post it here in case Steven stops logging in to community Slack one day and no one can ever figure this out again.

The issue isn’t MiniKF, it’s the default way Vagrant is set up – it can’t access your GPUs. The solution is to make Virtualbox using LibVRT which CAN see your GPUs. I’m just going to copy-paste Steven’s answer from community Slack since it is very thorough.

Steven says: here’s a more detailed explanation.

First of all, these are instructions for Linux. My current setup is a headless machine with 2 GPUs. With this setup, when MiniKF is running, the GPUs are “detached” from the main OS and only MiniKF can access them. This may not fit your current use case.

I followed some of the instructions I found here to enable IOMMU.

1. IOMMU Setup

  • Enable IOMMU and CPU virtualization in the machine BIOS.
  • Enable IOMMU in the boot kernel parameters (In my case, the system uses grub2, I edited /etc/default/grub to add amd_iommu=on iommu=pt – then regenerated the config grub2-mkconfig -o /boot/grub2/grub.cfg).
  • Reboot.
  • Verify that IOMMU is correctly enabled dmesg | grep -i -e DMAR -e IOMMU.
  • Find IOMMU groups of your GPUs and hardware info. (This output all your devices, search for VGA or NVIDIA to quickly find the values ​​we’re looking for).
#!/bin/bash
for d in /sys/kernel/iommu_groups/*/devices/*; do
  n=${d#*/iommu_groups/*}; n=${n%%/*}
  printf 'IOMMU Group %s ' "$n"
  lspci -nns "${d##*/}"
done

  • In my case, both cards are already on isolated IOMMU groups, no patching is needed. We will refer to these values ​​multiple times throughout the process.
IOMMU Group 16 08:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2204] (rev a1)
IOMMU Group 16 08:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:1aef] (rev a1)
IOMMU Group 47 43:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2204] (rev a1)
IOMMU Group 47 43:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:1aef] (rev a1)

2. libvirt Setup

  • Enable hook support for QEMU Wget –O /etc/libvirt/hooks/qemu chmod +x /etc/libvirt/hooks/qemu
  • Create config with the device addresses /etc/libvirt/hooks/kvm.conf. The addresses are the ones displayed when executing the IOMMU group script. 08:00.1 -> pci_0000_08_00_1
## Virsh devices
VIRSH_VIDEO_1=pci_0000_08_00_0
VIRSH_AUDIO_1=pci_0000_08_00_1

VIRSH_VIDEO_2=pci_0000_43_00_0
VIRSH_AUDIO_2=pci_0000_43_00_1

  • Create bind hook /etc/libvirt/hooks/qemu.d/minikf_default/prepare/begin/bind_vfio.sh
#!/bin/bash

## Load the config file
source "/etc/libvirt/hooks/kvm.conf"

## Load vfio
modprobe vfio
modprobe vfio_iommu_type1
modprobe vfio_pci

## Unbind gpu from nvidia and bind to vfio
virsh nodedev-detach $VIRSH_VIDEO_1
virsh nodedev-detach $VIRSH_AUDIO_1
virsh nodedev-detach $VIRSH_VIDEO_2
virsh nodedev-detach $VIRSH_AUDIO_2

  • Create unbind hook /etc/libvirt/hooks/qemu.d/minikf_default/release/end/unbind_vfio.sh
#!/bin/bash

## Load the config file
source "/etc/libvirt/hooks/kvm.conf"

## Load vfio
modprobe vfio
modprobe vfio_iommu_type1
modprobe vfio_pci

## Unbind gpu from nvidia and bind to vfio
virsh nodedev-reattach $VIRSH_VIDEO_1
virsh nodedev-reattach $VIRSH_AUDIO_1
virsh nodedev-reattach $VIRSH_VIDEO_2
virsh nodedev-reattach $VIRSH_AUDIO_2

modprobe -r vfio_pci
modprobe -r vfio_iommu_type1
modprobe -r vfio

  • Execute chmod +x /etc/libvirt/hooks/qemu.d/minikf_default/prepare/begin/bind_vfio.sh /etc/libvirt/hooks/qemu.d/minikf_default/release/end/unbind_vfio.sh to make hooks executable.
  • Edit vfio conf /etc/modprobe.d/vfio.conf. The ids correspond to the NVIDIA devices from the IOMMU Groups step. (I’m not sure if this step is required).
options vfio-pci ids=10de:2204,10de:1aef

3. Vagrant Setup

  • Install vagrant mutate vagrant plugin install vagrant-mutateMutate minikf box to libvirt vagrant mutate arrikto/minikf libvirt.
  • Vagrant mutate does not copy all the box configuration, we need to copy the included folder from the VirtualBox version ~/.vagrant.d/boxes/arrikto-VAGRANTSLASH-minikf/20210428.0.1/virtualbox/ into the libvirt version ~/.vagrant.d/boxes/arrikto-VAGRANTSLASH-minikf/20210428.0.1/libvirt/
  • Edit local Vagrantfile to change the CPU/memory. In my case, I allocate 30 CPUs and 40GB of memory. Add this section to the file:
config.vm.provider :libvirt do |libvirt|
    libvirt.cpus = 30
    libvirt.memory = 40960
  End

  • Run the box vagrant up --provider=libvirt

4. Additional Box Config

Now that the VM is created, we need to attach the GPUs.

Note: These steps could probably be added into the Vagrantfile but I’m not familiar with it.

  • Stop the VM: vagrant halt.
  • Enable KVM hidden state: sudo virsh edit minikf_default. Add these lines in the <features> section.

” data-lang=””>

<kvm>
    <hidden state="on"/>
 </kvm>

  • Create config files for the GPUs. The address info is the one found when executing the IOMMU group script. 08:00.1 -> bus 0x08, slot 0x00, function 0x1

Device_gpu1.xml

” data-lang=””>

<hostdev mode="subsystem" type="pci" managed='yes'>
 <driver name="vfio"/>
 <source>
  <address domain='0x0000' bus="0x08" slot="0x00" function='0x0'/>
 </source>
</hostdev>

Device_gpu1_audio.xml

” data-lang=””>

<hostdev mode="subsystem" type="pci" managed='yes'>
 <driver name="vfio"/>
 <source>
  <address domain='0x0000' bus="0x08" slot="0x00" function='0x1'/>
 </source>
</hostdev>

Repeat for device_gpu2.xml and device_gpu_2.xml if necessary.

  • Attach the devices to the VM.

sudo virsh attach-device minikf_default --config device_gpu1.xml
sudo virsh attach-device minikf_default --config device_gpu1_audio.xml
sudo virsh attach-device minikf_default --config device_gpu2.xml
sudo virsh attach-device minikf_default --config device_gpu2_audio.xml

  • Restart vagrant box vagrant up –provider=libvirt.

The GPUs should now be available inside MiniKF! You can verify by going into the Minikf box sudo vagrant ssh and executing nvidia-smi to list the devices.

~Steven Payre

CreditQuestion- Ben Pashley, Answer- Steven Payre

4. How does one get involved in the Kubeflow Community?

Great question and glad you asked.

Every community is different, and the Kubeflow community has some quarks of its own. If you go to the “community” section of the Kubeflow website, you’ll see information on how to join Slack and various mailing lists.

In the Kubeflow Community- the Community Slack does appear to be the primary way users get support (also why so many questions in this series are pulled from there.) But you’ll also see lots of mailing lists/google groups to join.

So joining Slack (and participating) is a good first (and second) step. But after that what? My advice would be to give talks at meetups or even volunteer to be a co-organizer for a meetup.

Speaking at meetups is a low-stress/low-stakes way to start making a name for yourself in the community.

Credit: Me, Kubeflow community, and ASF member

.

Leave a Comment