Implement Chaos Engineering in Kubernetes

Chaos Network is an open-source, cloud-based Chaos Engineering platform built on custom Kubernetes resource definitions (K8s) (CRDs). It can simulate various types of faults and has a tremendous ability to orchestrate fault scenarios. You can use Chaos Mesh to simulate various anomalies that may occur in development, testing, and production environments, and to find potential problems with the system.

In this article, I’ll explore Chaos architecture practice in Kubernetes clusters, discuss important Chaos Mesh features by analyzing its source code, and explain how to develop Chaos Mesh’s level of control with code examples. If you are unfamiliar with Chaos Mesh, please review the Chaos Mesh documentation for a basic knowledge of its structure.

For the testing code in this article, check out this GitHub repository.

How to create chaos network chaos

Chaos Mesh is the Swiss Army Knife for performing Chaos Engineering on Kubernetes. This section introduces how it works.

Featured Mode

Chaos Mesh runs privileged containers in Kubernetes to generate failures. Chaos Daemon’s Pod is played as DaemonSet It adds additional capabilities to the Pod container runtime via the Pod security context.

apiVersion: apps/v1
kind: DaemonSet
spec:
  template:
    metadata: ...
    spec:
      containers:
        - name: chaos-daemon
          securityContext:
            {{- if .Values.chaosDaemon.privileged }}
            privileged: true
            capabilities:
              add:
                - SYS_PTRACE
            {{- else }}
            capabilities:
              add:
                - SYS_PTRACE
                - NET_ADMIN
                - MKNOD
                - SYS_CHROOT
                - SYS_ADMIN
                - KILL
                # CAP_IPC_LOCK is used to lock memory
                - IPC_LOCK
            {{- end }}

Linux capabilities grant container privileges to create a file /dev/fuse The file system is in a Userspace tube (FUSE). FUSE is the interface to the Linux userspace file system. It allows non-privileged users to create their own file systems without editing kernel code.

According to pull request #1109 on GitHub, the . file DaemonSet The program uses cgo to invoke Linux makedev Function to create tube valves.

// #include <sys/sysmacros.h>
// #include <sys/types.h>
// // makedev is a macro, so a wrapper is needed
// dev_t Makedev(unsigned int maj, unsigned int min) {
//   return makedev(maj, min);
// }
// EnsureFuseDev ensures /dev/fuse exists. If not, it will create one
func EnsureFuseDev() {
    if _, err := os.Open("/dev/fuse"); os.IsNotExist(err) {
        // 10, 229 according to https://www.kernel.org/doc/Documentation/admin-guide/devices.txt
        fuse := C.Makedev(10, 229)
        syscall.Mknod("/dev/fuse", 0o666|syscall.S_IFCHR, int(fuse))
    }
}

In pull request 1453, the Chaos Daemon service was able to set the privilege by default; That is, it sets privileged: true in the container SecurityContext.

kill horns

PodKillAnd PodFailure, And ContainerKill belong to PodChaos Category. PodKill Randomly kills the pod. Calls the server API to send the kill command.

import (
    "context"
    v1 "k8s.io/api/core/v1"
    "sigs.k8s.io/controller-runtime/pkg/client"
)
type Impl struct {
    client.Client
}
func (impl *Impl) Apply(ctx context.Context, index int, records []*v1alpha1.Record, obj v1alpha1.InnerObject) (v1alpha1.Phase, error) {
    ...
    err = impl.Get(ctx, namespacedName, &pod)
    if err != nil {
        // TODO: handle this error
        return v1alpha1.NotInjected, err
    }
    err = impl.Delete(ctx, &pod, &client.DeleteOptions{
        GracePeriodSeconds: &podchaos.Spec.GracePeriod, // PeriodSeconds has to be set specifically
    })
    ...
    return v1alpha1.Injected, nil
}

the GracePeriodSeconds The parameter allows Kubernetes to terminate the Pod. For example, if you need to delete a Pod immediately, use an extension kubectl delete pod --grace-period=0 --force ordering.

PodFailure Corrects the Pod object resource to replace the image in the Pod with the wrong one. Chaos only edits a file image fields containers And initContainers. This is because most of the metadata about Pod is immutable. For more details, see Updating and Replacing Pods.

func (impl *Impl) Apply(ctx context.Context, index int, records []*v1alpha1.Record, obj v1alpha1.InnerObject) (v1alpha1.Phase, error) {
    ...
    pod := origin.DeepCopy()
    for index := range pod.Spec.Containers {
        originImage := pod.Spec.Containers[index].Image
        name := pod.Spec.Containers[index].Name
        key := annotation.GenKeyForImage(podchaos, name, false)
        if pod.Annotations == nil {
            pod.Annotations = make(map[string]string)
        }
        // If the annotation is already existed, we could skip the reconcile for this container
        if _, ok := pod.Annotations[key]; ok {
            continue
        }
        pod.Annotations[key] = originImage
        pod.Spec.Containers[index].Image = config.ControllerCfg.PodFailurePauseImage
    }
    for index := range pod.Spec.InitContainers {
        originImage := pod.Spec.InitContainers[index].Image
        name := pod.Spec.InitContainers[index].Name
        key := annotation.GenKeyForImage(podchaos, name, true)
        if pod.Annotations == nil {
            pod.Annotations = make(map[string]string)
        }
        // If the annotation is already existed, we could skip the reconcile for this container
        if _, ok := pod.Annotations[key]; ok {
            continue
        }
        pod.Annotations[key] = originImage
        pod.Spec.InitContainers[index].Image = config.ControllerCfg.PodFailurePauseImage
    }
    err = impl.Patch(ctx, pod, client.MergeFrom(&origin))
    if err != nil {
        // TODO: handle this error
        return v1alpha1.NotInjected, err
    }
    return v1alpha1.Injected, nil
}

The default container image that causes the failure is gcr.io/google-containers/pause:latest.

PodKill And PodFailure Pod lifecycle control through Kubernetes API server. But ContainerKill Is this by Chaos Daemon working on the cluster node. ContainerKill Chaos Controller Manager is used to launch the client to initiate gRPC calls to Chaos Daemon.

func (b *ChaosDaemonClientBuilder) Build(ctx context.Context, pod *v1.Pod) (chaosdaemonclient.ChaosDaemonClientInterface, error) {
    ...
    daemonIP, err := b.FindDaemonIP(ctx, pod)
    if err != nil {
        return nil, err
    }
    builder := grpcUtils.Builder(daemonIP, config.ControllerCfg.ChaosDaemonPort).WithDefaultTimeout()
    if config.ControllerCfg.TLSConfig.ChaosMeshCACert != "" {
        builder.TLSFromFile(config.ControllerCfg.TLSConfig.ChaosMeshCACert, config.ControllerCfg.TLSConfig.ChaosDaemonClientCert, config.ControllerCfg.TLSConfig.ChaosDaemonClientKey)
    } else {
        builder.Insecure()
    }
    cc, err := builder.Build()
    if err != nil {
        return nil, err
    }
    return chaosdaemonclient.New(cc), nil
}

When the Chaos Controller Manager sends commands to the Chaos Daemon, it creates a matching client based on the Pod’s information. For example, to control a Pod on a node, it creates a client by getting an extension ClusterIP From the node where the Pod is located. If there is a Transport Layer Security (TLS) certificate configuration, the console manager adds a TLS certificate for the client.

When Chaos Daemon starts up if it has a TLS certificate, it attaches the certificate to enable gRPCS. TLS configuration option RequireAndVerifyClientCert Indicates whether to enable Mutual TLS (mTLS) authentication.

func newGRPCServer(containerRuntime string, reg prometheus.Registerer, tlsConf tlsConfig) (*grpc.Server, error) {
    ...
    if tlsConf != (tlsConfig{}) {
        caCert, err := ioutil.ReadFile(tlsConf.CaCert)
        if err != nil {
            return nil, err
        }
        caCertPool := x509.NewCertPool()
        caCertPool.AppendCertsFromPEM(caCert)
        serverCert, err := tls.LoadX509KeyPair(tlsConf.Cert, tlsConf.Key)
        if err != nil {
            return nil, err
        }
        creds := credentials.NewTLS(&tls.Config{
            Certificates: []tls.Certificate{serverCert},
            ClientCAs:    caCertPool,
            ClientAuth:   tls.RequireAndVerifyClientCert,
        })
        grpcOpts = append(grpcOpts, grpc.Creds(creds))
    }
    s := grpc.NewServer(grpcOpts...)
    grpcMetrics.InitializeMetrics(s)
    pb.RegisterChaosDaemonServer(s, ds)
    reflection.Register(s)
    return s, nil
}

Chaos Daemon provides the following gRPC interfaces for communication:

// ChaosDaemonClient is the client API for ChaosDaemon service.
//
// For semantics around ctx use and closing/ending streaming RPCs, please refer to https://godoc.org/google.golang.org/grpc#ClientConn.NewStream.
type ChaosDaemonClient interface {
    SetTcs(ctx context.Context, in *TcsRequest, opts ...grpc.CallOption) (*empty.Empty, error)
    FlushIPSets(ctx context.Context, in *IPSetsRequest, opts ...grpc.CallOption) (*empty.Empty, error)
    SetIptablesChains(ctx context.Context, in *IptablesChainsRequest, opts ...grpc.CallOption) (*empty.Empty, error)
    SetTimeOffset(ctx context.Context, in *TimeRequest, opts ...grpc.CallOption) (*empty.Empty, error)
    RecoverTimeOffset(ctx context.Context, in *TimeRequest, opts ...grpc.CallOption) (*empty.Empty, error)
    ContainerKill(ctx context.Context, in *ContainerRequest, opts ...grpc.CallOption) (*empty.Empty, error)
    ContainerGetPid(ctx context.Context, in *ContainerRequest, opts ...grpc.CallOption) (*ContainerResponse, error)
    ExecStressors(ctx context.Context, in *ExecStressRequest, opts ...grpc.CallOption) (*ExecStressResponse, error)
    CancelStressors(ctx context.Context, in *CancelStressRequest, opts ...grpc.CallOption) (*empty.Empty, error)
    ApplyIOChaos(ctx context.Context, in *ApplyIOChaosRequest, opts ...grpc.CallOption) (*ApplyIOChaosResponse, error)
    ApplyHttpChaos(ctx context.Context, in *ApplyHttpChaosRequest, opts ...grpc.CallOption) (*ApplyHttpChaosResponse, error)
    SetDNSServer(ctx context.Context, in *SetDNSServerRequest, opts ...grpc.CallOption) (*empty.Empty, error)
}

Network failure injection

From pull request #41, we know that Chaos Mesh pumps network failures in this way: it summons pbClient.SetNetem To encapsulate the parameters in a request and send the request to Chaos Daemon on the node for processing.

The network failure injection code is shown below as it appeared in 2019. As the project developed, the functions were distributed over several files.

func (r *Reconciler) applyPod(ctx context.Context, pod *v1.Pod, networkchaos *v1alpha1.NetworkChaos) error {
    ...
    pbClient := pb.NewChaosDaemonClient(c)
    containerId := pod.Status.ContainerStatuses[0].ContainerID
    netem, err := spec.ToNetem()
    if err != nil {
        return err
    }
    _, err = pbClient.SetNetem(ctx, &pb.NetemRequest{
        ContainerId: containerId,
        Netem:       netem,
    })
    return err
}

In the pkg/chaosdaemon package, we can see how Chaos Daemon handles requests.

func (s *Server) SetNetem(ctx context.Context, in *pb.NetemRequest) (*empty.Empty, error) {
    log.Info("Set netem", "Request", in)
    pid, err := s.crClient.GetPidFromContainerID(ctx, in.ContainerId)
    if err != nil {
        return nil, status.Errorf(codes.Internal, "get pid from containerID error: %v", err)
    }
    if err := Apply(in.Netem, pid); err != nil {
        return nil, status.Errorf(codes.Internal, "netem apply error: %v", err)
    }
    return &empty.Empty{}, nil
}
// Apply applies a netem on eth0 in pid related namespace
func Apply(netem *pb.Netem, pid uint32) error {
    log.Info("Apply netem on PID", "pid", pid)
    ns, err := netns.GetFromPath(GenNetnsPath(pid))
    if err != nil {
        log.Error(err, "failed to find network namespace", "pid", pid)
        return errors.Trace(err)
    }
    defer ns.Close()
    handle, err := netlink.NewHandleAt(ns)
    if err != nil {
        log.Error(err, "failed to get handle at network namespace", "network namespace", ns)
        return err
    }
    link, err := handle.LinkByName("eth0") // TODO: check whether interface name is eth0
    if err != nil {
        log.Error(err, "failed to find eth0 interface")
        return errors.Trace(err)
    }
    netemQdisc := netlink.NewNetem(netlink.QdiscAttrs{
        LinkIndex: link.Attrs().Index,
        Handle:    netlink.MakeHandle(1, 0),
        Parent:    netlink.HANDLE_ROOT,
    }, ToNetlinkNetemAttrs(netem))
    if err = handle.QdiscAdd(netemQdisc); err != nil {
        if !strings.Contains(err.Error(), "file exists") {
            log.Error(err, "failed to add Qdisc")
            return errors.Trace(err)
        }
    }
    return nil
}

Finally, the vishvananda/netlink A library that runs on the Linux network interface to complete the task.

from here, NetworkChaos It manipulates the Linux host network to create chaos. It includes tools such as iptables and ipset.

In Chaos Daemon’s Docker file, you can see the Linux tool chain it depends on:

RUN apt-get update &&  
    apt-get install -y tzdata iptables ipset stress-ng iproute2 fuse util-linux procps curl && 
    rm -rf /var/lib/apt/lists/*

stress test

As implemented by Chaos Daemon StressChaos. After the control manager calculates the rules, it sends the task to the selector Daemon. The combined parameters are shown below. It is combined into command execution parameters and appended with an extension stress-ng Executable command.

// Normalize the stressors to comply with stress-ng
func (in *Stressors) Normalize() (string, error) {
    stressors := ""
    if in.MemoryStressor != nil && in.MemoryStressor.Workers != 0 {
        stressors += fmt.Sprintf(" --vm %d --vm-keep", in.MemoryStressor.Workers)
        if len(in.MemoryStressor.Size) != 0 {
            if in.MemoryStressor.Size[len(in.MemoryStressor.Size)-1] != '%' {
                size, err := units.FromHumanSize(string(in.MemoryStressor.Size))
                if err != nil {
                    return "", err
                }
                stressors += fmt.Sprintf(" --vm-bytes %d", size)
            } else {
                stressors += fmt.Sprintf(" --vm-bytes %s",
                    in.MemoryStressor.Size)
            }
        }
        if in.MemoryStressor.Options != nil {
            for _, v := range in.MemoryStressor.Options {
                stressors += fmt.Sprintf(" %v ", v)
            }
        }
    }
    if in.CPUStressor != nil && in.CPUStressor.Workers != 0 {
        stressors += fmt.Sprintf(" --cpu %d", in.CPUStressor.Workers)
        if in.CPUStressor.Load != nil {
            stressors += fmt.Sprintf(" --cpu-load %d",
                *in.CPUStressor.Load)
        }
        if in.CPUStressor.Options != nil {
            for _, v := range in.CPUStressor.Options {
                stressors += fmt.Sprintf(" %v ", v)
            }
        }
    }
    return stressors, nil
}

The Chaos Daemon server side handles the function execution command to invoke the official Go package os/exec. For more details, see pkg/chaosdaemon/stress_server_linux.go a file. There is also a file with the same name that ends in darwin. *_darwin The files prevent possible errors when running the program on macOS.

The code uses the . extension shirou/gopsutil The package gets the state of the PID process and reads the standard outputs stdout and stderr. I’ve seen this processing mode in hashicorp/go-pluginAnd the go-plugin does that even better.

Injection I/O Error

Pull order 826 introduces a new implementation of IOChaos, without the use of lateral injection. Chaos uses Daemon to directly manipulate the Linux namespace through the basic commands of a runc container and run the chaos-mesh/toda FUSE program developed by Rust to inject I/O chaos into the container. The JSON-RPC 2.0 protocol is used to communicate between the toda and the control plane.

The new IOChaos app does not modify Pod resources. When selecting an IOChaos clutter experiment, a PodIOChaos resource is created against each disk filtered by the selector field. PodIoChaos owner reference is Pod. At the same time, a set of final debuggers were added to PodIoChaos to free up PodIoChaos resources before PodIoChaos was deleted.

// Apply implements the reconciler.InnerReconciler.Apply
func (r *Reconciler) Apply(ctx context.Context, req ctrl.Request, chaos v1alpha1.InnerObject) error {
    iochaos, ok := chaos.(*v1alpha1.IoChaos)
    if !ok {
        err := errors.New("chaos is not IoChaos")
        r.Log.Error(err, "chaos is not IoChaos", "chaos", chaos)
        return err
    }
    source := iochaos.Namespace + "https://dzone.com/" + iochaos.Name
    m := podiochaosmanager.New(source, r.Log, r.Client)
    pods, err := utils.SelectAndFilterPods(ctx, r.Client, r.Reader, &iochaos.Spec)
    if err != nil {
        r.Log.Error(err, "failed to select and filter pods")
        return err
    }
    r.Log.Info("applying iochaos", "iochaos", iochaos)
    for _, pod := range pods {
        t := m.WithInit(types.NamespacedName{
            Name:      pod.Name,
            Namespace: pod.Namespace,
        })
        // TODO: support chaos on multiple volume
        t.SetVolumePath(iochaos.Spec.VolumePath)
        t.Append(v1alpha1.IoChaosAction{
            Type: iochaos.Spec.Action,
            Filter: v1alpha1.Filter{
                Path:    iochaos.Spec.Path,
                Percent: iochaos.Spec.Percent,
                Methods: iochaos.Spec.Methods,
            },
            Faults: []v1alpha1.IoFault{
                {
                    Errno:  iochaos.Spec.Errno,
                    Weight: 1,
                },
            },
            Latency:          iochaos.Spec.Delay,
            AttrOverrideSpec: iochaos.Spec.Attr,
            Source:           m.Source,
        })
        key, err := cache.MetaNamespaceKeyFunc(&pod)
        if err != nil {
            return err
        }
        iochaos.Finalizers = utils.InsertFinalizer(iochaos.Finalizers, key)
    }
    r.Log.Info("commiting updates of podiochaos")
    err = m.Commit(ctx)
    if err != nil {
        r.Log.Error(err, "fail to commit")
        return err
    }
    r.Event(iochaos, v1.EventTypeNormal, utils.EventChaosInjected, "")
    return nil
}

In the PodIoChaos resource controller, the Controller Manager encapsulates the resource in parameters and calls the Chaos Daemon interface to process the parameters.

// Apply flushes io configuration on pod
func (h *Handler) Apply(ctx context.Context, chaos *v1alpha1.PodIoChaos) error {
    h.Log.Info("updating io chaos", "pod", chaos.Namespace+"https://dzone.com/"+chaos.Name, "spec", chaos.Spec)
    ...
    res, err := pbClient.ApplyIoChaos(ctx, &pb.ApplyIoChaosRequest{
        Actions:     input,
        Volume:      chaos.Spec.VolumeMountPath,
        ContainerId: containerID,
        Instance:  chaos.Spec.Pid,
        StartTime: chaos.Spec.StartTime,
    })
    if err != nil {
        return err
    }
    chaos.Spec.Pid = res.Instance
    chaos.Spec.StartTime = res.StartTime
    chaos.OwnerReferences = []metav1.OwnerReference{
        {
            APIVersion: pod.APIVersion,
            Kind:       pod.Kind,
            Name:       pod.Name,
            UID:        pod.UID,
        },
    }
    return nil
}

the pkg/chaosdaemon/iochaos_server.go IOChaos Operations File. In this file, the FUSE program must be injected into the container. As discussed in version 2305 on GitHub, the . file /usr/local/bin/nsexec -l- p /proc/119186/ns/pid -m /proc/119186/ns/mnt - /usr/local/bin/toda --path /tmp --verbose info The command to run the toda program is executed under the same namespace as the Pod.

func (s *DaemonServer) ApplyIOChaos(ctx context.Context, in *pb.ApplyIOChaosRequest) (*pb.ApplyIOChaosResponse, error) {
    ...
    pid, err := s.crClient.GetPidFromContainerID(ctx, in.ContainerId)
    if err != nil {
        log.Error(err, "error while getting PID")
        return nil, err
    }
    args := fmt.Sprintf("--path %s --verbose info", in.Volume)
    log.Info("executing", "cmd", todaBin+" "+args)
    processBuilder := bpm.DefaultProcessBuilder(todaBin, strings.Split(args, " ")...).
        EnableLocalMnt().
        SetIdentifier(in.ContainerId)
    if in.EnterNS {
        processBuilder = processBuilder.SetNS(pid, bpm.MountNS).SetNS(pid, bpm.PidNS)
    }
    ...
    // Calls JSON RPC
    client, err := jrpc.DialIO(ctx, receiver, caller)
    if err != nil {
        return nil, err
    }
    cmd := processBuilder.Build()
    procState, err := s.backgroundProcessManager.StartProcess(cmd)
    if err != nil {
        return nil, err
    }
    ...
}

The following code sample creates the running commands. These commands are to implement runc’s primary namespace isolation:

// GetNsPath returns corresponding namespace path
func GetNsPath(pid uint32, typ NsType) string {
    return fmt.Sprintf("%s/%d/ns/%s", DefaultProcPrefix, pid, string(typ))
}
// SetNS sets the namespace of the process
func (b *ProcessBuilder) SetNS(pid uint32, typ NsType) *ProcessBuilder {
    return b.SetNSOpt([]nsOption{{
        Typ:  typ,
        Path: GetNsPath(pid, typ),
    }})
}
// Build builds the process
func (b *ProcessBuilder) Build() *ManagedProcess {
    args := b.args
    cmd := b.cmd
    if len(b.nsOptions) > 0 {
        args = append([]string{"--", cmd}, args...)
        for _, option := range b.nsOptions {
            args = append([]string{"-" + nsArgMap[option.Typ], option.Path}, args...)
        }
        if b.localMnt {
            args = append([]string{"-l"}, args...)
        }
        cmd = nsexecPath
    }
    ...
}

surveillance plane

Chaos Mesh is an open source Chaos architecture system under the Apache 2.0 protocol. As discussed above, it has rich capabilities and a good ecosystem. Develop maintenance team chaos-mesh/toda FUSE is based on a chaotic system, the file chaos-mesh/k8s_dns_chaos CoreDNS clutter plugin, Berkeley Packet Filter (BPF) based kernel error injection chaos-mesh/bpfki.

Now, I will describe the server-side code required to build an end-user-oriented chaos engineering platform. This implementation is just an example and not necessarily the best example. If you want to see the practice of developing on a real-world platform, you can refer to the Chaos Mesh Dashboard. Uses uber-go/fx Dependency injection framework and console runtime manager mode.

Chaos Network main features

As shown in the Chaos Mesh workflow below, we need to implement a server that sends YAML to the Kubernetes API. Chaos Controller Manager performs complex rule checking and handing over rules to Chaos Daemon. If you want to use Chaos Mesh with your platform, you just need to connect to the CRD resource creation process.

Basic Chaos Network Workflow

Basic Workflow for Chaos Mesh

Let’s look at the example on the Chaos Mesh website:

import (
    "context"
    "github.com/pingcap/chaos-mesh/api/v1alpha1"
    "sigs.k8s.io/controller-runtime/pkg/client"
)
func main() {
    ...
    delay := &chaosv1alpha1.NetworkChaos{
        Spec: chaosv1alpha1.NetworkChaosSpec{...},
    }
    k8sClient := client.New(conf, client.Options{ Scheme: scheme.Scheme })
    k8sClient.Create(context.TODO(), delay)
    k8sClient.Delete(context.TODO(), delay)
}

Chaos Mesh Network provides APIs that are compatible with all CRDs. We are using the console runtime developed by the Kubernetes API Machinery SIG to simplify interaction with the Kubernetes API.

injection of chaos

Suppose we want to create a file PodKill Resources by calling the program. After the resource is sent to the Kubernetes API server, it passes the Chaos Controller Manager’s accept controller to verify the data. When we create a mess experiment, if the accept controller fails to check the input data, it returns an error to the client. For specific parameters, you can read Create Experiments with YAML Configuration Files.

NewClient Creates a Kubernetes API client. You can refer to this example:

package main
import (
    "context"
    "controlpanel"
    "log"
    "github.com/chaos-mesh/chaos-mesh/api/v1alpha1"
    "github.com/pkg/errors"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
func applyPodKill(name, namespace string, labels map[string]string) error {
    cli, err := controlpanel.NewClient()
    if err != nil {
        return errors.Wrap(err, "create client")
    }
    cr := &v1alpha1.PodChaos{
        ObjectMeta: metav1.ObjectMeta{
            GenerateName: name,
            Namespace:    namespace,
        },
        Spec: v1alpha1.PodChaosSpec{
            Action: v1alpha1.PodKillAction,
            ContainerSelector: v1alpha1.ContainerSelector{
                PodSelector: v1alpha1.PodSelector{
                    Mode: v1alpha1.OnePodMode,
                    Selector: v1alpha1.PodSelectorSpec{
                        Namespaces:     []string{namespace},
                        LabelSelectors: labels,
                    },
                },
            },
        },
    }
    if err := cli.Create(context.Background(), cr); err != nil {
        return errors.Wrap(err, "create podkill")
    }
    return nil
}

The output of the running program log is:

I1021 00:51:55.225502   23781 request.go:665] Waited for 1.033116256s due to client-side throttling, not priority and fairness, request: GET:https://***
2021/10/21 00:51:56 apply podkill

Use kubectl to check the status of a file PodKill the supplier:

$ k describe podchaos.chaos-mesh.org -n dev podkillvjn77
Name:         podkillvjn77
Namespace:    dev
Labels:       <none>
Annotations:  <none>
API Version:  chaos-mesh.org/v1alpha1
Kind:         PodChaos
Metadata:
  Creation Timestamp:  2021-10-20T16:51:56Z
  Finalizers:
    chaos-mesh/records
  Generate Name:     podkill
  Generation:        7
  Resource Version:  938921488
  Self Link:         /apis/chaos-mesh.org/v1alpha1/namespaces/dev/podchaos/podkillvjn77
  UID:               afbb40b3-ade8-48ba-89db-04918d89fd0b
Spec:
  Action:        pod-kill
  Grace Period:  0
  Mode:          one
  Selector:
    Label Selectors:
      app:  nginx
    Namespaces:
      dev
Status:
  Conditions:
    Reason:  
    Status:  False
    Type:    Paused
    Reason:  
    Status:  True
    Type:    Selected
    Reason:  
    Status:  True
    Type:    AllInjected
    Reason:  
    Status:  False
    Type:    AllRecovered
  Experiment:
    Container Records:
      Id:            dev/nginx
      Phase:         Injected
      Selector Key:  .
    Desired Phase:   Run
Events:
  Type    Reason           Age    From          Message
  ----    ------           ----   ----          -------
  Normal  FinalizerInited  6m35s  finalizer     Finalizer has been inited
  Normal  Updated          6m35s  finalizer     Successfully update finalizer of resource
  Normal  Updated          6m35s  records       Successfully update records of resource
  Normal  Updated          6m35s  desiredphase  Successfully update desiredPhase of resource
  Normal  Applied          6m35s  records       Successfully apply chaos for dev/nginx
  Normal  Updated          6m35s  records       Successfully update records of resource

The control level also needs to query and obtain Chaos resources, so that users of the platform can view and manage the execution status of all Chaos experiments. To achieve this, we can call REST API to send a file Get or List Request. But in practice, we need to pay attention to detail. In our company, we noticed that every time the console requests the full amount of resource data, the Kubernetes API server gets overloaded.

I recommend reading how to use console runtime tutorial (in Japanese). If you don’t understand Japanese, you can still learn a lot from the tutorial by reading the source code. It covers many details. For example, by default, the kubeconfig console runtime, tags, environment variables, and the service account installed automatically in the Pod reads from multiple locations. Request to withdraw number 21 for armosec/kubescape This feature is used. This tutorial also includes common operations, such as how to number, update, and overwrite objects. I haven’t seen any very detailed English lessons.

Below are examples of Get And List Requests:

package controlpanel
import (
    "context"
    "github.com/chaos-mesh/chaos-mesh/api/v1alpha1"
    "github.com/pkg/errors"
    "sigs.k8s.io/controller-runtime/pkg/client"
)
func GetPodChaos(name, namespace string) (*v1alpha1.PodChaos, error) {
    cli := mgr.GetClient()
    item := new(v1alpha1.PodChaos)
    if err := cli.Get(context.Background(), client.ObjectKey{Name: name, Namespace: namespace}, item); err != nil {
        return nil, errors.Wrap(err, "get cr")
    }
    return item, nil
}
func ListPodChaos(namespace string, labels map[string]string) ([]v1alpha1.PodChaos, error) {
    cli := mgr.GetClient()
    list := new(v1alpha1.PodChaosList)
    if err := cli.List(context.Background(), list, client.InNamespace(namespace), client.MatchingLabels(labels)); err != nil {
        return nil, err
    }
    return list.Items, nil
}

This example uses the manager. This mode prevents the caching mechanism from repeatedly fetching large amounts of data:

  1. Get the capsule.
  2. get the List Request full data first.
  3. Refresh the cache when the clock data changes.

Chaos theme

The container runtime for the Container Runtime Interface (CRI) provides powerful core isolation capabilities that can support the stable operation of a container. But for more complex and scalable scenarios, container formatting is required. It also provides network chaos Schedule And Workflow Features. Based on the group Cron time, Schedule It can cause malfunctions on a regular basis and at intervals. Workflow Multiple error tests such as Argo Workflows can be scheduled.

The Chaos Console Manager does most of the work for us. The control plane primarily manages these YAML resources. You just need to think about the features that you want to provide to the end users.

Platform Features

The following figure shows the Chaos Network dashboard. We need to consider the features that the platform should provide to the end users.

Chaos Grid Panel

Chaos Grid Panel

From the Dashboard, we know that the platform may have these features:

  • injection mess
  • crash pod
  • network failure
  • pregnancy test
  • I/O Fail
  • event tracking
  • Associated alert
  • Telemetry timing

If you are interested in Chaos Mesh and want to improve it, join the Slack channel (#project-chaos-mesh) or submit pull requests or issues to their GitHub repository.

.

Leave a Comment