Disrupting the Cloud with eBPF Chaos Engineering

Chaos engineering is a discipline that allows for the identification, introduction, and remediation of failure modes in systems by subjecting them to intentional, controlled experimentation. With the increasing popularity of containerization and microservices, implementing chaos engineering practices has become more critical. One challenge in chaos engineering is to automate the experimentation process and to have meaningful insights from the experiments. Here comes the importance of eBPF.

eBPF (extended Berkeley Packet Filter) is a kernel technology that provides a safe and secure way of dynamically instrumenting the kernel. eBPF can monitor kernel events and trace the userspace application behavior. In this article, we will discuss how eBPF can be used for implementing chaos engineering in cloud-environments. We’ll also provide some examples of experiments with different cloud service providers.

Implementing Chaos Engineering with eBPF:

To implement chaos engineering with eBPF, we need to install an eBPF-based monitoring system that can trace and monitor the kernel behavior. One example of such a monitoring system is the eBPF tool named BCC (BPF Compiler Collection). BCC provides a high-level interface that compiles eBPF programs from human-readable scripts into bytecode to execute in the kernel.

Here’s a high-level view of how eBPF Chaos engineering works.

Identify the system’s failure domains, i.e., which components can cause the system to fail.
Define a hypothesis, i.e., assumptions about how the system will behave in the event of these failures.
Write a script leveraging BCC that uses eBPF programs to introduce these failures into the system.
Monitor the system’s behavior, i.e., how the failure pattern affected system metrics.
Analyze the results and remediate any defects that the tests exposed.

Experiments with Examples and Cloud Service Providers:

Here are three experiments commonly used in chaos engineering that can be implemented using eBPF:

Network partition experiment: The network partition experiment involves simulating a temporary network failure between two service components to see how the service responds to network outages. An example implementation of this experiment is to randomly drop network packets between two microservices at the transport layer. This will help identify if the services can recover gracefully from network failures. To implement this experiment using eBPF, a tool named chaos-net-emulator can be used, which passes the packets through an eBPF-based network emulator.
Resource starved experiment: The resource starved experiment involves simulating a situation where a service is not achieving expected resources, forcing the service to scale down. An example implementation of this experiment is to randomly throttle CPU allocation to a Kubernetes pod by modifying the cgroup settings. This will help identify if the service is running a degraded state with minimal resources. To implement this experiment using eBPF, a tool named cgroup-bpf can be used, which attaches an eBPF program to a specific cgroup path and filter the relevant metrics.
Disk IO experiment: The Disk IO experiment involves simulating a situation where the disk is over-utilized or under-utilized, causing the service to fail. An example implementation of this experiment is to randomly read and write to a file to simulate disk IO spikes. This will help identify if the service can overcome and recover from degraded disk IO. To implement this experiment using eBPF, a tool named IO Visor can be used, which has a set of eBPF programs that can analyze disk performance and spotlight the bottleneck.

Cloud Service Provider Example: Since cloud service providers have service level agreements (SLAs) that guarantee certain uptimes and reliability, implementing chaos engineering can help ensure the service is always available when customers require it.

Measuring and analyzing eBPF performance in a Kubernetes cluster

The objective of this experiment is to measure the performance of eBPF technology in a Kubernetes cluster and analyze it for any impact on application performance.

Requirements:

A Kubernetes cluster with at least two nodes running version 1.18 or higher
Access to the Linux kernel on each node
Tools required for building eBPF programs, such as the BCC toolkit and the LLVM compiler
Prometheus and Grafana for monitoring and visualizing metrics

Steps:

Install the BCC toolkit and the LLVM compiler on each Kubernetes node. These tools are needed for building and running eBPF programs.
Write a simple eBPF program to measure performance metrics such as packet counts, latency, and CPU utilization. A sample program could look like the following written in C:

#include <uapi/linux/ptrace.h>
#include <net/sock.h>

BPF_PERF_OUTPUT(packet_count);

int kprobe__udp_recvmsg(struct pt_regs *ctx, struct sock *sk, struct msghdr *msg,
                        size_t len, int flags) {

    u32 pid = bpf_get_current_pid_tgid();

    struct sk_buff* skb;
    skb = ((struct sk_buff*)(msg->msg_iter.buf));

    u64 ts = bpf_ktime_get_ns();
    packet_count.perf_submit(ctx, skb, &ts, sizeof(ts));
    return 0;
}

This program uses a kprobe to intercept network traffic, captures the timestamp when packets are received by the node, and then sends the relevant data into a perf buffer for monitoring.

Build and compile the eBPF program and attach it to the Kubernetes nodes using Cilium, a popular eBPF-based CNI plugin for Kubernetes. You can also use other similar tools like kube-bpf, Calico or WeaveWorks.
Configure Prometheus and Grafana to scrape metrics and visualize packet count, latency, and CPU utilization. For example, you can configure Prometheus to scrape metrics from the packet_count perf buffer defined in the eBPF program and Grafana to display time series graphs based on the scraped data.
Run a workload in your Kubernetes cluster and measure the eBPF performance metrics to analyze the impact on application performance.
Finally, analyze the metrics and identify any areas where eBPF may be impacting application performance. In case of any issues found remediation steps need to be performed.

In conclusion, this experiment measures and analyzes the performance of eBPF in a Kubernetes cluster using a sample eBPF program, Prometheus and Grafana. This is a useful exercise to ensure that the implementation of eBPF doesn't impact the overall performance of your applications running on the Kubernetes cluster.

Let’s take an example of implementing these experiments with Amazon Web Services.

Network partition experiment can be implemented by using an Amazon CloudWatch metric filter that drops the packets sent between microservices in Amazon Elastic Beanstalk.
Resource starved experiment can be implemented by using an Amazon CloudWatch alarm that triggers whenever the pod’s CPU utilization goes beyond a threshold value.
Disk IO experiment can be implemented by using the Amazon CloudWatch logs, which stores the logs in Amazon Simple Storage Service (S3).

Experiments with Google Cloud Platform (GCP):

Network partition experiment: To simulate a temporary network failure between two service components, we can use GCP’s firewall rules to drop network packets. We can configure the firewall rules to either drop packets based on IP ranges or tags, and we can use this functionality to simulate a network partition. By doing this, we can test how the service behaves in the event of network outages to ensure that it is resilient.
Resource starved experiment: To simulate a situation where a service is not achieving expected resources, we can use GCP’s compute engine API to modify machine types, which will change the resources allocated to a particular service. We can then use GCP’s monitoring service to verify that the service is not achieving its expected level of resources. The results should be analyzed to determine if the service is running in a degraded state and, if so, remediation steps need to be taken.
Disk IO experiment: To simulate a situation where the disk is over-utilized or under-utilized, we can use GCP’s compute engine API to modify the number of IOPS or disk size allocated to a particular service. We can then use GCP’s monitoring service to verify that the service is not achieving the expected level of disk IO. The results should be analyzed to determine if the service is running in a degraded state and, if so, remediation steps need to be taken.

In conclusion, eBPF, coupled with chaos engineering, provides a powerful solution for the identification, introduction, and remediation of failure modes in systems. With eBPF, automating the experimentation process and extracting meaningful insights from the experiments has become easier. The above experiments with examples and cloud service provider examples can aid in designing and executing a reliable, resilient, and fault-tolerant system.