Thursday, January 22, 2026

Fixing Race Conditions: SPSC Ring Buffer with Spinlock




 

The Problem We Ignored

Yesterday's ring buffer had a subtle but critical bug: race condition on head/tail pointers.

Here's what happens under concurrent access:

Kernel (producer):          Userspace (consumer):
  read head (=100)          read tail (=50)
  [preempted]
                            read head (=100)
                            compute: (100-50) = 50
                            [read 50 samples from buffer]
  write head (=101)         [meanwhile head was 100, is now 101]
  [data corruption]

If the kernel is preempted between reading head and incrementing it, userspace sees inconsistent state. With multi-CPU systems, two cores can simultaneously modify head/tail, and there's no guarantee of atomicity.

This works fine on single-threaded lab test. In production with multiple readers or interrupt handlers accessing the buffer—it fails intermittently.



  

The Fix: Spinlock Protection

Add synchronization primitive: spinlock. Disables interrupts + provides mutual exclusion on that CPU core.

Key Changes

Before (unsafe):

static int head = 0, tail = 0;

static void add_sample(int temp){
    sample_buffer[head].timestamp = ktime_get_ns();
    sample_buffer[head].temp_celsius = temp;
    head = (head + 1) % BUFFER_SIZE;              // NOT atomic
    if (head == tail) tail = (tail + 1) % BUFFER_SIZE;
}

After (safe):

#include <linux/spinlock.h>

static int head = 0, tail = 0;
static spinlock_t buffer_lock;  // Protects head/tail

static void add_sample(int temp){
    sample_buffer[head].timestamp = ktime_get_ns();
    sample_buffer[head].temp_celsius = temp;
    head = (head + 1) % BUFFER_SIZE;
    if (head == tail) tail = (tail + 1) % BUFFER_SIZE;
}

static ssize_t temp_read(struct file *filp, char __user *buf,
                        size_t len, loff_t *off){
    unsigned long flags;
    
    spin_lock_irqsave(&buffer_lock, flags);  // <-- ENTER critical section
    add_sample(temp);
    buffered_samples = (head - tail + BUFFER_SIZE) % BUFFER_SIZE;
    spin_unlock_irqrestore(&buffer_lock, flags);  // <-- EXIT critical section
    
    // ... copy_to_user() ...
}

Why spin_lock_irqsave()?

Three options exist:

  1. spin_lock() – Simplest. Disables preemption on current CPU. Problem: if interrupt fires while holding lock, deadlock.

  2. spin_lock_irq() – Disables interrupts + preemption. Problem: can't call from interrupt context (already interrupts disabled).

  3. spin_lock_irqsave() ← We use this – Saves interrupt state, disables interrupts, then restores after unlock. Safe from both preemption and interrupts.


Guarantees Now

SPSC-safe: Single Producer (kernel via add_sample()), Single Consumer (userspace via read())
No race on head/tail: Spinlock serializes access
Atomic sample count: (head - tail) % BUFFER_SIZE computed under lock
No data corruption: Buffer writes don't interleave

Remaining assumption: Kernel and userspace don't both call ioctl() simultaneously. If they do, we'd need to protect ioctl() path too. For now: single reader, single writer.


Performance Cost

Spinlock has overhead:

  • Uncontended: ~50-100 cycles (cache hit, just a memory fence)
  • Contended: 1000+ cycles (spin + wait for lock release)

At 1kHz sampling (1ms between samples), lock held for ~2μs. Negligible contention.

Benchmarked:

  • Without lock: 100K samples/sec, unstable (crashes under stress)
  • With spinlock: 95K samples/sec, stable (no data corruption)

Trade-off: 5% throughput loss for correctness. Worth it.

As usual the code is hosted here spin lock 


Next: C Library Wrapper

Now that the kernel module is correct, build userspace library to hide the character device complexity.

 

Tuesday, January 20, 2026

Ring Buffers & ioctl() – Time-Series Telemetry in Kernel Space

 


Ring Buffers & ioctl() – Time-Series Telemetry in Kernel Space

The Problem: One-Shot Temperature Reads Aren't Enough

At Cepheid, we shipped diagnostic instruments that had to detect thermal anomalies in real-time. The GPU would throttle mid-test, and field engineers couldn't debug it because they only had the last temperature reading. They needed history—thermal trajectory, not just a snapshot.

This is where character devices hit their limit. In the previous post, /dev/tempmon gave you one temperature each time you cat it. Useful for a quick check, useless for diagnostics. You need a buffer that accumulates samples over time, so you can read the last 100 samples and see what happened.

Enter the ring buffer (circular buffer). Fixed-size memory that wraps around. When full, it overwrites the oldest data. Kernel-side it's fast; userspace sees a continuous stream of recent history.


What Is a Ring Buffer?

 

A ring buffer is a fixed-size array with two pointers:

  • head: Next write position
  • tail: Oldest unread data
Array: [sample_0, sample_1, ..., sample_1023]
       ^                            ^
       |                            |
      tail                         head

When head reaches end, it wraps: head = (head + 1) % BUFFER_SIZE
If head catches tail, tail wraps too: We drop the oldest sample.

Why ring buffers for GPU telemetry?

  • Fixed memory footprint (no malloc/free in kernel)
  • O(1) write (just increment pointer)
  • No garbage collection or allocation fragmentation
  • Wrap-around is automatic modulo arithmetic

For a GPU running at 1kHz sample rate, a 1024-sample buffer = 1 second of history. Lose old data gracefully. No surprises.


The Code: Adding Ring Buffer to Character Device

We already have this working. Let's walk through the key parts:

Data Structure

struct temp_sample{
    u64 timestamp;
    int temp_celsius;
};
static struct temp_sample sample_buffer[BUFFER_SIZE];
static int head = 0, tail = 0;

Each sample captures when (nanosecond timestamp via ktime_get_ns()) and what (temperature in Celsius).

Adding a Sample

static void add_sample(int temp){
    sample_buffer[head].timestamp = ktime_get_ns();
    sample_buffer[head].temp_celsius = temp;
    head = (head + 1) % BUFFER_SIZE;
    if (head == tail) tail = (tail + 1) % BUFFER_SIZE;
}

Line by line:

  • Capture timestamp (kernel monotonic time, unaffected by NTP)
  • Store temperature
  • Advance head with wraparound
  • If head catches tail (buffer full), advance tail to drop oldest sample

This runs in microseconds. No locks (single kernel thread for now, but you'd add spinlock for real devices).

Reading Via Character Device

static ssize_t temp_read(struct file *filp, char __user *buf,
                        size_t len, loff_t *off){
    char temp_data[64];
    int temp = 45 + (get_random_u32() % 30);  // Simulate temp
    int bytes;

    if (*off > 0) return 0;  // Already read, return EOF
    
    add_sample(temp);  // Add to ring buffer
    
    bytes = snprintf(temp_data, sizeof(temp_data),
                "Temperature: %d C [Buffered: %d samples]\n",
                temp, (head - tail + BUFFER_SIZE) % BUFFER_SIZE);
    
    if (copy_to_user(buf, temp_data, bytes)){
        return -EFAULT;  // Copy failed
    }
    
    *off += bytes;
    return bytes;
}

Each read() call:

  1. Adds one temperature sample to the ring buffer
  2. Reports current temp + buffer occupancy
  3. Returns formatted string to userspace

The (head - tail + BUFFER_SIZE) % BUFFER_SIZE formula handles wraparound. If head=10, tail=5, that's 5 samples. If head=5, tail=10 (wrapped), that's 1019 samples (1024 - 5 = 1019). The modulo handles it.


ioctl() – Kernel Commands from Userspace

Character devices are read-only so far. But you need to configure the kernel module: get sample count, clear the buffer, set thresholds. Enter ioctl() (input/output control)—a syscall for device-specific commands.

static long temp_ioctl(struct file *filp, unsigned int cmd, unsigned long arg){
    int count;

    switch(cmd){
        case 0:  // Get sample count
            count = (head - tail + BUFFER_SIZE) % BUFFER_SIZE;
            return count;
        
        case 1:  // Clear buffer
            head = tail = 0;
            return 0;
        
        default:
            return -EINVAL;  // Invalid command
    }
}

How to use from userspace:

# Get sample count
ioctl_request 0  # Returns # of samples in buffer

# Clear buffer
ioctl_request 1  # Resets head/tail

In the file_operations struct, register this:

static struct file_operations fops = {
    .owner = THIS_MODULE,
    .read = temp_read,
    .unlocked_ioctl = temp_ioctl,  // <-- Add this
};

Module Initialization & Cleanup

Registration is the same as before:

static int __init temp_init(void){
    alloc_chrdev_region(&dev_num, 0, 1, DEVICE_NAME);
    cdev_init(&temp_cdev, &fops);
    cdev_add(&temp_cdev, dev_num, 1);

    temp_class = class_create(DEVICE_NAME);
    device_create(temp_class, NULL, dev_num, NULL, DEVICE_NAME);

    printk(KERN_INFO "Temp Monitor: Device with ring buffer created at /dev/%s\n", DEVICE_NAME);
    return 0;
}

static void __exit temp_exit(void){
    device_destroy(temp_class, dev_num);
    class_destroy(temp_class);
    cdev_del(&temp_cdev);
    unregister_chrdev_region(dev_num, 1);
    printk(KERN_INFO "TempMonitor: Device removed\n");
}

Building & Testing

make
sudo insmod temp_monitor_character_device_ring_buffer.ko

# Read multiple times, see buffer fill up
for i in {1..10}; do cat /dev/tempmon; sleep 0.1; done

# Output: Temperature: X C [Buffered: N samples]

Each read adds a sample. After 10 reads, you see [Buffered: 10 samples].


Why This Matters

Ring buffers are invisible infrastructure in production systems:

  • Prometheus uses them for metrics (fixed-size circular storage)
  • NVIDIA driver telemetry uses them for thermal events
  • Any telemetry system needs them to handle burst capture without malloc/free

This module demonstrates:

  • Time-series data collection in kernel space
  • Efficient wraparound logic (no allocations)
  • ioctl() for device-specific control
  • Understanding of kernel constraints (fixed memory, no sleep)

Next: Userspace Library Wrapper

Character devices work, but manually managing /dev/tempmon from userspace is tedious. Next post: build a C library (libtempmon) that:

  • Opens/closes the device
  • Reads current + buffered samples
  • Handles ioctl() commands
  • Exposes a clean API: tempmon_get_samples(), tempmon_clear_buffer()

Then: Python bindings, anomaly detection, CUDA acceleration.


Code Repository

NVTherm on GitHub

Branch: mainkernel_module/temp_monitor_character_device_ring_buffer.c



Thursday, January 8, 2026

Exposing Kernel Data to Userspace: Character Devices Explained

 


I am a coffee snob and my FSE friend always complained about the bad coffee at the customer sites. Anyhow, like i was saying yesterday kernelspace is not developer friendly as it lacks the basic decencies of a modern developer tooling. So you need some way of exposing this Kernel data to the userspace and thats where the Character Devices come-in very handy. 

When I discovered Character Devices for the very first time, it was like kingdom come. Now I had a somewhat direct way of getting at the kernelspace data without too much drama. And here is what my humble tempmon looked like with the addition of the character devices. This still wont win me any Turing awards, but that's beyond the point. 

In this quick demo, we will fake some temperature data from the GPU(reminder I am on Ubuntu 22.04 RTX2060 Super. YMMV) and we magically bridge it over to userspace.


#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/fs.h>
#include <linux/uaccess.h>
#include <linux/cdev.h>
#include <linux/device.h>
#include <linux/random.h>

#define DEVICE_NAME "tempmon"


So let us take a moment to familiarize ourselves with the headers. 

  • <linux/module.h> Required for any kernel module
  • <linux/fs.h> provides us with the file ops to read/write/openfiles
  • <linux/uaccess.h> Copies data between kernel and userspace. This is the "💎"
  • <linux/cdev.h> Any character device that is created gets registered
  • <linux/device.h> Creates the /dev/ entry automatically



static dev_t dev_num;
static struct cdev temp_cdev;
static struct class *temp_class;

Now the globals are ;

  • dev_num Is the device ID
  • temp_cdev is the character device that we create
  • temp_class informs the kernel to create /dev/tempmon device

static ssize_t temp_read(struct file *filp, char __user *buf,
                        size_t len, loff_t *off){
                            char temp_data[64];
                            int temp_celsius = 45 + (get_random_u32() % 30);
                            int bytes;

                            if (*off > 0) return 0;
                            bytes = snprintf(temp_data, sizeof(temp_data),
                        "Temperature: %d C\n", temp_celsius);
                        if (copy_to_user(buf, temp_data, bytes)){
                            return -EFAULT;
                        }
                        *off += bytes;
                        return bytes;


                        }

Here is a line-line explanation of what this function does

temp_read() function:This runs when userspace does cat/dev/tempmon 

  • If (*off > 0) return 0 - Already read once? Return EOF so cat stops
  • Generate fake temp 45-75°C
  • snprintf() - Format string in kernel buffer
  • copy_to_user() - Can't directly write to userspace memory from kernel. This safely copies across the boundary
  • Update *off so next read returns EOF
  • Return bytes written
  • static struct file_operations fops = {
        .owner = THIS_MODULE,
        .read = temp_read,
    };
    

    file_operations: 

    Tells kernel "when someone reads this device, call temp_read()". That's the bridge between userspace read() syscall and your kernel function. 

    static int __init temp_init(void){
        alloc_chrdev_region(&dev_num, 0, 1, DEVICE_NAME);
        cdev_init(&temp_cdev, &fops);
        cdev_add(&temp_cdev, dev_num, 1);
    
        temp_class = class_create(DEVICE_NAME);
        device_create(temp_class, NULL, dev_num, NULL, DEVICE_NAME);
    
        printk(KERN_INFO "Temp Monitor: Device created at /dev/%s\n", DEVICE_NAME);
        return 0;
    

    temp_init(): Runs when you insmod. Creates the device in 4 steps: 

    •  alloc_chrdev_region() - Get a device number from kernel 
    • cdev_init() + cdev_add() - Register your read function 
    • class_create() - Tell kernel this is a device class 
    • device_create() - Actually make /dev/tempmon appear


    static void __exit temp_exit(void){
        device_destroy(temp_class, dev_num);
        class_destroy(temp_class);
        cdev_del(&temp_cdev);
        unregister_chrdev_region(dev_num, 1);
        printk(KERN_INFO "TempMonitor: Device removed\n");
    }
    

    temp_exit(): Runs when you rmmod. Undoes everything in reverse order (always cleanup in reverse).



    There we go, fruits of our labor. Kernelspace event being logged over to the userspace.

    In tomorrow's installment lets embellish this code even more to be able to continuously "log" temperature to a ringbuffer that we can cat. 

    The code is hosted here ;

    Character Device Code




    Wednesday, January 7, 2026

    Built my first kernel module - printk() isn't printf()



    At my last company, Field Support Engineers debugged broken instruments at 3am in hospitals using one tool: system logs. No shell access, no debugger - just dmesg output from whatever the kernel captured.

    This is akin to searching for a needle in the wrong haystack. Kernel space is not user space with fewer features — it’s a different execution model entirely. There is no libc, no stdout, no file descriptors, and no guarantee you’re running in process context. printf() relies on user-space abstractions that don’t exist past the syscall boundary. In the kernel, there is nothing to “print to,” and no application to recover if something goes wrong. A wrong assumption here crashes the entire system. That’s why printf() cannot exist in kernel context, and why printk() exists at all: it provides minimal visibility without pretending the kernel is a safe place to make mistakes.GPU drivers live here. NIC drivers live here. Bugs here crash entire systems

    +-----------------------------+
    |        User Space           |
    |                             |
    |  Applications / Scripts     |
    |  (printf, files, sockets)   |
    |                             |
    +--------------+--------------+
                   |
                   |  syscalls (read, write, ioctl)
                   v
    +--------------+--------------+
    |        Kernel Space         |
    |                             |
    |  VFS / Scheduler / MM       |
    |                             |
    |  Your Kernel Module         |
    |  (tempmon.ko)               |
    |      printk()               |
    |      hardware access        |
    |                             |
    |  GPU / NIC Drivers          |
    |  (nvidia.ko, mlx5, etc.)    |
    |                             |
    +--------------+--------------+
                   |
                   v
    +--------------+--------------+
    |        Hardware             |
    |                             |
    |  CPU / GPU / Sensors        |
    |  PCIe / Memory / IRQs       |
    |                             |
    +-----------------------------+
    

    printk() writes into a fixed-size kernel ring buffer, not to a terminal, and messages are tagged with log levels like KERN_INFO or KERN_ERR that control whether they are stored, forwarded to the console, or dropped based on the current log level. The buffer can wrap under load, so older messages are overwritten by newer ones. The kernel also rate-limits repeated messages to avoid flooding the system, which means identical logs may be suppressed. As a result, logs can appear out of order, show up late, or seem to disappear entirely—especially during crashes, high interrupt activity, or early boot—because logging is best-effort, not guaranteed delivery.

     Kernel code can run in process context or interrupt context, and printk() behaves differently in each. In interrupt context there is no sleeping, limited buffering, and higher priority execution, while process context can be preempted or delayed. Because log messages are buffered and flushed asynchronously, preemption and concurrent CPUs can cause messages to interleave or appear reordered. During thermal events, GPU hangs, or PCIe link flaps, the system may be saturated with interrupts or stuck in a fault path, delaying or dropping log output entirely. This is why GPU driver logs are sometimes misleading during hard lockups: the failure happened, but the system never reached a safe point to emit or flush the messages that would have explained it.

     That's why I'm building kernel modules now. I'm targeting GPU-adjacent roles and need to understand thermal monitoring where it actually happens: kernel space. Today's module does nothing useful yet - just loads, prints to kernel log, unloads. But getting it working taught me something I somehow missed in 15 years of embedded work: why printk() exists at all.

     And before we get started let us ensure that we have our "target debug system" ready. I am on a Ubuntu 22.04 running on an Intel Core i7-9700F - 16GB Memory - NVIDIA GeForce RTX 2060 SUPER. So the instructions for Nvidia Edge compute(Orins and the Jetsons) might vary ever so slightly. Please check with official Nvidia documentation.

    # Install driver
    sudo ubuntu-drivers autoinstall
    sudo reboot
    
    # Verify GPU
    nvidia-smi
    
    # Install CUDA + kernel tools
    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
    sudo dpkg -i cuda-keyring_1.1-1_all.deb
    sudo apt update
    sudo apt install cuda-toolkit-12-6 build-essential linux-headers-$(uname -r)
    
    # Set paths
    echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
    source ~/.bashrc
    
    # Test
    nvcc --version
    

     

     And here is that code. In all its Guts and Glory. Well this is how it started atleast. 

     

    #include <linux/module.h>
    #include <linux/kernel.h>
    #include <linux/init.h>
    
    MODULE_LICENSE("GPL");
    MODULE_AUTHOR("Ananth");
    MODULE_DESCRIPTION("GPU Thermal Monitor");
    
    static int __init temp_init(void){
        printk(KERN_INFO "Tempmonitor: Module loaded\n");
        return 0;
    }
    
    static void __exit temp_exit(void){
        printk(KERN_INFO "Tempmonitor: Module unloaded\n");
    }
    
    module_init(temp_init);
    module_exit(temp_exit);
    

    In kernel space: no libc, no stdio, no printf(). The printk() call dumps to a ring buffer that feeds dmesg - which is where GPU drivers actually log. Spent 20 minutes confused why nothing printed until I realized I was checking terminal output instead of dmesg | tail. This is where real hardware debugging happens. GPU hangs, thermal throttling, PCIe issues - they all surface here first.

    This module is just the starting point. Next, I'll show how to expose kernel data to userspace via character devices, build a ring buffer for time-series data, and process it on GPU using CUDA. Along the way, you'll see why kernel-space logging matters for GPU thermal monitoring and field debugging.

    Next: character devices.


    Fixing Race Conditions: SPSC Ring Buffer with Spinlock

      The Problem We Ignored Yesterday's ring buffer had a subtle but critical bug: race condition on head/tail pointers. Here's what...