Hardcore Troubleshooting: How I Caught That "Gone 3 Milliseconds" in the Linux Kernel?

Foreword

In Real-Time Linux optimization, cyclictest serves as our "thermometer," indicating when the system is "sick" (high latency), but it never reveals the "source of the illness." When cyclictest reports a 5ms Max Latency, do you investigate the driver code? Question the scheduling strategy? Or suspect the underlying firmware (ATF) is causing trouble?

Often times, we are dealing with a black box. In order to understand the source of the latency, instead of fixing the bug, I did the opposite - I wrote kernel modules to actively "create faults". By reproducing the four classic scenarios of hard interrupt storms, priority starvation, kernel lock contention, and hardware SMI, together with Ftrace's microscopic analysis, I summarized a set of general two-phase troubleshooting methodology. Mastering this set of methodology, even the most complex system delays will have nothing to hide.

Before getting to the point, let's first understand the principle of cyclictest testing:

cyclictest creates a specified number and priority of real-time processes. These processes set a timer to wake themselves up periodically. The time between the timer overflow that triggers an interrupt into the ISR to call wake_up_process() to wake up the real-time process, and the time the process actually starts running, is the latency we need to measure. After the real-time process resumes running, it will get the current system time, subtract the time when the sleep started and the sleep duration to get the latency.

So, what are the causes of high latency?

Hard interrupts interfering with or shutting down interrupts. The CPU is executing an interrupt handler function (which automatically shuts down interrupts) when the cyclictest is supposed to wake up, or the code actively shuts down interrupts, resulting in high latency.
Scheduling delays. cyclictest has been woken up (state changed to R), but there are threads with higher priority in the runqueue, causing cyclictest to not get CPU.
Close preemption. Timer interrupts can be triggered when the cyclictest should be woken up, but cannot be scheduled to the cyclictest process.
Hardware black hole. In the ARM64 architecture, the kernel runs on EL1, virtualization runs on EL2, and security firmware runs on EL3. When a system call gets stuck in EL2 or EL3, the Linux kernel stops running completely and the kernel is not aware of it.

In order to understand these scenarios more deeply, we reproduce them in practice.

Hard interrupts interfere with/shut down interrupts

First clone the project:

> git clone https://github.com/hlleng/rt_test_noise
> cd irq_noise && make
> sudo insmod irq_noise.ko
> sudo ./trace_irq_noise.sh

If some events are missing, turn them on in config. Use cyclictest to test performance:

> for i in 0 1 2 3; do echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done
> sudo cyclictest -t 1 -a 0 -p 80 -D 5m -b 3000 --tracemark # if not，please install rt-tests
> echo 3000 | taskset -c 0 tee /proc/inject_irq_latency

Step-by-step explanation of the three commands:

First sentence: set the ondemand policy to performance to prevent ondemand interruptions, and change the number of CPUs according to the actual situation. If you use the default ondemand policy, the frequency will be increased when the CPU utilization increases, which will increase the latency.

Second sentence: Use cyclictest to test latency

-t 1 -a 0: To reduce the amount of information, start only one thread, bound to CPU0.
-p 80: set thread priority to 80, policy is SCHED_FIFO
-D 5m: test for 5 minutes
-b 3000 --tracemark: stop ftrace sampling and cyclictest testing when latency exceeds 3000us, and write tracing_mark_write: hit latency threshold in logs

If -i is not specified, the default is to wake up the thread every 1000us.

Third sentence: write to the irq_noise module to create a 3000us delay. If the cyclictest stop is not triggered, you can write more times for reasons explained below.

If everything goes well, the following log will appear, indicating that the delay exceeded 3000us:

> sudo cyclictest -t 1 -a 0 -p 80 -D 5m -b 3000 --tracemark
# /dev/cpu_dma_latency set to 0us
INFO: debugfs mountpoint: /sys/kernel/debug/tracing/
policy: fifo: loadavg: 0.73 0.91 0.53 2/364 8160

T: 0 ( 8148) P:80 I:1000 C:   6003 Min:     30 Act:   49 Avg:   45 Max:    2838
# Thread Ids: 08148
# Break thread: 8148
# Break value: 3157

We have exported the logs to a txt file for easier reading:

> cat trace > /tmp/trace_irq.txt

Next, a few trace logs are interpreted.

             tee-8160    [000] dNh..   856.208920: sched_wakeup: comm=cyclictest pid=8148 prio=19 target_cpu=000
             tee-8160    [000] dNh..   856.208923: hrtimer_expire_exit: hrtimer=000000005c8f967e
             tee-8160    [000] dNh..   856.208926: irq_handler_exit: irq=23 ret=handled
             tee-8160    [000] d....   856.208954: sched_switch: prev_comm=tee prev_pid=8160 prev_prio=120 prev_state=R ==> next_comm=cyclictest next_pid=8148 next_prio=19
      cyclictest-8148    [000] d....   856.208990: sched_switch: prev_comm=cyclictest prev_pid=8148 prev_prio=19 prev_state=S ==> next_comm=tee next_pid=8160 next_prio=120
             tee-8160    [000] d.h..   856.209894: irq_handler_entry: irq=23 name=arch_timer
             tee-8160    [000] d.h..   856.209898: hrtimer_expire_entry: hrtimer=0000000086669db1 function=timer_callback [irq_noise] now=856211171021
             tee-8160    [000] d.h..   856.212940: hrtimer_expire_exit: hrtimer=0000000086669db1
             tee-8160    [000] d.h..   856.212944: hrtimer_expire_entry: hrtimer=000000005c8f967e function=hrtimer_wakeup now=856214218069
             tee-8160    [000] dNh..   856.212957: sched_wakeup: comm=cyclictest pid=8148 prio=19 target_cpu=000
             tee-8160    [000] dNh..   856.212959: hrtimer_expire_exit: hrtimer=000000005c8f967e
             tee-8160    [000] dNh..   856.212961: hrtimer_expire_entry: hrtimer=00000000c5b1b039 function=tick_sched_timer now=856214218069
             tee-8160    [000] dNh..   856.212991: hrtimer_expire_exit: hrtimer=00000000c5b1b039
             tee-8160    [000] dNh..   856.212994: hrtimer_expire_entry: hrtimer=000000009ea9d197 function=hrtimer_wakeup now=856214218069
             tee-8160    [000] dNh..   856.213008: sched_wakeup: comm=cyclictest pid=8147 prio=120 target_cpu=000
             tee-8160    [000] dNh..   856.213010: hrtimer_expire_exit: hrtimer=000000009ea9d197
             tee-8160    [000] dNh..   856.213012: irq_handler_exit: irq=23 ret=handled
             tee-8160    [000] d....   856.213036: sched_switch: prev_comm=tee prev_pid=8160 prev_prio=120 prev_state=R ==> next_comm=cyclictest next_pid=8148 next_prio=19
      cyclictest-8148    [000] .....   856.213140: tracing_mark_write: hit latency threshold (3157 > 3000)

sched_wakeup occurs when a process (or thread) is woken up from a blocking state and added to a runnable queue.

sched_switch occurs when the scheduler decides to switch the CPU from one process ( prev ) to another ( next ), and the next process starts executing on the CPU.

As you can see from the last line of the log, the latency of thread cyclictest-8148 is more than 3000us, so we target the keyword cyclictest-8148.

856.212957: sched_wakeup added cyclictest-8148 to the runqueue.

856.213036: sched_switch officially executes the cyclictest-8148 thread.

The difference between the two is 79us, a reasonable value.

Moving on to the previous cyclictest-8148.

856.208920: sched_wakeup added cyclictest-8148 to the runqueue. Since the interval is 1ms (1000us), the next wakeup should theoretically be around 856.209920.

Note that at 856.209898, just as cyclictest is about to wake up (~22us away), the irq_noise module triggers, and the CPU enters the hard interrupt context (d.h..) ) and starts executing the dead loop busy, etc.

Duration of irq_noise: 856.209898 (hrtimer_expire_entry) → 856.212940 (hrtimer_expire_exit).

The difference is 3042us, which is not far from the 3000us we entered in userland. Since the callback function is busy etc., no other business is performed between entry and exit.

If the above text description is hard to read, take a look at the flowchart below:

[ Previous Sleep sched_switch ] (Timestamp: 856.208990)
         |
         | (cyclictest sleeps, expected wake-up in 1ms)
         v
         .
         . Time Passes ...
         .
         |
(Just before theoretical wake-up 856.209894) -> [ irq_handler_entry (Hardware Timer Interrupt Triggered) ]
         |
(856.209898) -> [ hrtimer_expire_entry (irq_noise callback starts) ]
         |
         v
+=========================================================================+
|  *** SEVERE BLOCKING REGION (Hard IRQ Context) *** |
|                                                                         |
|  [ Theoretical Wake-up Point T_expected (~856.2099xx) ] <-- Should have woken up here |
|            (Wake-up deferred due to high-priority busy-wait below)      |
|                                                                         |
|  .....................................................................  |
|  .                                                                   .  |
|  .           [ IRQ NOISE Hogging CPU (Busy-wait loop) ]              .  |
|  .                                                                   .  |
|  .                  Duration: 3042 us                                .  |
|  .                                                                   .  |
|  .....................................................................  |
|                                                                         |
+=========================================================================+
         |
(856.212940) -> [ hrtimer_expire_exit (irq_noise callback ends, CPU released) ]
         |
         v
(856.212957) -> [ Actual Wake-up sched_wakeup ] (cyclictest added to runqueue)
         |
         | (Scheduling Delay: 79us)
         |
(856.213036) -> [ Actual Execution sched_switch ] (cyclictest starts running, high latency detected)

Why is the trigger not 100% successful? This is because the injection delay is 3000us, the alarm threshold ( -b ) is also 3000us, and the cycle time of the cyclictest is 1000us. this leads to phase alignment problems.

Take a simple example:

The cyclictest is scheduled to wake up at T=1000, execute until T=1001, and then sleep.
cyclictest finishes running normally at T=1001, goes to sleep, and plans to wake up next time at T=2001.
The interrupt overwrites the wakeup point at T=2001.
cyclictest wakes up at the end of interrupt T=4002.
Calculate latency: 4002 (actual) - 2001 (scheduled) = 2001.
Result: Latency is only about 2ms, well below the 3000us threshold. miss!

If you want to trigger 100%, you need to inject 4500us, so the worst case is 4.5 - 1 = 3.5ms, which is definitely more than 3000us.

Scheduling Latency / Preemption Latency

Next, let's look at scenario 2 - scheduling delay. This scenario is easy to understand: when cyclictest 's timer expires, the kernel calls sched_wakeup to add cyclictest to the runqueue, but since there are higher-priority threads ahead of cyclictest, the higher-priority threads must be run before cyclictest can take its turn.

[ Previous Sleep sched_switch ] (cyclictest yields CPU and sleeps)
         |
         | (cyclictest is sleeping...)
         | (Meanwhile, other tasks are running on the CPU)
         v
         .
         . Time Passes ...
         .
         |
(Theoretical Wake-up Point T_expected arrives) -> [ Timer Interrupt Fires ] (Kernel enters interrupt handler)
         |
         v
(In Interrupt Context) -> [ Core Event: sched_wakeup (cyclictest) ]
         |
         |  Note: cyclictest state changes from SLEEPING to RUNNABLE.
         |        It is added to the CPU's runqueue, **READY** to execute.
         |
         v
+=========================================================================+
|  *** Scheduling Blocked Region *** |
|                                                                         |
|  [ The Conflict ]: The CPU does NOT immediately switch to cyclictest.   |
|                                                                         |
|  .....................................................................  |
|  .                                                                   .  |
|  .   [ Obstacle: A Higher Priority Task is currently currently ]     .  |
|  .   (e.g., Another RT process, or a kernel thread like migration/0) .  |
|  .                                                                   .  |
|  .   -------------------------------------------------------------   .  |
|  .   >> cyclictest State: RUNNABLE                                   .  |
|  .   >> cyclictest Location: Waiting in the Runqueue for the CPU     .  |
|  .   -------------------------------------------------------------   .  |
|  .                                                                   .  |
|  .....................................................................  |
|                                                                         |
|  (This waiting period constitutes the main scheduling latency)          |
+=========================================================================+
         |
         | (The Higher Priority Task finally finishes or yields the CPU)
         v
[ Scheduler runs ] (Picks the next highest priority runnable task)
         |
         v
[ Core Event: sched_switch ] (Context Switch: Prev_High_Prio ==> Next_cyclictest)
         |
         v
(cyclictest finally obtains the CPU, starts execution, and reports latency)

We use stress-ng to simulate this

# stress-ng uses the default SCHED_OTHER (normal process)
> sudo taskset -c 0 stress-ng --cpu 1 --timeout 600s &

# run cyclictest
> sudo cyclictest -t 1 -a 0 -p 80 -D 5m -b 3000
policy: fifo: loadavg: 0.17 0.17 0.10 1/301 1695

T: 0 ( 1498) P:80 I:1000 C: 299994 Min:     11 Act:   53 Avg:   31 Max:     698

This is a very average figure since it is not patched with PREEMPT_RT.

Let's try it out if the priority stress-ng is higher than cyclictest.

# stress-ng uses SCHED_FIFO with a priority higher than cyclictest.
> sudo taskset -c 0 stress-ng --cpu 1 --sched fifo --sched-prio 90 --timeout 600s &

> sudo cyclictest -t 1 -a 0 -p 80 -D 5m -b 3000 --tracemark

As you can see, cyclictest basically can't output data because the CPU is occupied by stress-ng at this time. Until the stress-ng process times out, cyclictest can't run.

Turning off preemption

Next, let's take a look at the third scenario. In fact, both scenarios are almost exactly the same in userland, and both of them lead to the result that the CPU is in the state of "unschedulable", and Cyclictest can't get the CPU even if the time is up and the priority is high.

However, from a kernel implementation and troubleshooting perspective, there is a subtle but crucial difference. We can call this " same death, different killers ".

What are the differences? （At the kernel level, we can understand the difference by looking at a "hegemony level pyramid". The higher the level, the more it can interrupt the execution of lower levels.

/  \
            /    \
           / Level 1\   <-- Hard IRQ / Interrupts Disabled
          /(Highest  \      (Nobody can interrupt, causes hard IRQ latency)
         / Privilege) \
        /______________\
       /                \
      /      Level 2     \  <-- Preempt Off / Spinlocks held
     /(Scheduler Disabled)\     (Interrupts can enter, but tasks cannot switch)
    /______________________\
   /                        \
  /         Level 3          \ <-- Normal Processes / RT Processes
 /    (Fully Preemptible)     \    (Subject to priority scheduling)
/______________________________\

If the rt_test_noise repository is already cloned, the

> for i in 0 1 2 3; do echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done
> sudo cyclictest -t 1 -a 0 -p 80 -D 5m -b 3000 --tracemark
> sudo ./trace_preemptirqoff.sh
> echo 3500 | taskset -c 0 tee /proc/inject_preempt_latency

Check the maximum latency

> cat tracing_max_latency
3507

As you can see, the maximum latency is 3507, which is not far from the 3500 we entered.

Next look at the detailed logs, the full logs are a bit long and are placed in the link. https://pastebin.com/CJfv4VT9

We divided the log into three phases: off preemption, intermediate interlude, and on preemption. Phase 1: Off preemption

k_preemp-1645      0...1.    1us : preempt_count_add <-preempt_count_add
k_preemp-1645      0...1.    2us : __udelay <-thread_fn

We need to focus on the 0.. .1. flag bit, actually 1 is preempt depth, which means that the preemption counter is now 1 (i.e. preemption is forbidden). Note that preemption is possible only if the preempt pepth is 0. Since we are bound to CPU0, the log only shows CPU0.

Phase 2: Interlude

At 362us, CPU0 triggers an interrupt.

k_preemp-1645      0d..1.  362us : irq_enter_rcu <-el1_interrupt
...
k_preemp-1645      0d.h1.  375us : arch_timer_handler_phys <-handle_percpu_devid_irq

We can see that a timer interrupt was triggered, what triggered the timer interrupt? Let's look at the back log

k_preemp-1645      0d.h1.  389us : hrtimer_wakeup <-__hrtimer_run_queues
k_preemp-1645      0d.h1.  391us : try_to_wake_up <-wake_up_process
...
k_preemp-1645      0d.h4.  411us : enqueue_task_rt <-enqueue_task

The timer interrupt realized that cyclictest should wake up, so it executes the wakeup process and puts it into the RT run queue (enqueue_task_rt ).

Key flags change: note the line 433us:

k_preemp-1645      0dNh4.  433us : task_woken_rt <-ttwu_do_wakeup

The N (Need Resched) flag is lit! N corresponds to the TIF_NEED_RESCHED variable, which indicates that there is a task on the current CPU that needs to be rescheduled, which means, in plain English, that the kernel clearly knows that "there is a task (cyclictest) that is more important than the current process that wants to run, and it needs to be rescheduled! ". We can see many times in the log preempt_count_add and preempt_count_sub, but it does not matter, we only need to pay attention to preempt depth is 0, as long as it is not 0, are not allowed to schedule.

Phase 3: Turning Preemption On

k_preemp-1645      0.N.1. 3506us : preempt_count_sub <-preempt_count_sub

Although many things are sent when preemption is turned off, such as interrupts, spin_lock, etc., none of these have a material effect on the scheduler until the last call to preempt_count_sub, which reduces the preempt depth to zero.

Based on the logs, we can simply draw a flow chart

[k_preempt_noise]
       |
       |  <-- 1us: preempt_disable() [Preemption Disabled]
       |
       | (Busy waiting...)
       |
       +---- 362us: [Hardware Interrupt Occurs!] -----------------------+
       |                                                                |
       |    [IRQ Context]                                               |
       |    hrtimer_wakeup -> cyclictest                                |
       |    Sets Need_Resched flag (N)                                  |
       |    (Wants to schedule, but sees preemption disabled)           |
       |                                                                |
       +---- 1002us: [Interrupt Return] --------------------------------+
       |
       | (Continues busy waiting... 'N' flag remains set)
       |
       |  <-- 3505us: preempt_enable() [Preemption Enabled]
       v
[Context Switch finally occurs -> cyclictest]

Hardware black hole

Next is the last scenario - the "hardware black hole". In the ARM64 architecture, the kernel runs in EL1, virtualization runs in EL2, and security runs in EL3. When the kernel calls, it will fall into EL2, and when it calls smc, it will fall into EL3.

The phenomenon is that the Linux kernel stops running altogether - clock interrupts are not responded to, scheduling stalls, and even Ftrace cannot be recorded. From the trace point of view, there is nothing in the trace file, but the timestamps undergo a weird Time Skip.

Since we can't modify Rockchip's underlying firmware (ATF) to create a real black hole, we won't do any specific practice in this chapter. Here is a case study: when I was doing guest performance testing in a virtualized scenario, I found that the latency was very high. No matter how to use ftrace, I can't locate the latency information. I suspect that the guest latency is too high because it is caught in the long execution time of EL2.

[ Normal Linux Kernel Execution (Running at EL1) ]
(Trace Timestamp: T_point_A, e.g., 100.000000)
         |
         | (Kernel running normally, Ftrace recording active)
         v
[ Trigger Event ] (e.g., Need to call secure service or hypervisor function)
         |
(Kernel executes trap inst.) -> [ Executes SMC (Trap to EL3) or HVC (Trap to EL2) ]
         |
         v
   /---------------------------------------------------------------------\
   |  *** Entering the Hardware Black Hole *** |
   |  [ CPU execution control handed over to Firmware (ATF) or Hypervisor]|
   \---------------------------------------------------------------------/
         |
         v
+=========================================================================+
|  *** OS Invisible Region *** |
|                                                                         |
|  [ Linux Kernel State (EL1) ]:                                          |
|  >> Completely FROZEN.                                                  |
|  >> Timer interrupts cannot be handled (blocked).                       |
|  >> Scheduler halted.                                                   |
|  >> Ftrace CANNOT record any data (Trace is SILENT).                    |
|                                                                         |
|  .....................................................................  |
|  .                                                                   .  |
|  .   [ Physical Hardware State (CPU at EL2/EL3) ]                    .  |
|  .   >> CPU is busy executing firmware or hypervisor code.           .  |
|  .   >> The physical clock keeps ticking.                            .  |
|  .   >> (This may take a long time, e.g., milliseconds or more)      .  |
|  .                                                                   .  |
|  .....................................................................  |
|                                                                         |
|  (This period appears as a "blank gap" in the trace file)               |
+=========================================================================+
         |
         v
   /---------------------------------------------------------------------\
   |  *** Exiting the Black Hole *** |
   |  [ Firmware/Hypervisor finishes, executes ERET to return to Linux ] |
   \---------------------------------------------------------------------/
         |
         v
[ Linux Kernel Resumes Execution (Back to EL1) ]
(Trace Timestamp: T_point_B, e.g., 100.005000)
         |
         v
[ Phenomenon Confirmed: Weird Time Skip (Time Jump) ]
>> A massive time gap appears between two adjacent trace records.
>> Latency = T_point_B - T_point_A = 5ms (with ZERO records in between)

If you suspect a hardware black hole, Linux provides a specialized tool called hwlatdetect.

# Run Detection (Detect for 10 minutes, threshold 10us)
> sudo hwlatdetect --duration=10m --threshold=10

The principle of this tool is very violent - it sends a kernel thread to poll the CPU hardware clock frantically, and if it finds that the time difference between two reads is much higher than expected, the CPU has been 'stolen'. If hwlatdetect reports high latency and Ftrace irqsoff/preemptoff doesn't catch anything, that's a real hardware/firmware black hole (ATF, TrustZone, or bus contention).

To summarize

By analyzing the above four scenarios, we can see that the root causes of real-time Linux latency problems are mainly centered on hard interrupt interference, scheduling delays, shutdown preemption, and hardware black holes. In the actual debugging process, ftrace is our most powerful weapon, through the combination of function_graph, irqsoff, preemptoff and other tracers, together with the analysis of the key flags, we can accurately locate the root cause of the delay.

We can capture the basic information in the following script

cd /sys/kernel/debug/tracing
echo 0 > tracing_on
echo nop > current_tracer
echo 1 > tracing_cpumask # Capture information only from CPU0.
echo 1 > events/sched/sched_switch/enable
echo 1 > events/sched/sched_wakeup/enable
echo 1 > events/irq/enable
echo 1 > events/timer/enable
echo 1 > tracing_on

in following the process of troubleshooting:

[Previous Wakeup] --- (After 1ms period) ---> [Theoretical Wakeup Point T_expected]
                                          |
                                          |  <--- Blind Zone: Stage 2 (Pre-Wakeup)
                                          |       "Interference" happens here
                                          |       (Hard IRQ / IRQs Disabled / SMI)
                                          |
                                          v
                                     [Actual Wakeup sched_wakeup]
                                          |
                                          |  <--- Visible Zone: Stage 1 (Post-Wakeup)
                                          |       "Congestion" happens here
                                          |       (High Priority Suppression / Preemption Disabled)
                                          |
                                          v
                                     [Actual Execution sched_switch]

There is no silver bullet for real-time optimization, but there is a scientific methodology. From blind guessing to pinpointing through Ftrace, this is not only an improvement in technology, but also a change in mindset. Hopefully, this set of 'catching ghosts' guides will make you feel less powerless in the face of millisecond latency.

Hardcore Troubleshooting: How I Caught That "Gone 3 Milliseconds" in the Linux Kernel?

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
irq_noise		irq_noise
preempt_noise		preempt_noise
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hardcore Troubleshooting: How I Caught That "Gone 3 Milliseconds" in the Linux Kernel?

Foreword

Hard interrupts interfere with/shut down interrupts

Scheduling Latency / Preemption Latency

Turning off preemption

Hardware black hole

To summarize

About

Uh oh!

Releases

Packages

Languages

hlleng/rt_test_noise

Folders and files

Latest commit

History

Repository files navigation

Hardcore Troubleshooting: How I Caught That "Gone 3 Milliseconds" in the Linux Kernel?

Foreword

Hard interrupts interfere with/shut down interrupts

Scheduling Latency / Preemption Latency

Turning off preemption

Hardware black hole

To summarize

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages