Skip to content

[Bug] Kernel panic when resuming from a memory snapshot and using a memory balloon #5566

@maggie-lou

Description

@maggie-lou

Describe the bug

When we resume from a memory snapshot, occasionally we see a kernel panic with rcu_sched self-detected stall on CPU. This is accompanied by a lot of Failed to update balloon stats, missing descriptor. in the guest kernel logs, and we see that the balloon failed to fully expand in the VM that originally saved the snapshot.

Sample logs below:

2025-11-14T19:57:22.890561666 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:23.890560783 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:24.890559859 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:25.890556822 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:26.890555789 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:27.890560534 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:28.890562374 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:29.890557855 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:30.890561168 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:31.890559363 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:32.890561514 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:33.890559609 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:34.890558585 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:35.890562199 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:36.890554795 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:37.890554513 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:38.890558974 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:39.890562452 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:40.890560853 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:41.890560426 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:42.890558045 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:43.890558900 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:44.890560365 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:45.890559767 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:46.890559430 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:47.890560194 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:48.890562080 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:49.890560181 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:50.890559703 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:51.890560297 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:52.890560471 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:53.890560294 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:54.890558845 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:55.890558898 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:56.890559522 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:57.890558979 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:58.890559948 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:57:59.890561468 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:58:00.890562537 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:58:01.890560101 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:58:02.890558276 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:58:03.890556981 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:58:04.890561987 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:58:05.890559871 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:58:06.890564916 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:58:07.890558814 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:58:08.890558902 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:58:09.890558920 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:58:10.890561031 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:58:11.890559185 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:58:12.890559754 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
2025-11-14T19:58:13.890559621 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.
[99996.001171] rcu: INFO: rcu_sched self-detected stall on CPU
[99996.002349] rcu: 	3-...!: (1 GPs behind) idle=36b/1/0x4000000000000000 softirq=27270/27272 fqs=560 
[99996.004245] 	(t=15803 jiffies g=30389 q=527)
[99996.005056] rcu: rcu_sched kthread starved for 13659 jiffies! g30389 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=5
[99996.006953] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[99996.008626] rcu: RCU grace-period kthread stack dump:
[99996.009557] task:rcu_sched       state:R  running task     stack:    0 pid:   12 ppid:     2 flags:0x00004000
[99996.011423] Call Trace:
[99996.011928]  __schedule+0x248/0x690
[99996.012610]  schedule+0x49/0xb0
[99996.013208]  schedule_timeout+0x7b/0xf0
[99996.013921]  ? lock_timer_base+0x90/0x90
[99996.014680]  rcu_gp_fqs_loop+0xe1/0x300
[99996.015425]  rcu_gp_kthread+0x8f/0x120
[99996.016140]  ? rcu_gp_init+0x4f0/0x4f0
[99996.016857]  kthread+0x125/0x150
[99996.017464]  ? set_kthread_struct+0x40/0x40
[99996.018247]  ret_from_fork+0x22/0x30
[99996.018963] rcu: Stack dump where RCU GP kthread last ran:
[99996.019985] Sending NMI from CPU 3 to CPUs 5:
[99996.020869] NMI backtrace for cpu 5
[99996.020878] CPU: 5 PID: 517 Comm: C2 CompilerThre Not tainted 5.15.0 #18
[99996.020884] RIP: 0010:__get_user_8+0x18/0x30
[99996.020900] Code: 31 c0 0f 01 ca c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 48 ba f9 ef ff ff ff 7f 00 00 48 39 d0 73 64 48 19 d2 48 21 d0 0f 01 cb <48> 8b 10 31 c0 0f 01 ca c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
[99996.020901] RSP: 0018:ffffc900009fbe58 EFLAGS: 00040206
[99996.020908] RAX: 00007f1410970fe8 RBX: 00007f142bbf6117 RCX: 00007f1410970fe0
[99996.020909] RDX: ffffffffffffffff RSI: ffffc900009fbf58 RDI: 0000000000000000
[99996.020910] RBP: ffffc900009fbef0 R08: ffff888100886e88 R09: ffff8881180356e8
[99996.020910] R10: ffffc900009fbec0 R11: 0000000000000000 R12: 0000000000000000
[99996.020911] R13: ffff888102ae9580 R14: ffffc900009fbf58 R15: 0000000000000000
[99996.020914] FS:  00007f1410970640(0000) GS:ffff88820d340000(0000) knlGS:0000000000000000
[99996.020916] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[99996.020916] CR2: 0000000000040005 CR3: 00000001183e8005 CR4: 0000000000370ea0
[99996.020917] Call Trace:
[99996.020919]  ? __rseq_handle_notify_resume+0x5b/0x380
[99996.020930]  ? __x64_sys_futex+0x73/0x1d0
[99996.020937]  exit_to_user_mode_prepare+0xe6/0x120
[99996.020939]  syscall_exit_to_user_mode+0x21/0x40
[99996.020942]  do_syscall_64+0x48/0x90
[99996.020946]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[99996.020948] RIP: 0033:0x7f142bbf6117
[99996.020957] Code: 18 e8 fd f8 ff ff 4c 8b 54 24 18 45 31 c0 44 89 ea 41 89 c4 8b 74 24 0c 48 8b 7c 24 10 41 b9 ff ff ff ff b8 ca 00 00 00 0f 05 <44> 89 e7 48 89 c3 e8 3e f9 ff ff e9 62 ff ff ff 66 0f 1f 84 00 00
[99996.020960] RSP: 002b:00007f141096f920 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[99996.020962] RAX: ffffffffffffff92 RBX: 0000000000000000 RCX: 00007f142bbf6117
[99996.020962] RDX: 0000000000000000 RSI: 0000000000000089 RDI: 00007f142bb2f950
[99996.020963] RBP: 00007f142bb2f928 R08: 0000000000000000 R09: 00000000ffffffff
[99996.020963] R10: 00007f141096fa40 R11: 0000000000000246 R12: 0000000000000000
[99996.020964] R13: 0000000000000000 R14: 00007f142bb2f950 R15: 00000000000075f4
[99996.021817] NMI backtrace for cpu 3
[99996.057249] CPU: 3 PID: 27 Comm: ksoftirqd/3 Not tainted 5.15.0 #18
[99996.058397] Call Trace:
[99996.058895]  <IRQ>
[99996.059299]  dump_stack_lvl+0x38/0x49
[99996.060006]  dump_stack+0x10/0x12
[99996.060628]  nmi_cpu_backtrace.cold+0x32/0x75
[99996.061432]  ? lapic_can_unplug_cpu+0x80/0x80
[99996.062259]  nmi_trigger_cpumask_backtrace+0xc1/0xd0
[99996.063228]  arch_trigger_cpumask_backtrace+0x14/0x20
[99996.064174]  rcu_dump_cpu_stacks+0xce/0x100
[99996.064978]  rcu_sched_clock_irq.cold+0x2a4/0x460
[99996.065850]  ? account_system_index_time+0x91/0xa0
[99996.066756]  update_process_times+0x8f/0xc0
[99996.067547]  tick_sched_handle+0x33/0x50
[99996.068291]  tick_sched_timer+0x83/0xb0
[99996.069022]  ? tick_nohz_handler+0xb0/0xb0
[99996.069797]  __hrtimer_run_queues+0x10c/0x1c0
[99996.070605]  hrtimer_interrupt+0xfc/0x210
[99996.071380]  __sysvec_apic_timer_interrupt+0x5a/0x70
[99996.072289]  sysvec_apic_timer_interrupt+0x6f/0x80
[99996.073194]  </IRQ>
[99996.073600]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[99996.074541] RIP: 0010:rcu_cblist_dequeue+0xd/0x20
[99996.075471] Code: 46 10 01 00 00 00 c3 48 c7 06 00 00 00 00 48 89 76 08 48 c7 46 10 00 00 00 00 c3 66 90 48 8b 07 48 85 c0 74 10 48 83 6f 10 01 <48> 8b 10 48 89 17 48 85 d2 74 01 c3 48 89 7f 08 c3 66 90 48 8b 47
[99996.078941] RSP: 0018:ffffc9000010fe00 EFLAGS: 00000206
[99996.079927] RAX: ffff888086edf2f0 RBX: 00000000000001dc RCX: 00000000801a0016
[99996.081249] RDX: 00000000801a0017 RSI: 00000000801a0016 RDI: ffffc9000010fe20
[99996.082544] RBP: ffffc9000010fe68 R08: 0000000000000001 R09: 0000000000000000
[99996.083913] R10: 0000000000000001 R11: 0000000000020200 R12: ffff88820d2de600
[99996.085230] R13: ffffc9000010fe20 R14: 00000000000001db R15: ffff88820d2de670
[99996.086528]  ? rcu_core+0x1e2/0x5e0
[99996.087224]  rcu_core_si+0x9/0x10
[99996.087853]  __do_softirq+0xb4/0x1de
[99996.088515]  run_ksoftirqd+0x19/0x30
[99996.089222]  smpboot_thread_fn+0xb5/0x150
[99996.089976]  ? __smpboot_create_thread.part.0+0x120/0x120
[99996.091004]  kthread+0x125/0x150
[99996.091629]  ? set_kthread_struct+0x40/0x40
[99996.092404]  ret_from_fork+0x22/0x30
2025-11-14T19:58:14.890563805 [21f75c8a-088d-4e8c-8bb6-e3c211061d5e:main] Failed to update balloon stats, missing descriptor.

To Reproduce

Unfortunately we don't have a clean reproduction yet, but our general workflow is:

  1. Run a workload in a VM.
  2. After execution has completed, expand the balloon to reclaim available memory using the following logic

stats, err := c.machine.GetBalloonStats(ctx)
availableMemMB := stats.AvailableMemory / 1e6
balloonSizeMB := int64(float64(availableMemMB) * .9)
err := c.machine.UpdateBalloon(ctx, balloonSizeMB)

  1. Deflate the balloon back down to 0MB and pause and snapshot the VM.
  2. Restore from the VM the next time a workload comes in.

Expected behaviour

In the example I included the logs for, we ran a workload at 3:17PM.
After execution completed, we attempted to expand the balloon to 4839MB (using the logic in the reproduction steps).
The balloon expanded to 2688MB and stopped expanding.
We then attempted to deflate the balloon to 0MB.
There were a couple seconds of no progress, and we paused and took a snapshot of the VM. At this point, the balloon was still at 2688MB.
The next time we tried to resume from the snapshot at 4:10PM, there was a panic with the stack trace from above.

Environment

For context on our guest kernel version: I know that v5.10 and v6.1 are the officially supported versions.

We run some workloads on GCP, and have limited control over the host kernel. If I remember correctly, we upgraded from v5.10 -> v5.15 because we noticed an increase in guest kernel panics with v5.10 that were resolved on v5.15.

We haven't upgraded to v6.1 because we noticed some network performance degradation.

Checks

  • Have you searched the Firecracker Issues database for similar problems?
  • Have you read the existing relevant Firecracker documentation?
  • Are you certain the bug being reported is a Firecracker issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions