Skip to content

Commit 373ccbe

Browse files
Michal Hockotorvalds
authored andcommitted
mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress
Tetsuo Handa has reported that the system might basically livelock in OOM condition without triggering the OOM killer. The issue is caused by internal dependency of the direct reclaim on vmstat counter updates (via zone_reclaimable) which are performed from the workqueue context. If all the current workers get assigned to an allocation request, though, they will be looping inside the allocator trying to reclaim memory but zone_reclaimable can see stalled numbers so it will consider a zone reclaimable even though it has been scanned way too much. WQ concurrency logic will not consider this situation as a congested workqueue because it relies that worker would have to sleep in such a situation. This also means that it doesn't try to spawn new workers or invoke the rescuer thread if the one is assigned to the queue. In order to fix this issue we need to do two things. First we have to let wq concurrency code know that we are in trouble so we have to do a short sleep. In order to prevent from issues handled by 0e093d9 ("writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone") we limit the sleep only to worker threads which are the ones of the interest anyway. The second thing to do is to create a dedicated workqueue for vmstat and mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to have a spare worker thread for it. Signed-off-by: Michal Hocko <[email protected]> Reported-by: Tetsuo Handa <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Cristopher Lameter <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Arkadiusz Miskiewicz <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 475a2f9 commit 373ccbe

File tree

2 files changed

+20
-5
lines changed

2 files changed

+20
-5
lines changed

mm/backing-dev.c

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -957,8 +957,9 @@ EXPORT_SYMBOL(congestion_wait);
957957
* jiffies for either a BDI to exit congestion of the given @sync queue
958958
* or a write to complete.
959959
*
960-
* In the absence of zone congestion, cond_resched() is called to yield
961-
* the processor if necessary but otherwise does not sleep.
960+
* In the absence of zone congestion, a short sleep or a cond_resched is
961+
* performed to yield the processor and to allow other subsystems to make
962+
* a forward progress.
962963
*
963964
* The return value is 0 if the sleep is for the full timeout. Otherwise,
964965
* it is the number of jiffies that were still remaining when the function
@@ -978,7 +979,19 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
978979
*/
979980
if (atomic_read(&nr_wb_congested[sync]) == 0 ||
980981
!test_bit(ZONE_CONGESTED, &zone->flags)) {
981-
cond_resched();
982+
983+
/*
984+
* Memory allocation/reclaim might be called from a WQ
985+
* context and the current implementation of the WQ
986+
* concurrency control doesn't recognize that a particular
987+
* WQ is congested if the worker thread is looping without
988+
* ever sleeping. Therefore we have to do a short sleep
989+
* here rather than calling cond_resched().
990+
*/
991+
if (current->flags & PF_WQ_WORKER)
992+
schedule_timeout(1);
993+
else
994+
cond_resched();
982995

983996
/* In case we scheduled, work out time remaining */
984997
ret = timeout - (jiffies - start);

mm/vmstat.c

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1379,6 +1379,7 @@ static const struct file_operations proc_vmstat_file_operations = {
13791379
#endif /* CONFIG_PROC_FS */
13801380

13811381
#ifdef CONFIG_SMP
1382+
static struct workqueue_struct *vmstat_wq;
13821383
static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
13831384
int sysctl_stat_interval __read_mostly = HZ;
13841385
static cpumask_var_t cpu_stat_off;
@@ -1391,7 +1392,7 @@ static void vmstat_update(struct work_struct *w)
13911392
* to occur in the future. Keep on running the
13921393
* update worker thread.
13931394
*/
1394-
schedule_delayed_work_on(smp_processor_id(),
1395+
queue_delayed_work_on(smp_processor_id(), vmstat_wq,
13951396
this_cpu_ptr(&vmstat_work),
13961397
round_jiffies_relative(sysctl_stat_interval));
13971398
} else {
@@ -1460,7 +1461,7 @@ static void vmstat_shepherd(struct work_struct *w)
14601461
if (need_update(cpu) &&
14611462
cpumask_test_and_clear_cpu(cpu, cpu_stat_off))
14621463

1463-
schedule_delayed_work_on(cpu,
1464+
queue_delayed_work_on(cpu, vmstat_wq,
14641465
&per_cpu(vmstat_work, cpu), 0);
14651466

14661467
put_online_cpus();
@@ -1549,6 +1550,7 @@ static int __init setup_vmstat(void)
15491550

15501551
start_shepherd_timer();
15511552
cpu_notifier_register_done();
1553+
vmstat_wq = alloc_workqueue("vmstat", WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);
15521554
#endif
15531555
#ifdef CONFIG_PROC_FS
15541556
proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);

0 commit comments

Comments
 (0)