Skip to content

Conversation

@hemaprathaban
Copy link

HI Jacob
I have added some cleanup patches and cooling register

Jacob Pan and others added 11 commits August 19, 2013 15:30
Signed-off-by: Jacob Pan <[email protected]>
Signed-off-by: Hema Prathaban <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
ERROR: do not use C99 // comments.
C99 comments are removed in intel_pch_thermal.c

Signed-off-by: Hema prathaban <[email protected]>
This patch fixes the trailing white space error
ERROR: trailing whitespace

Signed-off-by: Hema prathaban <[email protected]>
This patch add the tabs where possible

ERROR: code indent should use tabs where possible

Signed-off-by: Hema prathaban <[email protected]>
This patch include include KERN_ facility level
WARNING: printk() should include KERN_ facility level

Signed-off-by: Hema prathaban <[email protected]>
This patch removes the space between function name
and open parenthesis
WARNING: space prohibited between function name and open parenthesis '('

Signed-off-by: Hema prathaban <[email protected]>
This patch removes the following warning
WARNING: sizeof *ptdev should be sizeof(*ptdev)

Signed-off-by: Hema prathaban <[email protected]>
This patch removes following  warning
WARNING: Prefer netdev_info(netdev, ... then dev_info(dev, ... then
pr_info(...  to printk(KERN_INFO ...

Signed-off-by: Hema prathaban <[email protected]>
This patch removes the following warning
WARNING: Use DEFINE_PCI_DEVICE_TABLE for struct pci_device_id

Signed-off-by: Hema prathaban <[email protected]>
This patch register the cooling device
for intel_pch_thermal.c

Signed-off-by: Hema prathaban <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 10, 2013
Pull networking fixes from David Miller:

 1) There was a simplification in the ipv6 ndisc packet sending
    attempted here, which avoided using memory accounting on the
    per-netns ndisc socket for sending NDISC packets.  It did fix some
    important issues, but it causes regressions so it gets reverted here
    too.  Specifically, the problem with this change is that the IPV6
    output path really depends upon there being a valid skb->sk
    attached.

    The reason we want to do this change in some form when we figure out
    how to do it right, is that if a device goes down the ndisc_sk
    socket send queue will fill up and block NDISC packets that we want
    to send to other devices too.  That's really bad behavior.

    Hopefully Thomas can come up with a better version of this change.

 2) Fix a severe TCP performance regression by reverting a change made
    to dev_pick_tx() quite some time ago.  From Eric Dumazet.

 3) TIPC returns wrongly signed error codes, fix from Erik Hugne.

 4) Fix OOPS when doing IPSEC over ipv4 tunnels due to orphaning the
    skb->sk too early.  Fix from Li Hongjun.

 5) RAW ipv4 sockets can use the wrong routing key during lookup, from
    Chris Clark.

 6) Similar to #1 revert an older change that tried to use plain
    alloc_skb() for SYN/ACK TCP packets, this broke the netfilter owner
    mark which needs to see the skb->sk for such frames.  From Phil
    Oester.

 7) BNX2x driver bug fixes from Ariel Elior and Yuval Mintz,
    specifically in the handling of virtual functions.

 8) IPSEC path error propagations to sockets is not done properly when
    we have v4 in v6, and v6 in v4 type rules.  Fix from Hannes Frederic
    Sowa.

 9) Fix missing channel context release in mac80211, from Johannes Berg.

10) Fix network namespace handing wrt.  SCM_RIGHTS, from Andy
    Lutomirski.

11) Fix usage of bogus NAPI weight in jme, netxen, and ps3_gelic
    drivers.  From Michal Schmidt.

12) Hopefully a complete and correct fix for the genetlink dump locking
    and module reference counting.  From Pravin B Shelar.

13) sk_busy_loop() must do a cpu_relax(), from Eliezer Tamir.

14) Fix handling of timestamp offset when restoring a snapshotted TCP
    socket.  From Andrew Vagin.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
  net: fec: fix time stamping logic after napi conversion
  net: bridge: convert MLDv2 Query MRC into msecs_to_jiffies for max_delay
  mISDN: return -EINVAL on error in dsp_control_req()
  net: revert 8728c54 ("net: dev_pick_tx() fix")
  Revert "ipv6: Don't depend on per socket memory for neighbour discovery messages"
  ipv4 tunnels: fix an oops when using ipip/sit with IPsec
  tipc: set sk_err correctly when connection fails
  tcp: tcp_make_synack() should use sock_wmalloc
  bridge: separate querier and query timer into IGMP/IPv4 and MLD/IPv6 ones
  ipv6: Don't depend on per socket memory for neighbour discovery messages
  ipv4: sendto/hdrincl: don't use destination address found in header
  tcp: don't apply tsoffset if rcv_tsecr is zero
  tcp: initialize rcv_tstamp for restored sockets
  net: xilinx: fix memleak
  net: usb: Add HP hs2434 device to ZLP exception table
  net: add cpu_relax to busy poll loop
  net: stmmac: fixed the pbl setting with DT
  genl: Hold reference on correct module while netlink-dump.
  genl: Fix genl dumpit() locking.
  xfrm: Fix potential null pointer dereference in xdst_queue_output
  ...
jacobpan pushed a commit that referenced this pull request Sep 10, 2013
When picolcd is switched into bootloader mode (for FW flashing) make
sure not to try to dereference NULL-pointers of feature-devices during
unplug/unbind.

This fixes following BUG:
  BUG: unable to handle kernel NULL pointer dereference at 00000298
  IP: [<f811f56b>] picolcd_exit_framebuffer+0x1b/0x80 [hid_picolcd]
  *pde = 00000000
  Oops: 0000 [#1]
  Modules linked in: hid_picolcd syscopyarea sysfillrect sysimgblt fb_sys_fops
  CPU: 0 PID: 15 Comm: khubd Not tainted 3.11.0-rc7-00002-g50d62d4 #2
  EIP: 0060:[<f811f56b>] EFLAGS: 00010292 CPU: 0
  EIP is at picolcd_exit_framebuffer+0x1b/0x80 [hid_picolcd]
  Call Trace:
   [<f811d1ab>] picolcd_remove+0xcb/0x120 [hid_picolcd]
   [<c1469b09>] hid_device_remove+0x59/0xc0
   [<c13464ca>] __device_release_driver+0x5a/0xb0
   [<c134653f>] device_release_driver+0x1f/0x30
   [<c134603d>] bus_remove_device+0x9d/0xd0
   [<c13439a5>] device_del+0xd5/0x150
   [<c14696a4>] hid_destroy_device+0x24/0x60
   [<c1474cbb>] usbhid_disconnect+0x1b/0x40
   ...

Signed-off-by: Bruno Prémont <[email protected]>
Cc: [email protected]
Signed-off-by: Jiri Kosina <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 10, 2013
DEBUG_KOBJECT_RELEASE helps to find the issue attached below.

After some investigation, it seems the reason is:
The mod->mkobj.kobj(ffffffffa01600d0 below) is freed together with mod
itself in free_module(). However, its children still hold references to
it, as the delay caused by DEBUG_KOBJECT_RELEASE. So when the
child(holders below) tries to decrease the reference count to its parent
in kobject_del(), BUG happens as it tries to access already freed memory.

This patch tries to fix it by waiting for the mod->mkobj.kobj to be
really released in the module removing process (and some error code
paths).

[ 1844.175287] kobject: 'holders' (ffff88007c1f1600): kobject_release, parent ffffffffa01600d0 (delayed)
[ 1844.178991] kobject: 'notes' (ffff8800370b2a00): kobject_release, parent ffffffffa01600d0 (delayed)
[ 1845.180118] kobject: 'holders' (ffff88007c1f1600): kobject_cleanup, parent ffffffffa01600d0
[ 1845.182130] kobject: 'holders' (ffff88007c1f1600): auto cleanup kobject_del
[ 1845.184120] BUG: unable to handle kernel paging request at ffffffffa01601d0
[ 1845.185026] IP: [<ffffffff812cda81>] kobject_put+0x11/0x60
[ 1845.185026] PGD 1a13067 PUD 1a14063 PMD 7bd30067 PTE 0
[ 1845.185026] Oops: 0000 [#1] PREEMPT
[ 1845.185026] Modules linked in: xfs libcrc32c [last unloaded: kprobe_example]
[ 1845.185026] CPU: 0 PID: 18 Comm: kworker/0:1 Tainted: G           O 3.11.0-rc6-next-20130819+ #1
[ 1845.185026] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 1845.185026] Workqueue: events kobject_delayed_cleanup
[ 1845.185026] task: ffff88007ca51f00 ti: ffff88007ca5c000 task.ti: ffff88007ca5c000
[ 1845.185026] RIP: 0010:[<ffffffff812cda81>]  [<ffffffff812cda81>] kobject_put+0x11/0x60
[ 1845.185026] RSP: 0018:ffff88007ca5dd08  EFLAGS: 00010282
[ 1845.185026] RAX: 0000000000002000 RBX: ffffffffa01600d0 RCX: ffffffff8177d638
[ 1845.185026] RDX: ffff88007ca5dc18 RSI: 0000000000000000 RDI: ffffffffa01600d0
[ 1845.185026] RBP: ffff88007ca5dd18 R08: ffffffff824e9810 R09: ffffffffffffffff
[ 1845.185026] R10: ffff8800ffffffff R11: dead4ead00000001 R12: ffffffff81a95040
[ 1845.185026] R13: ffff88007b27a960 R14: ffff88007c1f1600 R15: 0000000000000000
[ 1845.185026] FS:  0000000000000000(0000) GS:ffffffff81a23000(0000) knlGS:0000000000000000
[ 1845.185026] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1845.185026] CR2: ffffffffa01601d0 CR3: 0000000037207000 CR4: 00000000000006b0
[ 1845.185026] Stack:
[ 1845.185026]  ffff88007c1f1600 ffff88007c1f1600 ffff88007ca5dd38 ffffffff812cdb7e
[ 1845.185026]  0000000000000000 ffff88007c1f1640 ffff88007ca5dd68 ffffffff812cdbfe
[ 1845.185026]  ffff88007c974800 ffff88007c1f1640 ffff88007ff61a00 0000000000000000
[ 1845.185026] Call Trace:
[ 1845.185026]  [<ffffffff812cdb7e>] kobject_del+0x2e/0x40
[ 1845.185026]  [<ffffffff812cdbfe>] kobject_delayed_cleanup+0x6e/0x1d0
[ 1845.185026]  [<ffffffff81063a45>] process_one_work+0x1e5/0x670
[ 1845.185026]  [<ffffffff810639e3>] ? process_one_work+0x183/0x670
[ 1845.185026]  [<ffffffff810642b3>] worker_thread+0x113/0x370
[ 1845.185026]  [<ffffffff810641a0>] ? rescuer_thread+0x290/0x290
[ 1845.185026]  [<ffffffff8106bfba>] kthread+0xda/0xe0
[ 1845.185026]  [<ffffffff814ff0f0>] ? _raw_spin_unlock_irq+0x30/0x60
[ 1845.185026]  [<ffffffff8106bee0>] ? kthread_create_on_node+0x130/0x130
[ 1845.185026]  [<ffffffff8150751a>] ret_from_fork+0x7a/0xb0
[ 1845.185026]  [<ffffffff8106bee0>] ? kthread_create_on_node+0x130/0x130
[ 1845.185026] Code: 81 48 c7 c7 28 95 ad 81 31 c0 e8 9b da 01 00 e9 4f ff ff ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 85 ff 74 1d <f6> 87 00 01 00 00 01 74 1e 48 8d 7b 38 83 6b 38 01 0f 94 c0 84
[ 1845.185026] RIP  [<ffffffff812cda81>] kobject_put+0x11/0x60
[ 1845.185026]  RSP <ffff88007ca5dd08>
[ 1845.185026] CR2: ffffffffa01601d0
[ 1845.185026] ---[ end trace 49a70afd109f5653 ]---

Signed-off-by: Li Zhong <[email protected]>
Acked-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 10, 2013
This patch corrects media entities unregistration order to make sure
the fimc.N.capture and fimc-lite video nodes are unregistered with
fimc->lock mutex held. This prevents races between video device open()
and defered probing and NULL pointer dereference in open() callback
as follows:
[   77.645000] Unable to handle kernel NULL pointer dereference at virtual address 00000290t
[   77.655000] pgd = ee7a8000
[   77.660000] [00000290] *pgd=6e13c831, *pte=00000000, *ppte=00000000
[   77.665000] Internal error: Oops: 17 [#1] PREEMPT SMP ARM
[   77.670000] Modules linked in: s5p_fimc ipv6 exynos_fimc_is exynos_fimc_lite
 s5p_csis v4l2_mem2mem videobuf2_dma_contig videobuf2_memops exynos4_is_common videobuf2_core [last unloaded: s5p_fimc]
[   77.685000] CPU: 0 PID : 2998 Comm: v4l_id Tainted: G        W   3.10.0-next-20130709-00039-g39f491b-dirty #1548
[   77.695000] task: ee084000 ti: ee46e000 task.ti: ee46e000
[   77.700000] PC is at __mutex_lock_slowpath+0x54/0x368
[   77.705000] LR is at __mutex_lock_slowpath+0x24/0x368
[   77.710000] pc : [<c038dc10>]    lr : [<c038dbe0>]    psr: 60000093
[   77.710000] sp : ee46fd70  ip : 000008c8  fp : c054e34c
[   77.725000] r10: ee084000  r9 : 00000000  r8 : ee439480
[   77.730000] r7 : ee46e000  r6 : 60000013  r5 : 00000290  r4 : 0000028
[   77.735000] r3 : 00000000  r2 : 00000000  r1 : 20000093  r0 : 00000001
[   77.740000] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM Segment user
[   77.750000] Control: 10c5387d  Table: 6e7a804a  DAC: 00000015
[   77.755000] Process v4l_id (pid: 2998, stack limit = 0xee46e238)
[   77.760000] Stack: (0xee46fd70 to 0xee470000)
    	       ...
[   77.935000] [<c038dc10>] (__mutex_lock_slowpath+0x54/0x368) from [<c038df30>] (mutex_lock+0xc/0x24)
[   77.945000] [<c038df30>] (mutex_lock+0xc/0x24) from [<bf03fa90>] (fimc_lite_open+0x12c/0x2bc [exynos_fimc_lite])
[   77.955000] [<bf03fa90>] (fimc_lite_open+0x12c/0x2bc [exynos_fimc_lite]) from [<c02ab11c>] (v4l2_open+0xa0/0xe0)
[   77.965000] [<c02ab11c>] (v4l2_open+0xa0/0xe0) from [<c00b1de4>] (chrdev_open+0x88/0x170)
[   77.975000] [<c00b1de4>] (chrdev_open+0x88/0x170) from [<c00ac710>] (do_dentry_open.isra.14+0x1d8/0x258)
[   77.985000] [<c00ac710>] (do_dentry_open.isra.14+0x1d8/0x258) from [<c00ac860>] (finish_open+0x20/0x38)
[   77.995000] [<c00ac860>] (finish_open+0x20/0x38) from [<c00ba658>] (do_last.isra.43+0x538/0xb1c)
[   78.000000] [<c00ba658>] (do_last.isra.43+0x538/0xb1c) from [<c00bacf0>] (path_openat+0xb4/0x5c4)
[   78.010000] [<c00bacf0>] (path_openat+0xb4/0x5c4) from [<c00bb4b4>] (do_filp_open+0x2c/0x80)
[   78.020000] [<c00bb4b4>] (do_filp_open+0x2c/0x80) from [<c00ad744>] (do_sys_open+0xf4/0x1a8)
[   78.025000] [<c00ad744>] (do_sys_open+0xf4/0x1a8) from [<c000e320>] (ret_fast_syscall+0x0/0x30)
[   78.035000] Code: 1a000093 e10f6000 f10c0080 e2845004 (e1953f9f)

Reported-by: Andrzej Hajda <[email protected]>
Signed-off-by: Sylwester Nawrocki <[email protected]>
Signed-off-by: Kyungmin Park <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Cc: [email protected]
jacobpan pushed a commit that referenced this pull request Sep 10, 2013
Pull SCSI target fixes from Nicholas Bellinger:
 "The first patch is to address a long standing issue where INQUIRY
  vendor + model response data was not correctly padded with ASCII
  spaces, causing MSFT and Falconstor multipath stacks to not function
  with our LUNs.

  The second -> forth patches are additional iscsi-target regression
  fixes for the post >= v3.10 iser-target changes.  The second and third
  are failure cases that have appeared during further testing, and the
  forth is only reproducible with malformed NOP packets.

  The fifth patch is a v3.11 specific regression caused by a recent
  optimization that showed up during WRITE I/O failure testing.

  I'll be sending Patch #1 and Patch #5 to Greg-KH separately for
  stable"

* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
  target: Fix se_cmd->state_list leak regression during WRITE failure
  iscsi-target: Fix potential NULL pointer in solicited NOPOUT reject
  iscsi-target: Fix iscsit_transport reference leak during NP thread reset
  iscsi-target: Fix ImmediateData=Yes failure regression in >= v3.10
  target: Fix trailing ASCII space usage in INQUIRY vendor+model
jacobpan pushed a commit that referenced this pull request Sep 10, 2013
When my 'ifup eth' script was fired multiple times and ran concurrent on
my laptop, for some obscure /etc scripting reason, it was revealed
that the store_enabled() function in netconsole doesn't handle it nicely,
as recorded by the Oops below (a syslog paste, but not mangled too much
to prevent from discerning the traceback).

On Linux 3.10.4, this patch seeks to remedy the problem, and it has been
running stable on my laptop for a few days.

[52608.609325] BUG: unable to handle kernel NULL pointer dereference at 00000000000003e0
[52608.609331] IP: [<ffffffff81532a17>] __netpoll_cleanup+0x27/0xe0
[52608.609339] PGD 15e51a067 PUD 15433e067 PMD 0
[52608.609343] Oops: 0000 [#1] SMP re firewire_ohci firewire_core crc_itu_t [last unloaded: kvm_intel]
[52608.609347] Modules linked in: kvm_intel tun vfat fat ppdev parport_pc parport fuse ipt_MASQUERADE usb_storage nf_conntrack_netbios_ns nf_conn [..garbled..]
[52608.609433] RAX: 0000000000000000 RBX: ffff880210bbcc68 RCX: 0000000000000000
[52608.609435] RDX: 0000000000000000 RSI: ffff8801ba447da0 RDI: ffff880210bbcc68
[52608.609437] RBP: ffff8801ba447e18 R08: 0000000000000000 R09: 0000000000000001
[52608.609439] R10: 000000000000000a R11: f000000000000000 R12: ffff880210bbcc68
[52608.609441] R13: ffff88020bc41000 R14: 0000000000000002 R15: 000000000000000200000000000
[52608.609443] FS:  00007f38d7bff740(0000) GS:ffff88021dc40000(0000) knlGS:0000000000000000
[52608.609446] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003300000000001427e0
[52608.609448] CR2: 00000000000003e0 CR3: 0000000154103000 CR4: 00000000001427e0
[52608.609450] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[52608.609452] netpoll: netconsole: local port 6665ess 10.0.0.27
[52608.609454] netpoll: netconsole: local IPv4 address 10.0.0.27
[52608.609456] netpoll: netconsole: interface 'em1'
[52608.609457] netpoll: netconsole: remote port 514ress 10.0.0.15
[52608.609459] netpoll: netconsole: remote IPv4 address 10.0.0.15:65:a8:9a:c7
[52608.609461] netpoll: netconsole: remote ethernet address 1c:6f:65:a8:9a:c7
[52608.609463] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[52608.609464] Stack:801ba447e08 ffff880210bbcc68 ffffffffffffffea ffff88020bc41000
[52608.609466]  ffff8801ba447e08 ffff880210bbcc68 ffffffffffffffea ffff88020bc41000
[52608.609471]  0000000000000002 0000000000000002 ffff8801ba447e38 ffffffff81532af4
[52608.609475]  0000000000000000 ffff880210bbcc00 ffff8801ba447e78 ffffffff81420e7c
[52608.609479] Call Trace:
[52608.609484]  [<ffffffff81532af4>] netpoll_cleanup+0x24/0x50
[52608.609489]  [<ffffffff81420e7c>] store_enabled+0x5c/0xe0
[52608.609492]  [<ffffffff81420abe>] netconsole_target_attr_store+0x2e/0x40
[52608.609498]  [<ffffffff811ff2a2>] configfs_write_file+0xd2/0x130
[52608.609503]  [<ffffffff81188f95>] vfs_write+0xc5/0x1f0
[52608.609506]  [<ffffffff81189482>] SyS_write+0x52/0xa0/0x10
[52608.609511]  [<ffffffff81628c2e>] ? do_page_fault+0xe/0x10
[52608.609516]  [<ffffffff8162d402>] system_call_fastpath+0x16/0x1b
[52608.609517] Code: 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 30 4c 89 65 e0 48 89 5d d8 49 89 fc 4c 89 6d e8 4c 89 75 f0 4c 89 7d f8 48 8 [..garbled..]
[52608.609559] RIP  [<ffffffff81532a17>] __netpoll_cleanup+0x27/0xe0
[52608.609563]  RSP <ffff8801ba447de8>
[52608.609564] CR2: 00000000000003e0
[52608.609567] ---[ end trace d25ec343349b61d2 ]---

Signed-off-by: Dan Aloni <[email protected]>
Signed-off-by: Neil Horman <[email protected]>
CC: David S. Miller <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 10, 2013
On vxlan device create if socket create fails vxlan device is not
added to hash table. Therefore we need to check if device
is in hashtable before we delete it from hlist.
Following patch avoid the crash. net-next already has this fix.

---8<---
BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffffa05f9ca7>] vxlan_dellink+0x77/0xf0 [vxlan]
PGD 42b2d9067 PUD 42e04c067 PMD 0
Oops: 0002 [#1] SMP
Modules linked in: vxlan(-)
Hardware name: Dell Inc. PowerEdge R620/0KCKR5, BIOS 1.4.8 10/25/2012
task: ffff88042ecf8760 ti: ffff88042f106000 task.ti: ffff88042f106000
RIP: 0010:[<ffffffffa05f9ca7>]  [<ffffffffa05f9ca7>]
vxlan_dellink+0x77/0xf0 [vxlan]
RSP: 0018:ffff88042f107e28  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88082af08000 RCX: ffff88083fd80000
RDX: 0000000000000000 RSI: ffff88042f107e58 RDI: ffff88042e12f810
RBP: ffff88042f107e48 R08: ffffffff8166eca0 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff88082af087c0
R13: ffff88042e12f000 R14: ffff88042f107e58 R15: ffff88042f107e58
FS:  00007f4ed2de7700(0000) GS:ffff88043fc80000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000042e076000 CR4: 00000000000407e0
Stack:
 ffff88082af08000 ffffffff81654848 ffffffffa05fb4e0 ffffffff81654780
 ffff88042f107e98 ffffffff813b9c7a ffff88042f107e58 ffff88042f107e58
 ffff88042f107e88 ffffffffa05fb4e0 ffffffffa05fb780 ffff88042f107f18
Call Trace:
 [<ffffffff813b9c7a>] __rtnl_link_unregister+0xca/0xd0
 [<ffffffff813bb0e9>] rtnl_link_unregister+0x19/0x30
 [<ffffffffa05faa4c>] vxlan_cleanup_module+0x10/0x2f [vxlan]
 [<ffffffff81099fef>] SyS_delete_module+0x1cf/0x2c0
 [<ffffffff8146c069>] ? do_page_fault+0x9/0x10
 [<ffffffff8146f012>] system_call_fastpath+0x16/0x1b
Code: 4d 85 ed 0f 84 95 00 00 00 4c 8d a7 c0 07 00 00 49 8d bd 10 08 00
00 e8 28 e8 e6 e0 48 8b 83 c0 07 00 00 49 8b 54 24 08 48 85 c0 <48> 89
02 74 04 48 89 50 08 49 b8 00 02 20 00 00 00 ad de 4d 89
RIP  [<ffffffffa05f9ca7>] vxlan_dellink+0x77/0xf0 [vxlan]
 RSP <ffff88042f107e28>
CR2: 0000000000000000

Signed-off-by: Pravin B Shelar <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 10, 2013
Since the Guest is in ring 1, it can't read the debug registers: doing
so gives a number of nasty messages:

(gdb) run
Starting program: /bin/sleep
[   31.170230] general protection fault: 0000 [#1] SMP
[   31.170230] Modules linked in:
[   31.170230] CPU: 0 PID: 2678 Comm: sleep Not tainted 3.11.0+ torvalds#64
[   31.170230] task: cc5c09b0 ti: cc79c000 task.ti: cc79c000
[   31.170230] EIP: 0061:[<c01333d8>] EFLAGS: 00000097 CPU: 0
[   31.170230] EIP is at native_get_debugreg+0x58/0x70
[   31.170230] EAX: 00000006 EBX: cc79dfb4 ECX: b7fff918 EDX: 00000000
[   31.170230] ESI: cc5c09b0 EDI: 00000000 EBP: cc79df84 ESP: cc79df84
[   31.170230]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
[   31.170230] CR0: 00000008 CR2: 081ba69a CR3: 0e2f2000 CR4: 00000000

Signed-off-by: Rusty Russell <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 10, 2013
Previously ceph_readpage_to_fscache did not call if page was marked as cached
before calling fscache_write_page resulting in a BUG inside of fscache.

FS-Cache: Assertion failed
------------[ cut here ]------------
kernel BUG at fs/fscache/page.c:874!
invalid opcode: 0000 [#1] SMP
Call Trace:
 [<ffffffffa02e6566>] __ceph_readpage_to_fscache+0x66/0x80 [ceph]
 [<ffffffffa02caf84>] readpage_nounlock+0x124/0x210 [ceph]
 [<ffffffffa02cb08d>] ceph_readpage+0x1d/0x40 [ceph]
 [<ffffffff81126db6>] generic_file_aio_read+0x1f6/0x700
 [<ffffffffa02c6fcc>] ceph_aio_read+0x5fc/0xab0 [ceph]

Signed-off-by: Milosz Tanski <[email protected]>
Signed-off-by: Sage Weil <[email protected]>
jacobpan pushed a commit that referenced this pull request Oct 17, 2019
An oops can be triggered in the scheduler when running qemu on arm64:

 Unable to handle kernel paging request at virtual address ffff000008effe40
 Internal error: Oops: 96000007 [#1] SMP
 Process migration/0 (pid: 12, stack limit = 0x00000000084e3736)
 pstate: 20000085 (nzCv daIf -PAN -UAO)
 pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20
 lr : move_queued_task.isra.21+0x124/0x298
 ...
 Call trace:
  __ll_sc___cmpxchg_case_acq_4+0x4/0x20
  __migrate_task+0xc8/0xe0
  migration_cpu_stop+0x170/0x180
  cpu_stopper_thread+0xec/0x178
  smpboot_thread_fn+0x1ac/0x1e8
  kthread+0x134/0x138
  ret_from_fork+0x10/0x18

__set_cpus_allowed_ptr() will choose an active dest_cpu in affinity mask to
migrage the process if process is not currently running on any one of the
CPUs specified in affinity mask. __set_cpus_allowed_ptr() will choose an
invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if
CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects
check. cpumask_test_cpu() of dest_cpu afterwards is overflown and may pass if
corresponding bit is coincidentally set. As a consequence, kernel will
access an invalid rq address associate with the invalid CPU in
migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs.

The reproduce the crash:

  1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling
  sched_setaffinity.

  2) A shell script repeatedly does "echo 0 > /sys/devices/system/cpu/cpu1/online"
  and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn.

  3) Oops appears if the invalid CPU is set in memory after tested cpumask.

Signed-off-by: KeMeng Shi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
jacobpan pushed a commit that referenced this pull request Oct 17, 2019
Thomas has noticed the following NULL ptr dereference when using cgroup
v1 kmem limit:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
PGD 0
P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 torvalds#42
Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
RIP: 0010:create_empty_buffers+0x24/0x100
Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
Call Trace:
 create_page_buffers+0x4d/0x60
 __block_write_begin_int+0x8e/0x5a0
 ? ext4_inode_attach_jinode.part.82+0xb0/0xb0
 ? jbd2__journal_start+0xd7/0x1f0
 ext4_da_write_begin+0x112/0x3d0
 generic_perform_write+0xf1/0x1b0
 ? file_update_time+0x70/0x140
 __generic_file_write_iter+0x141/0x1a0
 ext4_file_write_iter+0xef/0x3b0
 __vfs_write+0x17e/0x1e0
 vfs_write+0xa5/0x1a0
 ksys_write+0x57/0xd0
 do_syscall_64+0x55/0x160
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
fails __GFP_NOFAIL charge when the kmem limit is reached.  This is a wrong
behavior because nofail allocations are not allowed to fail.  Normal
charge path simply forces the charge even if that means to cross the
limit.  Kmem accounting should be doing the same.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Reported-by: Thomas Lindroth <[email protected]>
Debugged-by: Tetsuo Handa <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Thomas Lindroth <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
jacobpan pushed a commit that referenced this pull request Oct 17, 2019
I'm seeing a bunch of debug prints from a user of print_hex_dump_bytes()
in my kernel logs, but I don't have CONFIG_DYNAMIC_DEBUG enabled nor do I
have DEBUG defined in my build.  The problem is that
print_hex_dump_bytes() calls a wrapper function in lib/hexdump.c that
calls print_hex_dump() with KERN_DEBUG level.  There are three cases to
consider here

  1. CONFIG_DYNAMIC_DEBUG=y  --> call dynamic_hex_dum()
  2. CONFIG_DYNAMIC_DEBUG=n && DEBUG --> call print_hex_dump()
  3. CONFIG_DYNAMIC_DEBUG=n && !DEBUG --> stub it out

Right now, that last case isn't detected and we still call
print_hex_dump() from the stub wrapper.

Let's make print_hex_dump_bytes() only call print_hex_dump_debug() so that
it works properly in all cases.

Case #1, print_hex_dump_debug() calls dynamic_hex_dump() and we get same
behavior.  Case #2, print_hex_dump_debug() calls print_hex_dump() with
KERN_DEBUG and we get the same behavior.  Case #3, print_hex_dump_debug()
is a nop, changing behavior to what we want, i.e.  print nothing.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Stephen Boyd <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
jacobpan pushed a commit that referenced this pull request Oct 17, 2019
When removing a cbs instance when offloading is enabled, the crash
below can be observed.

The problem happens because that when offloading is enabled, the cbs
instance is not added to the list.

Also, the current code doesn't handle correctly the case when offload
is disabled without removing the qdisc: if the link speed changes the
credit calculations will be wrong. When we create the cbs instance
with offloading enabled, it's not added to the notification list, when
later we disable offloading, it's not in the list, so link speed
changes will not affect it.

The solution for both issues is the same, add the cbs instance being
created unconditionally to the global list, even if the link state
notification isn't useful "right now".

Crash log:

[518758.189866] BUG: kernel NULL pointer dereference, address: 0000000000000000
[518758.189870] #PF: supervisor read access in kernel mode
[518758.189871] #PF: error_code(0x0000) - not-present page
[518758.189872] PGD 0 P4D 0
[518758.189874] Oops: 0000 [#1] SMP PTI
[518758.189876] CPU: 3 PID: 4825 Comm: tc Not tainted 5.2.9 #1
[518758.189877] Hardware name: Gigabyte Technology Co., Ltd. Z390 AORUS ULTRA/Z390 AORUS ULTRA-CF, BIOS F7 03/14/2019
[518758.189881] RIP: 0010:__list_del_entry_valid+0x29/0xa0
[518758.189883] Code: 90 48 b8 00 01 00 00 00 00 ad de 55 48 8b 17 4c 8b 47 08 48 89 e5 48 39 c2 74 27 48 b8 00 02 00 00 00 00 ad de 49 39 c0 74 2d <49> 8b 30 48 39 fe 75 3d 48 8b 52 08 48 39 f2 75 4c b8 01 00 00 00
[518758.189885] RSP: 0018:ffffa27e43903990 EFLAGS: 00010207
[518758.189887] RAX: dead000000000200 RBX: ffff8bce69f0f000 RCX: 0000000000000000
[518758.189888] RDX: 0000000000000000 RSI: ffff8bce69f0f064 RDI: ffff8bce69f0f1e0
[518758.189890] RBP: ffffa27e43903990 R08: 0000000000000000 R09: ffff8bce69e788c0
[518758.189891] R10: ffff8bce62acd400 R11: 00000000000003cb R12: ffff8bce69e78000
[518758.189892] R13: ffff8bce69f0f140 R14: 0000000000000000 R15: 0000000000000000
[518758.189894] FS:  00007fa1572c8f80(0000) GS:ffff8bce6e0c0000(0000) knlGS:0000000000000000
[518758.189895] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[518758.189896] CR2: 0000000000000000 CR3: 000000040a398006 CR4: 00000000003606e0
[518758.189898] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[518758.189899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[518758.189900] Call Trace:
[518758.189904]  cbs_destroy+0x32/0xa0 [sch_cbs]
[518758.189906]  qdisc_destroy+0x45/0x120
[518758.189907]  qdisc_put+0x25/0x30
[518758.189908]  qdisc_graft+0x2c1/0x450
[518758.189910]  tc_get_qdisc+0x1c8/0x310
[518758.189912]  ? get_page_from_freelist+0x91a/0xcb0
[518758.189914]  rtnetlink_rcv_msg+0x293/0x360
[518758.189916]  ? kmem_cache_alloc_node_trace+0x178/0x260
[518758.189918]  ? __kmalloc_node_track_caller+0x38/0x50
[518758.189920]  ? rtnl_calcit.isra.0+0xf0/0xf0
[518758.189922]  netlink_rcv_skb+0x48/0x110
[518758.189923]  rtnetlink_rcv+0x10/0x20
[518758.189925]  netlink_unicast+0x15b/0x1d0
[518758.189926]  netlink_sendmsg+0x1ea/0x380
[518758.189929]  sock_sendmsg+0x2f/0x40
[518758.189930]  ___sys_sendmsg+0x295/0x2f0
[518758.189932]  ? ___sys_recvmsg+0x151/0x1e0
[518758.189933]  ? do_wp_page+0x7e/0x450
[518758.189935]  __sys_sendmsg+0x48/0x80
[518758.189937]  __x64_sys_sendmsg+0x1a/0x20
[518758.189939]  do_syscall_64+0x53/0x1f0
[518758.189941]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[518758.189942] RIP: 0033:0x7fa15755169a
[518758.189944] Code: 48 c7 c0 ff ff ff ff eb be 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 18 b8 2e 00 00 00 c5 fc 77 0f 05 <48> 3d 00 f0 ff ff 77 5e c3 0f 1f 44 00 00 48 83 ec 28 89 54 24 1c
[518758.189946] RSP: 002b:00007ffda58b60b8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[518758.189948] RAX: ffffffffffffffda RBX: 000055e4b836d9a0 RCX: 00007fa15755169a
[518758.189949] RDX: 0000000000000000 RSI: 00007ffda58b6128 RDI: 0000000000000003
[518758.189951] RBP: 00007ffda58b6190 R08: 0000000000000001 R09: 000055e4b9d848a0
[518758.189952] R10: 0000000000000000 R11: 0000000000000246 R12: 000000005d654b49
[518758.189953] R13: 0000000000000000 R14: 00007ffda58b6230 R15: 00007ffda58b6210
[518758.189955] Modules linked in: sch_cbs sch_etf sch_mqprio netlink_diag unix_diag e1000e igb intel_pch_thermal thermal video backlight pcc_cpufreq
[518758.189960] CR2: 0000000000000000
[518758.189961] ---[ end trace 6a13f7aaf5376019 ]---
[518758.189963] RIP: 0010:__list_del_entry_valid+0x29/0xa0
[518758.189964] Code: 90 48 b8 00 01 00 00 00 00 ad de 55 48 8b 17 4c 8b 47 08 48 89 e5 48 39 c2 74 27 48 b8 00 02 00 00 00 00 ad de 49 39 c0 74 2d <49> 8b 30 48 39 fe 75 3d 48 8b 52 08 48 39 f2 75 4c b8 01 00 00 00
[518758.189967] RSP: 0018:ffffa27e43903990 EFLAGS: 00010207
[518758.189968] RAX: dead000000000200 RBX: ffff8bce69f0f000 RCX: 0000000000000000
[518758.189969] RDX: 0000000000000000 RSI: ffff8bce69f0f064 RDI: ffff8bce69f0f1e0
[518758.189971] RBP: ffffa27e43903990 R08: 0000000000000000 R09: ffff8bce69e788c0
[518758.189972] R10: ffff8bce62acd400 R11: 00000000000003cb R12: ffff8bce69e78000
[518758.189973] R13: ffff8bce69f0f140 R14: 0000000000000000 R15: 0000000000000000
[518758.189975] FS:  00007fa1572c8f80(0000) GS:ffff8bce6e0c0000(0000) knlGS:0000000000000000
[518758.189976] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[518758.189977] CR2: 0000000000000000 CR3: 000000040a398006 CR4: 00000000003606e0
[518758.189979] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[518758.189980] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Fixes: e0a7683 ("net/sched: cbs: fix port_rate miscalculation")
Signed-off-by: Vinicius Costa Gomes <[email protected]>
Acked-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
jacobpan pushed a commit that referenced this pull request Oct 17, 2019
After commit a2c11b0 ("kcm: use BPF_PROG_RUN")
syzbot easily triggers the warning in cant_sleep().

As explained in commit 6cab5e9 ("bpf: run bpf programs
with preemption disabled") we need to disable preemption before
running bpf programs.

BUG: assuming atomic context at net/kcm/kcmsock.c:382
in_atomic(): 0, irqs_disabled(): 0, pid: 7, name: kworker/u4:0
3 locks held by kworker/u4:0/7:
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: __write_once_size include/linux/compiler.h:226 [inline]
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: atomic64_set include/asm-generic/atomic-instrumented.h:855 [inline]
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: atomic_long_set include/asm-generic/atomic-long.h:40 [inline]
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: set_work_data kernel/workqueue.c:620 [inline]
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: set_work_pool_and_clear_pending kernel/workqueue.c:647 [inline]
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: process_one_work+0x88b/0x1740 kernel/workqueue.c:2240
 #1: ffff8880a989fdc0 ((work_completion)(&strp->work)){+.+.}, at: process_one_work+0x8c1/0x1740 kernel/workqueue.c:2244
 #2: ffff888098998d10 (sk_lock-AF_INET){+.+.}, at: lock_sock include/net/sock.h:1522 [inline]
 #2: ffff888098998d10 (sk_lock-AF_INET){+.+.}, at: strp_sock_lock+0x2e/0x40 net/strparser/strparser.c:440
CPU: 0 PID: 7 Comm: kworker/u4:0 Not tainted 5.3.0+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: kstrp strp_work
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 __cant_sleep kernel/sched/core.c:6826 [inline]
 __cant_sleep.cold+0xa4/0xbc kernel/sched/core.c:6803
 kcm_parse_func_strparser+0x54/0x200 net/kcm/kcmsock.c:382
 __strp_recv+0x5dc/0x1b20 net/strparser/strparser.c:221
 strp_recv+0xcf/0x10b net/strparser/strparser.c:343
 tcp_read_sock+0x285/0xa00 net/ipv4/tcp.c:1639
 strp_read_sock+0x14d/0x200 net/strparser/strparser.c:366
 do_strp_work net/strparser/strparser.c:414 [inline]
 strp_work+0xe3/0x130 net/strparser/strparser.c:423
 process_one_work+0x9af/0x1740 kernel/workqueue.c:2269

Fixes: a2c11b0 ("kcm: use BPF_PROG_RUN")
Fixes: 6cab5e9 ("bpf: run bpf programs with preemption disabled")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: syzbot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
jacobpan pushed a commit that referenced this pull request Oct 17, 2019
qdisc_root() use from netem_enqueue() triggers a lockdep warning.

__dev_queue_xmit() uses rcu_read_lock_bh() which is
not equivalent to rcu_read_lock() + local_bh_disable_bh as far
as lockdep is concerned.

WARNING: suspicious RCU usage
5.3.0-rc7+ #0 Not tainted
-----------------------------
include/net/sch_generic.h:492 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
3 locks held by syz-executor427/8855:
 #0: 00000000b5525c01 (rcu_read_lock_bh){....}, at: lwtunnel_xmit_redirect include/net/lwtunnel.h:92 [inline]
 #0: 00000000b5525c01 (rcu_read_lock_bh){....}, at: ip_finish_output2+0x2dc/0x2570 net/ipv4/ip_output.c:214
 #1: 00000000b5525c01 (rcu_read_lock_bh){....}, at: __dev_queue_xmit+0x20a/0x3650 net/core/dev.c:3804
 #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: spin_lock include/linux/spinlock.h:338 [inline]
 #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: __dev_xmit_skb net/core/dev.c:3502 [inline]
 #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: __dev_queue_xmit+0x14b8/0x3650 net/core/dev.c:3838

stack backtrace:
CPU: 0 PID: 8855 Comm: syz-executor427 Not tainted 5.3.0-rc7+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5357
 qdisc_root include/net/sch_generic.h:492 [inline]
 netem_enqueue+0x1cfb/0x2d80 net/sched/sch_netem.c:479
 __dev_xmit_skb net/core/dev.c:3527 [inline]
 __dev_queue_xmit+0x15d2/0x3650 net/core/dev.c:3838
 dev_queue_xmit+0x18/0x20 net/core/dev.c:3902
 neigh_hh_output include/net/neighbour.h:500 [inline]
 neigh_output include/net/neighbour.h:509 [inline]
 ip_finish_output2+0x1726/0x2570 net/ipv4/ip_output.c:228
 __ip_finish_output net/ipv4/ip_output.c:308 [inline]
 __ip_finish_output+0x5fc/0xb90 net/ipv4/ip_output.c:290
 ip_finish_output+0x38/0x1f0 net/ipv4/ip_output.c:318
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip_mc_output+0x292/0xf40 net/ipv4/ip_output.c:417
 dst_output include/net/dst.h:436 [inline]
 ip_local_out+0xbb/0x190 net/ipv4/ip_output.c:125
 ip_send_skb+0x42/0xf0 net/ipv4/ip_output.c:1555
 udp_send_skb.isra.0+0x6b2/0x1160 net/ipv4/udp.c:887
 udp_sendmsg+0x1e96/0x2820 net/ipv4/udp.c:1174
 inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807
 sock_sendmsg_nosec net/socket.c:637 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:657
 ___sys_sendmsg+0x3e2/0x920 net/socket.c:2311
 __sys_sendmmsg+0x1bf/0x4d0 net/socket.c:2413
 __do_sys_sendmmsg net/socket.c:2442 [inline]
 __se_sys_sendmmsg net/socket.c:2439 [inline]
 __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2439
 do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: syzbot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
jacobpan pushed a commit that referenced this pull request Oct 17, 2019
When a cpu requests broadcasting, before starting the tick broadcast
hrtimer, bc_set_next() checks if the timer callback (bc_handler) is active
using hrtimer_try_to_cancel(). But hrtimer_try_to_cancel() does not provide
the required synchronization when the callback is active on other core.

The callback could have already executed tick_handle_oneshot_broadcast()
and could have also returned. But still there is a small time window where
the hrtimer_try_to_cancel() returns -1. In that case bc_set_next() returns
without doing anything, but the next_event of the tick broadcast clock
device is already set to a timeout value.

In the race condition diagram below, CPU #1 is running the timer callback
and CPU #2 is entering idle state and so calls bc_set_next().

In the worst case, the next_event will contain an expiry time, but the
hrtimer will not be started which happens when the racing callback returns
HRTIMER_NORESTART. The hrtimer might never recover if all further requests
from the CPUs to subscribe to tick broadcast have timeout greater than the
next_event of tick broadcast clock device. This leads to cascading of
failures and finally noticed as rcu stall warnings

Here is a depiction of the race condition

CPU #1 (Running timer callback)                   CPU #2 (Enter idle
                                                  and subscribe to
                                                  tick broadcast)
---------------------                             ---------------------

__run_hrtimer()                                   tick_broadcast_enter()

  bc_handler()                                      __tick_broadcast_oneshot_control()

    tick_handle_oneshot_broadcast()

      raw_spin_lock(&tick_broadcast_lock);

      dev->next_event = KTIME_MAX;                  //wait for tick_broadcast_lock
      //next_event for tick broadcast clock
      set to KTIME_MAX since no other cores
      subscribed to tick broadcasting

      raw_spin_unlock(&tick_broadcast_lock);

    if (dev->next_event == KTIME_MAX)
      return HRTIMER_NORESTART
    // callback function exits without
       restarting the hrtimer                      //tick_broadcast_lock acquired
                                                   raw_spin_lock(&tick_broadcast_lock);

                                                   tick_broadcast_set_event()

                                                     clockevents_program_event()

                                                       dev->next_event = expires;

                                                       bc_set_next()

                                                         hrtimer_try_to_cancel()
                                                         //returns -1 since the timer
                                                         callback is active. Exits without
                                                         restarting the timer
  cpu_base->running = NULL;

The comment that hrtimer cannot be armed from within the callback is
wrong. It is fine to start the hrtimer from within the callback. Also it is
safe to start the hrtimer from the enter/exit idle code while the broadcast
handler is active. The enter/exit idle code and the broadcast handler are
synchronized using tick_broadcast_lock. So there is no need for the
existing try to cancel logic. All this can be removed which will eliminate
the race condition as well.

Fixes: 5d1638a ("tick: Introduce hrtimer based broadcast")
Originally-by: Thomas Gleixner <[email protected]>
Signed-off-by: Balasubramani Vivekanandan <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
jacobpan pushed a commit that referenced this pull request Oct 17, 2019
We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
as following:

[  108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
[  108.827059] PGD 0 P4D 0
[  108.827313] Oops: 0000 [#1] SMP PTI
[  108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ torvalds#431
[  108.829503] Workqueue: kblockd blk_mq_timeout_work
[  108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
[  108.838191] Call Trace:
[  108.838406]  bt_iter+0x74/0x80
[  108.838665]  blk_mq_queue_tag_busy_iter+0x204/0x450
[  108.839074]  ? __switch_to_asm+0x34/0x70
[  108.839405]  ? blk_mq_stop_hw_queue+0x40/0x40
[  108.839823]  ? blk_mq_stop_hw_queue+0x40/0x40
[  108.840273]  ? syscall_return_via_sysret+0xf/0x7f
[  108.840732]  blk_mq_timeout_work+0x74/0x200
[  108.841151]  process_one_work+0x297/0x680
[  108.841550]  worker_thread+0x29c/0x6f0
[  108.841926]  ? rescuer_thread+0x580/0x580
[  108.842344]  kthread+0x16a/0x1a0
[  108.842666]  ? kthread_flush_work+0x170/0x170
[  108.843100]  ret_from_fork+0x35/0x40

The bug is caused by the race between timeout handle and completion for
flush request.

When timeout handle function blk_mq_rq_timed_out() try to read
'req->q->mq_ops', the 'req' have completed and reinitiated by next
flush request, which would call blk_rq_init() to clear 'req' as 0.

After commit 12f5b93 ("blk-mq: Remove generation seqeunce"),
normal requests lifetime are protected by refcount. Until 'rq->ref'
drop to zero, the request can really be free. Thus, these requests
cannot been reused before timeout handle finish.

However, flush request has defined .end_io and rq->end_io() is still
called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
can be reused by the next flush request handle, resulting in null
pointer deference BUG ON.

We fix this problem by covering flush request with 'rq->ref'.
If the refcount is not zero, flush_end_io() return and wait the
last holder recall it. To record the request status, we add a new
entry 'rq_status', which will be used in flush_end_io().

Cc: Christoph Hellwig <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Bart Van Assche <[email protected]>
Cc: [email protected] # v4.18+
Reviewed-by: Ming Lei <[email protected]>
Reviewed-by: Bob Liu <[email protected]>
Signed-off-by: Yufen Yu <[email protected]>

-------
v2:
 - move rq_status from struct request to struct blk_flush_queue
v3:
 - remove unnecessary '{}' pair.
v4:
 - let spinlock to protect 'fq->rq_status'
v5:
 - move rq_status after flush_running_idx member of struct blk_flush_queue
Signed-off-by: Jens Axboe <[email protected]>
jacobpan pushed a commit that referenced this pull request Oct 17, 2019
Ido Schimmel says:

====================
mlxsw: Various fixes

This patchset includes two small fixes for the mlxsw driver and one
patch which clarifies recently introduced devlink-trap documentation.

Patch #1 clears the port's VLAN filters during port initialization. This
ensures that the drop reason reported to the user is consistent. The
problem is explained in detail in the commit message.

Patch #2 clarifies the description of one of the traps exposed via
devlink-trap.

Patch #3 from Danielle forbids the installation of a tc filter with
multiple mirror actions since this is not supported by the device. The
failure is communicated to the user via extack.
====================

Signed-off-by: David S. Miller <[email protected]>
jacobpan pushed a commit that referenced this pull request Oct 17, 2019
syzbot reported a crash in cbq_normalize_quanta() caused
by an out of range cl->priority.

iproute2 enforces this check, but malicious users do not.

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN PTI
Modules linked in:
CPU: 1 PID: 26447 Comm: syz-executor.1 Not tainted 5.3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:cbq_normalize_quanta.part.0+0x1fd/0x430 net/sched/sch_cbq.c:902
RSP: 0018:ffff8801a5c333b0 EFLAGS: 00010206
RAX: 0000000020000003 RBX: 00000000fffffff8 RCX: ffffc9000712f000
RDX: 00000000000043bf RSI: ffffffff83be8962 RDI: 0000000100000018
RBP: ffff8801a5c33420 R08: 000000000000003a R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000002ef
R13: ffff88018da95188 R14: dffffc0000000000 R15: 0000000000000015
FS:  00007f37d26b1700(0000) GS:ffff8801dad00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004c7cec CR3: 00000001bcd0a006 CR4: 00000000001626f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff83be9d57>] cbq_normalize_quanta include/net/pkt_sched.h:27 [inline]
 [<ffffffff83be9d57>] cbq_addprio net/sched/sch_cbq.c:1097 [inline]
 [<ffffffff83be9d57>] cbq_set_wrr+0x2d7/0x450 net/sched/sch_cbq.c:1115
 [<ffffffff83bee8a7>] cbq_change_class+0x987/0x225b net/sched/sch_cbq.c:1537
 [<ffffffff83b96985>] tc_ctl_tclass+0x555/0xcd0 net/sched/sch_api.c:2329
 [<ffffffff83a84655>] rtnetlink_rcv_msg+0x485/0xc10 net/core/rtnetlink.c:5248
 [<ffffffff83cadf0a>] netlink_rcv_skb+0x17a/0x460 net/netlink/af_netlink.c:2510
 [<ffffffff83a7db6d>] rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5266
 [<ffffffff83cac2c6>] netlink_unicast_kernel net/netlink/af_netlink.c:1324 [inline]
 [<ffffffff83cac2c6>] netlink_unicast+0x536/0x720 net/netlink/af_netlink.c:1350
 [<ffffffff83cacd4a>] netlink_sendmsg+0x89a/0xd50 net/netlink/af_netlink.c:1939
 [<ffffffff8399d46e>] sock_sendmsg_nosec net/socket.c:673 [inline]
 [<ffffffff8399d46e>] sock_sendmsg+0x12e/0x170 net/socket.c:684
 [<ffffffff8399f1fd>] ___sys_sendmsg+0x81d/0x960 net/socket.c:2359
 [<ffffffff839a2d05>] __sys_sendmsg+0x105/0x1d0 net/socket.c:2397
 [<ffffffff839a2df9>] SYSC_sendmsg net/socket.c:2406 [inline]
 [<ffffffff839a2df9>] SyS_sendmsg+0x29/0x30 net/socket.c:2404
 [<ffffffff8101ccc8>] do_syscall_64+0x528/0x770 arch/x86/entry/common.c:305
 [<ffffffff84400091>] entry_SYSCALL_64_after_hwframe+0x42/0xb7

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: syzbot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 11, 2020
In the existing NVMeOF Passthru core command handling on failure of
nvme_alloc_request() it errors out with rq value set to NULL. In the
error handling path it calls blk_put_request() without checking if
rq is set to NULL or not which produces following Oops:-

[ 1457.346861] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1457.347838] #PF: supervisor read access in kernel mode
[ 1457.348464] #PF: error_code(0x0000) - not-present page
[ 1457.349085] PGD 0 P4D 0
[ 1457.349402] Oops: 0000 [#1] SMP NOPTI
[ 1457.349851] CPU: 18 PID: 10782 Comm: kworker/18:2 Tainted: G           OE     5.8.0-rc4nvme-5.9+ torvalds#35
[ 1457.350951] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e3214
[ 1457.352347] Workqueue: events nvme_loop_execute_work [nvme_loop]
[ 1457.353062] RIP: 0010:blk_mq_free_request+0xe/0x110
[ 1457.353651] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8
[ 1457.355975] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282
[ 1457.356636] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
[ 1457.357526] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000
[ 1457.358416] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000
[ 1457.359317] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000
[ 1457.360424] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8
[ 1457.361322] FS:  0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000
[ 1457.362337] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1457.363058] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0
[ 1457.363973] Call Trace:
[ 1457.364296]  nvmet_passthru_execute_cmd+0x150/0x2c0 [nvmet]
[ 1457.364990]  process_one_work+0x24e/0x5a0
[ 1457.365493]  ? __schedule+0x353/0x840
[ 1457.365957]  worker_thread+0x3c/0x380
[ 1457.366426]  ? process_one_work+0x5a0/0x5a0
[ 1457.366948]  kthread+0x135/0x150
[ 1457.367362]  ? kthread_create_on_node+0x60/0x60
[ 1457.367934]  ret_from_fork+0x22/0x30
[ 1457.368388] Modules linked in: nvme_loop(OE) nvmet(OE) nvme_fabrics(OE) null_blk nvme(OE) nvme_corer
[ 1457.368414]  ata_piix crc32c_intel virtio_pci libata virtio_ring serio_raw t10_pi virtio floppy dm_]
[ 1457.380849] CR2: 0000000000000000
[ 1457.381288] ---[ end trace c6cab61bfd1f68fd ]---
[ 1457.381861] RIP: 0010:blk_mq_free_request+0xe/0x110
[ 1457.382469] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8
[ 1457.384749] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282
[ 1457.385393] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
[ 1457.386264] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000
[ 1457.387142] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000
[ 1457.388029] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000
[ 1457.388914] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8
[ 1457.389798] FS:  0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000
[ 1457.390796] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1457.391508] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0
[ 1457.392525] Kernel panic - not syncing: Fatal exception
[ 1457.394138] Kernel Offset: disabled
[ 1457.394677] ---[ end Kernel panic - not syncing: Fatal exception ]---

We fix this Oops by adding a new goto label out_put_req and reordering
the blk_put_request call to avoid calling blk_put_request() with rq
value is set to NULL. Here we also update the rest of the code
accordingly.

Fixes: 06b7164dfdc0 ("nvmet: add passthru code to process commands")
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Logan Gunthorpe <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 11, 2020
Recently nvme_dev.q_depth was changed from an int to u16 type.

This falls over for the queue depth calculation in nvme_pci_enable(),
where NVME_CAP_MQES(dev->ctrl.cap) + 1 may overflow as a u16, as
NVME_CAP_MQES() is a 16b number also. That happens for me, and this is the
result:

root@ubuntu:/home/john# [148.272996] Unable to handle kernel NULL pointer
dereference at virtual address 0000000000000010
Mem abort info:
ESR = 0x96000004
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000004
CM = 0, WnR = 0
user pgtable: 4k pages, 48-bit VAs, pgdp=00000a27bf3c9000
[0000000000000010] pgd=0000000000000000, p4d=0000000000000000
Internal error: Oops: 96000004 [#1] PREEMPT SMP
Modules linked in: nvme nvme_core
CPU: 56 PID: 256 Comm: kworker/u195:0 Not tainted
5.8.0-next-20200812 torvalds#27
Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
V1.16.01 03/15/2019
Workqueue: nvme-reset-wq nvme_reset_work [nvme]
pstate: 80c00009 (Nzcv daif +PAN +UAO BTYPE=--)
pc : __sg_alloc_table_from_pages+0xec/0x238
lr : __sg_alloc_table_from_pages+0xc8/0x238
sp : ffff800013ccbad0
x29: ffff800013ccbad0 x28: ffff0a27b3d380a8
x27: 0000000000000000 x26: 0000000000002dc2
x25: 0000000000000dc0 x24: 0000000000000000
x23: 0000000000000000 x22: ffff800013ccbbe8
x21: 0000000000000010 x20: 0000000000000000
x19: 00000000fffff000 x18: ffffffffffffffff
x17: 00000000000000c0 x16: fffffe289eaf6380
x15: ffff800011b59948 x14: ffff002bc8fe98f8
x13: ff00000000000000 x12: ffff8000114ca000
x11: 0000000000000000 x10: ffffffffffffffff
x9 : ffffffffffffffc0 x8 : ffff0a27b5f9b6a0
x7 : 0000000000000000 x6 : 0000000000000001
x5 : ffff0a27b5f9b680 x4 : 0000000000000000
x3 : ffff0a27b5f9b680 x2 : 0000000000000000
 x1 : 0000000000000001 x0 : 0000000000000000
 Call trace:
__sg_alloc_table_from_pages+0xec/0x238
sg_alloc_table_from_pages+0x18/0x28
iommu_dma_alloc+0x474/0x678
dma_alloc_attrs+0xd8/0xf0
nvme_alloc_queue+0x114/0x160 [nvme]
nvme_reset_work+0xb34/0x14b4 [nvme]
process_one_work+0x1e8/0x360
worker_thread+0x44/0x478
kthread+0x150/0x158
ret_from_fork+0x10/0x34
 Code: f94002c3 6b01017f 540007c2 11000486 (f8645aa5)
---[ end trace 89bb2b72d59bf925 ]---

Fix by making onto a u32.

Also use u32 for nvme_dev.q_depth, as we assign this value from
nvme_dev.q_depth, and nvme_dev.q_depth will possibly hold 65536 - this
avoids the same crash as above.

Fixes: 61f3b89 ("nvme-pci: use unsigned for io queue depth")
Signed-off-by: John Garry <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 11, 2020
The recent commit 01eb018 ("powerpc/64s: Fix restore_math
unnecessarily changing MSR") changed some of the handling of floating
point/vector restore.

In particular it caused current->thread.fpexc_mode to be copied into
the current MSR (via msr_check_and_set()), rather than just into
regs->msr (which is moved into MSR on return to userspace).

This can lead to a crash in the kernel if we take a floating point
exception when restoring FPSCR:

  Oops: Exception in kernel mode, sig: 8 [#1]
  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
  Modules linked in:
  CPU: 3 PID: 101213 Comm: ld64.so.2 Not tainted 5.9.0-rc1-00098-g18445bf405cb-dirty torvalds#9
  NIP:  c00000000000fbb4 LR: c00000000001a7ac CTR: c000000000183570
  REGS: c0000016b7cfb3b0 TRAP: 0700   Not tainted  (5.9.0-rc1-00098-g18445bf405cb-dirty)
  MSR:  900000000290b933 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44002444  XER: 00000000
  CFAR: c00000000001a7a8 IRQMASK: 1
  GPR00: c00000000001ae40 c0000016b7cfb640 c0000000011b7f00 c000001542a0f740
  GPR04: c000001542a0f720 c000001542a0eb00 0000000000000900 c000001542a0eb00
  GPR08: 000000000000000a 0000000000002000 9000000000009033 0000000000000000
  GPR12: 0000000000004000 c0000017ffffd900 0000000000000001 c000000000df5a58
  GPR16: c000000000e19c18 c0000000010e1123 0000000000000001 c000000000e1a638
  GPR20: 0000000000000000 c0000000044b1d00 0000000000000000 c000001542a0f2a0
  GPR24: 00000016c7fe0000 c000001542a0f720 c000000001c93da0 c000000000fe5f28
  GPR28: c000001542a0f720 0000000000800000 c0000016b7cfbe90 0000000002802900
  NIP load_fp_state+0x4/0x214
  LR  restore_math+0x17c/0x1f0
  Call Trace:
    0xc0000016b7cfb680 (unreliable)
    __switch_to+0x330/0x460
    __schedule+0x318/0x920
    schedule+0x74/0x140
    schedule_timeout+0x318/0x3f0
    wait_for_completion+0xc8/0x210
    call_usermodehelper_exec+0x234/0x280
    do_coredump+0xedc/0x13c0
    get_signal+0x1d4/0xbe0
    do_notify_resume+0x1a0/0x490
    interrupt_exit_user_prepare+0x1c4/0x230
    interrupt_return+0x14/0x1c0
  Instruction dump:
  ebe10168 e88101a0 7c8ff120 382101e0 e8010010 7c0803a6 4e800020 790605c4
  782905c4 7c0008a8 7c0008a8 c8030200 <fffe058e> 48000088 c8030000 c8230010

Fix it by only loading the fpexc_mode value into regs->msr.

Also add a comment to explain that although VSX is subject to the
value of fpexc_mode, we don't have to handle that separately because
we only allow VSX to be enabled if FP is also enabled.

Fixes: 01eb018 ("powerpc/64s: Fix restore_math unnecessarily changing MSR")
Reported-by: Milton Miller <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Reviewed-by: Nicholas Piggin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
jacobpan pushed a commit that referenced this pull request Sep 11, 2020
With the conversion of the tree locks to rwsem I got the following
lockdep splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.8.0-rc7-00167-g0d7ba0c5b375-dirty torvalds#925 Not tainted
  ------------------------------------------------------
  btrfs-uuid/7955 is trying to acquire lock:
  ffff88bfbafec0f8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

  but task is already holding lock:
  ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (btrfs-uuid-00){++++}-{3:3}:
	 down_read_nested+0x3e/0x140
	 __btrfs_tree_read_lock+0x39/0x180
	 __btrfs_read_lock_root_node+0x3a/0x50
	 btrfs_search_slot+0x4bd/0x990
	 btrfs_uuid_tree_add+0x89/0x2d0
	 btrfs_uuid_scan_kthread+0x330/0x390
	 kthread+0x133/0x150
	 ret_from_fork+0x1f/0x30

  -> #0 (btrfs-root-00){++++}-{3:3}:
	 __lock_acquire+0x1272/0x2310
	 lock_acquire+0x9e/0x360
	 down_read_nested+0x3e/0x140
	 __btrfs_tree_read_lock+0x39/0x180
	 __btrfs_read_lock_root_node+0x3a/0x50
	 btrfs_search_slot+0x4bd/0x990
	 btrfs_find_root+0x45/0x1b0
	 btrfs_read_tree_root+0x61/0x100
	 btrfs_get_root_ref.part.50+0x143/0x630
	 btrfs_uuid_tree_iterate+0x207/0x314
	 btrfs_uuid_rescan_kthread+0x12/0x50
	 kthread+0x133/0x150
	 ret_from_fork+0x1f/0x30

  other info that might help us debug this:

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(btrfs-uuid-00);
				 lock(btrfs-root-00);
				 lock(btrfs-uuid-00);
    lock(btrfs-root-00);

   *** DEADLOCK ***

  1 lock held by btrfs-uuid/7955:
   #0: ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

  stack backtrace:
  CPU: 73 PID: 7955 Comm: btrfs-uuid Kdump: loaded Not tainted 5.8.0-rc7-00167-g0d7ba0c5b375-dirty torvalds#925
  Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
  Call Trace:
   dump_stack+0x78/0xa0
   check_noncircular+0x165/0x180
   __lock_acquire+0x1272/0x2310
   lock_acquire+0x9e/0x360
   ? __btrfs_tree_read_lock+0x39/0x180
   ? btrfs_root_node+0x1c/0x1d0
   down_read_nested+0x3e/0x140
   ? __btrfs_tree_read_lock+0x39/0x180
   __btrfs_tree_read_lock+0x39/0x180
   __btrfs_read_lock_root_node+0x3a/0x50
   btrfs_search_slot+0x4bd/0x990
   btrfs_find_root+0x45/0x1b0
   btrfs_read_tree_root+0x61/0x100
   btrfs_get_root_ref.part.50+0x143/0x630
   btrfs_uuid_tree_iterate+0x207/0x314
   ? btree_readpage+0x20/0x20
   btrfs_uuid_rescan_kthread+0x12/0x50
   kthread+0x133/0x150
   ? kthread_create_on_node+0x60/0x60
   ret_from_fork+0x1f/0x30

This problem exists because we have two different rescan threads,
btrfs_uuid_scan_kthread which creates the uuid tree, and
btrfs_uuid_tree_iterate that goes through and updates or deletes any out
of date roots.  The problem is they both do things in different order.
btrfs_uuid_scan_kthread() reads the tree_root, and then inserts entries
into the uuid_root.  btrfs_uuid_tree_iterate() scans the uuid_root, but
then does a btrfs_get_fs_root() which can read from the tree_root.

It's actually easy enough to not be holding the path in
btrfs_uuid_scan_kthread() when we add a uuid entry, as we already drop
it further down and re-start the search when we loop.  So simply move
the path release before we add our entry to the uuid tree.

This also fixes a problem where we're holding a path open after we do
btrfs_end_transaction(), which has it's own problems.

CC: [email protected] # 4.4+
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 11, 2020
With the conversion of the tree locks to rwsem I got the following
lockdep splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.8.0-rc7-00165-g04ec4da5f45f-dirty torvalds#922 Not tainted
  ------------------------------------------------------
  compsize/11122 is trying to acquire lock:
  ffff889fabca8768 (&mm->mmap_lock#2){++++}-{3:3}, at: __might_fault+0x3e/0x90

  but task is already holding lock:
  ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (btrfs-fs-00){++++}-{3:3}:
	 down_write_nested+0x3b/0x70
	 __btrfs_tree_lock+0x24/0x120
	 btrfs_search_slot+0x756/0x990
	 btrfs_lookup_inode+0x3a/0xb4
	 __btrfs_update_delayed_inode+0x93/0x270
	 btrfs_async_run_delayed_root+0x168/0x230
	 btrfs_work_helper+0xd4/0x570
	 process_one_work+0x2ad/0x5f0
	 worker_thread+0x3a/0x3d0
	 kthread+0x133/0x150
	 ret_from_fork+0x1f/0x30

  -> #1 (&delayed_node->mutex){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 btrfs_delayed_update_inode+0x50/0x440
	 btrfs_update_inode+0x8a/0xf0
	 btrfs_dirty_inode+0x5b/0xd0
	 touch_atime+0xa1/0xd0
	 btrfs_file_mmap+0x3f/0x60
	 mmap_region+0x3a4/0x640
	 do_mmap+0x376/0x580
	 vm_mmap_pgoff+0xd5/0x120
	 ksys_mmap_pgoff+0x193/0x230
	 do_syscall_64+0x50/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&mm->mmap_lock#2){++++}-{3:3}:
	 __lock_acquire+0x1272/0x2310
	 lock_acquire+0x9e/0x360
	 __might_fault+0x68/0x90
	 _copy_to_user+0x1e/0x80
	 copy_to_sk.isra.32+0x121/0x300
	 search_ioctl+0x106/0x200
	 btrfs_ioctl_tree_search_v2+0x7b/0xf0
	 btrfs_ioctl+0x106f/0x30a0
	 ksys_ioctl+0x83/0xc0
	 __x64_sys_ioctl+0x16/0x20
	 do_syscall_64+0x50/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  other info that might help us debug this:

  Chain exists of:
    &mm->mmap_lock#2 --> &delayed_node->mutex --> btrfs-fs-00

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(btrfs-fs-00);
				 lock(&delayed_node->mutex);
				 lock(btrfs-fs-00);
    lock(&mm->mmap_lock#2);

   *** DEADLOCK ***

  1 lock held by compsize/11122:
   #0: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

  stack backtrace:
  CPU: 17 PID: 11122 Comm: compsize Kdump: loaded Not tainted 5.8.0-rc7-00165-g04ec4da5f45f-dirty torvalds#922
  Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
  Call Trace:
   dump_stack+0x78/0xa0
   check_noncircular+0x165/0x180
   __lock_acquire+0x1272/0x2310
   lock_acquire+0x9e/0x360
   ? __might_fault+0x3e/0x90
   ? find_held_lock+0x72/0x90
   __might_fault+0x68/0x90
   ? __might_fault+0x3e/0x90
   _copy_to_user+0x1e/0x80
   copy_to_sk.isra.32+0x121/0x300
   ? btrfs_search_forward+0x2a6/0x360
   search_ioctl+0x106/0x200
   btrfs_ioctl_tree_search_v2+0x7b/0xf0
   btrfs_ioctl+0x106f/0x30a0
   ? __do_sys_newfstat+0x5a/0x70
   ? ksys_ioctl+0x83/0xc0
   ksys_ioctl+0x83/0xc0
   __x64_sys_ioctl+0x16/0x20
   do_syscall_64+0x50/0x90
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

The problem is we're doing a copy_to_user() while holding tree locks,
which can deadlock if we have to do a page fault for the copy_to_user().
This exists even without my locking changes, so it needs to be fixed.
Rework the search ioctl to do the pre-fault and then
copy_to_user_nofault for the copying.

CC: [email protected] # 4.4+
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 11, 2020
I got the following lockdep splat while testing:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.8.0-rc7-00172-g021118712e59 torvalds#932 Not tainted
  ------------------------------------------------------
  btrfs/229626 is trying to acquire lock:
  ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450

  but task is already holding lock:
  ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> torvalds#7 (&fs_info->scrub_lock){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 btrfs_scrub_dev+0x11c/0x630
	 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
	 btrfs_ioctl+0x2799/0x30a0
	 ksys_ioctl+0x83/0xc0
	 __x64_sys_ioctl+0x16/0x20
	 do_syscall_64+0x50/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> torvalds#6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 btrfs_run_dev_stats+0x49/0x480
	 commit_cowonly_roots+0xb5/0x2a0
	 btrfs_commit_transaction+0x516/0xa60
	 sync_filesystem+0x6b/0x90
	 generic_shutdown_super+0x22/0x100
	 kill_anon_super+0xe/0x30
	 btrfs_kill_super+0x12/0x20
	 deactivate_locked_super+0x29/0x60
	 cleanup_mnt+0xb8/0x140
	 task_work_run+0x6d/0xb0
	 __prepare_exit_to_usermode+0x1cc/0x1e0
	 do_syscall_64+0x5c/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 btrfs_commit_transaction+0x4bb/0xa60
	 sync_filesystem+0x6b/0x90
	 generic_shutdown_super+0x22/0x100
	 kill_anon_super+0xe/0x30
	 btrfs_kill_super+0x12/0x20
	 deactivate_locked_super+0x29/0x60
	 cleanup_mnt+0xb8/0x140
	 task_work_run+0x6d/0xb0
	 __prepare_exit_to_usermode+0x1cc/0x1e0
	 do_syscall_64+0x5c/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 btrfs_record_root_in_trans+0x43/0x70
	 start_transaction+0xd1/0x5d0
	 btrfs_dirty_inode+0x42/0xd0
	 touch_atime+0xa1/0xd0
	 btrfs_file_mmap+0x3f/0x60
	 mmap_region+0x3a4/0x640
	 do_mmap+0x376/0x580
	 vm_mmap_pgoff+0xd5/0x120
	 ksys_mmap_pgoff+0x193/0x230
	 do_syscall_64+0x50/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #3 (&mm->mmap_lock#2){++++}-{3:3}:
	 __might_fault+0x68/0x90
	 _copy_to_user+0x1e/0x80
	 perf_read+0x141/0x2c0
	 vfs_read+0xad/0x1b0
	 ksys_read+0x5f/0xe0
	 do_syscall_64+0x50/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #2 (&cpuctx_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 perf_event_init_cpu+0x88/0x150
	 perf_event_init+0x1db/0x20b
	 start_kernel+0x3ae/0x53c
	 secondary_startup_64+0xa4/0xb0

  -> #1 (pmus_lock){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 perf_event_init_cpu+0x4f/0x150
	 cpuhp_invoke_callback+0xb1/0x900
	 _cpu_up.constprop.26+0x9f/0x130
	 cpu_up+0x7b/0xc0
	 bringup_nonboot_cpus+0x4f/0x60
	 smp_init+0x26/0x71
	 kernel_init_freeable+0x110/0x258
	 kernel_init+0xa/0x103
	 ret_from_fork+0x1f/0x30

  -> #0 (cpu_hotplug_lock){++++}-{0:0}:
	 __lock_acquire+0x1272/0x2310
	 lock_acquire+0x9e/0x360
	 cpus_read_lock+0x39/0xb0
	 alloc_workqueue+0x378/0x450
	 __btrfs_alloc_workqueue+0x15d/0x200
	 btrfs_alloc_workqueue+0x51/0x160
	 scrub_workers_get+0x5a/0x170
	 btrfs_scrub_dev+0x18c/0x630
	 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
	 btrfs_ioctl+0x2799/0x30a0
	 ksys_ioctl+0x83/0xc0
	 __x64_sys_ioctl+0x16/0x20
	 do_syscall_64+0x50/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  other info that might help us debug this:

  Chain exists of:
    cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(&fs_info->scrub_lock);
				 lock(&fs_devs->device_list_mutex);
				 lock(&fs_info->scrub_lock);
    lock(cpu_hotplug_lock);

   *** DEADLOCK ***

  2 locks held by btrfs/229626:
   #0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
   #1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630

  stack backtrace:
  CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 torvalds#932
  Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
  Call Trace:
   dump_stack+0x78/0xa0
   check_noncircular+0x165/0x180
   __lock_acquire+0x1272/0x2310
   lock_acquire+0x9e/0x360
   ? alloc_workqueue+0x378/0x450
   cpus_read_lock+0x39/0xb0
   ? alloc_workqueue+0x378/0x450
   alloc_workqueue+0x378/0x450
   ? rcu_read_lock_sched_held+0x52/0x80
   __btrfs_alloc_workqueue+0x15d/0x200
   btrfs_alloc_workqueue+0x51/0x160
   scrub_workers_get+0x5a/0x170
   btrfs_scrub_dev+0x18c/0x630
   ? start_transaction+0xd1/0x5d0
   btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
   btrfs_ioctl+0x2799/0x30a0
   ? do_sigaction+0x102/0x250
   ? lockdep_hardirqs_on_prepare+0xca/0x160
   ? _raw_spin_unlock_irq+0x24/0x30
   ? trace_hardirqs_on+0x1c/0xe0
   ? _raw_spin_unlock_irq+0x24/0x30
   ? do_sigaction+0x102/0x250
   ? ksys_ioctl+0x83/0xc0
   ksys_ioctl+0x83/0xc0
   __x64_sys_ioctl+0x16/0x20
   do_syscall_64+0x50/0x90
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.

Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.

Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info.  We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.

CC: [email protected] # 5.4+
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 11, 2020
This patch addresses an irq free warning and null pointer dereference
error problem when nvme devices got timeout error during initialization.
This problem happens when nvme_timeout() function is called while
nvme_reset_work() is still in execution. This patch fixed the problem by
setting flag of the problematic request to NVME_REQ_CANCELLED before
calling nvme_dev_disable() to make sure __nvme_submit_sync_cmd() returns
an error code and let nvme_submit_sync_cmd() fail gracefully.
The following is console output.

[   62.472097] nvme nvme0: I/O 13 QID 0 timeout, disable controller
[   62.488796] nvme nvme0: could not set timestamp (881)
[   62.494888] ------------[ cut here ]------------
[   62.495142] Trying to free already-free IRQ 11
[   62.495366] WARNING: CPU: 0 PID: 7 at kernel/irq/manage.c:1751 free_irq+0x1f7/0x370
[   62.495742] Modules linked in:
[   62.495902] CPU: 0 PID: 7 Comm: kworker/u4:0 Not tainted 5.8.0+ torvalds#8
[   62.496206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-p4
[   62.496772] Workqueue: nvme-reset-wq nvme_reset_work
[   62.497019] RIP: 0010:free_irq+0x1f7/0x370
[   62.497223] Code: e8 ce 49 11 00 48 83 c4 08 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 44 89 f6 48 c70
[   62.498133] RSP: 0000:ffffa96800043d40 EFLAGS: 00010086
[   62.498391] RAX: 0000000000000000 RBX: ffff9b87fc458400 RCX: 0000000000000000
[   62.498741] RDX: 0000000000000001 RSI: 0000000000000096 RDI: ffffffff9693d72c
[   62.499091] RBP: ffff9b87fd4c8f60 R08: ffffa96800043bfd R09: 0000000000000163
[   62.499440] R10: ffffa96800043bf8 R11: ffffa96800043bfd R12: ffff9b87fd4c8e00
[   62.499790] R13: ffff9b87fd4c8ea4 R14: 000000000000000b R15: ffff9b87fd76b000
[   62.500140] FS:  0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
[   62.500534] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   62.500816] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
[   62.501165] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   62.501515] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   62.501864] Call Trace:
[   62.501993]  pci_free_irq+0x13/0x20
[   62.502167]  nvme_reset_work+0x5d0/0x12a0
[   62.502369]  ? update_load_avg+0x59/0x580
[   62.502569]  ? ttwu_queue_wakelist+0xa8/0xc0
[   62.502780]  ? try_to_wake_up+0x1a2/0x450
[   62.502979]  process_one_work+0x1d2/0x390
[   62.503179]  worker_thread+0x45/0x3b0
[   62.503361]  ? process_one_work+0x390/0x390
[   62.503568]  kthread+0xf9/0x130
[   62.503726]  ? kthread_park+0x80/0x80
[   62.503911]  ret_from_fork+0x22/0x30
[   62.504090] ---[ end trace de9ed4a70f8d71e2 ]---
[  123.912275] nvme nvme0: I/O 12 QID 0 timeout, disable controller
[  123.914670] nvme nvme0: 1/0/0 default/read/poll queues
[  123.916310] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  123.917469] #PF: supervisor write access in kernel mode
[  123.917725] #PF: error_code(0x0002) - not-present page
[  123.917976] PGD 0 P4D 0
[  123.918109] Oops: 0002 [#1] SMP PTI
[  123.918283] CPU: 0 PID: 7 Comm: kworker/u4:0 Tainted: G        W         5.8.0+ torvalds#8
[  123.918650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-p4
[  123.919219] Workqueue: nvme-reset-wq nvme_reset_work
[  123.919469] RIP: 0010:__blk_mq_alloc_map_and_request+0x21/0x80
[  123.919757] Code: 66 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 63 ee 53 48 8b 47 68 89 ee 48 89 fb 8b4
[  123.920657] RSP: 0000:ffffa96800043d40 EFLAGS: 00010286
[  123.920912] RAX: ffff9b87fc4fee40 RBX: ffff9b87fc8cb008 RCX: 0000000000000000
[  123.921258] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b87fc618000
[  123.921602] RBP: 0000000000000000 R08: ffff9b87fdc2c4a0 R09: ffff9b87fc616000
[  123.921949] R10: 0000000000000000 R11: ffff9b87fffd1500 R12: 0000000000000000
[  123.922295] R13: 0000000000000000 R14: ffff9b87fc8cb200 R15: ffff9b87fc8cb000
[  123.922641] FS:  0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
[  123.923032] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  123.923312] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
[  123.923660] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  123.924007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  123.924353] Call Trace:
[  123.924479]  blk_mq_alloc_tag_set+0x137/0x2a0
[  123.924694]  nvme_reset_work+0xed6/0x12a0
[  123.924898]  process_one_work+0x1d2/0x390
[  123.925099]  worker_thread+0x45/0x3b0
[  123.925280]  ? process_one_work+0x390/0x390
[  123.925486]  kthread+0xf9/0x130
[  123.925642]  ? kthread_park+0x80/0x80
[  123.925825]  ret_from_fork+0x22/0x30
[  123.926004] Modules linked in:
[  123.926158] CR2: 0000000000000000
[  123.926322] ---[ end trace de9ed4a70f8d71e3 ]---
[  123.926549] RIP: 0010:__blk_mq_alloc_map_and_request+0x21/0x80
[  123.926832] Code: 66 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 63 ee 53 48 8b 47 68 89 ee 48 89 fb 8b4
[  123.927734] RSP: 0000:ffffa96800043d40 EFLAGS: 00010286
[  123.927989] RAX: ffff9b87fc4fee40 RBX: ffff9b87fc8cb008 RCX: 0000000000000000
[  123.928336] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b87fc618000
[  123.928679] RBP: 0000000000000000 R08: ffff9b87fdc2c4a0 R09: ffff9b87fc616000
[  123.929025] R10: 0000000000000000 R11: ffff9b87fffd1500 R12: 0000000000000000
[  123.929370] R13: 0000000000000000 R14: ffff9b87fc8cb200 R15: ffff9b87fc8cb000
[  123.929715] FS:  0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
[  123.930106] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  123.930384] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
[  123.930731] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  123.931077] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Co-developed-by: Keith Busch <[email protected]>
Signed-off-by: Tong Zhang <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 11, 2020
…s metrics" test

Linux 5.9 introduced perf test case "Parse and process metrics" and
on s390 this test case always dumps core:

  [root@t35lp67 perf]# ./perf test -vvvv -F 67
  67: Parse and process metrics                             :
  --- start ---
  metric expr inst_retired.any / cpu_clk_unhalted.thread for IPC
  parsing metric: inst_retired.any / cpu_clk_unhalted.thread
  Segmentation fault (core dumped)
  [root@t35lp67 perf]#

I debugged this core dump and gdb shows this call chain:

  (gdb) where
   #0  0x000003ffabc3192a in __strnlen_c_1 () from /lib64/libc.so.6
   #1  0x000003ffabc293de in strcasestr () from /lib64/libc.so.6
   #2  0x0000000001102ba2 in match_metric(list=0x1e6ea20 "inst_retired.any",
            n=<optimized out>)
       at util/metricgroup.c:368
   #3  find_metric (map=<optimized out>, map=<optimized out>,
           metric=0x1e6ea20 "inst_retired.any")
      at util/metricgroup.c:765
   #4  __resolve_metric (ids=0x0, map=<optimized out>, metric_list=0x0,
           metric_no_group=<optimized out>, m=<optimized out>)
      at util/metricgroup.c:844
   #5  resolve_metric (ids=0x0, map=0x0, metric_list=0x0,
          metric_no_group=<optimized out>)
      at util/metricgroup.c:881
   torvalds#6  metricgroup__add_metric (metric=<optimized out>,
        metric_no_group=metric_no_group@entry=false, events=<optimized out>,
        events@entry=0x3ffd84fb878, metric_list=0x0,
        metric_list@entry=0x3ffd84fb868, map=0x0)
      at util/metricgroup.c:943
   torvalds#7  0x00000000011034ae in metricgroup__add_metric_list (map=0x13f9828 <map>,
        metric_list=0x3ffd84fb868, events=0x3ffd84fb878,
        metric_no_group=<optimized out>, list=<optimized out>)
      at util/metricgroup.c:988
   torvalds#8  parse_groups (perf_evlist=perf_evlist@entry=0x1e70260,
          str=str@entry=0x12f34b2 "IPC", metric_no_group=<optimized out>,
          metric_no_merge=<optimized out>,
          fake_pmu=fake_pmu@entry=0x1462f18 <perf_pmu.fake>,
          metric_events=0x3ffd84fba58, map=0x1)
      at util/metricgroup.c:1040
   torvalds#9  0x0000000001103eb2 in metricgroup__parse_groups_test(
  	evlist=evlist@entry=0x1e70260, map=map@entry=0x13f9828 <map>,
  	str=str@entry=0x12f34b2 "IPC",
  	metric_no_group=metric_no_group@entry=false,
  	metric_no_merge=metric_no_merge@entry=false,
  	metric_events=0x3ffd84fba58)
      at util/metricgroup.c:1082
   torvalds#10 0x00000000010c84d8 in __compute_metric (ratio2=0x0, name2=0x0,
          ratio1=<synthetic pointer>, name1=0x12f34b2 "IPC",
  	vals=0x3ffd84fbad8, name=0x12f34b2 "IPC")
      at tests/parse-metric.c:159
   torvalds#11 compute_metric (ratio=<synthetic pointer>, vals=0x3ffd84fbad8,
  	name=0x12f34b2 "IPC")
      at tests/parse-metric.c:189
   torvalds#12 test_ipc () at tests/parse-metric.c:208
.....
..... omitted many more lines

This test case was added with
commit 218ca91 ("perf tests: Add parse metric test for frontend metric").

When I compile with make DEBUG=y it works fine and I do not get a core dump.

It turned out that the above listed function call chain worked on a struct
pmu_event array which requires a trailing element with zeroes which was
missing. The marco map_for_each_event() loops over that array tests for members
metric_expr/metric_name/metric_group being non-NULL. Adding this element fixes
the issue.

Output after:

  [root@t35lp46 perf]# ./perf test 67
  67: Parse and process metrics                             : Ok
  [root@t35lp46 perf]#

Committer notes:

As Ian remarks, this is not s390 specific:

<quote Ian>
  This also shows up with address sanitizer on all architectures
  (perhaps change the patch title) and perhaps add a "Fixes: <commit>"
  tag.

  =================================================================
  ==4718==ERROR: AddressSanitizer: global-buffer-overflow on address
  0x55c93b4d59e8 at pc 0x55c93a1541e2 bp 0x7ffd24327c60 sp
  0x7ffd24327c58
  READ of size 8 at 0x55c93b4d59e8 thread T0
      #0 0x55c93a1541e1 in find_metric tools/perf/util/metricgroup.c:764:2
      #1 0x55c93a153e6c in __resolve_metric tools/perf/util/metricgroup.c:844:9
      #2 0x55c93a152f18 in resolve_metric tools/perf/util/metricgroup.c:881:9
      #3 0x55c93a1528db in metricgroup__add_metric
  tools/perf/util/metricgroup.c:943:9
      #4 0x55c93a151996 in metricgroup__add_metric_list
  tools/perf/util/metricgroup.c:988:9
      #5 0x55c93a1511b9 in parse_groups tools/perf/util/metricgroup.c:1040:8
      torvalds#6 0x55c93a1513e1 in metricgroup__parse_groups_test
  tools/perf/util/metricgroup.c:1082:9
      torvalds#7 0x55c93a0108ae in __compute_metric tools/perf/tests/parse-metric.c:159:8
      torvalds#8 0x55c93a010744 in compute_metric tools/perf/tests/parse-metric.c:189:9
      torvalds#9 0x55c93a00f5ee in test_ipc tools/perf/tests/parse-metric.c:208:2
      torvalds#10 0x55c93a00f1e8 in test__parse_metric
  tools/perf/tests/parse-metric.c:345:2
      torvalds#11 0x55c939fd7202 in run_test tools/perf/tests/builtin-test.c:410:9
      torvalds#12 0x55c939fd6736 in test_and_print tools/perf/tests/builtin-test.c:440:9
      torvalds#13 0x55c939fd58c3 in __cmd_test tools/perf/tests/builtin-test.c:661:4
      torvalds#14 0x55c939fd4e02 in cmd_test tools/perf/tests/builtin-test.c:807:9
      torvalds#15 0x55c939e4763d in run_builtin tools/perf/perf.c:313:11
      torvalds#16 0x55c939e46475 in handle_internal_command tools/perf/perf.c:365:8
      torvalds#17 0x55c939e4737e in run_argv tools/perf/perf.c:409:2
      torvalds#18 0x55c939e45f7e in main tools/perf/perf.c:539:3

  0x55c93b4d59e8 is located 0 bytes to the right of global variable
  'pme_test' defined in 'tools/perf/tests/parse-metric.c:17:25'
  (0x55c93b4d54a0) of size 1352
  SUMMARY: AddressSanitizer: global-buffer-overflow
  tools/perf/util/metricgroup.c:764:2 in find_metric
  Shadow bytes around the buggy address:
    0x0ab9a7692ae0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0ab9a7692af0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0ab9a7692b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0ab9a7692b10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0ab9a7692b20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  =>0x0ab9a7692b30: 00 00 00 00 00 00 00 00 00 00 00 00 00[f9]f9 f9
    0x0ab9a7692b40: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9
    0x0ab9a7692b50: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9
    0x0ab9a7692b60: f9 f9 f9 f9 f9 f9 f9 f9 00 00 00 00 00 00 00 00
    0x0ab9a7692b70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0ab9a7692b80: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9
  Shadow byte legend (one shadow byte represents 8 application bytes):
    Addressable:           00
    Partially addressable: 01 02 03 04 05 06 07
    Heap left redzone:	   fa
    Freed heap region:	   fd
    Stack left redzone:	   f1
    Stack mid redzone:	   f2
    Stack right redzone:     f3
    Stack after return:	   f5
    Stack use after scope:   f8
    Global redzone:          f9
    Global init order:	   f6
    Poisoned by user:        f7
    Container overflow:	   fc
    Array cookie:            ac
    Intra object redzone:    bb
    ASan internal:           fe
    Left alloca redzone:     ca
    Right alloca redzone:    cb
    Shadow gap:              cc
</quote>

I'm also adding the missing "Fixes" tag and setting just .name to NULL,
as doing it that way is more compact (the compiler will zero out
everything else) and the table iterators look for .name being NULL as
the sentinel marking the end of the table.

Fixes: 0a507af ("perf tests: Add parse metric test for ipc metric")
Signed-off-by: Thomas Richter <[email protected]>
Reviewed-by: Sumanth Korikkar <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 11, 2020
One can not simply remove vmalloc faulting on x86-32. Upstream

	commit: 7f0a002 ("x86/mm: remove vmalloc faulting")

removed it on x86 alltogether because previously the
arch_sync_kernel_mappings() interface was introduced. This interface
added synchronization of vmalloc/ioremap page-table updates to all
page-tables in the system at creation time and was thought to make
vmalloc faulting obsolete.

But that assumption was incredibly naive.

It turned out that there is a race window between the time the vmalloc
or ioremap code establishes a mapping and the time it synchronizes
this change to other page-tables in the system.

During this race window another CPU or thread can establish a vmalloc
mapping which uses the same intermediate page-table entries (e.g. PMD
or PUD) and does no synchronization in the end, because it found all
necessary mappings already present in the kernel reference page-table.

But when these intermediate page-table entries are not yet
synchronized, the other CPU or thread will continue with a vmalloc
address that is not yet mapped in the page-table it currently uses,
causing an unhandled page fault and oops like below:

	BUG: unable to handle page fault for address: fe80c000
	#PF: supervisor write access in kernel mode
	#PF: error_code(0x0002) - not-present page
	*pde = 33183067 *pte = a8648163
	Oops: 0002 [#1] SMP
	CPU: 1 PID: 13514 Comm: cve-2017-17053 Tainted: G
	...
	Call Trace:
	 ldt_dup_context+0x66/0x80
	 dup_mm+0x2b3/0x480
	 copy_process+0x133b/0x15c0
	 _do_fork+0x94/0x3e0
	 __ia32_sys_clone+0x67/0x80
	 __do_fast_syscall_32+0x3f/0x70
	 do_fast_syscall_32+0x29/0x60
	 do_SYSENTER_32+0x15/0x20
	 entry_SYSENTER_32+0x9f/0xf2
	EIP: 0xb7eef549

So the arch_sync_kernel_mappings() interface is racy, but removing it
would mean to re-introduce the vmalloc_sync_all() interface, which is
even more awful. Keep arch_sync_kernel_mappings() in place and catch
the race condition in the page-fault handler instead.

Do a partial revert of above commit to get vmalloc faulting on x86-32
back in place.

Fixes: 7f0a002 ("x86/mm: remove vmalloc faulting")
Reported-by: Naresh Kamboju <[email protected]>
Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
jacobpan pushed a commit that referenced this pull request Sep 11, 2020
When an ISM device is removed, all its linkgroups are terminated,
i.e. all the corresponding connections are killed.
Connection killing invokes smc_close_active_abort(), which decreases
the sock refcount for certain states to simulate passive closing.
And it cancels the close worker and has to give up the sock lock for
this timeframe. This opens the door for a passive close worker or a
socket close to run in between. In this case smc_close_active_abort() and
passive close worker resp. smc_release() might do a sock_put for passive
closing. This causes:

[ 1323.315943] refcount_t: underflow; use-after-free.
[ 1323.316055] WARNING: CPU: 3 PID: 54469 at lib/refcount.c:28 refcount_warn_saturate+0xe8/0x130
[ 1323.316069] Kernel panic - not syncing: panic_on_warn set ...
[ 1323.316084] CPU: 3 PID: 54469 Comm: uperf Not tainted 5.9.0-20200826.rc2.git0.46328853ed20.300.fc32.s390x+debug #1
[ 1323.316096] Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
[ 1323.316108] Call Trace:
[ 1323.316125]  [<00000000c0d4aae8>] show_stack+0x90/0xf8
[ 1323.316143]  [<00000000c15989b0>] dump_stack+0xa8/0xe8
[ 1323.316158]  [<00000000c0d8344e>] panic+0x11e/0x288
[ 1323.316173]  [<00000000c0d83144>] __warn+0xac/0x158
[ 1323.316187]  [<00000000c1597a7a>] report_bug+0xb2/0x130
[ 1323.316201]  [<00000000c0d36424>] monitor_event_exception+0x44/0xc0
[ 1323.316219]  [<00000000c195c716>] pgm_check_handler+0x1da/0x238
[ 1323.316234]  [<00000000c151844c>] refcount_warn_saturate+0xec/0x130
[ 1323.316280] ([<00000000c1518448>] refcount_warn_saturate+0xe8/0x130)
[ 1323.316310]  [<000003ff801f2e2a>] smc_release+0x192/0x1c8 [smc]
[ 1323.316323]  [<00000000c169f1fa>] __sock_release+0x5a/0xe0
[ 1323.316334]  [<00000000c169f2ac>] sock_close+0x2c/0x40
[ 1323.316350]  [<00000000c1086de0>] __fput+0xb8/0x278
[ 1323.316362]  [<00000000c0db1e0e>] task_work_run+0x76/0xb8
[ 1323.316393]  [<00000000c0d8ab84>] do_exit+0x26c/0x520
[ 1323.316408]  [<00000000c0d8af08>] do_group_exit+0x48/0xc0
[ 1323.316421]  [<00000000c0d8afa8>] __s390x_sys_exit_group+0x28/0x38
[ 1323.316433]  [<00000000c195c32c>] system_call+0xe0/0x2b4
[ 1323.316446] 1 lock held by uperf/54469:
[ 1323.316456]  #0: 0000000044125e60 (&sb->s_type->i_mutex_key#9){+.+.}-{3:3}, at: __sock_release+0x44/0xe0

The patch rechecks sock state in smc_close_active_abort() after
smc_close_cancel_work() to avoid duplicate decrease of sock
refcount for the same purpose.

Fixes: 611b63a ("net/smc: cancel tx worker in case of socket aborts")
Reviewed-by: Karsten Graul <[email protected]>
Signed-off-by: Ursula Braun <[email protected]>
Signed-off-by: Karsten Graul <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 11, 2020
Commit:

  cc9aec0 ("x86/numa_emulation: Introduce uniform split capability")

uses "-1" as the starting node ID, which causes the strange kernel log as
follows, when "numa=fake=32G" is added to the kernel command line:

    Faking node -1 at [mem 0x0000000000000000-0x0000000893ffffff] (35136MB)
    Faking node 0 at [mem 0x0000001840000000-0x000000203fffffff] (32768MB)
    Faking node 1 at [mem 0x0000000894000000-0x000000183fffffff] (64192MB)
    Faking node 2 at [mem 0x0000002040000000-0x000000283fffffff] (32768MB)
    Faking node 3 at [mem 0x0000002840000000-0x000000303fffffff] (32768MB)

And finally the kernel crashes:

    BUG: Bad page state in process swapper  pfn:00011
    page:(____ptrval____) refcount:0 mapcount:1 mapping:(____ptrval____) index:0x55cd7e44b270 pfn:0x11
    failed to read mapping contents, not a valid kernel address?
    flags: 0x5(locked|uptodate)
    raw: 0000000000000005 000055cd7e44af30 000055cd7e44af50 0000000100000006
    raw: 000055cd7e44b270 000055cd7e44b290 0000000000000000 000055cd7e44b510
    page dumped because: page still charged to cgroup
    page->mem_cgroup:000055cd7e44b510
    Modules linked in:
    CPU: 0 PID: 0 Comm: swapper Not tainted 5.9.0-rc2 #1
    Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
    Call Trace:
     dump_stack+0x57/0x80
     bad_page.cold+0x63/0x94
     __free_pages_ok+0x33f/0x360
     memblock_free_all+0x127/0x195
     mem_init+0x23/0x1f5
     start_kernel+0x219/0x4f5
     secondary_startup_64+0xb6/0xc0

Fix this bug via using 0 as the starting node ID.  This restores the
original behavior before cc9aec0.

[ mingo: Massaged the changelog. ]

Fixes: cc9aec0 ("x86/numa_emulation: Introduce uniform split capability")
Signed-off-by: "Huang, Ying" <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
jacobpan pushed a commit that referenced this pull request Sep 11, 2020
The dev_iommu_priv_set() must be called after probe_device(). This fixes
a NULL pointer deference bug when booting a system with kernel cmdline
"intel_iommu=on,igfx_off", where the dev_iommu_priv_set() is abused.

The following stacktrace was produced:

 Command line: BOOT_IMAGE=/isolinux/bzImage console=tty1 intel_iommu=on,igfx_off
 ...
 DMAR: Host address width 39
 DMAR: DRHD base: 0x000000fed90000 flags: 0x0
 DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 1c0000c40660462 ecap 19e2ff0505e
 DMAR: DRHD base: 0x000000fed91000 flags: 0x1
 DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap d2008c40660462 ecap f050da
 DMAR: RMRR base: 0x0000009aa9f000 end: 0x0000009aabefff
 DMAR: RMRR base: 0x0000009d000000 end: 0x0000009f7fffff
 DMAR: No ATSR found
 BUG: kernel NULL pointer dereference, address: 0000000000000038
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 PGD 0 P4D 0
 Oops: 0002 [#1] SMP PTI
 CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.9.0-devel+ #2
 Hardware name: LENOVO 20HGS0TW00/20HGS0TW00, BIOS N1WET46S (1.25s ) 03/30/2018
 RIP: 0010:intel_iommu_init+0xed0/0x1136
 Code: fe e9 61 02 00 00 bb f4 ff ff ff e9 57 02 00 00 48 63 d1 48 c1 e2 04 48
       03 50 20 48 8b 12 48 85 d2 74 0b 48 8b 92 d0 02 00 00 48 89 7a 38 ff c1
       e9 15 f5 ff ff 48 c7 c7 60 99 ac a7 49 c7 c7 a0
 RSP: 0000:ffff96d180073dd0 EFLAGS: 00010282
 RAX: ffff8c91037a7d20 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffffffffff
 RBP: ffff96d180073e90 R08: 0000000000000001 R09: ffff8c91039fe3c0
 R10: 0000000000000226 R11: 0000000000000226 R12: 000000000000000b
 R13: ffff8c910367c650 R14: ffffffffa8426d60 R15: 0000000000000000
 FS:  0000000000000000(0000) GS:ffff8c9107480000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000038 CR3: 00000004b100a001 CR4: 00000000003706e0
 Call Trace:
  ? _raw_spin_unlock_irqrestore+0x1f/0x30
  ? call_rcu+0x10e/0x320
  ? trace_hardirqs_on+0x2c/0xd0
  ? rdinit_setup+0x2c/0x2c
  ? e820__memblock_setup+0x8b/0x8b
  pci_iommu_init+0x16/0x3f
  do_one_initcall+0x46/0x1e4
  kernel_init_freeable+0x169/0x1b2
  ? rest_init+0x9f/0x9f
  kernel_init+0xa/0x101
  ret_from_fork+0x22/0x30
 Modules linked in:
 CR2: 0000000000000038
 ---[ end trace 3653722a6f936f18 ]---

Fixes: 01b9d4e ("iommu/vt-d: Use dev_iommu_priv_get/set()")
Reported-by: Torsten Hilbrich <[email protected]>
Reported-by: Wendy Wang <[email protected]>
Signed-off-by: Lu Baolu <[email protected]>
Tested-by: Torsten Hilbrich <[email protected]>
Link: https://lore.kernel.org/linux-iommu/[email protected]/
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Joerg Roedel <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 11, 2020
__apply_to_page_range() is also used to change and/or allocate
page-table pages in the vmalloc area of the address space.  Make sure
these changes get synchronized to other page-tables in the system by
calling arch_sync_kernel_mappings() when necessary.

The impact appears limited to x86-32, where apply_to_page_range may miss
updating the PMD.  That leads to explosions in drivers like

  BUG: unable to handle page fault for address: fe036000
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  *pde = 00000000
  Oops: 0002 [#1] SMP
  CPU: 3 PID: 1300 Comm: gem_concurrent_ Not tainted 5.9.0-rc1+ torvalds#16
  Hardware name:  /NUC6i3SYB, BIOS SYSKLi35.86A.0024.2015.1027.2142 10/27/2015
  EIP: __execlists_context_alloc+0x132/0x2d0 [i915]
  Code: 31 d2 89 f0 e8 2f 55 02 00 89 45 e8 3d 00 f0 ff ff 0f 87 11 01 00 00 8b 4d e8 03 4b 30 b8 5a 5a 5a 5a ba 01 00 00 00 8d 79 04 <c7> 01 5a 5a 5a 5a c7 81 fc 0f 00 00 5a 5a 5a 5a 83 e7 fc 29 f9 81
  EAX: 5a5a5a5a EBX: f60ca00 ECX: fe036000 EDX: 00000001
  ESI: f43b7340 EDI: fe036004 EBP: f6389cb8 ESP: f6389c9c
  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010286
  CR0: 80050033 CR2: fe036000 CR3: 2d361000 CR4: 001506d0
  DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
  DR6: fffe0ff0 DR7: 00000400
  Call Trace:
    execlists_context_alloc+0x10/0x20 [i915]
    intel_context_alloc_state+0x3f/0x70 [i915]
    __intel_context_do_pin+0x117/0x170 [i915]
    i915_gem_do_execbuffer+0xcc7/0x2500 [i915]
    i915_gem_execbuffer2_ioctl+0xcd/0x1f0 [i915]
    drm_ioctl_kernel+0x8f/0xd0
    drm_ioctl+0x223/0x3d0
    __ia32_sys_ioctl+0x1ab/0x760
    __do_fast_syscall_32+0x3f/0x70
    do_fast_syscall_32+0x29/0x60
    do_SYSENTER_32+0x15/0x20
    entry_SYSENTER_32+0x9f/0xf2
  EIP: 0xb7f28559
  Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
  EAX: ffffffda EBX: 00000005 ECX: c0406469 EDX: bf95556c
  ESI: b7e68000 EDI: c0406469 EBP: 00000005 ESP: bf9554d8
  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000296
  Modules linked in: i915 x86_pkg_temp_thermal intel_powerclamp crc32_pclmul crc32c_intel intel_cstate intel_uncore intel_gtt drm_kms_helper intel_pch_thermal video button autofs4 i2c_i801 i2c_smbus fan
  CR2: 00000000fe036000

It looks like kasan, xen and i915 are vulnerable.

Actual impact is "on thinkpad X60 in 5.9-rc1, screen starts blinking
after 30-or-so minutes, and machine is unusable"

[[email protected]: ARCH_PAGE_TABLE_SYNC_MASK needs vmalloc.h]
  Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: changelog addition]
[[email protected]: changelog addition]

Fixes: 2ba3e69 ("mm/vmalloc: track which page-table levels were modified")
Fixes: 86cf69f ("x86/mm/32: implement arch_sync_kernel_mappings()")
Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Tested-by: Chris Wilson <[email protected]>	[x86-32]
Tested-by: Pavel Machek <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Cc: <[email protected]>	[5.8+]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 24, 2020
Move another allocation out of regulator_list_mutex-protected region, as
reclaim might want to take the same lock.

WARNING: possible circular locking dependency detected
5.7.13+ torvalds#534 Not tainted
------------------------------------------------------
kswapd0/383 is trying to acquire lock:
c0e5d920 (regulator_list_mutex){+.+.}-{3:3}, at: regulator_lock_dependent+0x54/0x2c0

but task is already holding lock:
c0e38518 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x50

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (fs_reclaim){+.+.}-{0:0}:
       fs_reclaim_acquire.part.11+0x40/0x50
       fs_reclaim_acquire+0x24/0x28
       kmem_cache_alloc_trace+0x40/0x1e8
       regulator_register+0x384/0x1630
       devm_regulator_register+0x50/0x84
       reg_fixed_voltage_probe+0x248/0x35c
[...]
other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(fs_reclaim);
                               lock(regulator_list_mutex);
                               lock(fs_reclaim);
  lock(regulator_list_mutex);

 *** DEADLOCK ***
[...]
2 locks held by kswapd0/383:
 #0: c0e38518 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x50
 #1: cb70e5e0 (hctx->srcu){....}-{0:0}, at: hctx_lock+0x60/0xb8
[...]

Fixes: 541d052 ("regulator: core: Only support passing enable GPIO descriptors")
[this commit only changes context]
Fixes: f8702f9 ("regulator: core: Use ww_mutex for regulators locking")
[this is when the regulator_list_mutex was introduced in reclaim locking path]

Signed-off-by: Michał Mirosław <[email protected]>
Link: https://lore.kernel.org/r/41fe6a9670335721b48e8f5195038c3d67a3bf92.1597195321.git.mirq-linux@rere.qmqm.pl
Signed-off-by: Mark Brown <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 24, 2020
On Linux 5.9-rc1 I get the following warning with apq8016-sbc:

WARNING: CPU: 2 PID: 69 at sound/core/init.c:207 snd_card_new+0x36c/0x3b0 [snd]
CPU: 2 PID: 69 Comm: kworker/2:1 Not tainted 5.9.0-rc1 #1
Workqueue: events deferred_probe_work_func
pc : snd_card_new+0x36c/0x3b0 [snd]
lr : snd_card_new+0xf4/0x3b0 [snd]
Call trace:
 snd_card_new+0x36c/0x3b0 [snd]
 snd_soc_bind_card+0x340/0x9a0 [snd_soc_core]
 snd_soc_register_card+0xf4/0x110 [snd_soc_core]
 devm_snd_soc_register_card+0x44/0xa0 [snd_soc_core]
 apq8016_sbc_platform_probe+0x11c/0x140 [snd_soc_apq8016_sbc]

This warning was introduced in
commit 81033c6 ("ALSA: core: Warn on empty module").
It looks like we are supposed to set card->owner to THIS_MODULE.

Fix this for all the qcom ASoC drivers.

Cc: Srinivas Kandagatla <[email protected]>
Fixes: 79119c7 ("ASoC: qcom: Add Storm machine driver")
Fixes: bdb052e ("ASoC: qcom: add apq8016 sound card support")
Fixes: a6f933f ("ASoC: qcom: apq8096: Add db820c machine driver")
Fixes: 6b1687b ("ASoC: qcom: add sdm845 sound card support")
Signed-off-by: Stephan Gerhold <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 24, 2020
To avoid the following kernel panic when calling kmem_cache_create() with
a NULL pointer from pool_cache(), Block the rxe_param_set_add() from
running if the rdma_rxe module is not initialized.

 BUG: unable to handle kernel NULL pointer dereference at 000000000000000b
 PGD 0 P4D 0
 Oops: 0000 [#1] SMP NOPTI
 CPU: 4 PID: 8512 Comm: modprobe Kdump: loaded Not tainted 4.18.0-231.el8.x86_64 #1
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 10/02/2018
 RIP: 0010:kmem_cache_alloc+0xd1/0x1b0
 Code: 8b 57 18 45 8b 77 1c 48 8b 5c 24 30 0f 1f 44 00 00 5b 48 89 e8 5d 41 5c 41 5d 41 5e 41 5f c3 81 e3 00 00 10 00 75 0e 4d 89 fe <41> f6 47 0b 04 0f 84 6c ff ff ff 4c 89 ff e8 cc da 01 00 49 89 c6
 RSP: 0018:ffffa2b8c773f9d0 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000005
 RDX: 0000000000000004 RSI: 00000000006080c0 RDI: 0000000000000000
 RBP: ffff8ea0a8634fd0 R08: ffffa2b8c773f988 R09: 00000000006000c0
 R10: 0000000000000000 R11: 0000000000000230 R12: 00000000006080c0
 R13: ffffffffc0a97fc8 R14: 0000000000000000 R15: 0000000000000000
 FS:  00007f9138ed9740(0000) GS:ffff8ea4ae800000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000000000000000b CR3: 000000046d59a000 CR4: 00000000003406e0
 Call Trace:
  rxe_alloc+0xc8/0x160 [rdma_rxe]
  rxe_get_dma_mr+0x25/0xb0 [rdma_rxe]
  __ib_alloc_pd+0xcb/0x160 [ib_core]
  ib_mad_init_device+0x296/0x8b0 [ib_core]
  add_client_context+0x11a/0x160 [ib_core]
  enable_device_and_get+0xdc/0x1d0 [ib_core]
  ib_register_device+0x572/0x6b0 [ib_core]
  ? crypto_create_tfm+0x32/0xe0
  ? crypto_create_tfm+0x7a/0xe0
  ? crypto_alloc_tfm+0x58/0xf0
  rxe_register_device+0x19d/0x1c0 [rdma_rxe]
  rxe_net_add+0x3d/0x70 [rdma_rxe]
  ? dev_get_by_name_rcu+0x73/0x90
  rxe_param_set_add+0xaf/0xc0 [rdma_rxe]
  parse_args+0x179/0x370
  ? ref_module+0x1b0/0x1b0
  load_module+0x135e/0x17e0
  ? ref_module+0x1b0/0x1b0
  ? __do_sys_init_module+0x13b/0x180
  __do_sys_init_module+0x13b/0x180
  do_syscall_64+0x5b/0x1a0
  entry_SYSCALL_64_after_hwframe+0x65/0xca
 RIP: 0033:0x7f9137ed296e

This can be triggered if a user tries to use the 'module option' which is
not actually a real module option but some idiotic (and thankfully no
obsolete) sysfs interface.

Fixes: 8700e3e ("Soft RoCE driver")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Kamal Heib <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 24, 2020
Nikolay reported a lockdep splat in generic/476 that I could reproduce
with btrfs/187.

  ======================================================
  WARNING: possible circular locking dependency detected
  5.9.0-rc2+ #1 Tainted: G        W
  ------------------------------------------------------
  kswapd0/100 is trying to acquire lock:
  ffff9e8ef38b6268 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330

  but task is already holding lock:
  ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (fs_reclaim){+.+.}-{0:0}:
	 fs_reclaim_acquire+0x65/0x80
	 slab_pre_alloc_hook.constprop.0+0x20/0x200
	 kmem_cache_alloc_trace+0x3a/0x1a0
	 btrfs_alloc_device+0x43/0x210
	 add_missing_dev+0x20/0x90
	 read_one_chunk+0x301/0x430
	 btrfs_read_sys_array+0x17b/0x1b0
	 open_ctree+0xa62/0x1896
	 btrfs_mount_root.cold+0x12/0xea
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 vfs_kern_mount.part.0+0x71/0xb0
	 btrfs_mount+0x10d/0x379
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 path_mount+0x434/0xc00
	 __x64_sys_mount+0xe3/0x120
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x7e/0x7e0
	 btrfs_chunk_alloc+0x125/0x3a0
	 find_free_extent+0xdf6/0x1210
	 btrfs_reserve_extent+0xb3/0x1b0
	 btrfs_alloc_tree_block+0xb0/0x310
	 alloc_tree_block_no_bg_flush+0x4a/0x60
	 __btrfs_cow_block+0x11a/0x530
	 btrfs_cow_block+0x104/0x220
	 btrfs_search_slot+0x52e/0x9d0
	 btrfs_lookup_inode+0x2a/0x8f
	 __btrfs_update_delayed_inode+0x80/0x240
	 btrfs_commit_inode_delayed_inode+0x119/0x120
	 btrfs_evict_inode+0x357/0x500
	 evict+0xcf/0x1f0
	 vfs_rmdir.part.0+0x149/0x160
	 do_rmdir+0x136/0x1a0
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
	 __lock_acquire+0x1184/0x1fa0
	 lock_acquire+0xa4/0x3d0
	 __mutex_lock+0x7e/0x7e0
	 __btrfs_release_delayed_node.part.0+0x3f/0x330
	 btrfs_evict_inode+0x24c/0x500
	 evict+0xcf/0x1f0
	 dispose_list+0x48/0x70
	 prune_icache_sb+0x44/0x50
	 super_cache_scan+0x161/0x1e0
	 do_shrink_slab+0x178/0x3c0
	 shrink_slab+0x17c/0x290
	 shrink_node+0x2b2/0x6d0
	 balance_pgdat+0x30a/0x670
	 kswapd+0x213/0x4c0
	 kthread+0x138/0x160
	 ret_from_fork+0x1f/0x30

  other info that might help us debug this:

  Chain exists of:
    &delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(fs_reclaim);
				 lock(&fs_info->chunk_mutex);
				 lock(fs_reclaim);
    lock(&delayed_node->mutex);

   *** DEADLOCK ***

  3 locks held by kswapd0/100:
   #0: ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
   #1: ffffffffa9d65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
   #2: ffff9e8e9da260e0 (&type->s_umount_key#48){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0

  stack backtrace:
  CPU: 1 PID: 100 Comm: kswapd0 Tainted: G        W         5.9.0-rc2+ #1
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:
   dump_stack+0x92/0xc8
   check_noncircular+0x12d/0x150
   __lock_acquire+0x1184/0x1fa0
   lock_acquire+0xa4/0x3d0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   __mutex_lock+0x7e/0x7e0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   ? lock_acquire+0xa4/0x3d0
   ? btrfs_evict_inode+0x11e/0x500
   ? find_held_lock+0x2b/0x80
   __btrfs_release_delayed_node.part.0+0x3f/0x330
   btrfs_evict_inode+0x24c/0x500
   evict+0xcf/0x1f0
   dispose_list+0x48/0x70
   prune_icache_sb+0x44/0x50
   super_cache_scan+0x161/0x1e0
   do_shrink_slab+0x178/0x3c0
   shrink_slab+0x17c/0x290
   shrink_node+0x2b2/0x6d0
   balance_pgdat+0x30a/0x670
   kswapd+0x213/0x4c0
   ? _raw_spin_unlock_irqrestore+0x46/0x60
   ? add_wait_queue_exclusive+0x70/0x70
   ? balance_pgdat+0x670/0x670
   kthread+0x138/0x160
   ? kthread_create_worker_on_cpu+0x40/0x40
   ret_from_fork+0x1f/0x30

This is because we are holding the chunk_mutex when we call
btrfs_alloc_device, which does a GFP_KERNEL allocation.  We don't want
to switch that to a GFP_NOFS lock because this is the only place where
it matters.  So instead use memalloc_nofs_save() around the allocation
in order to avoid the lockdep splat.

Reported-by: Nikolay Borisov <[email protected]>
CC: [email protected] # 4.4+
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 24, 2020
When trying to get a new fs root for a snapshot during the transaction
at transaction.c:create_pending_snapshot(), if btrfs_get_new_fs_root()
fails we leave "pending->snap" pointing to an error pointer, and then
later at ioctl.c:create_snapshot() we dereference that pointer, resulting
in a crash:

  [12264.614689] BUG: kernel NULL pointer dereference, address: 00000000000007c4
  [12264.615650] #PF: supervisor write access in kernel mode
  [12264.616487] #PF: error_code(0x0002) - not-present page
  [12264.617436] PGD 0 P4D 0
  [12264.618328] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  [12264.619150] CPU: 0 PID: 2310635 Comm: fsstress Tainted: G        W         5.9.0-rc3-btrfs-next-67 #1
  [12264.619960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  [12264.621769] RIP: 0010:btrfs_mksubvol+0x438/0x4a0 [btrfs]
  [12264.622528] Code: bc ef ff ff (...)
  [12264.624092] RSP: 0018:ffffaa6fc7277cd8 EFLAGS: 00010282
  [12264.624669] RAX: 00000000fffffff4 RBX: ffff9d3e8f151a60 RCX: 0000000000000000
  [12264.625249] RDX: 0000000000000001 RSI: ffffffff9d56c9be RDI: fffffffffffffff4
  [12264.625830] RBP: ffff9d3e8f151b48 R08: 0000000000000000 R09: 0000000000000000
  [12264.626413] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffff4
  [12264.626994] R13: ffff9d3ede380538 R14: ffff9d3ede380500 R15: ffff9d3f61b2eeb8
  [12264.627582] FS:  00007f140d5d8200(0000) GS:ffff9d3fb5e00000(0000) knlGS:0000000000000000
  [12264.628176] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [12264.628773] CR2: 00000000000007c4 CR3: 000000020f8e8004 CR4: 00000000003706f0
  [12264.629379] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [12264.629994] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [12264.630594] Call Trace:
  [12264.631227]  btrfs_mksnapshot+0x7b/0xb0 [btrfs]
  [12264.631840]  __btrfs_ioctl_snap_create+0x16f/0x1a0 [btrfs]
  [12264.632458]  btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
  [12264.633078]  btrfs_ioctl+0x1864/0x3130 [btrfs]
  [12264.633689]  ? do_sys_openat2+0x1a7/0x2d0
  [12264.634295]  ? kmem_cache_free+0x147/0x3a0
  [12264.634899]  ? __x64_sys_ioctl+0x83/0xb0
  [12264.635488]  __x64_sys_ioctl+0x83/0xb0
  [12264.636058]  do_syscall_64+0x33/0x80
  [12264.636616]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

  (gdb) list *(btrfs_mksubvol+0x438)
  0x7c7b8 is in btrfs_mksubvol (fs/btrfs/ioctl.c:858).
  853		ret = 0;
  854		pending_snapshot->anon_dev = 0;
  855	fail:
  856		/* Prevent double freeing of anon_dev */
  857		if (ret && pending_snapshot->snap)
  858			pending_snapshot->snap->anon_dev = 0;
  859		btrfs_put_root(pending_snapshot->snap);
  860		btrfs_subvolume_release_metadata(root, &pending_snapshot->block_rsv);
  861	free_pending:
  862		if (pending_snapshot->anon_dev)

So fix this by setting "pending->snap" to NULL if we get an error from the
call to btrfs_get_new_fs_root() at transaction.c:create_pending_snapshot().

Fixes: 2dfb1e4 ("btrfs: preallocate anon block device at first phase of snapshot creation")
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 24, 2020
…kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for Linux 5.9, take #1

- Multiple stolen time fixes, with a new capability to match x86
- Fix for hugetlbfs mappings when PUD and PMD are the same level
- Fix for hugetlbfs mappings when PTE mappings are enforced
  (dirty logging, for example)
- Fix tracing output of 64bit values
jacobpan pushed a commit that referenced this pull request Sep 24, 2020
When compiling with DEBUG=1 on Fedora 32 I'm getting crash for 'perf
test signal':

  Program received signal SIGSEGV, Segmentation fault.
  0x0000000000c68548 in __test_function ()
  (gdb) bt
  #0  0x0000000000c68548 in __test_function ()
  #1  0x00000000004d62e9 in test_function () at tests/bp_signal.c:61
  #2  0x00000000004d689a in test__bp_signal (test=0xa8e280 <generic_ ...
  #3  0x00000000004b7d49 in run_test (test=0xa8e280 <generic_tests+1 ...
  #4  0x00000000004b7e7f in test_and_print (t=0xa8e280 <generic_test ...
  #5  0x00000000004b8927 in __cmd_test (argc=1, argv=0x7fffffffdce0, ...
  ...

It's caused by the symbol __test_function being in the ".bss" section:

  $ readelf -a ./perf | less
    [Nr] Name              Type             Address           Offset
         Size              EntSize          Flags  Link  Info  Align
    ...
    [28] .bss              NOBITS           0000000000c356a0  008346a0
         00000000000511f8  0000000000000000  WA       0     0     32

  $ nm perf | grep __test_function
  0000000000c68548 B __test_function

I guess most of the time we're just lucky the inline asm ended up in the
".text" section, so making it specific explicit with push and pop
section clauses.

  $ readelf -a ./perf | less
    [Nr] Name              Type             Address           Offset
         Size              EntSize          Flags  Link  Info  Align
    ...
    [13] .text             PROGBITS         0000000000431240  00031240
         0000000000306faa  0000000000000000  AX       0     0     16

  $ nm perf | grep __test_function
  00000000004d62c8 T __test_function

Committer testing:

  $ readelf -wi ~/bin/perf | grep producer -m1
    <c>   DW_AT_producer    : (indirect string, offset: 0x254a): GNU C99 10.2.1 20200723 (Red Hat 10.2.1-1) -mtune=generic -march=x86-64 -ggdb3 -std=gnu99 -fno-omit-frame-pointer -funwind-tables -fstack-protector-all
                                                                                                                                         ^^^^^
                                                                                                                                         ^^^^^
                                                                                                                                         ^^^^^
  $

Before:

  $ perf test signal
  20: Breakpoint overflow signal handler                    : FAILED!
  $

After:

  $ perf test signal
  20: Breakpoint overflow signal handler                    : Ok
  $

Fixes: 8fd34e1 ("perf test: Improve bp_signal")
Signed-off-by: Jiri Olsa <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Michael Petlan <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Wang Nan <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 24, 2020
Commit

  b972fdb ("EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()")

didn't clear all the information from the scanned system and, more
specifically, left ghes_hw.num_dimms to its previous value. On a
second load (CONFIG_DEBUG_TEST_DRIVER_REMOVE=y), the driver would use
the leftover num_dimms value which is not 0 and thus the 0 check in
enumerate_dimms() will get bypassed and it would go directly to the
pointer deref:

  d = &hw->dimms[hw->num_dimms];

which is, of course, NULL:

  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  PGD 0 P4D 0
  Oops: 0002 [#1] PREEMPT SMP
  CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc4+ torvalds#7
  Hardware name: GIGABYTE MZ01-CE1-00/MZ01-CE1-00, BIOS F02 08/29/2018
  RIP: 0010:enumerate_dimms.cold+0x7b/0x375

Reset the whole ghes_hw on driver unregister so that no stale values are
used on a second system scan.

Fixes: b972fdb ("EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()")
Cc: Shiju Jose <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
jacobpan pushed a commit that referenced this pull request Sep 24, 2020
The aliases were never released causing the following leaks:

  Indirect leak of 1224 byte(s) in 9 object(s) allocated from:
    #0 0x7feefb830628 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x107628)
    #1 0x56332c8f1b62 in __perf_pmu__new_alias util/pmu.c:322
    #2 0x56332c8f401f in pmu_add_cpu_aliases_map util/pmu.c:778
    #3 0x56332c792ce9 in __test__pmu_event_aliases tests/pmu-events.c:295
    #4 0x56332c792ce9 in test_aliases tests/pmu-events.c:367
    #5 0x56332c76a09b in run_test tests/builtin-test.c:410
    torvalds#6 0x56332c76a09b in test_and_print tests/builtin-test.c:440
    torvalds#7 0x56332c76ce69 in __cmd_test tests/builtin-test.c:695
    torvalds#8 0x56332c76ce69 in cmd_test tests/builtin-test.c:807
    torvalds#9 0x56332c7d2214 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    torvalds#10 0x56332c6701a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    torvalds#11 0x56332c6701a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    torvalds#12 0x56332c6701a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    torvalds#13 0x7feefb359cc9 in __libc_start_main ../csu/libc-start.c:308

Fixes: 956a783 ("perf test: Test pmu-events aliases")
Signed-off-by: Namhyung Kim <[email protected]>
Reviewed-by: John Garry <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 24, 2020
The evsel->unit borrows a pointer of pmu event or alias instead of
owns a string.  But tool event (duration_time) passes a result of
strdup() caused a leak.

It was found by ASAN during metric test:

  Direct leak of 210 byte(s) in 70 object(s) allocated from:
    #0 0x7fe366fca0b5 in strdup (/lib/x86_64-linux-gnu/libasan.so.5+0x920b5)
    #1 0x559fbbcc6ea3 in add_event_tool util/parse-events.c:414
    #2 0x559fbbcc6ea3 in parse_events_add_tool util/parse-events.c:1414
    #3 0x559fbbd8474d in parse_events_parse util/parse-events.y:439
    #4 0x559fbbcc95da in parse_events__scanner util/parse-events.c:2096
    #5 0x559fbbcc95da in __parse_events util/parse-events.c:2141
    torvalds#6 0x559fbbc28555 in check_parse_id tests/pmu-events.c:406
    torvalds#7 0x559fbbc28555 in check_parse_id tests/pmu-events.c:393
    torvalds#8 0x559fbbc28555 in check_parse_cpu tests/pmu-events.c:415
    torvalds#9 0x559fbbc28555 in test_parsing tests/pmu-events.c:498
    torvalds#10 0x559fbbc0109b in run_test tests/builtin-test.c:410
    torvalds#11 0x559fbbc0109b in test_and_print tests/builtin-test.c:440
    torvalds#12 0x559fbbc03e69 in __cmd_test tests/builtin-test.c:695
    torvalds#13 0x559fbbc03e69 in cmd_test tests/builtin-test.c:807
    torvalds#14 0x559fbbc691f4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    torvalds#15 0x559fbbb071a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    torvalds#16 0x559fbbb071a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    torvalds#17 0x559fbbb071a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    torvalds#18 0x7fe366b68cc9 in __libc_start_main ../csu/libc-start.c:308

Fixes: f0fbb11 ("perf stat: Implement duration_time as a proper event")
Signed-off-by: Namhyung Kim <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 24, 2020
The test_generic_metric() missed to release entries in the pctx.  Asan
reported following leak (and more):

  Direct leak of 128 byte(s) in 1 object(s) allocated from:
    #0 0x7f4c9396980e in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10780e)
    #1 0x55f7e748cc14 in hashmap_grow (/home/namhyung/project/linux/tools/perf/perf+0x90cc14)
    #2 0x55f7e748d497 in hashmap__insert (/home/namhyung/project/linux/tools/perf/perf+0x90d497)
    #3 0x55f7e7341667 in hashmap__set /home/namhyung/project/linux/tools/perf/util/hashmap.h:111
    #4 0x55f7e7341667 in expr__add_ref util/expr.c:120
    #5 0x55f7e7292436 in prepare_metric util/stat-shadow.c:783
    torvalds#6 0x55f7e729556d in test_generic_metric util/stat-shadow.c:858
    torvalds#7 0x55f7e712390b in compute_single tests/parse-metric.c:128
    torvalds#8 0x55f7e712390b in __compute_metric tests/parse-metric.c:180
    torvalds#9 0x55f7e712446d in compute_metric tests/parse-metric.c:196
    torvalds#10 0x55f7e712446d in test_dcache_l2 tests/parse-metric.c:295
    torvalds#11 0x55f7e712446d in test__parse_metric tests/parse-metric.c:355
    torvalds#12 0x55f7e70be09b in run_test tests/builtin-test.c:410
    torvalds#13 0x55f7e70be09b in test_and_print tests/builtin-test.c:440
    torvalds#14 0x55f7e70c101a in __cmd_test tests/builtin-test.c:661
    torvalds#15 0x55f7e70c101a in cmd_test tests/builtin-test.c:807
    torvalds#16 0x55f7e7126214 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    torvalds#17 0x55f7e6fc41a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    torvalds#18 0x55f7e6fc41a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    torvalds#19 0x55f7e6fc41a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    torvalds#20 0x7f4c93492cc9 in __libc_start_main ../csu/libc-start.c:308

Fixes: 6d432c4 ("perf tools: Add test_generic_metric function")
Signed-off-by: Namhyung Kim <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 24, 2020
The metricgroup__add_metric() can find multiple match for a metric group
and it's possible to fail.  Also it can fail in the middle like in
resolve_metric() even for single metric.

In those cases, the intermediate list and ids will be leaked like:

  Direct leak of 3 byte(s) in 1 object(s) allocated from:
    #0 0x7f4c938f40b5 in strdup (/lib/x86_64-linux-gnu/libasan.so.5+0x920b5)
    #1 0x55f7e71c1bef in __add_metric util/metricgroup.c:683
    #2 0x55f7e71c31d0 in add_metric util/metricgroup.c:906
    #3 0x55f7e71c3844 in metricgroup__add_metric util/metricgroup.c:940
    #4 0x55f7e71c488d in metricgroup__add_metric_list util/metricgroup.c:993
    #5 0x55f7e71c488d in parse_groups util/metricgroup.c:1045
    torvalds#6 0x55f7e71c60a4 in metricgroup__parse_groups_test util/metricgroup.c:1087
    torvalds#7 0x55f7e71235ae in __compute_metric tests/parse-metric.c:164
    torvalds#8 0x55f7e7124650 in compute_metric tests/parse-metric.c:196
    torvalds#9 0x55f7e7124650 in test_recursion_fail tests/parse-metric.c:318
    torvalds#10 0x55f7e7124650 in test__parse_metric tests/parse-metric.c:356
    torvalds#11 0x55f7e70be09b in run_test tests/builtin-test.c:410
    torvalds#12 0x55f7e70be09b in test_and_print tests/builtin-test.c:440
    torvalds#13 0x55f7e70c101a in __cmd_test tests/builtin-test.c:661
    torvalds#14 0x55f7e70c101a in cmd_test tests/builtin-test.c:807
    torvalds#15 0x55f7e7126214 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    torvalds#16 0x55f7e6fc41a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    torvalds#17 0x55f7e6fc41a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    torvalds#18 0x55f7e6fc41a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    torvalds#19 0x7f4c93492cc9 in __libc_start_main ../csu/libc-start.c:308

Fixes: 83de0b7 ("perf metric: Collect referenced metrics in struct metric_ref_node")
Signed-off-by: Namhyung Kim <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 24, 2020
The following leaks were detected by ASAN:

  Indirect leak of 360 byte(s) in 9 object(s) allocated from:
    #0 0x7fecc305180e in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10780e)
    #1 0x560578f6dce5 in perf_pmu__new_format util/pmu.c:1333
    #2 0x560578f752fc in perf_pmu_parse util/pmu.y:59
    #3 0x560578f6a8b7 in perf_pmu__format_parse util/pmu.c:73
    #4 0x560578e07045 in test__pmu tests/pmu.c:155
    #5 0x560578de109b in run_test tests/builtin-test.c:410
    torvalds#6 0x560578de109b in test_and_print tests/builtin-test.c:440
    torvalds#7 0x560578de401a in __cmd_test tests/builtin-test.c:661
    torvalds#8 0x560578de401a in cmd_test tests/builtin-test.c:807
    torvalds#9 0x560578e49354 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    torvalds#10 0x560578ce71a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    torvalds#11 0x560578ce71a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    torvalds#12 0x560578ce71a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    torvalds#13 0x7fecc2b7acc9 in __libc_start_main ../csu/libc-start.c:308

Fixes: cff7f95 ("perf tests: Move pmu tests into separate object")
Signed-off-by: Namhyung Kim <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 24, 2020
The crash log as the below:

[Thu Aug 20 23:18:14 2020] general protection fault: 0000 [#1] SMP NOPTI
[Thu Aug 20 23:18:14 2020] CPU: 152 PID: 1837 Comm: kworker/152:1 Tainted: G           OE     5.4.0-42-generic torvalds#46~18.04.1-Ubuntu
[Thu Aug 20 23:18:14 2020] Hardware name: GIGABYTE G482-Z53-YF/MZ52-G40-00, BIOS R12 05/13/2020
[Thu Aug 20 23:18:14 2020] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
[Thu Aug 20 23:18:14 2020] RIP: 0010:evict_process_queues_cpsch+0xc9/0x130 [amdgpu]
[Thu Aug 20 23:18:14 2020] Code: 49 8d 4d 10 48 39 c8 75 21 eb 44 83 fa 03 74 36 80 78 72 00 74 0c 83 ab 68 01 00 00 01 41 c6 45 41 00 48 8b 00 48 39 c8 74 25 <80> 78 70 00 c6 40 6d 01 74 ee 8b 50 28 c6 40 70 00 83 ab 60 01 00
[Thu Aug 20 23:18:14 2020] RSP: 0018:ffffb29b52f6fc90 EFLAGS: 00010213
[Thu Aug 20 23:18:14 2020] RAX: 1c884edb0a118914 RBX: ffff8a0d45ff3c00 RCX: ffff8a2d83e41038
[Thu Aug 20 23:18:14 2020] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff8a0e2e4178c0
[Thu Aug 20 23:18:14 2020] RBP: ffffb29b52f6fcb0 R08: 0000000000001b64 R09: 0000000000000004
[Thu Aug 20 23:18:14 2020] R10: ffffb29b52f6fb78 R11: 0000000000000001 R12: ffff8a0d45ff3d28
[Thu Aug 20 23:18:14 2020] R13: ffff8a2d83e41028 R14: 0000000000000000 R15: 0000000000000000
[Thu Aug 20 23:18:14 2020] FS:  0000000000000000(0000) GS:ffff8a0e2e400000(0000) knlGS:0000000000000000
[Thu Aug 20 23:18:14 2020] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Thu Aug 20 23:18:14 2020] CR2: 000055c783c0e6a8 CR3: 00000034a1284000 CR4: 0000000000340ee0
[Thu Aug 20 23:18:14 2020] Call Trace:
[Thu Aug 20 23:18:14 2020]  kfd_process_evict_queues+0x43/0xd0 [amdgpu]
[Thu Aug 20 23:18:14 2020]  kfd_suspend_all_processes+0x60/0xf0 [amdgpu]
[Thu Aug 20 23:18:14 2020]  kgd2kfd_suspend.part.7+0x43/0x50 [amdgpu]
[Thu Aug 20 23:18:14 2020]  kgd2kfd_pre_reset+0x46/0x60 [amdgpu]
[Thu Aug 20 23:18:14 2020]  amdgpu_amdkfd_pre_reset+0x1a/0x20 [amdgpu]
[Thu Aug 20 23:18:14 2020]  amdgpu_device_gpu_recover+0x377/0xf90 [amdgpu]
[Thu Aug 20 23:18:14 2020]  ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu]
[Thu Aug 20 23:18:14 2020]  amdgpu_ras_do_recovery+0x159/0x190 [amdgpu]
[Thu Aug 20 23:18:14 2020]  process_one_work+0x20f/0x400
[Thu Aug 20 23:18:14 2020]  worker_thread+0x34/0x410

When GPU hang, user process will fail to create a compute queue whose
struct object will be freed later, but driver wrongly add this queue to
queue list of the proccess. And then kfd_process_evict_queues will
access a freed memory, which cause a system crash.

v2:
The failure to execute_queues should probably not be reported to
the caller of create_queue, because the queue was already created.
Therefore change to ignore the return value from execute_queues.

Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]
jacobpan pushed a commit that referenced this pull request Sep 24, 2020
Running the eBPF test_verifier leads to random errors looking like this:

[ 6525.735488] Unexpected kernel BRK exception at EL1
[ 6525.735502] Internal error: ptrace BRK handler: f2000100 [#1] SMP
[ 6525.741609] Modules linked in: nls_utf8 cifs libdes libarc4 dns_resolver fscache binfmt_misc nls_ascii nls_cp437 vfat fat aes_ce_blk crypto_simd cryptd aes_ce_cipher ghash_ce gf128mul efi_pstore sha2_ce sha256_arm64 sha1_ce evdev efivars efivarfs ip_tables x_tables autofs4 btrfs blake2b_generic xor xor_neon zstd_compress raid6_pq libcrc32c crc32c_generic ahci xhci_pci libahci xhci_hcd igb libata i2c_algo_bit nvme realtek usbcore nvme_core scsi_mod t10_pi netsec mdio_devres of_mdio gpio_keys fixed_phy libphy gpio_mb86s7x
[ 6525.787760] CPU: 3 PID: 7881 Comm: test_verifier Tainted: G        W         5.9.0-rc1+ torvalds#47
[ 6525.796111] Hardware name: Socionext SynQuacer E-series DeveloperBox, BIOS build #1 Jun  6 2020
[ 6525.804812] pstate: 20000005 (nzCv daif -PAN -UAO BTYPE=--)
[ 6525.810390] pc : bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
[ 6525.815613] lr : bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
[ 6525.820832] sp : ffff8000130cbb80
[ 6525.824141] x29: ffff8000130cbbb0 x28: 0000000000000000
[ 6525.829451] x27: 000005ef6fcbf39b x26: 0000000000000000
[ 6525.834759] x25: ffff8000130cbb80 x24: ffff800011dc7038
[ 6525.840067] x23: ffff8000130cbd00 x22: ffff0008f624d080
[ 6525.845375] x21: 0000000000000001 x20: ffff800011dc7000
[ 6525.850682] x19: 0000000000000000 x18: 0000000000000000
[ 6525.855990] x17: 0000000000000000 x16: 0000000000000000
[ 6525.861298] x15: 0000000000000000 x14: 0000000000000000
[ 6525.866606] x13: 0000000000000000 x12: 0000000000000000
[ 6525.871913] x11: 0000000000000001 x10: ffff8000000a660c
[ 6525.877220] x9 : ffff800010951810 x8 : ffff8000130cbc38
[ 6525.882528] x7 : 0000000000000000 x6 : 0000009864cfa881
[ 6525.887836] x5 : 00ffffffffffffff x4 : 002880ba1a0b3e9f
[ 6525.893144] x3 : 0000000000000018 x2 : ffff8000000a4374
[ 6525.898452] x1 : 000000000000000a x0 : 0000000000000009
[ 6525.903760] Call trace:
[ 6525.906202]  bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
[ 6525.911076]  bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
[ 6525.915957]  bpf_dispatcher_xdp_func+0x14/0x20
[ 6525.920398]  bpf_test_run+0x70/0x1b0
[ 6525.923969]  bpf_prog_test_run_xdp+0xec/0x190
[ 6525.928326]  __do_sys_bpf+0xc88/0x1b28
[ 6525.932072]  __arm64_sys_bpf+0x24/0x30
[ 6525.935820]  el0_svc_common.constprop.0+0x70/0x168
[ 6525.940607]  do_el0_svc+0x28/0x88
[ 6525.943920]  el0_sync_handler+0x88/0x190
[ 6525.947838]  el0_sync+0x140/0x180
[ 6525.951154] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
[ 6525.957249] ---[ end trace cecc3f93b14927e2 ]---

The reason is the offset[] creation and later usage, while building
the eBPF body. The code currently omits the first instruction, since
build_insn() will increase our ctx->idx before saving it.
That was fine up until bounded eBPF loops were introduced. After that
introduction, offset[0] must be the offset of the end of prologue which
is the start of the 1st insn while, offset[n] holds the
offset of the end of n-th insn.

When "taken loop with back jump to 1st insn" test runs, it will
eventually call bpf2a64_offset(-1, 2, ctx). Since negative indexing is
permitted, the current outcome depends on the value stored in
ctx->offset[-1], which has nothing to do with our array.
If the value happens to be 0 the tests will work. If not this error
triggers.

commit 7c2e988 ("bpf: fix x64 JIT code generation for jmp to 1st insn")
fixed an indentical bug on x86 when eBPF bounded loops were introduced.

So let's fix it by creating the ctx->offset[] differently. Track the
beginning of instruction and account for the extra instruction while
calculating the arm instruction offsets.

Fixes: 2589726 ("bpf: introduce bounded loops")
Reported-by: Naresh Kamboju <[email protected]>
Reported-by: Jiri Olsa <[email protected]>
Co-developed-by: Jean-Philippe Brucker <[email protected]>
Co-developed-by: Yauheni Kaliuta <[email protected]>
Signed-off-by: Jean-Philippe Brucker <[email protected]>
Signed-off-by: Yauheni Kaliuta <[email protected]>
Signed-off-by: Ilias Apalodimas <[email protected]>
Acked-by: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>
jacobpan pushed a commit that referenced this pull request Sep 24, 2020
Steal time initialization requires mapping a memory region which
invokes a memory allocation. Doing this at CPU starting time results
in the following trace when CONFIG_DEBUG_ATOMIC_SLEEP is enabled:

BUG: sleeping function called from invalid context at mm/slab.h:498
in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 0, name: swapper/1
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc5+ #1
Call trace:
 dump_backtrace+0x0/0x208
 show_stack+0x1c/0x28
 dump_stack+0xc4/0x11c
 ___might_sleep+0xf8/0x130
 __might_sleep+0x58/0x90
 slab_pre_alloc_hook.constprop.101+0xd0/0x118
 kmem_cache_alloc_node_trace+0x84/0x270
 __get_vm_area_node+0x88/0x210
 get_vm_area_caller+0x38/0x40
 __ioremap_caller+0x70/0xf8
 ioremap_cache+0x78/0xb0
 memremap+0x9c/0x1a8
 init_stolen_time_cpu+0x54/0xf0
 cpuhp_invoke_callback+0xa8/0x720
 notify_cpu_starting+0xc8/0xd8
 secondary_start_kernel+0x114/0x180
CPU1: Booted secondary processor 0x0000000001 [0x431f0a11]

However we don't need to initialize steal time at CPU starting time.
We can simply wait until CPU online time, just sacrificing a bit of
accuracy by returning zero for steal time until we know better.

While at it, add __init to the functions that are only called by
pv_time_init() which is __init.

Signed-off-by: Andrew Jones <[email protected]>
Fixes: e0685fa ("arm64: Retrieve stolen time as paravirtualized guest")
Cc: [email protected]
Reviewed-by: Steven Price <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>
jacobpan added a commit that referenced this pull request Nov 10, 2025
[   28.720626] NOIOMMU: no_iommu: unmapping IOVA 0x1000000 size 0x8000
[   28.720878] NOIOMMU: no_iommu: unmapping IOVA 0x1008000 size 0x8000
[   28.721367] Unable to handle kernel paging request at virtual address fffffc1fbff00008
[   28.721419] Mem abort info:
[   28.721433]   ESR = 0x0000000096000007
[   28.721484]   EC = 0x25: DABT (current EL), IL = 32 bits
[   28.721540]   SET = 0, FnV = 0
[   28.721575]
** replaying previous printk message **
[   28.721575]   EA = 0, S1PTW = 0
[   28.721602]   FSC = 0x07: level 3 translation fault
[   28.721629] Data abort info:
[   28.721642]   ISV = 0, ISS = 0x00000007, ISS2 = 0x00000000
[   28.721671]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[   28.721698]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[   28.721743] swapper pgtable: 64k pages, 52-bit VAs, pgdp=000000012dea0000
[   28.721761] [fffffc1fbff00008] pgd=0000000130050403, p4d=0000000130050403, pud=0000000130050403, pmd=1000000100f20403, pte=0000000000000000
[   28.722288] Internal error: Oops: 0000000096000007 [#1]  SMP
[   28.724579] Modules linked in:
[   28.725116] CPU: 1 UID: 0 PID: 152 Comm: vfio-noiommu-pc Tainted: G        W           6.17.0-rc2-00036-gb13c949bd2d3-dirty torvalds#313 PREEMPT
[   28.725743] Tainted: [W]=WARN
[   28.725860] Hardware name: QEMU QEMU Virtual Machine, BIOS 2024.02-2 03/11/2024
[   28.726251] pstate: 60402009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   28.726605] pc : unpin_user_page_range_dirty_lock+0x40/0x130
[   28.727284] lr : batch_unpin+0xbc/0xf0
[   28.727447] sp : ffff800085baf830
[   28.727604] x29: ffff800085baf830 x28: 0000000000000000 x27: 0000000000000000
[   28.727941] x26: 0000000000000001 x25: 0000000000000000 x24: ffff800081958440
[   28.728259] x23: 0000000000000000 x22: fffffc1fbff00000 x21: 0000000000000001
[   28.728532] x20: 0000000000000001 x19: 0000000000000001 x18: 00000000ffffffff
[   28.728770] x17: 00000000c9f3662f x16: 00000000d722bb96 x15: ffff800085baf390
[   28.729023] x14: ffff800083e943cf x13: ffff80008393f9a8 x12: 0000000000000885
[   28.729339] x11: 00000000000002d7 x10: ffff80008399bdd0 x9 : ffff80008393f9a8
[   28.729693] x8 : 00000001000012d7 x7 : ffff8000839979a8 x6 : 00000000000022d7
[   28.729959] x5 : ffff800085baf9b0 x4 : ffff800085baf9a8 x3 : 0000000000000000
[   28.730301] x2 : fffffc1fbff00000 x1 : 0000000000000001 x0 : fffffc1fbff00000
[   28.730643] Call trace:
[   28.730873]  unpin_user_page_range_dirty_lock+0x40/0x130 (P)
[   28.731215]  batch_unpin+0xbc/0xf0
[   28.731388]  __iopt_area_unfill_domain+0x230/0x330
[   28.731591]  iopt_area_unfill_domain+0x14/0x20
[   28.731776]  iopt_table_remove_domain+0x1f0/0x264
[   28.731983]  iommufd_hwpt_paging_destroy+0x74/0x130
[   28.732176]  iommufd_object_remove+0x1b8/0x21c
[   28.732353]  iommufd_device_detach+0x44/0xe4
[   28.732503]  vfio_iommufd_physical_detach_ioas+0x38/0x54
[   28.732693]  vfio_df_ioctl_detach_pt+0x158/0x1b0
[   28.732874]  vfio_device_fops_unl_ioctl+0x188/0xa68
[   28.733086]  __arm64_sys_ioctl+0xb4c/0xbdc
[   28.733260]  invoke_syscall.constprop.0+0x74/0xcc
[   28.733472]  do_el0_svc+0x4c/0xbc
[   28.733633]  el0_svc+0xdc/0x12c
[   28.733779]  el0t_64_sync_handler+0x40/0xe4
[   28.733965]  el0t_64_sync+0x198/0x19c
[   28.734324] Code: a9046bf9 d280001 d503201f 8b191ac2 (f9400440)
[   28.734748] ---[ end trace 0000000000000000 ]---

Signed-off-by: Jacob Pan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant