Discussion:
RCU bug with v3.17-rc3 ?
Felipe Balbi
2014-09-04 18:40:21 UTC
Permalink
Hi,

I keep triggering the following Oops with -rc3 when writing to the mass
storage gadget driver:

| # modprobe g_mass_storage stall=0 removable=1 file=/dev/sda
| [ 44.883554] Number of LUNs=8
| [ 44.886709] Mass Storage Function, version: 2009/09/11
| [ 44.892303] LUN: removable file: (no medium)
| [ 44.896916] Number of LUNs=1
| [ 44.901198] LUN: removable file: /dev/sda
| [ 44.905410] Number of LUNs=1
| [ 44.917706] g_mass_storage gadget: Mass Storage Gadget, version: 2009/09/11
| [ 44.925018] g_mass_storage gadget: userspace failed to provide iSerialNumber
| [ 44.932489] g_mass_storage gadget: g_mass_storage ready
| [ 52.583773] g_mass_storage gadget: high-speed config #1: Linux File-Backed Storage
| # [ 98.270585] Unable to handle kernel paging request at virtual address ffffffff
| [ 98.278198] pgd = c0004000
| [ 98.281027] [ffffffff] *pgd=ae7f6821, *pte=00000000, *ppte=00000000
| [ 98.287648] Internal error: Oops: 17 [#1] SMP ARM
| [ 98.292559] Modules linked in: g_mass_storage usb_f_mass_storage libcomposite configfs usb_storage xhci_hcd dwc3 udc_core matrix_keypad lis3lv02d_i2c dwc3_omap lis3lv02d input_polldev
| [ 98.309721] CPU: 0 PID: 1820 Comm: file-storage Not tainted 3.17.0-rc3-00013-gc6b1a7d #806
| [ 98.318346] task: ec356040 ti: ec378000 task.ti: ec378000
| [ 98.324000] PC is at find_get_entry+0x7c/0x128
| [ 98.328640] LR is at 0xfffffffa
| [ 98.331912] pc : [<c011394c>] lr : [<fffffffa>] psr: a0000013
| [ 98.331912] sp : ec379b50 ip : 00000000 fp : ec379b84
| [ 98.343888] r10: c0c81243 r9 : 00000001 r8 : ea123d28
| [ 98.349352] r7 : ec378010 r6 : 00000001 r5 : 00000000 r4 : 0000000f
| [ 98.356181] r3 : ec379b3c r2 : 00000000 r1 : 00000001 r0 : ffffffff
| [ 98.363006] Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel
| [ 98.370646] Control: 10c5387d Table: ac2b0059 DAC: 00000015
| [ 98.376641] Process file-storage (pid: 1820, stack limit = 0xec378248)
| [ 98.383454] Stack: (0xec379b50 to 0xec37a000)
| [ 98.388003] 9b40: 00000000 00000000 c01138d0 c002aa3c
| [ 98.396560] 9b60: 0000000f 00000000 ea123d24 000200d0 00000001 000000d0 ec379bbc ec379b88
| [ 98.405100] 9b80: c0114360 c01138dc c1486a00 60000013 ec379bc4 00001400 00000000 ea123d24
| [ 98.413635] 9ba0: 00000c00 00000400 ec378010 c06dea0c ec379bdc ec379bc0 c011478c c0114330
| [ 98.422183] 9bc0: 000000d0 c00904f8 c1486a00 00001400 ec379c04 ec379be0 c019cd68 c0114760
| [ 98.430732] 9be0: c0090808 c0090590 ec379c34 00000001 00000c00 ea123d24 ec379c2c ec379c08
| [ 98.439300] 9c00: c019ecbc c019cd44 00000c00 00000001 ec379c58 c019eb9c 00000c00 ec379d54
| [ 98.447860] 9c20: ec379c8c ec379c30 c0113f14 c019ec8c 00000c00 00000001 ec379c58 ec379c5c
| [ 98.456414] 9c40: ec378030 00000001 ec250cc0 00000000 00001400 00000000 c018195c c00acd08
| [ 98.464974] 9c60: 5408b05a 00001000 ec250cc0 00000000 ec379d68 ea123d24 ec378010 00000000
| [ 98.473533] 9c80: ec379cf4 ec379c90 c0115ed4 c0113e6c 00000001 00000000 c019f2b0 c0090590
| [ 98.482071] 9ca0: ec379cc4 ec378010 c06c3df4 00001000 ea123c64 c019f2b0 ec379d54 ec379cc8
| [ 98.490607] 9cc0: 00001400 00000000 00000001 ec379d68 ec379d54 ec379e30 ec250cc0 ec356040
| [ 98.499178] 9ce0: ed7ab800 ec30d800 ec379d3c ec379cf8 c019f2b0 c0115c8c c06be3b8 c006dcec
| [ 98.507741] 9d00: ec1b0010 ec30d800 ec379d08 ec379d08 ec379d10 ec379d10 ec379d18 ec379d18
| [ 98.516288] 9d20: 00001400 00000000 ec379e30 ec250cc0 ec379dc4 ec379d40 c016618c c019f284
| [ 98.524833] 9d40: 00001000 c0317b78 ec379d7c ec394000 00001000 00000003 00000000 00001000
| [ 98.533385] 9d60: ec379d4c 00000001 ec250cc0 00000000 00000000 00000000 ec356040 00000000
| [ 98.541946] 9d80: 00000000 00000000 00001400 00000000 00001000 00000000 00000000 00000000
| [ 98.550482] 9da0: ec394000 ec250cc0 ec394000 ec379e30 00001000 00001000 ec379df4 ec379dc8
| [ 98.559023] 9dc0: c0166a3c c01660f4 00000002 ec0ace20 00001000 0000000e ec0ace00 00000000
| [ 98.567567] 9de0: 00001000 ed7ab800 ec379e64 ec379df8 bf0bc3b4 c0166994 0000006f 00001000
| [ 98.576112] 9e00: bf0bc7a4 60000013 e8156000 0000000e 3930343d 00000000 bf0bc7a4 ec0ace00
| [ 98.584660] 9e20: 00002400 00000000 00001400 00000000 00001400 00000000 ec379e64 00000000
| [ 98.593193] 9e40: ed36ddc0 ec378018 ec30d894 ec0ace00 ec30d800 ec30d840 ec379ed4 ec379e68
| [ 98.601754] 9e60: bf0bd1c8 bf0bc08c bf0bf6ec ec378010 c06c3df4 ec356040 00000001 00000000
| [ 98.610305] 9e80: ec379eac ec379e90 c00906b0 c00904f8 ec30d894 ed36ddc0 ec378018 ec30d894
| [ 98.618857] 9ea0: ec379ebc ec379eb0 c0090808 ec30d800 ed36ddc0 ec378018 ec30d894 00000000
| [ 98.627405] 9ec0: 00000200 ec0ace00 ec379f14 ec379ed8 bf0bdbe8 bf0bc74c c06c3d94 ec0acc80
| [ 98.635942] 9ee0: ec394000 ec30d800 bf0bd8cc ec0acc80 00000000 ec30d800 bf0bd8cc 00000000
| [ 98.644465] 9f00: 00000000 00000000 ec379fac ec379f18 c0066ac4 bf0bd8d8 ed1d1040 00000000
| [ 98.652990] 9f20: ec379f3c ec30d800 00000000 00000000 dead4ead ffffffff ffffffff c0c86138
| [ 98.661526] 9f40: 00000000 00000000 c08998e0 00000000 c006dd7c ec379f54 ec379f54 00000000
| [ 98.670077] 9f60: 00000000 dead4ead ffffffff ffffffff c0c86138 00000000 00000000 c08998e0
| [ 98.678612] 9f80: 00000000 ec379f90 ec379f88 ec379f88 ec0acc80 c00669e0 00000000 00000000
| [ 98.687148] 9fa0: 00000000 ec379fb0 c000eea8 c00669ec 00000000 00000000 00000000 00000000
| [ 98.695699] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
| [ 98.704249] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
| [ 98.712805] [<c011394c>] (find_get_entry) from [<c0114360>] (pagecache_get_page+0x3c/0x1f0)
| [ 98.721529] [<c0114360>] (pagecache_get_page) from [<c011478c>] (grab_cache_page_write_begin+0x38/0x50)
| [ 98.731345] [<c011478c>] (grab_cache_page_write_begin) from [<c019cd68>] (block_write_begin+0x30/0x90)
| [ 98.741067] [<c019cd68>] (block_write_begin) from [<c019ecbc>] (blkdev_write_begin+0x3c/0x48)
| [ 98.749974] [<c019ecbc>] (blkdev_write_begin) from [<c0113f14>] (generic_perform_write+0xb4/0x1e4)
| [ 98.759335] [<c0113f14>] (generic_perform_write) from [<c0115ed4>] (__generic_file_write_iter+0x254/0x51c)
| [ 98.769424] [<c0115ed4>] (__generic_file_write_iter) from [<c019f2b0>] (blkdev_write_iter+0x38/0xc0)
| [ 98.778978] [<c019f2b0>] (blkdev_write_iter) from [<c016618c>] (new_sync_write+0xa4/0xcc)
| [ 98.787526] [<c016618c>] (new_sync_write) from [<c0166a3c>] (vfs_write+0xb4/0x1c0)
| [ 98.795462] [<c0166a3c>] (vfs_write) from [<bf0bc3b4>] (do_write+0x334/0x53c [usb_f_mass_storage])
| [ 98.804858] [<bf0bc3b4>] (do_write [usb_f_mass_storage]) from [<bf0bd1c8>] (do_scsi_command+0xa88/0x118c [usb_f_mass_storage])
| [ 98.816782] [<bf0bd1c8>] (do_scsi_command [usb_f_mass_storage]) from [<bf0bdbe8>] (fsg_main_thread+0x31c/0x72c [usb_f_mass_storage])
| [ 98.829249] [<bf0bdbe8>] (fsg_main_thread [usb_f_mass_storage]) from [<c0066ac4>] (kthread+0xe4/0x100)
| [ 98.838993] [<c0066ac4>] (kthread) from [<c000eea8>] (ret_from_fork+0x14/0x20)
| [ 98.846554] Code: e1a01009 eb0905d4 e3500000 0a00001f (e5904000)
| [ 98.853110] ---[ end trace 8bdf31522b942652 ]---


The setup is a bit "odd", I have a USB stick attached to the host port
on my platform and the peripheral port uses that stick as backing file.
that is connected to a laptop which I'm using to read/write to that
backing file. The problem doesn't seem to trigger if I run the exact
same test straight to the USB stick which is attached to the host port.

My test application is rather basic [1] which I run with a script [2] to
pass sensible arguments. I haven't found another way to reproducing this
yet, so it could very well be that g_mass_storage is at fault here, as I
also managed to trigger this when using a tmpfs as backing file.

Anyway, looking at PC:

| (gdb) list *(find_get_entry+0x7c)
| 0xc011394c is in find_get_entry (include/linux/radix-tree.h:196).
| 191 * radix_tree_deref_retry must be used to confirm validity of the pointer if
| 192 * only the read lock is held.
| 193 */
| 194 static inline void *radix_tree_deref_slot(void **pslot)
| 195 {
| 196 return rcu_dereference(*pslot);
| 197 }
| 198
| 199 /**
| 200 * radix_tree_deref_slot_protected - dereference a slot without RCU lock but with tree lock held
| (gdb)

And looking at the arguments for that function, we're passing r0 as
0xffffffff and r1 as 1, which clearly is bogus, but I don't know, at
least not yet, where did those come from. I'll see if I can reproduce
the same problem with dummy_hcd to rule out a bug in my dwc3 driver :-)

cheers
--
balbi
Paul E. McKenney
2014-09-04 19:16:42 UTC
Permalink
Post by Felipe Balbi
Hi,
I keep triggering the following Oops with -rc3 when writing to the mass
v3.17-rc3, correct?

I take it that the test passes on some earlier version?

Thanx, Paul
Post by Felipe Balbi
| # modprobe g_mass_storage stall=0 removable=1 file=/dev/sda
| [ 44.883554] Number of LUNs=8
| [ 44.886709] Mass Storage Function, version: 2009/09/11
| [ 44.892303] LUN: removable file: (no medium)
| [ 44.896916] Number of LUNs=1
| [ 44.901198] LUN: removable file: /dev/sda
| [ 44.905410] Number of LUNs=1
| [ 44.917706] g_mass_storage gadget: Mass Storage Gadget, version: 2009/09/11
| [ 44.925018] g_mass_storage gadget: userspace failed to provide iSerialNumber
| [ 44.932489] g_mass_storage gadget: g_mass_storage ready
| [ 52.583773] g_mass_storage gadget: high-speed config #1: Linux File-Backed Storage
| # [ 98.270585] Unable to handle kernel paging request at virtual address ffffffff
| [ 98.278198] pgd = c0004000
| [ 98.281027] [ffffffff] *pgd=ae7f6821, *pte=00000000, *ppte=00000000
| [ 98.287648] Internal error: Oops: 17 [#1] SMP ARM
| [ 98.292559] Modules linked in: g_mass_storage usb_f_mass_storage libcomposite configfs usb_storage xhci_hcd dwc3 udc_core matrix_keypad lis3lv02d_i2c dwc3_omap lis3lv02d input_polldev
| [ 98.309721] CPU: 0 PID: 1820 Comm: file-storage Not tainted 3.17.0-rc3-00013-gc6b1a7d #806
| [ 98.318346] task: ec356040 ti: ec378000 task.ti: ec378000
| [ 98.324000] PC is at find_get_entry+0x7c/0x128
| [ 98.328640] LR is at 0xfffffffa
| [ 98.331912] pc : [<c011394c>] lr : [<fffffffa>] psr: a0000013
| [ 98.331912] sp : ec379b50 ip : 00000000 fp : ec379b84
| [ 98.343888] r10: c0c81243 r9 : 00000001 r8 : ea123d28
| [ 98.349352] r7 : ec378010 r6 : 00000001 r5 : 00000000 r4 : 0000000f
| [ 98.356181] r3 : ec379b3c r2 : 00000000 r1 : 00000001 r0 : ffffffff
| [ 98.363006] Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel
| [ 98.370646] Control: 10c5387d Table: ac2b0059 DAC: 00000015
| [ 98.376641] Process file-storage (pid: 1820, stack limit = 0xec378248)
| [ 98.383454] Stack: (0xec379b50 to 0xec37a000)
| [ 98.388003] 9b40: 00000000 00000000 c01138d0 c002aa3c
| [ 98.396560] 9b60: 0000000f 00000000 ea123d24 000200d0 00000001 000000d0 ec379bbc ec379b88
| [ 98.405100] 9b80: c0114360 c01138dc c1486a00 60000013 ec379bc4 00001400 00000000 ea123d24
| [ 98.413635] 9ba0: 00000c00 00000400 ec378010 c06dea0c ec379bdc ec379bc0 c011478c c0114330
| [ 98.422183] 9bc0: 000000d0 c00904f8 c1486a00 00001400 ec379c04 ec379be0 c019cd68 c0114760
| [ 98.430732] 9be0: c0090808 c0090590 ec379c34 00000001 00000c00 ea123d24 ec379c2c ec379c08
| [ 98.439300] 9c00: c019ecbc c019cd44 00000c00 00000001 ec379c58 c019eb9c 00000c00 ec379d54
| [ 98.447860] 9c20: ec379c8c ec379c30 c0113f14 c019ec8c 00000c00 00000001 ec379c58 ec379c5c
| [ 98.456414] 9c40: ec378030 00000001 ec250cc0 00000000 00001400 00000000 c018195c c00acd08
| [ 98.464974] 9c60: 5408b05a 00001000 ec250cc0 00000000 ec379d68 ea123d24 ec378010 00000000
| [ 98.473533] 9c80: ec379cf4 ec379c90 c0115ed4 c0113e6c 00000001 00000000 c019f2b0 c0090590
| [ 98.482071] 9ca0: ec379cc4 ec378010 c06c3df4 00001000 ea123c64 c019f2b0 ec379d54 ec379cc8
| [ 98.490607] 9cc0: 00001400 00000000 00000001 ec379d68 ec379d54 ec379e30 ec250cc0 ec356040
| [ 98.499178] 9ce0: ed7ab800 ec30d800 ec379d3c ec379cf8 c019f2b0 c0115c8c c06be3b8 c006dcec
| [ 98.507741] 9d00: ec1b0010 ec30d800 ec379d08 ec379d08 ec379d10 ec379d10 ec379d18 ec379d18
| [ 98.516288] 9d20: 00001400 00000000 ec379e30 ec250cc0 ec379dc4 ec379d40 c016618c c019f284
| [ 98.524833] 9d40: 00001000 c0317b78 ec379d7c ec394000 00001000 00000003 00000000 00001000
| [ 98.533385] 9d60: ec379d4c 00000001 ec250cc0 00000000 00000000 00000000 ec356040 00000000
| [ 98.541946] 9d80: 00000000 00000000 00001400 00000000 00001000 00000000 00000000 00000000
| [ 98.550482] 9da0: ec394000 ec250cc0 ec394000 ec379e30 00001000 00001000 ec379df4 ec379dc8
| [ 98.559023] 9dc0: c0166a3c c01660f4 00000002 ec0ace20 00001000 0000000e ec0ace00 00000000
| [ 98.567567] 9de0: 00001000 ed7ab800 ec379e64 ec379df8 bf0bc3b4 c0166994 0000006f 00001000
| [ 98.576112] 9e00: bf0bc7a4 60000013 e8156000 0000000e 3930343d 00000000 bf0bc7a4 ec0ace00
| [ 98.584660] 9e20: 00002400 00000000 00001400 00000000 00001400 00000000 ec379e64 00000000
| [ 98.593193] 9e40: ed36ddc0 ec378018 ec30d894 ec0ace00 ec30d800 ec30d840 ec379ed4 ec379e68
| [ 98.601754] 9e60: bf0bd1c8 bf0bc08c bf0bf6ec ec378010 c06c3df4 ec356040 00000001 00000000
| [ 98.610305] 9e80: ec379eac ec379e90 c00906b0 c00904f8 ec30d894 ed36ddc0 ec378018 ec30d894
| [ 98.618857] 9ea0: ec379ebc ec379eb0 c0090808 ec30d800 ed36ddc0 ec378018 ec30d894 00000000
| [ 98.627405] 9ec0: 00000200 ec0ace00 ec379f14 ec379ed8 bf0bdbe8 bf0bc74c c06c3d94 ec0acc80
| [ 98.635942] 9ee0: ec394000 ec30d800 bf0bd8cc ec0acc80 00000000 ec30d800 bf0bd8cc 00000000
| [ 98.644465] 9f00: 00000000 00000000 ec379fac ec379f18 c0066ac4 bf0bd8d8 ed1d1040 00000000
| [ 98.652990] 9f20: ec379f3c ec30d800 00000000 00000000 dead4ead ffffffff ffffffff c0c86138
| [ 98.661526] 9f40: 00000000 00000000 c08998e0 00000000 c006dd7c ec379f54 ec379f54 00000000
| [ 98.670077] 9f60: 00000000 dead4ead ffffffff ffffffff c0c86138 00000000 00000000 c08998e0
| [ 98.678612] 9f80: 00000000 ec379f90 ec379f88 ec379f88 ec0acc80 c00669e0 00000000 00000000
| [ 98.687148] 9fa0: 00000000 ec379fb0 c000eea8 c00669ec 00000000 00000000 00000000 00000000
| [ 98.695699] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
| [ 98.704249] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
| [ 98.712805] [<c011394c>] (find_get_entry) from [<c0114360>] (pagecache_get_page+0x3c/0x1f0)
| [ 98.721529] [<c0114360>] (pagecache_get_page) from [<c011478c>] (grab_cache_page_write_begin+0x38/0x50)
| [ 98.731345] [<c011478c>] (grab_cache_page_write_begin) from [<c019cd68>] (block_write_begin+0x30/0x90)
| [ 98.741067] [<c019cd68>] (block_write_begin) from [<c019ecbc>] (blkdev_write_begin+0x3c/0x48)
| [ 98.749974] [<c019ecbc>] (blkdev_write_begin) from [<c0113f14>] (generic_perform_write+0xb4/0x1e4)
| [ 98.759335] [<c0113f14>] (generic_perform_write) from [<c0115ed4>] (__generic_file_write_iter+0x254/0x51c)
| [ 98.769424] [<c0115ed4>] (__generic_file_write_iter) from [<c019f2b0>] (blkdev_write_iter+0x38/0xc0)
| [ 98.778978] [<c019f2b0>] (blkdev_write_iter) from [<c016618c>] (new_sync_write+0xa4/0xcc)
| [ 98.787526] [<c016618c>] (new_sync_write) from [<c0166a3c>] (vfs_write+0xb4/0x1c0)
| [ 98.795462] [<c0166a3c>] (vfs_write) from [<bf0bc3b4>] (do_write+0x334/0x53c [usb_f_mass_storage])
| [ 98.804858] [<bf0bc3b4>] (do_write [usb_f_mass_storage]) from [<bf0bd1c8>] (do_scsi_command+0xa88/0x118c [usb_f_mass_storage])
| [ 98.816782] [<bf0bd1c8>] (do_scsi_command [usb_f_mass_storage]) from [<bf0bdbe8>] (fsg_main_thread+0x31c/0x72c [usb_f_mass_storage])
| [ 98.829249] [<bf0bdbe8>] (fsg_main_thread [usb_f_mass_storage]) from [<c0066ac4>] (kthread+0xe4/0x100)
| [ 98.838993] [<c0066ac4>] (kthread) from [<c000eea8>] (ret_from_fork+0x14/0x20)
| [ 98.846554] Code: e1a01009 eb0905d4 e3500000 0a00001f (e5904000)
| [ 98.853110] ---[ end trace 8bdf31522b942652 ]---
The setup is a bit "odd", I have a USB stick attached to the host port
on my platform and the peripheral port uses that stick as backing file.
that is connected to a laptop which I'm using to read/write to that
backing file. The problem doesn't seem to trigger if I run the exact
same test straight to the USB stick which is attached to the host port.
My test application is rather basic [1] which I run with a script [2] to
pass sensible arguments. I haven't found another way to reproducing this
yet, so it could very well be that g_mass_storage is at fault here, as I
also managed to trigger this when using a tmpfs as backing file.
| (gdb) list *(find_get_entry+0x7c)
| 0xc011394c is in find_get_entry (include/linux/radix-tree.h:196).
| 191 * radix_tree_deref_retry must be used to confirm validity of the pointer if
| 192 * only the read lock is held.
| 193 */
| 194 static inline void *radix_tree_deref_slot(void **pslot)
| 195 {
| 196 return rcu_dereference(*pslot);
| 197 }
| 198
| 199 /**
| 200 * radix_tree_deref_slot_protected - dereference a slot without RCU lock but with tree lock held
| (gdb)
And looking at the arguments for that function, we're passing r0 as
0xffffffff and r1 as 1, which clearly is bogus, but I don't know, at
least not yet, where did those come from. I'll see if I can reproduce
the same problem with dummy_hcd to rule out a bug in my dwc3 driver :-)
cheers
--
balbi
Felipe Balbi
2014-09-04 19:25:35 UTC
Permalink
Post by Paul E. McKenney
Post by Felipe Balbi
Hi,
I keep triggering the following Oops with -rc3 when writing to the mass
v3.17-rc3, correct?
yup, as in subject ;-)
Post by Paul E. McKenney
I take it that the test passes on some earlier version?
about to test v3.14.17.
--
balbi
Felipe Balbi
2014-09-04 20:04:03 UTC
Permalink
Hi,
Post by Felipe Balbi
Post by Paul E. McKenney
Post by Felipe Balbi
Hi,
I keep triggering the following Oops with -rc3 when writing to the mass
v3.17-rc3, correct?
yup, as in subject ;-)
Post by Paul E. McKenney
I take it that the test passes on some earlier version?
about to test v3.14.17.
coudln't get v3.14 working on this board but at least v3.16 is also
affected except that on now it happened during boot, I didn't even need
to run my test:

[ 17.438195] Unable to handle kernel paging request at virtual address ffffffff
[ 17.446109] pgd = ec360000
[ 17.448947] [ffffffff] *pgd=ae7f6821, *pte=00000000, *ppte=00000000
[ 17.455639] Internal error: Oops: 17 [#1] SMP ARM
[ 17.460578] Modules linked in: dwc3(+) udc_core lis3lv02d_i2c lis3lv02d input_polldev dwc3_omap matrix_keypad
[ 17.471060] CPU: 0 PID: 1381 Comm: accounts-daemon Tainted: G W 3.16.0-00005-g8a6cdb4 #811
[ 17.480735] task: ed716040 ti: ec026000 task.ti: ec026000
[ 17.486405] PC is at find_get_entry+0x7c/0x128
[ 17.491070] LR is at 0xfffffffa
[ 17.494364] pc : [<c0110b4c>] lr : [<fffffffa>] psr: a0000013
[ 17.494364] sp : ec027dc8 ip : 00000000 fp : ec027dfc
[ 17.506384] r10: c0c6f6bc r9 : 00000005 r8 : ecdf22f8
[ 17.511860] r7 : ec026008 r6 : 00000001 r5 : 00000000 r4 : 00000000
[ 17.518705] r3 : ec027db4 r2 : 00000000 r1 : 00000005 r0 : ffffffff
[ 17.525526] Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 17.533007] Control: 10c5387d Table: ac360059 DAC: 00000015
[ 17.539020] Process accounts-daemon (pid: 1381, stack limit = 0xec026248)
[ 17.546151] Stack: (0xec027dc8 to 0xec028000)
[ 17.550710] 7dc0: 00000000 00000000 c0110ad0 ecdf0b80 00000000 ecdf22f4
[ 17.559259] 7de0: ecdf22f4 00000000 00000005 00000000 ec027e34 ec027e00 c0111874 c0110adc
[ 17.567824] 7e00: ecdf0b80 c03565b4 ed7165f8 ec3dddf0 ecdf22f4 00000005 ec3ddd00 00000001
[ 17.576385] 7e20: ecdf21a0 00000000 ec027ebc ec027e38 c0112978 c0111844 00000000 c06af938
[ 17.584950] 7e40: ecdf0b70 ecdf0b70 ec027e6c ec027e58 00000005 00000006 00000b80 ecdf0b70
[ 17.593514] 7e60: 00000000 c0163264 ec3dddf0 ec027ee8 ec027ed4 00000b80 ec027eac ec027e88
[ 17.602087] 7e80: c0178d98 c0356590 00000000 00000000 00020000 00005b80 00000000 ec027f78
[ 17.610653] 7ea0: ec3ddd00 ed716040 b6cab018 00000000 ec027f44 ec027ec0 c0163264 c0112780
[ 17.619202] 7ec0: 00000180 00000180 ec027efc b6cab018 00000180 00000000 00000000 00000180
[ 17.627772] 7ee0: ec027ecc 00000001 ec3ddd00 00000000 00000000 00000000 ed716040 00000000
[ 17.636371] 7f00: 00000000 00000000 00005b80 00000000 00000180 00000000 00000000 00000000
[ 17.644946] 7f20: b6cab018 ec3ddd00 b6cab018 ec027f78 ec3ddd00 00000180 ec027f74 ec027f48
[ 17.653524] 7f40: c0163a6c c01631cc b6cab018 00000000 00005b80 00000000 ec3ddd03 ec3ddd00
[ 17.662085] 7f60: 00000180 b6cab018 ec027fa4 ec027f78 c0164198 c01639e0 00005b80 00000000
[ 17.670658] 7f80: be91badc be91ba50 00044a00 00000003 c000f044 ec026000 00000000 ec027fa8
[ 17.679222] 7fa0: c000edc0 c0164158 be91badc be91ba50 00000008 b6cab018 00000180 be91ba38
[ 17.687794] 7fc0: be91badc be91ba50 00044a00 00000003 be91bbac b6cab008 00000000 00000000
[ 17.696370] 7fe0: 00000020 be91ba40 b6c78e8c b6c78ea8 60000010 00000008 ae7f6821 ae7f6c21
[ 17.704956] [<c0110b4c>] (find_get_entry) from [<c0111874>] (pagecache_get_page+0x3c/0x1f4)
[ 17.713687] [<c0111874>] (pagecache_get_page) from [<c0112978>] (generic_file_read_iter+0x204/0x794)
[ 17.723259] [<c0112978>] (generic_file_read_iter) from [<c0163264>] (new_sync_read+0xa4/0xcc)
[ 17.732185] [<c0163264>] (new_sync_read) from [<c0163a6c>] (vfs_read+0x98/0x158)
[ 17.739945] [<c0163a6c>] (vfs_read) from [<c0164198>] (SyS_read+0x4c/0xa0)
[ 17.747149] [<c0164198>] (SyS_read) from [<c000edc0>] (ret_fast_syscall+0x0/0x48)
[ 17.754994] Code: e1a01009 eb08ffa9 e3500000 0a00001f (e5904000)
[ 17.761476] ---[ end trace 49c4ed35a1c01157 ]---

It seems to be a difficult-to-reproduce race though. On a second boot it
didn't die during boot, but died with my USB test case. Unfortunately,
the platform I'm using is pretty new and only goes as far back as v3.16
(which I had to backport 11 patches to get it to boot good enough for
this test).

I wonder if a corrupt file system could cause such problems... I keep
seeing EXT4 errors every now and again; considering that this dies in a
path through VFS, I wonder...

cheers
--
balbi
Paul E. McKenney
2014-09-05 21:32:16 UTC
Permalink
Post by Felipe Balbi
Hi,
Post by Felipe Balbi
Post by Paul E. McKenney
Post by Felipe Balbi
Hi,
I keep triggering the following Oops with -rc3 when writing to the mass
v3.17-rc3, correct?
yup, as in subject ;-)
Post by Paul E. McKenney
I take it that the test passes on some earlier version?
about to test v3.14.17.
coudln't get v3.14 working on this board but at least v3.16 is also
affected except that on now it happened during boot, I didn't even need
[ 17.438195] Unable to handle kernel paging request at virtual address ffffffff
[ 17.446109] pgd = ec360000
[ 17.448947] [ffffffff] *pgd=ae7f6821, *pte=00000000, *ppte=00000000
[ 17.455639] Internal error: Oops: 17 [#1] SMP ARM
[ 17.460578] Modules linked in: dwc3(+) udc_core lis3lv02d_i2c lis3lv02d input_polldev dwc3_omap matrix_keypad
[ 17.471060] CPU: 0 PID: 1381 Comm: accounts-daemon Tainted: G W 3.16.0-00005-g8a6cdb4 #811
[ 17.480735] task: ed716040 ti: ec026000 task.ti: ec026000
[ 17.486405] PC is at find_get_entry+0x7c/0x128
[ 17.491070] LR is at 0xfffffffa
[ 17.494364] pc : [<c0110b4c>] lr : [<fffffffa>] psr: a0000013
[ 17.494364] sp : ec027dc8 ip : 00000000 fp : ec027dfc
[ 17.506384] r10: c0c6f6bc r9 : 00000005 r8 : ecdf22f8
[ 17.511860] r7 : ec026008 r6 : 00000001 r5 : 00000000 r4 : 00000000
[ 17.518705] r3 : ec027db4 r2 : 00000000 r1 : 00000005 r0 : ffffffff
[ 17.525526] Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 17.533007] Control: 10c5387d Table: ac360059 DAC: 00000015
[ 17.539020] Process accounts-daemon (pid: 1381, stack limit = 0xec026248)
[ 17.546151] Stack: (0xec027dc8 to 0xec028000)
[ 17.550710] 7dc0: 00000000 00000000 c0110ad0 ecdf0b80 00000000 ecdf22f4
[ 17.559259] 7de0: ecdf22f4 00000000 00000005 00000000 ec027e34 ec027e00 c0111874 c0110adc
[ 17.567824] 7e00: ecdf0b80 c03565b4 ed7165f8 ec3dddf0 ecdf22f4 00000005 ec3ddd00 00000001
[ 17.576385] 7e20: ecdf21a0 00000000 ec027ebc ec027e38 c0112978 c0111844 00000000 c06af938
[ 17.584950] 7e40: ecdf0b70 ecdf0b70 ec027e6c ec027e58 00000005 00000006 00000b80 ecdf0b70
[ 17.593514] 7e60: 00000000 c0163264 ec3dddf0 ec027ee8 ec027ed4 00000b80 ec027eac ec027e88
[ 17.602087] 7e80: c0178d98 c0356590 00000000 00000000 00020000 00005b80 00000000 ec027f78
[ 17.610653] 7ea0: ec3ddd00 ed716040 b6cab018 00000000 ec027f44 ec027ec0 c0163264 c0112780
[ 17.619202] 7ec0: 00000180 00000180 ec027efc b6cab018 00000180 00000000 00000000 00000180
[ 17.627772] 7ee0: ec027ecc 00000001 ec3ddd00 00000000 00000000 00000000 ed716040 00000000
[ 17.636371] 7f00: 00000000 00000000 00005b80 00000000 00000180 00000000 00000000 00000000
[ 17.644946] 7f20: b6cab018 ec3ddd00 b6cab018 ec027f78 ec3ddd00 00000180 ec027f74 ec027f48
[ 17.653524] 7f40: c0163a6c c01631cc b6cab018 00000000 00005b80 00000000 ec3ddd03 ec3ddd00
[ 17.662085] 7f60: 00000180 b6cab018 ec027fa4 ec027f78 c0164198 c01639e0 00005b80 00000000
[ 17.670658] 7f80: be91badc be91ba50 00044a00 00000003 c000f044 ec026000 00000000 ec027fa8
[ 17.679222] 7fa0: c000edc0 c0164158 be91badc be91ba50 00000008 b6cab018 00000180 be91ba38
[ 17.687794] 7fc0: be91badc be91ba50 00044a00 00000003 be91bbac b6cab008 00000000 00000000
[ 17.696370] 7fe0: 00000020 be91ba40 b6c78e8c b6c78ea8 60000010 00000008 ae7f6821 ae7f6c21
[ 17.704956] [<c0110b4c>] (find_get_entry) from [<c0111874>] (pagecache_get_page+0x3c/0x1f4)
[ 17.713687] [<c0111874>] (pagecache_get_page) from [<c0112978>] (generic_file_read_iter+0x204/0x794)
[ 17.723259] [<c0112978>] (generic_file_read_iter) from [<c0163264>] (new_sync_read+0xa4/0xcc)
[ 17.732185] [<c0163264>] (new_sync_read) from [<c0163a6c>] (vfs_read+0x98/0x158)
[ 17.739945] [<c0163a6c>] (vfs_read) from [<c0164198>] (SyS_read+0x4c/0xa0)
[ 17.747149] [<c0164198>] (SyS_read) from [<c000edc0>] (ret_fast_syscall+0x0/0x48)
[ 17.754994] Code: e1a01009 eb08ffa9 e3500000 0a00001f (e5904000)
[ 17.761476] ---[ end trace 49c4ed35a1c01157 ]---
It seems to be a difficult-to-reproduce race though. On a second boot it
didn't die during boot, but died with my USB test case. Unfortunately,
the platform I'm using is pretty new and only goes as far back as v3.16
(which I had to backport 11 patches to get it to boot good enough for
this test).
I wonder if a corrupt file system could cause such problems... I keep
seeing EXT4 errors every now and again; considering that this dies in a
path through VFS, I wonder...
I recall hearing of similar things in the past, but must defer to the
FS/VFS experts on this one.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Felipe Balbi
2014-10-08 17:13:22 UTC
Permalink
Hi,
Post by Paul E. McKenney
Post by Felipe Balbi
Hi,
Post by Felipe Balbi
Post by Paul E. McKenney
Post by Felipe Balbi
Hi,
I keep triggering the following Oops with -rc3 when writing to the mass
v3.17-rc3, correct?
yup, as in subject ;-)
Post by Paul E. McKenney
I take it that the test passes on some earlier version?
about to test v3.14.17.
coudln't get v3.14 working on this board but at least v3.16 is also
affected except that on now it happened during boot, I didn't even need
[ 17.438195] Unable to handle kernel paging request at virtual address ffffffff
[ 17.446109] pgd = ec360000
[ 17.448947] [ffffffff] *pgd=ae7f6821, *pte=00000000, *ppte=00000000
[ 17.455639] Internal error: Oops: 17 [#1] SMP ARM
[ 17.460578] Modules linked in: dwc3(+) udc_core lis3lv02d_i2c lis3lv02d input_polldev dwc3_omap matrix_keypad
[ 17.471060] CPU: 0 PID: 1381 Comm: accounts-daemon Tainted: G W 3.16.0-00005-g8a6cdb4 #811
[ 17.480735] task: ed716040 ti: ec026000 task.ti: ec026000
[ 17.486405] PC is at find_get_entry+0x7c/0x128
[ 17.491070] LR is at 0xfffffffa
[ 17.494364] pc : [<c0110b4c>] lr : [<fffffffa>] psr: a0000013
[ 17.494364] sp : ec027dc8 ip : 00000000 fp : ec027dfc
[ 17.506384] r10: c0c6f6bc r9 : 00000005 r8 : ecdf22f8
[ 17.511860] r7 : ec026008 r6 : 00000001 r5 : 00000000 r4 : 00000000
[ 17.518705] r3 : ec027db4 r2 : 00000000 r1 : 00000005 r0 : ffffffff
[ 17.525526] Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 17.533007] Control: 10c5387d Table: ac360059 DAC: 00000015
[ 17.539020] Process accounts-daemon (pid: 1381, stack limit = 0xec026248)
[ 17.546151] Stack: (0xec027dc8 to 0xec028000)
[ 17.550710] 7dc0: 00000000 00000000 c0110ad0 ecdf0b80 00000000 ecdf22f4
[ 17.559259] 7de0: ecdf22f4 00000000 00000005 00000000 ec027e34 ec027e00 c0111874 c0110adc
[ 17.567824] 7e00: ecdf0b80 c03565b4 ed7165f8 ec3dddf0 ecdf22f4 00000005 ec3ddd00 00000001
[ 17.576385] 7e20: ecdf21a0 00000000 ec027ebc ec027e38 c0112978 c0111844 00000000 c06af938
[ 17.584950] 7e40: ecdf0b70 ecdf0b70 ec027e6c ec027e58 00000005 00000006 00000b80 ecdf0b70
[ 17.593514] 7e60: 00000000 c0163264 ec3dddf0 ec027ee8 ec027ed4 00000b80 ec027eac ec027e88
[ 17.602087] 7e80: c0178d98 c0356590 00000000 00000000 00020000 00005b80 00000000 ec027f78
[ 17.610653] 7ea0: ec3ddd00 ed716040 b6cab018 00000000 ec027f44 ec027ec0 c0163264 c0112780
[ 17.619202] 7ec0: 00000180 00000180 ec027efc b6cab018 00000180 00000000 00000000 00000180
[ 17.627772] 7ee0: ec027ecc 00000001 ec3ddd00 00000000 00000000 00000000 ed716040 00000000
[ 17.636371] 7f00: 00000000 00000000 00005b80 00000000 00000180 00000000 00000000 00000000
[ 17.644946] 7f20: b6cab018 ec3ddd00 b6cab018 ec027f78 ec3ddd00 00000180 ec027f74 ec027f48
[ 17.653524] 7f40: c0163a6c c01631cc b6cab018 00000000 00005b80 00000000 ec3ddd03 ec3ddd00
[ 17.662085] 7f60: 00000180 b6cab018 ec027fa4 ec027f78 c0164198 c01639e0 00005b80 00000000
[ 17.670658] 7f80: be91badc be91ba50 00044a00 00000003 c000f044 ec026000 00000000 ec027fa8
[ 17.679222] 7fa0: c000edc0 c0164158 be91badc be91ba50 00000008 b6cab018 00000180 be91ba38
[ 17.687794] 7fc0: be91badc be91ba50 00044a00 00000003 be91bbac b6cab008 00000000 00000000
[ 17.696370] 7fe0: 00000020 be91ba40 b6c78e8c b6c78ea8 60000010 00000008 ae7f6821 ae7f6c21
[ 17.704956] [<c0110b4c>] (find_get_entry) from [<c0111874>] (pagecache_get_page+0x3c/0x1f4)
[ 17.713687] [<c0111874>] (pagecache_get_page) from [<c0112978>] (generic_file_read_iter+0x204/0x794)
[ 17.723259] [<c0112978>] (generic_file_read_iter) from [<c0163264>] (new_sync_read+0xa4/0xcc)
[ 17.732185] [<c0163264>] (new_sync_read) from [<c0163a6c>] (vfs_read+0x98/0x158)
[ 17.739945] [<c0163a6c>] (vfs_read) from [<c0164198>] (SyS_read+0x4c/0xa0)
[ 17.747149] [<c0164198>] (SyS_read) from [<c000edc0>] (ret_fast_syscall+0x0/0x48)
[ 17.754994] Code: e1a01009 eb08ffa9 e3500000 0a00001f (e5904000)
[ 17.761476] ---[ end trace 49c4ed35a1c01157 ]---
It seems to be a difficult-to-reproduce race though. On a second boot it
didn't die during boot, but died with my USB test case. Unfortunately,
the platform I'm using is pretty new and only goes as far back as v3.16
(which I had to backport 11 patches to get it to boot good enough for
this test).
I wonder if a corrupt file system could cause such problems... I keep
seeing EXT4 errors every now and again; considering that this dies in a
path through VFS, I wonder...
I recall hearing of similar things in the past, but must defer to the
FS/VFS experts on this one.
resurrecting this thread. I'm facing the same issues with a brand new
filesystem mounted through NFS. The way to reproduce is the same though:
using g_mass_storage with either tmpfs or mmc as backing store.

However it seems to die much more frequently than before. I can
reproduce all the time. It's definitely not a problem with my board as I
have two boards with different SoCs (ARM Cortex A8 and ARM Cortex A9)
with two different USB peripheral controllers (MUSB and DWC3), using the
same rootfs and they die the exact same way no matter if I use tmpfs or
MMC as backing store.

Adding a few more folks here.
--
balbi
Felipe Balbi
2014-10-08 17:57:07 UTC
Permalink
Hi,
Post by Felipe Balbi
Post by Paul E. McKenney
Post by Felipe Balbi
Hi,
Post by Felipe Balbi
Post by Paul E. McKenney
Post by Felipe Balbi
Hi,
I keep triggering the following Oops with -rc3 when writing to the mass
v3.17-rc3, correct?
yup, as in subject ;-)
Post by Paul E. McKenney
I take it that the test passes on some earlier version?
about to test v3.14.17.
coudln't get v3.14 working on this board but at least v3.16 is also
affected except that on now it happened during boot, I didn't even need
[ 17.438195] Unable to handle kernel paging request at virtual address ffffffff
[ 17.446109] pgd = ec360000
[ 17.448947] [ffffffff] *pgd=ae7f6821, *pte=00000000, *ppte=00000000
[ 17.455639] Internal error: Oops: 17 [#1] SMP ARM
[ 17.460578] Modules linked in: dwc3(+) udc_core lis3lv02d_i2c lis3lv02d input_polldev dwc3_omap matrix_keypad
[ 17.471060] CPU: 0 PID: 1381 Comm: accounts-daemon Tainted: G W 3.16.0-00005-g8a6cdb4 #811
[ 17.480735] task: ed716040 ti: ec026000 task.ti: ec026000
[ 17.486405] PC is at find_get_entry+0x7c/0x128
[ 17.491070] LR is at 0xfffffffa
[ 17.494364] pc : [<c0110b4c>] lr : [<fffffffa>] psr: a0000013
[ 17.494364] sp : ec027dc8 ip : 00000000 fp : ec027dfc
[ 17.506384] r10: c0c6f6bc r9 : 00000005 r8 : ecdf22f8
[ 17.511860] r7 : ec026008 r6 : 00000001 r5 : 00000000 r4 : 00000000
[ 17.518705] r3 : ec027db4 r2 : 00000000 r1 : 00000005 r0 : ffffffff
[ 17.525526] Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 17.533007] Control: 10c5387d Table: ac360059 DAC: 00000015
[ 17.539020] Process accounts-daemon (pid: 1381, stack limit = 0xec026248)
[ 17.546151] Stack: (0xec027dc8 to 0xec028000)
[ 17.550710] 7dc0: 00000000 00000000 c0110ad0 ecdf0b80 00000000 ecdf22f4
[ 17.559259] 7de0: ecdf22f4 00000000 00000005 00000000 ec027e34 ec027e00 c0111874 c0110adc
[ 17.567824] 7e00: ecdf0b80 c03565b4 ed7165f8 ec3dddf0 ecdf22f4 00000005 ec3ddd00 00000001
[ 17.576385] 7e20: ecdf21a0 00000000 ec027ebc ec027e38 c0112978 c0111844 00000000 c06af938
[ 17.584950] 7e40: ecdf0b70 ecdf0b70 ec027e6c ec027e58 00000005 00000006 00000b80 ecdf0b70
[ 17.593514] 7e60: 00000000 c0163264 ec3dddf0 ec027ee8 ec027ed4 00000b80 ec027eac ec027e88
[ 17.602087] 7e80: c0178d98 c0356590 00000000 00000000 00020000 00005b80 00000000 ec027f78
[ 17.610653] 7ea0: ec3ddd00 ed716040 b6cab018 00000000 ec027f44 ec027ec0 c0163264 c0112780
[ 17.619202] 7ec0: 00000180 00000180 ec027efc b6cab018 00000180 00000000 00000000 00000180
[ 17.627772] 7ee0: ec027ecc 00000001 ec3ddd00 00000000 00000000 00000000 ed716040 00000000
[ 17.636371] 7f00: 00000000 00000000 00005b80 00000000 00000180 00000000 00000000 00000000
[ 17.644946] 7f20: b6cab018 ec3ddd00 b6cab018 ec027f78 ec3ddd00 00000180 ec027f74 ec027f48
[ 17.653524] 7f40: c0163a6c c01631cc b6cab018 00000000 00005b80 00000000 ec3ddd03 ec3ddd00
[ 17.662085] 7f60: 00000180 b6cab018 ec027fa4 ec027f78 c0164198 c01639e0 00005b80 00000000
[ 17.670658] 7f80: be91badc be91ba50 00044a00 00000003 c000f044 ec026000 00000000 ec027fa8
[ 17.679222] 7fa0: c000edc0 c0164158 be91badc be91ba50 00000008 b6cab018 00000180 be91ba38
[ 17.687794] 7fc0: be91badc be91ba50 00044a00 00000003 be91bbac b6cab008 00000000 00000000
[ 17.696370] 7fe0: 00000020 be91ba40 b6c78e8c b6c78ea8 60000010 00000008 ae7f6821 ae7f6c21
[ 17.704956] [<c0110b4c>] (find_get_entry) from [<c0111874>] (pagecache_get_page+0x3c/0x1f4)
[ 17.713687] [<c0111874>] (pagecache_get_page) from [<c0112978>] (generic_file_read_iter+0x204/0x794)
[ 17.723259] [<c0112978>] (generic_file_read_iter) from [<c0163264>] (new_sync_read+0xa4/0xcc)
[ 17.732185] [<c0163264>] (new_sync_read) from [<c0163a6c>] (vfs_read+0x98/0x158)
[ 17.739945] [<c0163a6c>] (vfs_read) from [<c0164198>] (SyS_read+0x4c/0xa0)
[ 17.747149] [<c0164198>] (SyS_read) from [<c000edc0>] (ret_fast_syscall+0x0/0x48)
[ 17.754994] Code: e1a01009 eb08ffa9 e3500000 0a00001f (e5904000)
[ 17.761476] ---[ end trace 49c4ed35a1c01157 ]---
It seems to be a difficult-to-reproduce race though. On a second boot it
didn't die during boot, but died with my USB test case. Unfortunately,
the platform I'm using is pretty new and only goes as far back as v3.16
(which I had to backport 11 patches to get it to boot good enough for
this test).
I wonder if a corrupt file system could cause such problems... I keep
seeing EXT4 errors every now and again; considering that this dies in a
path through VFS, I wonder...
I recall hearing of similar things in the past, but must defer to the
FS/VFS experts on this one.
resurrecting this thread. I'm facing the same issues with a brand new
using g_mass_storage with either tmpfs or mmc as backing store.
However it seems to die much more frequently than before. I can
reproduce all the time. It's definitely not a problem with my board as I
have two boards with different SoCs (ARM Cortex A8 and ARM Cortex A9)
with two different USB peripheral controllers (MUSB and DWC3), using the
same rootfs and they die the exact same way no matter if I use tmpfs or
MMC as backing store.
Adding a few more folks here.
alright, first stable kernel with Cortex A8 was v3.14. All other kernel
versions die starting with v3.15 to today's Linus. I'll start bisecting
now.
--
balbi
Felipe Balbi
2014-10-08 21:29:38 UTC
Permalink
Hi,

On Wed, Oct 08, 2014 at 12:57:07PM -0500, Felipe Balbi wrote:

[ snip ]
Post by Felipe Balbi
Post by Felipe Balbi
Post by Paul E. McKenney
Post by Felipe Balbi
It seems to be a difficult-to-reproduce race though. On a second boot it
didn't die during boot, but died with my USB test case. Unfortunately,
the platform I'm using is pretty new and only goes as far back as v3.16
(which I had to backport 11 patches to get it to boot good enough for
this test).
I wonder if a corrupt file system could cause such problems... I keep
seeing EXT4 errors every now and again; considering that this dies in a
path through VFS, I wonder...
I recall hearing of similar things in the past, but must defer to the
FS/VFS experts on this one.
resurrecting this thread. I'm facing the same issues with a brand new
using g_mass_storage with either tmpfs or mmc as backing store.
However it seems to die much more frequently than before. I can
reproduce all the time. It's definitely not a problem with my board as I
have two boards with different SoCs (ARM Cortex A8 and ARM Cortex A9)
with two different USB peripheral controllers (MUSB and DWC3), using the
same rootfs and they die the exact same way no matter if I use tmpfs or
MMC as backing store.
Adding a few more folks here.
alright, first stable kernel with Cortex A8 was v3.14. All other kernel
versions die starting with v3.15 to today's Linus. I'll start bisecting
now.
Finally bisected it down to commit 139e561660fe11e0fc35e142a800df3dd7d03e9d
(lib: radix_tree: tree node interface). Here's full bisect log:

git bisect start
# good: [455c6fdbd219161bd09b1165f11699d6d73de11c] Linux 3.14
git bisect good 455c6fdbd219161bd09b1165f11699d6d73de11c
# bad: [1860e379875dfe7271c649058aeddffe5afd9d0d] Linux 3.15
git bisect bad 1860e379875dfe7271c649058aeddffe5afd9d0d
# bad: [74a475acea49459721ae4b062d3da68c74259009] SubmittingPatches: add style recommendation to use imperative descriptions
git bisect bad 74a475acea49459721ae4b062d3da68c74259009
# good: [c12e69c6aaf785fd307d05cb6f36ca0e7577ead7] Merge tag 'staging-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
git bisect good c12e69c6aaf785fd307d05cb6f36ca0e7577ead7
# good: [0fc31966035d7a540c011b6c967ce8eae1db121b] Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next
git bisect good 0fc31966035d7a540c011b6c967ce8eae1db121b
# good: [bdfc7cbdeef8cadba0e5793079ac0130b8e2220c] Merge branch 'mips-for-linux-next' of git://git.linux-mips.org/pub/scm/ralf/upstream-sfr
git bisect good bdfc7cbdeef8cadba0e5793079ac0130b8e2220c
# good: [0f1b1e6d73cb989ce2c071edc57deade3b084dfe] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
git bisect good 0f1b1e6d73cb989ce2c071edc57deade3b084dfe
# good: [181e7d5d7bd7747e882e3ca89ecbf0fc3e72d0da] ixgbe: remove redundant if clause from PTP work
git bisect good 181e7d5d7bd7747e882e3ca89ecbf0fc3e72d0da
# good: [59ecc26004e77e100c700b1d0da7502b0fdadb46] Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
git bisect good 59ecc26004e77e100c700b1d0da7502b0fdadb46
# good: [2b665e276c15ba7d9fc8cdd16931883a51ed13e4] fs/direct-io.c: remove redundant comparison
git bisect good 2b665e276c15ba7d9fc8cdd16931883a51ed13e4
# bad: [f412c97abef71026d8192ca8efca231f1e3906b3] mm, hugetlb: mark some bootstrap functions as __init
git bisect bad f412c97abef71026d8192ca8efca231f1e3906b3
# good: [4e35f483850ba46b838adfd312b3052416e15204] mm, hugetlb: use vma_resv_map() map types
git bisect good 4e35f483850ba46b838adfd312b3052416e15204
# good: [6dbaf22ce1f1dfba33313198eb5bd989ae76dd87] mm: shmem: save one radix tree lookup when truncating swapped pages
git bisect good 6dbaf22ce1f1dfba33313198eb5bd989ae76dd87
# good: [91b0abe36a7b2b3b02d7500925a5f8455334f0e5] mm + fs: store shadow entries in page cache
git bisect good 91b0abe36a7b2b3b02d7500925a5f8455334f0e5
# bad: [139e561660fe11e0fc35e142a800df3dd7d03e9d] lib: radix_tree: tree node interface
git bisect bad 139e561660fe11e0fc35e142a800df3dd7d03e9d
# good: [a528910e12ec7ee203095eb1711468a66b9b60b0] mm: thrash detection-based file cache sizing
git bisect good a528910e12ec7ee203095eb1711468a66b9b60b0
# first bad commit: [139e561660fe11e0fc35e142a800df3dd7d03e9d] lib: radix_tree: tree node interface

I tried reverting that commit on v3.15 but it's non-trivial; I'll leave
that for tomorrow. Meanwhile, adding folks involved with that commit to
Cc list and another backtrace for reference:

[ 113.696647] Unable to handle kernel paging request at virtual address ffffffff
[ 113.704370] pgd = c0004000
[ 113.707276] [ffffffff] *pgd=9fef6821, *pte=00000000, *ppte=00000000
[ 113.713998] Internal error: Oops: 17 [#1] SMP ARM
[ 113.718912] Modules linked in: g_mass_storage usb_f_mass_storage libcomposite configfs musb_dsps musb_hdrc musb_am335x
[ 113.730144] CPU: 0 PID: 1368 Comm: file-storage Not tainted 3.17.0-02899-g748eb79 #239
[ 113.738410] task: de606e00 ti: dd0ba000 task.ti: dd0ba000
[ 113.744060] PC is at find_get_entry+0x64/0x100
[ 113.748700] LR is at 0xfffffffa
[ 113.751978] pc : [<c01065b4>] lr : [<fffffffa>] psr: a00f0013
[ 113.751978] sp : dd0bbba0 ip : 00000000 fp : dd0bbbd4
[ 113.763962] r10: c0665100 r9 : 00001000 r8 : 0000001a
[ 113.769415] r7 : dd0ee9b8 r6 : 00000001 r5 : 00000000 r4 : dd0ee880
[ 113.776228] r3 : dd0bbb8c r2 : 00000000 r1 : 0000001a r0 : ffffffff
[ 113.783044] Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel
[ 113.790674] Control: 10c5387d Table: 9e210019 DAC: 00000015
[ 113.796672] Process file-storage (pid: 1368, stack limit = 0xdd0ba248)
[ 113.803486] Stack: (0xdd0bbba0 to 0xdd0bc000)
[ 113.808038] bba0: 00000000 00000000 c0106550 00017508 00000002 dd0ee880 dd0ee9b4 0000001a
[ 113.816578] bbc0: 00001000 00000000 dd0bbbf4 dd0bbbd8 c010716c c010655c 00013ef0 dd0ee880
[ 113.825118] bbe0: dd0bbda4 00000003 dd0bbc6c dd0bbbf8 c011df94 c0107150 dd0bbc2c c0106b9c
[ 113.833657] bc00: c0089a3c c0089328 00000001 c0107080 00000002 dd0bbcc0 000000d0 00000000
[ 113.842197] bc20: 0001a000 00000000 00000000 dd0ee9b4 0000001a c011e74c dd0bbc94 dd0bbc48
[ 113.850736] bc40: c011beec 00001000 dd0bbda4 dd0ee9b4 00001000 00000000 00001000 c0665100
[ 113.859276] bc60: dd0bbc94 dd0bbc70 c011e74c c011df08 000200da 00000000 00001000 dd0bbda4
[ 113.867816] bc80: dd0ee9b4 00001000 dd0bbcf4 dd0bbc98 c0106b10 c011e700 00001000 00000001
[ 113.876356] bca0: dd0bbcc0 dd0bbcc4 dd0ba000 00000001 de60ee40 00002000 0001a000 00000000
[ 113.884896] bcc0: dfe71ac0 c00a3b60 54355ca1 00004000 de60ee40 00000000 dd0bbdb8 dd0ee9b4
[ 113.893436] bce0: dd0ee880 ffffffff dd0bbd5c dd0bbcf8 c0108c6c c0106a68 dd0bbd5c dd0bbd08
[ 113.901975] bd00: c064b790 c0089c48 00000001 dd0ba038 c0108f70 c0089328 00000001 c0108f7c
[ 113.910515] bd20: dd0bbda4 de606e00 00018000 00000000 dd0bbd5c dd0bbdb8 dd0ee920 dd0bbda4
[ 113.919055] bd40: de60ee40 de606e00 dd0e5000 de664a00 dd0bbd8c dd0bbd60 c0108f7c c0108a24
[ 113.927595] bd60: c008c410 c0089fd0 00000001 00000000 00018000 00000000 dd0bbe80 de60ee40
[ 113.936134] bd80: dd0bbe14 dd0bbd90 c014c920 c0108f40 00004000 00000001 00000001 de274000
[ 113.944674] bda0: 00004000 00000003 00002000 00002000 dd0bbd9c 00000001 de60ee40 00000000
[ 113.953214] bdc0: 00000000 00000000 de606e00 00000000 00000000 00000000 00018000 00000000
[ 113.961753] bde0: 00004000 00000000 00000000 00000000 de274000 de60ee40 de274000 dd0bbe80
[ 113.970293] be00: 00004000 de6ce9c0 dd0bbe44 dd0bbe18 c014d1c8 c014c888 00000002 de6ce9c0
[ 113.978833] be20: 00004000 00000000 00000000 00008000 de6ce9c0 dd0e5000 dd0bbeb4 dd0bbe48
[ 113.987373] be40: bf059cc4 c014d120 00000000 dd0bbe9c dd0bbe68 bf05a04c 19000000 00000000
[ 113.995912] be60: dd0ba000 00000000 00000000 6f48202c 00018000 00000000 00020000 00000000
[ 114.004452] be80: 00018000 00000000 00000000 de664a00 de6ce9c0 00000000 de664a38 de664a00
[ 114.012992] bea0: dd0ba038 de664a7c dd0bbf24 dd0bbeb8 bf05a938 bf059980 00000001 c00899dc
[ 114.021531] bec0: a00f0013 de2e3bd4 00000000 00052000 00000000 dd0bbee0 c0089c50 c0089a70
[ 114.030071] bee0: dd0bbf04 dd0bbef0 c064f3a4 de6ce840 00000000 de664a00 bf05a244 de6ce840
[ 114.038611] bf00: 00000000 de664a00 bf05a244 00000000 00000000 00000000 dd0bbfac dd0bbf28
[ 114.047151] bf20: c0065bdc bf05a250 c0089c50 00000000 dd0bbf54 de664a00 00000000 00000000
[ 114.055690] bf40: dead4ead ffffffff ffffffff c0a8a238 00000000 00000000 c08070f8 dd0bbf5c
[ 114.064230] bf60: dd0bbf5c 00000000 00000000 dead4ead ffffffff ffffffff c0a8a238 00000000
[ 114.072770] bf80: 00000000 c08070f8 dd0bbf88 dd0bbf88 de6ce840 c0065af8 00000000 00000000
[ 114.081310] bfa0: 00000000 dd0bbfb0 c000eea8 c0065b04 00000000 00000000 00000000 00000000
[ 114.089850] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 114.098389] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 0001086e 00001a02
[ 114.106944] [<c01065b4>] (find_get_entry) from [<c010716c>] (find_lock_entry+0x28/0x7c)
[ 114.115316] [<c010716c>] (find_lock_entry) from [<c011df94>] (shmem_getpage_gfp+0x98/0x7f8)
[ 114.124042] [<c011df94>] (shmem_getpage_gfp) from [<c011e74c>] (shmem_write_begin+0x58/0x94)
[ 114.132856] [<c011e74c>] (shmem_write_begin) from [<c0106b10>] (generic_perform_write+0xb4/0x1c8)
[ 114.142124] [<c0106b10>] (generic_perform_write) from [<c0108c6c>] (__generic_file_write_iter+0x254/0x51c)
[ 114.152208] [<c0108c6c>] (__generic_file_write_iter) from [<c0108f7c>] (generic_file_write_iter+0x48/0xdc)
[ 114.162298] [<c0108f7c>] (generic_file_write_iter) from [<c014c920>] (new_sync_write+0xa4/0xcc)
[ 114.171386] [<c014c920>] (new_sync_write) from [<c014d1c8>] (vfs_write+0xb4/0x1c0)
[ 114.179334] [<c014d1c8>] (vfs_write) from [<bf059cc4>] (do_write+0x350/0x4b8 [usb_f_mass_storage])
[ 114.188719] [<bf059cc4>] (do_write [usb_f_mass_storage]) from [<bf05a938>] (fsg_main_thread+0x6f4/0x13f8 [usb_f_mass_storage])
[ 114.200636] [<bf05a938>] (fsg_main_thread [usb_f_mass_storage]) from [<c0065bdc>] (kthread+0xe4/0x100)
[ 114.210368] [<c0065bdc>] (kthread) from [<c000eea8>] (ret_from_fork+0x14/0x20)
[ 114.217914] Code: e1a01008 eb08abbe e3500000 0a00001b (e5904000)
[ 114.224529] ---[ end trace afb7e71d4b71be98 ]---

for those who are coming by late, the problem happens when I use
g_mass_storage with either Cortex A8 or Cortex A9 with two different USB
peripheral controllers using either tmpfs or mmc as backing store.
--
balbi
Johannes Weiner
2014-10-09 16:01:38 UTC
Permalink
Hi Felipe,
Post by Felipe Balbi
Finally bisected it down to commit 139e561660fe11e0fc35e142a800df3dd7d03e9d
git bisect start
# good: [455c6fdbd219161bd09b1165f11699d6d73de11c] Linux 3.14
git bisect good 455c6fdbd219161bd09b1165f11699d6d73de11c
# bad: [1860e379875dfe7271c649058aeddffe5afd9d0d] Linux 3.15
git bisect bad 1860e379875dfe7271c649058aeddffe5afd9d0d
# bad: [74a475acea49459721ae4b062d3da68c74259009] SubmittingPatches: add style recommendation to use imperative descriptions
git bisect bad 74a475acea49459721ae4b062d3da68c74259009
# good: [c12e69c6aaf785fd307d05cb6f36ca0e7577ead7] Merge tag 'staging-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
git bisect good c12e69c6aaf785fd307d05cb6f36ca0e7577ead7
# good: [0fc31966035d7a540c011b6c967ce8eae1db121b] Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next
git bisect good 0fc31966035d7a540c011b6c967ce8eae1db121b
# good: [bdfc7cbdeef8cadba0e5793079ac0130b8e2220c] Merge branch 'mips-for-linux-next' of git://git.linux-mips.org/pub/scm/ralf/upstream-sfr
git bisect good bdfc7cbdeef8cadba0e5793079ac0130b8e2220c
# good: [0f1b1e6d73cb989ce2c071edc57deade3b084dfe] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
git bisect good 0f1b1e6d73cb989ce2c071edc57deade3b084dfe
# good: [181e7d5d7bd7747e882e3ca89ecbf0fc3e72d0da] ixgbe: remove redundant if clause from PTP work
git bisect good 181e7d5d7bd7747e882e3ca89ecbf0fc3e72d0da
# good: [59ecc26004e77e100c700b1d0da7502b0fdadb46] Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
git bisect good 59ecc26004e77e100c700b1d0da7502b0fdadb46
# good: [2b665e276c15ba7d9fc8cdd16931883a51ed13e4] fs/direct-io.c: remove redundant comparison
git bisect good 2b665e276c15ba7d9fc8cdd16931883a51ed13e4
# bad: [f412c97abef71026d8192ca8efca231f1e3906b3] mm, hugetlb: mark some bootstrap functions as __init
git bisect bad f412c97abef71026d8192ca8efca231f1e3906b3
# good: [4e35f483850ba46b838adfd312b3052416e15204] mm, hugetlb: use vma_resv_map() map types
git bisect good 4e35f483850ba46b838adfd312b3052416e15204
# good: [6dbaf22ce1f1dfba33313198eb5bd989ae76dd87] mm: shmem: save one radix tree lookup when truncating swapped pages
git bisect good 6dbaf22ce1f1dfba33313198eb5bd989ae76dd87
# good: [91b0abe36a7b2b3b02d7500925a5f8455334f0e5] mm + fs: store shadow entries in page cache
git bisect good 91b0abe36a7b2b3b02d7500925a5f8455334f0e5
# bad: [139e561660fe11e0fc35e142a800df3dd7d03e9d] lib: radix_tree: tree node interface
git bisect bad 139e561660fe11e0fc35e142a800df3dd7d03e9d
# good: [a528910e12ec7ee203095eb1711468a66b9b60b0] mm: thrash detection-based file cache sizing
git bisect good a528910e12ec7ee203095eb1711468a66b9b60b0
# first bad commit: [139e561660fe11e0fc35e142a800df3dd7d03e9d] lib: radix_tree: tree node interface
I tried reverting that commit on v3.15 but it's non-trivial; I'll leave
that for tomorrow. Meanwhile, adding folks involved with that commit to
[ 113.696647] Unable to handle kernel paging request at virtual address ffffffff
[ 113.704370] pgd = c0004000
[ 113.707276] [ffffffff] *pgd=9fef6821, *pte=00000000, *ppte=00000000
[ 113.713998] Internal error: Oops: 17 [#1] SMP ARM
[ 113.718912] Modules linked in: g_mass_storage usb_f_mass_storage libcomposite configfs musb_dsps musb_hdrc musb_am335x
[ 113.730144] CPU: 0 PID: 1368 Comm: file-storage Not tainted 3.17.0-02899-g748eb79 #239
[ 113.738410] task: de606e00 ti: dd0ba000 task.ti: dd0ba000
[ 113.744060] PC is at find_get_entry+0x64/0x100
Could you please provide the disassembly of that function?

I'm thinking it's not the slot pointer itself that's bad, because
__radix_tree_lookup() dereferences that to test if it's populated
before returning it, and slot life-time is guaranteed by RCU.

That would only leave garbage in the slot itself, crashing during
page_cache_get_speculative().

I'll keep staring at this change, but nothing stands out to me yet.

Thanks,
Johannes
Post by Felipe Balbi
[ 113.748700] LR is at 0xfffffffa
[ 113.751978] pc : [<c01065b4>] lr : [<fffffffa>] psr: a00f0013
[ 113.751978] sp : dd0bbba0 ip : 00000000 fp : dd0bbbd4
[ 113.763962] r10: c0665100 r9 : 00001000 r8 : 0000001a
[ 113.769415] r7 : dd0ee9b8 r6 : 00000001 r5 : 00000000 r4 : dd0ee880
[ 113.776228] r3 : dd0bbb8c r2 : 00000000 r1 : 0000001a r0 : ffffffff
[ 113.783044] Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel
[ 113.790674] Control: 10c5387d Table: 9e210019 DAC: 00000015
[ 113.796672] Process file-storage (pid: 1368, stack limit = 0xdd0ba248)
[ 113.803486] Stack: (0xdd0bbba0 to 0xdd0bc000)
[ 113.808038] bba0: 00000000 00000000 c0106550 00017508 00000002 dd0ee880 dd0ee9b4 0000001a
[ 113.816578] bbc0: 00001000 00000000 dd0bbbf4 dd0bbbd8 c010716c c010655c 00013ef0 dd0ee880
[ 113.825118] bbe0: dd0bbda4 00000003 dd0bbc6c dd0bbbf8 c011df94 c0107150 dd0bbc2c c0106b9c
[ 113.833657] bc00: c0089a3c c0089328 00000001 c0107080 00000002 dd0bbcc0 000000d0 00000000
[ 113.842197] bc20: 0001a000 00000000 00000000 dd0ee9b4 0000001a c011e74c dd0bbc94 dd0bbc48
[ 113.850736] bc40: c011beec 00001000 dd0bbda4 dd0ee9b4 00001000 00000000 00001000 c0665100
[ 113.859276] bc60: dd0bbc94 dd0bbc70 c011e74c c011df08 000200da 00000000 00001000 dd0bbda4
[ 113.867816] bc80: dd0ee9b4 00001000 dd0bbcf4 dd0bbc98 c0106b10 c011e700 00001000 00000001
[ 113.876356] bca0: dd0bbcc0 dd0bbcc4 dd0ba000 00000001 de60ee40 00002000 0001a000 00000000
[ 113.884896] bcc0: dfe71ac0 c00a3b60 54355ca1 00004000 de60ee40 00000000 dd0bbdb8 dd0ee9b4
[ 113.893436] bce0: dd0ee880 ffffffff dd0bbd5c dd0bbcf8 c0108c6c c0106a68 dd0bbd5c dd0bbd08
[ 113.901975] bd00: c064b790 c0089c48 00000001 dd0ba038 c0108f70 c0089328 00000001 c0108f7c
[ 113.910515] bd20: dd0bbda4 de606e00 00018000 00000000 dd0bbd5c dd0bbdb8 dd0ee920 dd0bbda4
[ 113.919055] bd40: de60ee40 de606e00 dd0e5000 de664a00 dd0bbd8c dd0bbd60 c0108f7c c0108a24
[ 113.927595] bd60: c008c410 c0089fd0 00000001 00000000 00018000 00000000 dd0bbe80 de60ee40
[ 113.936134] bd80: dd0bbe14 dd0bbd90 c014c920 c0108f40 00004000 00000001 00000001 de274000
[ 113.944674] bda0: 00004000 00000003 00002000 00002000 dd0bbd9c 00000001 de60ee40 00000000
[ 113.953214] bdc0: 00000000 00000000 de606e00 00000000 00000000 00000000 00018000 00000000
[ 113.961753] bde0: 00004000 00000000 00000000 00000000 de274000 de60ee40 de274000 dd0bbe80
[ 113.970293] be00: 00004000 de6ce9c0 dd0bbe44 dd0bbe18 c014d1c8 c014c888 00000002 de6ce9c0
[ 113.978833] be20: 00004000 00000000 00000000 00008000 de6ce9c0 dd0e5000 dd0bbeb4 dd0bbe48
[ 113.987373] be40: bf059cc4 c014d120 00000000 dd0bbe9c dd0bbe68 bf05a04c 19000000 00000000
[ 113.995912] be60: dd0ba000 00000000 00000000 6f48202c 00018000 00000000 00020000 00000000
[ 114.004452] be80: 00018000 00000000 00000000 de664a00 de6ce9c0 00000000 de664a38 de664a00
[ 114.012992] bea0: dd0ba038 de664a7c dd0bbf24 dd0bbeb8 bf05a938 bf059980 00000001 c00899dc
[ 114.021531] bec0: a00f0013 de2e3bd4 00000000 00052000 00000000 dd0bbee0 c0089c50 c0089a70
[ 114.030071] bee0: dd0bbf04 dd0bbef0 c064f3a4 de6ce840 00000000 de664a00 bf05a244 de6ce840
[ 114.038611] bf00: 00000000 de664a00 bf05a244 00000000 00000000 00000000 dd0bbfac dd0bbf28
[ 114.047151] bf20: c0065bdc bf05a250 c0089c50 00000000 dd0bbf54 de664a00 00000000 00000000
[ 114.055690] bf40: dead4ead ffffffff ffffffff c0a8a238 00000000 00000000 c08070f8 dd0bbf5c
[ 114.064230] bf60: dd0bbf5c 00000000 00000000 dead4ead ffffffff ffffffff c0a8a238 00000000
[ 114.072770] bf80: 00000000 c08070f8 dd0bbf88 dd0bbf88 de6ce840 c0065af8 00000000 00000000
[ 114.081310] bfa0: 00000000 dd0bbfb0 c000eea8 c0065b04 00000000 00000000 00000000 00000000
[ 114.089850] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 114.098389] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 0001086e 00001a02
[ 114.106944] [<c01065b4>] (find_get_entry) from [<c010716c>] (find_lock_entry+0x28/0x7c)
[ 114.115316] [<c010716c>] (find_lock_entry) from [<c011df94>] (shmem_getpage_gfp+0x98/0x7f8)
[ 114.124042] [<c011df94>] (shmem_getpage_gfp) from [<c011e74c>] (shmem_write_begin+0x58/0x94)
[ 114.132856] [<c011e74c>] (shmem_write_begin) from [<c0106b10>] (generic_perform_write+0xb4/0x1c8)
[ 114.142124] [<c0106b10>] (generic_perform_write) from [<c0108c6c>] (__generic_file_write_iter+0x254/0x51c)
[ 114.152208] [<c0108c6c>] (__generic_file_write_iter) from [<c0108f7c>] (generic_file_write_iter+0x48/0xdc)
[ 114.162298] [<c0108f7c>] (generic_file_write_iter) from [<c014c920>] (new_sync_write+0xa4/0xcc)
[ 114.171386] [<c014c920>] (new_sync_write) from [<c014d1c8>] (vfs_write+0xb4/0x1c0)
[ 114.179334] [<c014d1c8>] (vfs_write) from [<bf059cc4>] (do_write+0x350/0x4b8 [usb_f_mass_storage])
[ 114.188719] [<bf059cc4>] (do_write [usb_f_mass_storage]) from [<bf05a938>] (fsg_main_thread+0x6f4/0x13f8 [usb_f_mass_storage])
[ 114.200636] [<bf05a938>] (fsg_main_thread [usb_f_mass_storage]) from [<c0065bdc>] (kthread+0xe4/0x100)
[ 114.210368] [<c0065bdc>] (kthread) from [<c000eea8>] (ret_from_fork+0x14/0x20)
[ 114.217914] Code: e1a01008 eb08abbe e3500000 0a00001b (e5904000)
[ 114.224529] ---[ end trace afb7e71d4b71be98 ]---
for those who are coming by late, the problem happens when I use
g_mass_storage with either Cortex A8 or Cortex A9 with two different USB
peripheral controllers using either tmpfs or mmc as backing store.
--
balbi
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Felipe Balbi
2014-10-09 16:26:56 UTC
Permalink
Hi Johannes,
Post by Johannes Weiner
Post by Felipe Balbi
Finally bisected it down to commit 139e561660fe11e0fc35e142a800df3dd7d03e9d
git bisect start
# good: [455c6fdbd219161bd09b1165f11699d6d73de11c] Linux 3.14
git bisect good 455c6fdbd219161bd09b1165f11699d6d73de11c
# bad: [1860e379875dfe7271c649058aeddffe5afd9d0d] Linux 3.15
git bisect bad 1860e379875dfe7271c649058aeddffe5afd9d0d
# bad: [74a475acea49459721ae4b062d3da68c74259009] SubmittingPatches: add style recommendation to use imperative descriptions
git bisect bad 74a475acea49459721ae4b062d3da68c74259009
# good: [c12e69c6aaf785fd307d05cb6f36ca0e7577ead7] Merge tag 'staging-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
git bisect good c12e69c6aaf785fd307d05cb6f36ca0e7577ead7
# good: [0fc31966035d7a540c011b6c967ce8eae1db121b] Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next
git bisect good 0fc31966035d7a540c011b6c967ce8eae1db121b
# good: [bdfc7cbdeef8cadba0e5793079ac0130b8e2220c] Merge branch 'mips-for-linux-next' of git://git.linux-mips.org/pub/scm/ralf/upstream-sfr
git bisect good bdfc7cbdeef8cadba0e5793079ac0130b8e2220c
# good: [0f1b1e6d73cb989ce2c071edc57deade3b084dfe] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
git bisect good 0f1b1e6d73cb989ce2c071edc57deade3b084dfe
# good: [181e7d5d7bd7747e882e3ca89ecbf0fc3e72d0da] ixgbe: remove redundant if clause from PTP work
git bisect good 181e7d5d7bd7747e882e3ca89ecbf0fc3e72d0da
# good: [59ecc26004e77e100c700b1d0da7502b0fdadb46] Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
git bisect good 59ecc26004e77e100c700b1d0da7502b0fdadb46
# good: [2b665e276c15ba7d9fc8cdd16931883a51ed13e4] fs/direct-io.c: remove redundant comparison
git bisect good 2b665e276c15ba7d9fc8cdd16931883a51ed13e4
# bad: [f412c97abef71026d8192ca8efca231f1e3906b3] mm, hugetlb: mark some bootstrap functions as __init
git bisect bad f412c97abef71026d8192ca8efca231f1e3906b3
# good: [4e35f483850ba46b838adfd312b3052416e15204] mm, hugetlb: use vma_resv_map() map types
git bisect good 4e35f483850ba46b838adfd312b3052416e15204
# good: [6dbaf22ce1f1dfba33313198eb5bd989ae76dd87] mm: shmem: save one radix tree lookup when truncating swapped pages
git bisect good 6dbaf22ce1f1dfba33313198eb5bd989ae76dd87
# good: [91b0abe36a7b2b3b02d7500925a5f8455334f0e5] mm + fs: store shadow entries in page cache
git bisect good 91b0abe36a7b2b3b02d7500925a5f8455334f0e5
# bad: [139e561660fe11e0fc35e142a800df3dd7d03e9d] lib: radix_tree: tree node interface
git bisect bad 139e561660fe11e0fc35e142a800df3dd7d03e9d
# good: [a528910e12ec7ee203095eb1711468a66b9b60b0] mm: thrash detection-based file cache sizing
git bisect good a528910e12ec7ee203095eb1711468a66b9b60b0
# first bad commit: [139e561660fe11e0fc35e142a800df3dd7d03e9d] lib: radix_tree: tree node interface
I tried reverting that commit on v3.15 but it's non-trivial; I'll leave
that for tomorrow. Meanwhile, adding folks involved with that commit to
[ 113.696647] Unable to handle kernel paging request at virtual address ffffffff
[ 113.704370] pgd = c0004000
[ 113.707276] [ffffffff] *pgd=9fef6821, *pte=00000000, *ppte=00000000
[ 113.713998] Internal error: Oops: 17 [#1] SMP ARM
[ 113.718912] Modules linked in: g_mass_storage usb_f_mass_storage libcomposite configfs musb_dsps musb_hdrc musb_am335x
[ 113.730144] CPU: 0 PID: 1368 Comm: file-storage Not tainted 3.17.0-02899-g748eb79 #239
[ 113.738410] task: de606e00 ti: dd0ba000 task.ti: dd0ba000
[ 113.744060] PC is at find_get_entry+0x64/0x100
Could you please provide the disassembly of that function?
here you go. It's ARM assembly however:

Dump of assembler code for function find_get_entry:
0xc011da48 <+0>: mov r12, sp
0xc011da4c <+4>: push {r4, r5, r6, r7, r8, r9, r11, r12, lr, pc}
0xc011da50 <+8>: sub r11, r12, #4
0xc011da54 <+12>: sub sp, sp, #16
0xc011da58 <+16>: push {lr} ; (str lr, [sp, #-4]!)
0xc011da5c <+20>: bl 0xc000ef00 <__gnu_mcount_nc>
0xc011da60 <+24>: mov r6, r0
0xc011da64 <+28>: mov r7, r1
0xc011da68 <+32>: ldr r2, [pc, #520] ; 0xc011dc78 <find_get_entry+560>
0xc011da6c <+36>: mov r3, #0
0xc011da70 <+40>: mov r1, r3
0xc011da74 <+44>: str r2, [sp, #8]
0xc011da78 <+48>: str r3, [sp]
0xc011da7c <+52>: mov r2, r3
0xc011da80 <+56>: str r3, [sp, #4]
0xc011da84 <+60>: ldr r0, [pc, #496] ; 0xc011dc7c <find_get_entry+564>
0xc011da88 <+64>: mov r3, #2
0xc011da8c <+68>: bl 0xc0095f88 <lock_acquire>
0xc011da90 <+72>: bl 0xc00a7b50 <debug_lockdep_rcu_enabled>
0xc011da94 <+76>: cmp r0, #0
0xc011da98 <+80>: beq 0xc011daac <find_get_entry+100>
0xc011da9c <+84>: ldr r4, [pc, #476] ; 0xc011dc80 <find_get_entry+568>
0xc011daa0 <+88>: ldrb r3, [r4, #1]
0xc011daa4 <+92>: cmp r3, #0
0xc011daa8 <+96>: beq 0xc011dbfc <find_get_entry+436>
0xc011daac <+100>: ldr r8, [pc, #460] ; 0xc011dc80 <find_get_entry+568>
0xc011dab0 <+104>: add r6, r6, #4
0xc011dab4 <+108>: mov r5, #1
0xc011dab8 <+112>: mov r0, r6
0xc011dabc <+116>: mov r1, r7
0xc011dac0 <+120>: bl 0xc0364660 <radix_tree_lookup_slot>
0xc011dac4 <+124>: subs r9, r0, #0
0xc011dac8 <+128>: beq 0xc011dc24 <find_get_entry+476>
0xc011dacc <+132>: ldr r4, [r9]
0xc011dad0 <+136>: bl 0xc00a7b50 <debug_lockdep_rcu_enabled>
0xc011dad4 <+140>: cmp r0, #0
0xc011dad8 <+144>: beq 0xc011dae8 <find_get_entry+160>
0xc011dadc <+148>: ldrb r3, [r8, #2]
0xc011dae0 <+152>: cmp r3, #0
0xc011dae4 <+156>: beq 0xc011dbcc <find_get_entry+388>
0xc011dae8 <+160>: cmp r4, #0
0xc011daec <+164>: beq 0xc011dc24 <find_get_entry+476>
0xc011daf0 <+168>: tst r4, #3
0xc011daf4 <+172>: bne 0xc011dc4c <find_get_entry+516>
0xc011daf8 <+176>: mov r2, sp
0xc011dafc <+180>: bic r3, r2, #8128 ; 0x1fc0
0xc011db00 <+184>: bic r3, r3, #63 ; 0x3f
0xc011db04 <+188>: ldr r2, [pc, #376] ; 0xc011dc84 <find_get_entry+572>
0xc011db08 <+192>: ldr r3, [r3, #4]
0xc011db0c <+196>: and r2, r2, r3
0xc011db10 <+200>: cmp r2, #0
0xc011db14 <+204>: bne 0xc011dc68 <find_get_entry+544>
0xc011db18 <+208>: add r3, r4, #16
0xc011db1c <+212>: mcr 15, 0, r2, cr7, cr10, {5}
0xc011db20 <+216>: mov r2, #0
0xc011db24 <+220>: pld [r3]
0xc011db28 <+224>: ldrex r1, [r3]
0xc011db2c <+228>: teq r1, r2
0xc011db30 <+232>: beq 0xc011db44 <find_get_entry+252>
0xc011db34 <+236>: add r0, r1, r5
0xc011db38 <+240>: strex r12, r0, [r3]
0xc011db3c <+244>: teq r12, #0
0xc011db40 <+248>: bne 0xc011db28 <find_get_entry+224>
0xc011db44 <+252>: cmp r1, #0
0xc011db48 <+256>: beq 0xc011dab8 <find_get_entry+112>
0xc011db4c <+260>: mov r3, #0
0xc011db50 <+264>: mcr 15, 0, r3, cr7, cr10, {5}
0xc011db54 <+268>: ldr r3, [r4]
0xc011db58 <+272>: tst r3, #32768 ; 0x8000
0xc011db5c <+276>: bne 0xc011dc58 <find_get_entry+528>
0xc011db60 <+280>: ldr r3, [r9]
0xc011db64 <+284>: cmp r3, r4
0xc011db68 <+288>: bne 0xc011dc6c <find_get_entry+548>
0xc011db6c <+292>: bl 0xc00a7b50 <debug_lockdep_rcu_enabled>
0xc011db70 <+296>: cmp r0, #0
0xc011db74 <+300>: beq 0xc011db88 <find_get_entry+320>
0xc011db78 <+304>: ldr r5, [pc, #256] ; 0xc011dc80 <find_get_entry+568>
0xc011db7c <+308>: ldrb r3, [r5, #3]
0xc011db80 <+312>: cmp r3, #0
0xc011db84 <+316>: beq 0xc011dba4 <find_get_entry+348>
0xc011db88 <+320>: ldr r0, [pc, #236] ; 0xc011dc7c <find_get_entry+564>
0xc011db8c <+324>: mov r1, #1
0xc011db90 <+328>: ldr r2, [pc, #240] ; 0xc011dc88 <find_get_entry+576>
0xc011db94 <+332>: bl 0xc0096380 <lock_release>
0xc011db98 <+336>: sub sp, r11, #36 ; 0x24
0xc011db9c <+340>: mov r0, r4
0xc011dba0 <+344>: ldm sp, {r4, r5, r6, r7, r8, r9, r11, sp, pc}
0xc011dba4 <+348>: bl 0xc00aadc4 <rcu_is_watching>
0xc011dba8 <+352>: cmp r0, #0
0xc011dbac <+356>: bne 0xc011db88 <find_get_entry+320>
0xc011dbb0 <+360>: mov r3, #1
0xc011dbb4 <+364>: ldr r0, [pc, #208] ; 0xc011dc8c <find_get_entry+580>
0xc011dbb8 <+368>: ldr r1, [pc, #208] ; 0xc011dc90 <find_get_entry+584>
0xc011dbbc <+372>: ldr r2, [pc, #208] ; 0xc011dc94 <find_get_entry+588>
0xc011dbc0 <+376>: strb r3, [r5, #3]
0xc011dbc4 <+380>: bl 0xc00920cc <lockdep_rcu_suspicious>
0xc011dbc8 <+384>: b 0xc011db88 <find_get_entry+320>
0xc011dbcc <+388>: bl 0xc00a7b50 <debug_lockdep_rcu_enabled>
0xc011dbd0 <+392>: cmp r0, #0
0xc011dbd4 <+396>: beq 0xc011dae8 <find_get_entry+160>
0xc011dbd8 <+400>: bl 0xc00aadc4 <rcu_is_watching>
0xc011dbdc <+404>: cmp r0, #0
0xc011dbe0 <+408>: bne 0xc011dc2c <find_get_entry+484>
0xc011dbe4 <+412>: ldr r0, [pc, #172] ; 0xc011dc98 <find_get_entry+592>
0xc011dbe8 <+416>: mov r1, #196 ; 0xc4
0xc011dbec <+420>: ldr r2, [pc, #168] ; 0xc011dc9c <find_get_entry+596>
0xc011dbf0 <+424>: strb r5, [r8, #2]
0xc011dbf4 <+428>: bl 0xc00920cc <lockdep_rcu_suspicious>
0xc011dbf8 <+432>: b 0xc011dae8 <find_get_entry+160>
0xc011dbfc <+436>: bl 0xc00aadc4 <rcu_is_watching>
0xc011dc00 <+440>: cmp r0, #0
0xc011dc04 <+444>: bne 0xc011daac <find_get_entry+100>
0xc011dc08 <+448>: mov r3, #1
0xc011dc0c <+452>: ldr r0, [pc, #120] ; 0xc011dc8c <find_get_entry+580>
0xc011dc10 <+456>: mov r1, #844 ; 0x34c
0xc011dc14 <+460>: ldr r2, [pc, #132] ; 0xc011dca0 <find_get_entry+600>
0xc011dc18 <+464>: strb r3, [r4, #1]
0xc011dc1c <+468>: bl 0xc00920cc <lockdep_rcu_suspicious>
0xc011dc20 <+472>: b 0xc011daac <find_get_entry+100>
0xc011dc24 <+476>: mov r4, #0
0xc011dc28 <+480>: b 0xc011db6c <find_get_entry+292>
0xc011dc2c <+484>: bl 0xc00ac38c <rcu_lockdep_current_cpu_online>
0xc011dc30 <+488>: cmp r0, #0
0xc011dc34 <+492>: beq 0xc011dbe4 <find_get_entry+412>
0xc011dc38 <+496>: ldr r0, [pc, #60] ; 0xc011dc7c <find_get_entry+564>
0xc011dc3c <+500>: bl 0xc0091264 <lock_is_held>
0xc011dc40 <+504>: cmp r0, #0
0xc011dc44 <+508>: beq 0xc011dbe4 <find_get_entry+412>
0xc011dc48 <+512>: b 0xc011dae8 <find_get_entry+160>
0xc011dc4c <+516>: tst r4, #1
0xc011dc50 <+520>: beq 0xc011db6c <find_get_entry+292>
0xc011dc54 <+524>: b 0xc011dab8 <find_get_entry+112>
0xc011dc58 <+528>: mov r0, r4
0xc011dc5c <+532>: ldr r1, [pc, #64] ; 0xc011dca4 <find_get_entry+604>
0xc011dc60 <+536>: bl 0xc01254d4 <dump_page>
0xc011dc64 <+540>: ; <UNDEFINED> instruction: 0xe7f001f2
0xc011dc68 <+544>: ; <UNDEFINED> instruction: 0xe7f001f2
0xc011dc6c <+548>: mov r0, r4
0xc011dc70 <+552>: bl 0xc012db6c <put_page>
0xc011dc74 <+556>: b 0xc011dab8 <find_get_entry+112>
0xc011dc78 <+560>: andsgt sp, r1, r8, asr #20
0xc011dc7c <+564>: adcgt r2, r11, r8, lsl r2
0xc011dc80 <+568>: ldrhtgt r0, [r0], r1
0xc011dc84 <+572>: andseq pc, pc, r0, lsl #30
0xc011dc88 <+576>: andsgt sp, r1, r8, lsl #23
0xc011dc8c <+580>: addgt sp, r5, r8, lsl #5
0xc011dc90 <+584>: andeq r0, r0, sp, ror r3
0xc011dc94 <+588>: ldrdgt sp, [r5], r0
0xc011dc98 <+592>: addgt sp, r7, r8, asr #7
0xc011dc9c <+596>: addgt lr, r6, r8, lsl #17
0xc011dca0 <+600>: addgt sp, r5, r4, lsr #5
0xc011dca4 <+604>: addgt sp, r7, r4, ror #7
End of assembler dump.
Post by Johannes Weiner
I'm thinking it's not the slot pointer itself that's bad, because
__radix_tree_lookup() dereferences that to test if it's populated
before returning it, and slot life-time is guaranteed by RCU.
That would only leave garbage in the slot itself, crashing during
page_cache_get_speculative().
I'll keep staring at this change, but nothing stands out to me yet.
alright, it's pretty deterministic however. Always on the same test, no
matter which USB controller, no matter if backing store is RAM or MMC.

Those two undefined instructions on the disassembly caught my attention,
perhaps I'm facing a GCC bug ?
--
balbi
Felipe Balbi
2014-10-09 20:35:05 UTC
Permalink
Hi,
Post by Felipe Balbi
Post by Johannes Weiner
I'm thinking it's not the slot pointer itself that's bad, because
__radix_tree_lookup() dereferences that to test if it's populated
before returning it, and slot life-time is guaranteed by RCU.
That would only leave garbage in the slot itself, crashing during
page_cache_get_speculative().
I'll keep staring at this change, but nothing stands out to me yet.
alright, it's pretty deterministic however. Always on the same test, no
matter which USB controller, no matter if backing store is RAM or MMC.
Those two undefined instructions on the disassembly caught my attention,
perhaps I'm facing a GCC bug ?
no, probably not a GCC bug. Looking at your commit, however. Man, it
does quite many things at once. Moves code around, adds new functions by
refactoring (and changing) code, renames things, changes int offset into
unsigned ints. Should not be too difficult too to miss a bug in there.

I'll continue digging here.
--
balbi
Rabin Vincent
2014-10-09 20:41:01 UTC
Permalink
Post by Felipe Balbi
alright, it's pretty deterministic however. Always on the same test, no
matter which USB controller, no matter if backing store is RAM or MMC.
Those two undefined instructions on the disassembly caught my attention,
perhaps I'm facing a GCC bug ?
The undefined instructions are just ARM's BUG() implementation.

But did you see the question I asked you yesterday in your other thread?
http://www.spinics.net/lists/arm-kernel/msg368634.html

Here it is again:

What GCC version are you using?

4.8.1 and 4.8.2 are known to miscompile the ARM kernel and these
find_get_entry() crashes with 0xffffffff involved smell a lot like the
earlier reports from kernels build with those compilers:

https://lkml.org/lkml/2014/6/25/456
https://lkml.org/lkml/2014/6/30/375
https://lkml.org/lkml/2014/6/30/660
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58854
https://lkml.org/lkml/2014/5/9/330

Also, I didn't see any public email making a definitive link between GCC
PR 58854 that Nathan pointed out in https://lkml.org/lkml/2014/6/30/660
and the earlier find_get_entry() crashes, but I just built GCC 4.8.1 and
an ARM kernel with that, and the GCC bug is clearly seen in
radix_tree_lookup_slot() which returns the pointer which
find_get_entry() is dereferencing:

<radix_tree_lookup_slot>:
e1a0c00d mov ip, sp
e92dd800 push {fp, ip, lr, pc}
e24cb004 sub fp, ip, #4
e24dd008 sub sp, sp, #8
e3a02000 mov r2, #0
e24b3010 sub r3, fp, #16
ebffffc5 bl c0176ab8 <__radix_tree_lookup>
e24bd00c sub sp, fp, #12 <--- sp moved up
e3500000 cmp r0, #0
151b0010 ldrne r0, [fp, #-16] <--- load from under sp
e89da800 ldm sp, {fp, sp, pc}

Please check your compiler to make sure it's not the same problem.
Felipe Balbi
2014-10-09 20:46:37 UTC
Permalink
Hi,
Post by Rabin Vincent
Post by Felipe Balbi
alright, it's pretty deterministic however. Always on the same test, no
matter which USB controller, no matter if backing store is RAM or MMC.
Those two undefined instructions on the disassembly caught my attention,
perhaps I'm facing a GCC bug ?
The undefined instructions are just ARM's BUG() implementation.
But did you see the question I asked you yesterday in your other thread?
http://www.spinics.net/lists/arm-kernel/msg368634.html
hmm, completely missed that, sorry. I'm using 4.8.2, will try something
else.
--
balbi
Felipe Balbi
2014-10-09 21:07:15 UTC
Permalink
Hi,
Post by Felipe Balbi
Post by Rabin Vincent
Post by Felipe Balbi
alright, it's pretty deterministic however. Always on the same test, no
matter which USB controller, no matter if backing store is RAM or MMC.
Those two undefined instructions on the disassembly caught my attention,
perhaps I'm facing a GCC bug ?
The undefined instructions are just ARM's BUG() implementation.
But did you see the question I asked you yesterday in your other thread?
http://www.spinics.net/lists/arm-kernel/msg368634.html
hmm, completely missed that, sorry. I'm using 4.8.2, will try something
else.
seems to be working fine now, thanks. I'll leave test running overnight
just in case.

thanks again, and sorry for the noise.

PS: I wonder if we should a warning message to the build system if we're
building with known broken versions of GCC.
--
balbi
Felipe Balbi
2014-10-10 13:57:43 UTC
Permalink
Post by Felipe Balbi
Hi,
Post by Felipe Balbi
Post by Rabin Vincent
Post by Felipe Balbi
alright, it's pretty deterministic however. Always on the same test, no
matter which USB controller, no matter if backing store is RAM or MMC.
Those two undefined instructions on the disassembly caught my attention,
perhaps I'm facing a GCC bug ?
The undefined instructions are just ARM's BUG() implementation.
But did you see the question I asked you yesterday in your other thread?
http://www.spinics.net/lists/arm-kernel/msg368634.html
hmm, completely missed that, sorry. I'm using 4.8.2, will try something
else.
seems to be working fine now, thanks. I'll leave test running overnight
just in case.
yup, ran over night without any problems.
--
balbi
Russell King - ARM Linux
2014-10-10 16:25:31 UTC
Permalink
Post by Felipe Balbi
Post by Felipe Balbi
Hi,
Post by Felipe Balbi
Post by Rabin Vincent
Post by Felipe Balbi
alright, it's pretty deterministic however. Always on the same test, no
matter which USB controller, no matter if backing store is RAM or MMC.
Those two undefined instructions on the disassembly caught my attention,
perhaps I'm facing a GCC bug ?
The undefined instructions are just ARM's BUG() implementation.
But did you see the question I asked you yesterday in your other thread?
http://www.spinics.net/lists/arm-kernel/msg368634.html
hmm, completely missed that, sorry. I'm using 4.8.2, will try something
else.
seems to be working fine now, thanks. I'll leave test running overnight
just in case.
yup, ran over night without any problems.
Right, so GCC 4.8.{1,2} are totally unsuitable for kernel building (and
it seems that this has been known about for some time.)

We can blacklist these GCC versions quite easily. We already have GCC
3.3 blacklisted, and it's trivial to add others. I would want to include
some proper details about the bug, just like the other existing entries
we already have in asm-offsets.c, where we name the functions that the
compiler is known to break where appropriate.

However, I'm rather annoyed that there are people here who have known
for some time that GCC 4.8.1 and GCC 4.8.2 _can_ lead to filesystem
corruption, and have sat on their backsides doing nothing about getting
it blacklisted for something like a year.

When people talk about the ARM community being dysfunctional... well,
this kind of irresponsible behaviour just gives them more fodder to
throw at us.
--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nathan Lynch
2014-10-11 01:44:33 UTC
Permalink
Post by Russell King - ARM Linux
Right, so GCC 4.8.{1,2} are totally unsuitable for kernel building (and
it seems that this has been known about for some time.)
Looking at http://gcc.gnu.org/PR58854 it seems that all 4.8.x for x < 3
are affected, as well as 4.9.0.
Post by Russell King - ARM Linux
We can blacklist these GCC versions quite easily. We already have GCC
3.3 blacklisted, and it's trivial to add others. I would want to include
some proper details about the bug, just like the other existing entries
we already have in asm-offsets.c, where we name the functions that the
compiler is known to break where appropriate.
Before blacklisting anything, it's worth considering that simple version
checks would break existing pre-4.8.3 compilers that have been patched
for PR58854. It looks like Yocto and Buildroot issued releases with
patched 4.8.2 compilers well before the (fixed) 4.8.3 release. I think
the most we can reasonably do without breaking some correctly-behaving
toolchains is to emit a warning.

Hopefully nobody's still using gcc 4.8 from the Linaro 2013.11 toolchain
release -- since it's a 4.8.3 prerelease from before the fix was
committed you'll get GCC_VERSION == 40803 but still generate bad code.
Post by Russell King - ARM Linux
However, I'm rather annoyed that there are people here who have known
for some time that GCC 4.8.1 and GCC 4.8.2 _can_ lead to filesystem
corruption, and have sat on their backsides doing nothing about getting
it blacklisted for something like a year.
Mea culpa, although I hadn't drawn the connection to FS corruption
reports until now. I have known about the issue for some time, but
figured the prevalence of the fix in downstream projects largely
mitigated the issue.

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter Hurley
2014-10-11 02:40:43 UTC
Permalink
Post by Nathan Lynch
Post by Russell King - ARM Linux
Right, so GCC 4.8.{1,2} are totally unsuitable for kernel building (and
it seems that this has been known about for some time.)
Looking at http://gcc.gnu.org/PR58854 it seems that all 4.8.x for x < 3
are affected, as well as 4.9.0.
Post by Russell King - ARM Linux
We can blacklist these GCC versions quite easily. We already have GCC
3.3 blacklisted, and it's trivial to add others. I would want to include
some proper details about the bug, just like the other existing entries
we already have in asm-offsets.c, where we name the functions that the
compiler is known to break where appropriate.
Before blacklisting anything, it's worth considering that simple version
checks would break existing pre-4.8.3 compilers that have been patched
for PR58854. It looks like Yocto and Buildroot issued releases with
patched 4.8.2 compilers well before the (fixed) 4.8.3 release. I think
the most we can reasonably do without breaking some correctly-behaving
toolchains is to emit a warning.
Providing a manual switch to override blacklisting is way more sane
than a build warning that no one's looking at.

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter Chen
2014-10-11 03:54:32 UTC
Permalink
Post by Nathan Lynch
Post by Russell King - ARM Linux
Right, so GCC 4.8.{1,2} are totally unsuitable for kernel building (and
it seems that this has been known about for some time.)
Looking at http://gcc.gnu.org/PR58854 it seems that all 4.8.x for x < 3
are affected, as well as 4.9.0.
Post by Russell King - ARM Linux
We can blacklist these GCC versions quite easily. We already have GCC
3.3 blacklisted, and it's trivial to add others. I would want to include
some proper details about the bug, just like the other existing entries
we already have in asm-offsets.c, where we name the functions that the
compiler is known to break where appropriate.
Before blacklisting anything, it's worth considering that simple version
checks would break existing pre-4.8.3 compilers that have been patched
for PR58854. It looks like Yocto and Buildroot issued releases with
patched 4.8.2 compilers well before the (fixed) 4.8.3 release. I think
the most we can reasonably do without breaking some correctly-behaving
toolchains is to emit a warning.
Yocto has PR58854 problem patch.

http://git.yoctoproject.org/cgit/cgit.cgi/poky/tree/meta/recipes-devtools/gcc/gcc-4.8/0048-PR58854_fix_arm_apcs_epilogue.patch?h=daisy
Post by Nathan Lynch
Hopefully nobody's still using gcc 4.8 from the Linaro 2013.11 toolchain
release -- since it's a 4.8.3 prerelease from before the fix was
committed you'll get GCC_VERSION == 40803 but still generate bad code.
Post by Russell King - ARM Linux
However, I'm rather annoyed that there are people here who have known
for some time that GCC 4.8.1 and GCC 4.8.2 _can_ lead to filesystem
corruption, and have sat on their backsides doing nothing about getting
it blacklisted for something like a year.
Mea culpa, although I hadn't drawn the connection to FS corruption
reports until now. I have known about the issue for some time, but
figured the prevalence of the fix in downstream projects largely
mitigated the issue.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Peter Chen
Russell King - ARM Linux
2014-10-11 14:16:38 UTC
Permalink
Post by Peter Chen
Post by Nathan Lynch
Post by Russell King - ARM Linux
Right, so GCC 4.8.{1,2} are totally unsuitable for kernel building (and
it seems that this has been known about for some time.)
Looking at http://gcc.gnu.org/PR58854 it seems that all 4.8.x for x < 3
are affected, as well as 4.9.0.
Post by Russell King - ARM Linux
We can blacklist these GCC versions quite easily. We already have GCC
3.3 blacklisted, and it's trivial to add others. I would want to include
some proper details about the bug, just like the other existing entries
we already have in asm-offsets.c, where we name the functions that the
compiler is known to break where appropriate.
Before blacklisting anything, it's worth considering that simple version
checks would break existing pre-4.8.3 compilers that have been patched
for PR58854. It looks like Yocto and Buildroot issued releases with
patched 4.8.2 compilers well before the (fixed) 4.8.3 release. I think
the most we can reasonably do without breaking some correctly-behaving
toolchains is to emit a warning.
Yocto has PR58854 problem patch.
http://git.yoctoproject.org/cgit/cgit.cgi/poky/tree/meta/recipes-devtools/gcc/gcc-4.8/0048-PR58854_fix_arm_apcs_epilogue.patch?h=daisy
Right, and we can provide links to these in the comments above the #error
so people have the right places to do a bit of research into whether their
compiler is safe.

It is unfortunate that they are indistinguishable from the broken versions,
but that's really a distro problem for causing that issue themselves -
especially given how serious this bug is.
--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Otavio Salvador
2014-10-11 14:51:23 UTC
Permalink
Hello Russell,

On Sat, Oct 11, 2014 at 11:16 AM, Russell King - ARM Linux
Post by Russell King - ARM Linux
Post by Peter Chen
Post by Nathan Lynch
Post by Russell King - ARM Linux
Right, so GCC 4.8.{1,2} are totally unsuitable for kernel building (and
it seems that this has been known about for some time.)
Looking at http://gcc.gnu.org/PR58854 it seems that all 4.8.x for x < 3
are affected, as well as 4.9.0.
Post by Russell King - ARM Linux
We can blacklist these GCC versions quite easily. We already have GCC
3.3 blacklisted, and it's trivial to add others. I would want to include
some proper details about the bug, just like the other existing entries
we already have in asm-offsets.c, where we name the functions that the
compiler is known to break where appropriate.
Before blacklisting anything, it's worth considering that simple version
checks would break existing pre-4.8.3 compilers that have been patched
for PR58854. It looks like Yocto and Buildroot issued releases with
patched 4.8.2 compilers well before the (fixed) 4.8.3 release. I think
the most we can reasonably do without breaking some correctly-behaving
toolchains is to emit a warning.
Yocto has PR58854 problem patch.
http://git.yoctoproject.org/cgit/cgit.cgi/poky/tree/meta/recipes-devtools/gcc/gcc-4.8/0048-PR58854_fix_arm_apcs_epilogue.patch?h=daisy
Right, and we can provide links to these in the comments above the #error
so people have the right places to do a bit of research into whether their
compiler is safe.
It is unfortunate that they are indistinguishable from the broken versions,
but that's really a distro problem for causing that issue themselves -
especially given how serious this bug is.
What about checking if GCC_PR58854_FIXED is not defined for error? So
build systems and people could easily define it if they know their GCC
has the fix applied.
--
Otavio Salvador O.S. Systems
http://www.ossystems.com.br http://code.ossystems.com.br
Mobile: +55 (53) 9981-7854 Mobile: +1 (347) 903-9750
Peter Hurley
2014-10-11 18:15:37 UTC
Permalink
Post by Otavio Salvador
Hello Russell,
On Sat, Oct 11, 2014 at 11:16 AM, Russell King - ARM Linux
Post by Russell King - ARM Linux
Post by Peter Chen
Post by Nathan Lynch
Post by Russell King - ARM Linux
Right, so GCC 4.8.{1,2} are totally unsuitable for kernel building (and
it seems that this has been known about for some time.)
Looking at http://gcc.gnu.org/PR58854 it seems that all 4.8.x for x < 3
are affected, as well as 4.9.0.
Post by Russell King - ARM Linux
We can blacklist these GCC versions quite easily. We already have GCC
3.3 blacklisted, and it's trivial to add others. I would want to include
some proper details about the bug, just like the other existing entries
we already have in asm-offsets.c, where we name the functions that the
compiler is known to break where appropriate.
Before blacklisting anything, it's worth considering that simple version
checks would break existing pre-4.8.3 compilers that have been patched
for PR58854. It looks like Yocto and Buildroot issued releases with
patched 4.8.2 compilers well before the (fixed) 4.8.3 release. I think
the most we can reasonably do without breaking some correctly-behaving
toolchains is to emit a warning.
Yocto has PR58854 problem patch.
http://git.yoctoproject.org/cgit/cgit.cgi/poky/tree/meta/recipes-devtools/gcc/gcc-4.8/0048-PR58854_fix_arm_apcs_epilogue.patch?h=daisy
Right, and we can provide links to these in the comments above the #error
so people have the right places to do a bit of research into whether their
compiler is safe.
It is unfortunate that they are indistinguishable from the broken versions,
but that's really a distro problem for causing that issue themselves -
especially given how serious this bug is.
What about checking if GCC_PR58854_FIXED is not defined for error? So
build systems and people could easily define it if they know their GCC
has the fix applied.
If the distro/build system/individual is capable of patching gcc, then it
seems reasonable that the same distro/build system/individual is capable
of carrying a patch on top of mainline kernel for building with their
"special" compiler.
Russell King - ARM Linux
2014-10-11 14:14:10 UTC
Permalink
Post by Nathan Lynch
Post by Russell King - ARM Linux
We can blacklist these GCC versions quite easily. We already have GCC
3.3 blacklisted, and it's trivial to add others. I would want to include
some proper details about the bug, just like the other existing entries
we already have in asm-offsets.c, where we name the functions that the
compiler is known to break where appropriate.
Before blacklisting anything, it's worth considering that simple version
checks would break existing pre-4.8.3 compilers that have been patched
for PR58854. It looks like Yocto and Buildroot issued releases with
patched 4.8.2 compilers well before the (fixed) 4.8.3 release. I think
the most we can reasonably do without breaking some correctly-behaving
toolchains is to emit a warning.
I wish that it was possible to just do the warning thing, but unfortunately
evidence is that many people ignore compiler warnings, because they see
them appearing from the kernel soo often they have become de-sensitised
to them.

This is pretty obvious from the various nightly build systems which produce
the same warnings for months without any progress on them - some of them
can be quite serious (oops-able) where printf format strings are concerned.
Post by Nathan Lynch
Post by Russell King - ARM Linux
for some time that GCC 4.8.1 and GCC 4.8.2 _can_ lead to filesystem
corruption, and have sat on their backsides doing nothing about getting
it blacklisted for something like a year.
Mea culpa, although I hadn't drawn the connection to FS corruption
reports until now. I have known about the issue for some time, but
figured the prevalence of the fix in downstream projects largely
mitigated the issue.
It's the FS corruption which swings it in favour of a #error - even if
we have a bunch of compilers around with that version which have the
problem fixed, it's /far/ better to #error out. Those people who know
definitely that they have a fixed compiler can comment out the test
after checking that they do indeed have a fixed version, or are willing
to take the risk.

What we can't do is have kernels built by people who then run into FS
corruption because of this known issue.
--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nathan Lynch
2014-10-11 19:27:36 UTC
Permalink
Post by Nathan Lynch
Post by Russell King - ARM Linux
Right, so GCC 4.8.{1,2} are totally unsuitable for kernel building (and
it seems that this has been known about for some time.)
Looking at http://gcc.gnu.org/PR58854 it seems that all 4.8.x for x < 3
are affected, as well as 4.9.0.
Correction -- 4.9.0 has this fixed, even though the GCC PR shows it as a
"known to fail" version.

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Laight
2014-10-13 09:11:34 UTC
Permalink
From: Nathan Lynch
Post by Nathan Lynch
Post by Russell King - ARM Linux
Right, so GCC 4.8.{1,2} are totally unsuitable for kernel building (and
it seems that this has been known about for some time.)
Looking at http://gcc.gnu.org/PR58854 it seems that all 4.8.x for x < 3
are affected, as well as 4.9.0.
Post by Russell King - ARM Linux
We can blacklist these GCC versions quite easily. We already have GCC
3.3 blacklisted, and it's trivial to add others. I would want to include
some proper details about the bug, just like the other existing entries
we already have in asm-offsets.c, where we name the functions that the
compiler is known to break where appropriate.
Before blacklisting anything, it's worth considering that simple version
checks would break existing pre-4.8.3 compilers that have been patched
for PR58854. It looks like Yocto and Buildroot issued releases with
patched 4.8.2 compilers well before the (fixed) 4.8.3 release. I think
the most we can reasonably do without breaking some correctly-behaving
toolchains is to emit a warning.
Is it possible to compile a small code fragment and check the generated
code for the bug?
Possibly predicated on the broken version number to avoid false positives.

David



--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Russell King - ARM Linux
2014-10-13 11:43:07 UTC
Permalink
Post by David Laight
From: Nathan Lynch
Post by Nathan Lynch
Post by Russell King - ARM Linux
Right, so GCC 4.8.{1,2} are totally unsuitable for kernel building (and
it seems that this has been known about for some time.)
Looking at http://gcc.gnu.org/PR58854 it seems that all 4.8.x for x < 3
are affected, as well as 4.9.0.
Post by Russell King - ARM Linux
We can blacklist these GCC versions quite easily. We already have GCC
3.3 blacklisted, and it's trivial to add others. I would want to include
some proper details about the bug, just like the other existing entries
we already have in asm-offsets.c, where we name the functions that the
compiler is known to break where appropriate.
Before blacklisting anything, it's worth considering that simple version
checks would break existing pre-4.8.3 compilers that have been patched
for PR58854. It looks like Yocto and Buildroot issued releases with
patched 4.8.2 compilers well before the (fixed) 4.8.3 release. I think
the most we can reasonably do without breaking some correctly-behaving
toolchains is to emit a warning.
Is it possible to compile a small code fragment and check the generated
code for the bug?
Possibly predicated on the broken version number to avoid false positives.
I don't see how - it looks like it requires an interrupt to occur at an
opportune moment to provoke the function to fail. The alternative would
be to parse the assembly generated by the compiler to determine how it
is dealing with the stack.

I think the only viable solution here is that:

1. We blacklist the bad compiler versions outright in the kernel.
2. We /consider/ a testing a preprocessor symbol which when present
indicates that these versions are fixed and should not be blacklisted.

The argument for (2) is that /if/ distros want to patch their compilers
to fix the problem, they /also/ have the ability to patch their compilers
to make them identifyable, and that is a far more reliable solution than
trying to parse the assembly output from multiple different GCC versions.

Remember, it's the distro's choice to fix these buggy compilers, so the
onus is on _them_ to deal with the mess they've created by doing so.
--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Greg KH
2014-10-14 02:06:40 UTC
Permalink
Post by Russell King - ARM Linux
Post by David Laight
From: Nathan Lynch
Post by Nathan Lynch
Post by Russell King - ARM Linux
Right, so GCC 4.8.{1,2} are totally unsuitable for kernel building (and
it seems that this has been known about for some time.)
Looking at http://gcc.gnu.org/PR58854 it seems that all 4.8.x for x < 3
are affected, as well as 4.9.0.
Post by Russell King - ARM Linux
We can blacklist these GCC versions quite easily. We already have GCC
3.3 blacklisted, and it's trivial to add others. I would want to include
some proper details about the bug, just like the other existing entries
we already have in asm-offsets.c, where we name the functions that the
compiler is known to break where appropriate.
Before blacklisting anything, it's worth considering that simple version
checks would break existing pre-4.8.3 compilers that have been patched
for PR58854. It looks like Yocto and Buildroot issued releases with
patched 4.8.2 compilers well before the (fixed) 4.8.3 release. I think
the most we can reasonably do without breaking some correctly-behaving
toolchains is to emit a warning.
Is it possible to compile a small code fragment and check the generated
code for the bug?
Possibly predicated on the broken version number to avoid false positives.
I don't see how - it looks like it requires an interrupt to occur at an
opportune moment to provoke the function to fail. The alternative would
be to parse the assembly generated by the compiler to determine how it
is dealing with the stack.
1. We blacklist the bad compiler versions outright in the kernel.
Yes, please do this, it's what we have done for other buggy compiler
versions, no need to do something different here.
Post by Russell King - ARM Linux
Remember, it's the distro's choice to fix these buggy compilers, so the
onus is on _them_ to deal with the mess they've created by doing so.
I totally agree.

Is someone going to send this patch, or do I have to write it myself?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter Hurley
2014-10-14 10:27:52 UTC
Permalink
Post by Greg KH
Post by Russell King - ARM Linux
Post by David Laight
From: Nathan Lynch
Post by Nathan Lynch
Post by Russell King - ARM Linux
Right, so GCC 4.8.{1,2} are totally unsuitable for kernel building (and
it seems that this has been known about for some time.)
Looking at http://gcc.gnu.org/PR58854 it seems that all 4.8.x for x < 3
are affected, as well as 4.9.0.
Post by Russell King - ARM Linux
We can blacklist these GCC versions quite easily. We already have GCC
3.3 blacklisted, and it's trivial to add others. I would want to include
some proper details about the bug, just like the other existing entries
we already have in asm-offsets.c, where we name the functions that the
compiler is known to break where appropriate.
Before blacklisting anything, it's worth considering that simple version
checks would break existing pre-4.8.3 compilers that have been patched
for PR58854. It looks like Yocto and Buildroot issued releases with
patched 4.8.2 compilers well before the (fixed) 4.8.3 release. I think
the most we can reasonably do without breaking some correctly-behaving
toolchains is to emit a warning.
Is it possible to compile a small code fragment and check the generated
code for the bug?
Possibly predicated on the broken version number to avoid false positives.
I don't see how - it looks like it requires an interrupt to occur at an
opportune moment to provoke the function to fail. The alternative would
be to parse the assembly generated by the compiler to determine how it
is dealing with the stack.
1. We blacklist the bad compiler versions outright in the kernel.
Yes, please do this, it's what we have done for other buggy compiler
versions, no need to do something different here.
Post by Russell King - ARM Linux
Remember, it's the distro's choice to fix these buggy compilers, so the
onus is on _them_ to deal with the mess they've created by doing so.
I totally agree.
Is someone going to send this patch, or do I have to write it myself?
I did on Friday (arm: Blacklist gcc 4.8.[012] ...) but Russell said he
was doing it himself.

Regards,
Peter Hurley
Russell King - ARM Linux
2014-10-15 21:23:10 UTC
Permalink
Post by Greg KH
Post by Russell King - ARM Linux
1. We blacklist the bad compiler versions outright in the kernel.
Yes, please do this, it's what we have done for other buggy compiler
versions, no need to do something different here.
Post by Russell King - ARM Linux
Remember, it's the distro's choice to fix these buggy compilers, so the
onus is on _them_ to deal with the mess they've created by doing so.
I totally agree.
Is someone going to send this patch, or do I have to write it myself?
As I said, I have a patch in progress, but it seems that there needed
to be some discussion about exactly which compiler versions are affected.
It seems that it's not as trivial as looking at the GCC bug entry.
--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Russell King - ARM Linux
2014-10-15 21:25:13 UTC
Permalink
Post by Russell King - ARM Linux
As I said, I have a patch in progress, but it seems that there needed
to be some discussion about exactly which compiler versions are affected.
It seems that it's not as trivial as looking at the GCC bug entry.
... and in any case, it has been a known bug for well over a year now,
and it seems that it doesn't affect _that_ many people. So taking some
extra time to get it properly correct is the _right_ thing to do.
--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Russell King - ARM Linux
2014-10-19 09:54:16 UTC
Permalink
Post by Russell King - ARM Linux
Post by Russell King - ARM Linux
As I said, I have a patch in progress, but it seems that there needed
to be some discussion about exactly which compiler versions are affected.
It seems that it's not as trivial as looking at the GCC bug entry.
... and in any case, it has been a known bug for well over a year now,
and it seems that it doesn't affect _that_ many people. So taking some
extra time to get it properly correct is the _right_ thing to do.
Well, this is just great. Pushing out the change which blacklists these
compilers takes out Olof's kernel build system...

Things are not as trivial as they seem.
--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Felipe Balbi
2014-10-19 15:28:01 UTC
Permalink
Hi,
Post by Russell King - ARM Linux
Post by Russell King - ARM Linux
Post by Russell King - ARM Linux
As I said, I have a patch in progress, but it seems that there needed
to be some discussion about exactly which compiler versions are affected.
It seems that it's not as trivial as looking at the GCC bug entry.
... and in any case, it has been a known bug for well over a year now,
and it seems that it doesn't affect _that_ many people. So taking some
extra time to get it properly correct is the _right_ thing to do.
Well, this is just great. Pushing out the change which blacklists these
compilers takes out Olof's kernel build system...
Things are not as trivial as they seem.
Maybe Olof just needs to update his compiler. Olof ?
--
balbi
Olof Johansson
2014-10-19 20:48:12 UTC
Permalink
Post by Felipe Balbi
Hi,
Post by Russell King - ARM Linux
Post by Russell King - ARM Linux
Post by Russell King - ARM Linux
As I said, I have a patch in progress, but it seems that there needed
to be some discussion about exactly which compiler versions are affected.
It seems that it's not as trivial as looking at the GCC bug entry.
... and in any case, it has been a known bug for well over a year now,
and it seems that it doesn't affect _that_ many people. So taking some
extra time to get it properly correct is the _right_ thing to do.
Well, this is just great. Pushing out the change which blacklists these
compilers takes out Olof's kernel build system...
Things are not as trivial as they seem.
Maybe Olof just needs to update his compiler. Olof ?
Yep, doing a run with 4.9.1 to see how it looks. In the past, 4.9 has
been really noisy with warnings, maybe most of them have been fixed by
now.


-Olof
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Aaro Koskinen
2014-10-09 21:47:06 UTC
Permalink
Hi,
Post by Rabin Vincent
What GCC version are you using?
4.8.1 and 4.8.2 are known to miscompile the ARM kernel and these
find_get_entry() crashes with 0xffffffff involved smell a lot like the
https://lkml.org/lkml/2014/6/25/456
https://lkml.org/lkml/2014/6/30/375
https://lkml.org/lkml/2014/6/30/660
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58854
https://lkml.org/lkml/2014/5/9/330
Is it possible to blacklist those GCC versions on ARM somehow as it
seems people are still using them?

This bug also ruined a file system on one of my boxes last year
(see e.g. http://marc.info/?l=linux-arm-kernel&m=139033442527244&w=2).

A.
Russell King - ARM Linux
2014-10-10 16:18:35 UTC
Permalink
Post by Aaro Koskinen
Hi,
Post by Rabin Vincent
What GCC version are you using?
4.8.1 and 4.8.2 are known to miscompile the ARM kernel and these
find_get_entry() crashes with 0xffffffff involved smell a lot like the
https://lkml.org/lkml/2014/6/25/456
https://lkml.org/lkml/2014/6/30/375
https://lkml.org/lkml/2014/6/30/660
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58854
https://lkml.org/lkml/2014/5/9/330
Is it possible to blacklist those GCC versions on ARM somehow as it
seems people are still using them?
This bug also ruined a file system on one of my boxes last year
(see e.g. http://marc.info/?l=linux-arm-kernel&m=139033442527244&w=2).
Given that, why the fsck (pun intended) did you not shout a little louder
about getting it blacklisted. Looking at your marc.info URL, there's
very little information there which hints at filesystem corruption, and
it's a thread of only *one* message according to marc.info.

Even _if_ I did read the message you point to above, that on its own did
not hint at filesystem corruption.

So, would you please mind passing on further details about this,
specifically which function in the ext4 code is affected, so it can
be properly written up.

Thanks.
--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Aaro Koskinen
2014-10-10 20:52:34 UTC
Permalink
Post by Russell King - ARM Linux
Post by Aaro Koskinen
Post by Rabin Vincent
What GCC version are you using?
4.8.1 and 4.8.2 are known to miscompile the ARM kernel and these
find_get_entry() crashes with 0xffffffff involved smell a lot like the
https://lkml.org/lkml/2014/6/25/456
https://lkml.org/lkml/2014/6/30/375
https://lkml.org/lkml/2014/6/30/660
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58854
https://lkml.org/lkml/2014/5/9/330
Is it possible to blacklist those GCC versions on ARM somehow as it
seems people are still using them?
This bug also ruined a file system on one of my boxes last year
(see e.g. http://marc.info/?l=linux-arm-kernel&m=139033442527244&w=2).
Given that, why the fsck (pun intended) did you not shout a little louder
about getting it blacklisted. Looking at your marc.info URL, there's
very little information there which hints at filesystem corruption, and
it's a thread of only *one* message according to marc.info.
Even _if_ I did read the message you point to above, that on its own did
not hint at filesystem corruption.
So, would you please mind passing on further details about this,
specifically which function in the ext4 code is affected, so it can
be properly written up.
I have not done any proper deeper analysis. After I first mailed about
the issue I just downgraded GCC and pretty much forgot about it until
an engineer from some commercial Linux vendor replied privately months
later and kindly pointed me the needed GCC fix (which I then shared
in the reply). Then I just moved on using a newer GCC with no issues.
Obviously this was not a widespread problem since no one else
reported the same.

Today I again booted a kernel compiled with GCC 4.8.2 and still was able
reproduce the issue, and I think below shows that at least ext3 can
easily end up in inconsistent state using these compiler versions:

0) Run the bad kernel:

~ # dmesg|grep GCC
[ 0.000000] Linux version 3.17.0-mvebu-los_9755+ (***@cooljazz) (gcc version 4.8.2 (GCC) ) #1 Fri Oct 10 21:05:20 EEST 2014

1) Start with small ext3 (writeback) fs with gcc tarball:

/mnt/test # ls -l
total 84092
-rw-r--r-- 1 root root 85999682 Apr 24 21:52 gcc-4.8.2.tar.bz2
drwx------ 2 root root 16384 Oct 10 10:33 lost+found
/mnt/test # df -h .
Filesystem Size Used Available Use% Mounted on
/dev/sdc1 3.8G 90.2M 3.5G 2% /mnt/test

2) Extract, delete & crash:

/mnt/test # tar xjf gcc-4.8.2.tar.bz2
/mnt/test # rm -rf gcc-4.8.2
rm: can't remove 'gcc-4.8.2/libgfortran/generated': Directory not empty
rm: can't remove 'gcc-4.8.2/libgfortran': Directory not empty
rm: can't remove 'gcc-4.8.2/gcc/testsuite/gcc.dg/compat/struct-by-value-18a_y.c': No such file or directory
rm: can't remove 'gcc-4.8.2/gcc/testsuite/gcc.dg/compat': Directory not empty
rm: can't remove 'gcc-4.8.2/gcc/testsuite/gcc.dg/tree-ssa': Directory not empty
rm: can't remove 'gcc-4.8.2/gcc/testsuite/gcc.dg': Directory not empty
rm: can't remove 'gcc-4.8.2/gcc/testsuite/gfortran.dg/result_default_init_1.f90': No such file or directory
rm: can't remove 'gcc-4.8.2/gcc/testsuite/gfortran.dg': Directory not empty
[ 960.864433] Unable to handle kernel paging request at virtual address ffffffff
[ 960.930597] pgd = df6e0000
[ 960.990849] [ffffffff] *pgd=1fffd831, *pte=00000000, *ppte=00000000
[ 961.056512] Internal error: Oops: 1 [#1] ARM
[ 961.120063] Modules linked in:
[ 961.180974] CPU: 0 PID: 684 Comm: rm Not tainted 3.17.0-mvebu-los_9755+ #1
[ 961.247146] task: df447b00 ti: df4de000 task.ti: df4de000
[ 961.311524] PC is at find_get_entry+0x28/0x84
[ 961.375037] LR is at radix_tree_lookup_slot+0x1c/0x2c
[ 961.439061] pc : [<c006e418>] lr : [<c018392c>] psr: a0000013
[ 961.439061] sp : df4dfc68 ip : 00000000 fp : df4dfc7c
[ 961.570018] r10: 00000001 r9 : c04e3253 r8 : df020b60
[ 961.634596] r7 : 0009001a r6 : 00000000 r5 : 0009001a r4 : df020c90
[ 961.700070] r3 : ffffffff r2 : 00000000 r1 : 0009001a r0 : ffffffff
[ 961.764437] Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 961.830518] Control: 0005317f Table: 1f6e0000 DAC: 00000015
[ 961.895866] Process rm (pid: 684, stack limit = 0xdf4de1c0)
[ 961.960597] Stack: (0xdf4dfc68 to 0xdf4e0000)
[ 962.022968] fc60: 00000001 df020c8c df4dfcb4 df4dfc80 c006eef68 c006e400
[ 962.091214] fc80: c00d4e80 c00d4764 00001000 0009001a 00000000 00000000 df0200b60 df020b60
[ 962.159490] fca0: df020bd8 df04e4d8 df4dfd04 df4dfcb8 c00d34c0 c006ef44 000000000 df4dfcc8
[ 962.226940] fcc0: c00d4e80 c00d4764 00001000 00000001 df4dfd84 dd1c73f0 000900306 00000000
[ 962.295558] fce0: 00090068 00000000 00000000 df020b60 df04e4d8 00000181 df4dffd4c df4dfd08
[ 962.364710] fd00: c00d4828 c00d347c 00000000 00000001 df4dfdc4 dd1c73f0 000000000 00000000
[ 962.433394] fd20: 00000000 00000000 df4dfd84 00090002 00001000 dbaa2200 df0200b60 df04e4d8
[ 962.501810] fd40: df4dfdbc df4dfd50 c00d4e80 c00d4764 00001000 df4dfd60 c01411284 c0148708
[ 962.569685] fd60: 0009001a 00000000 c0ebc7c0 df041180 00000002 00000000 df4dffd9c df4dfd88
[ 962.639143] fd80: c003813c c0038084 df041180 df0b7320 df4dfdac 00090002 000000000 dbaa2200
[ 962.708562] fda0: df4dfe4c df04e4d8 00000181 df04e4d8 df4dfe24 df4dfdc0 c010887c0 c00d4e6c
[ 962.778108] fdc0: 00001000 c038caf8 0000128f 00000000 00000000 00011000 000000001 c9c59740
[ 962.846670] fde0: 0009001a 00000000 00000a26 c824f240 00000010 00000000 df4dffe1c df04e4d8
[ 962.913956] fe00: df04e4d8 df4dfe4c de53cf40 de53cf40 00000000 df04e4d8 df4dffe44 df4dfe28
[ 962.980679] fe20: c010c5a8 c01086c4 df04e4d8 dee12000 dbaa2200 df04e4b4 df4dffe84 df4dfe48
[ 963.046696] fe40: c0115dc4 c010c584 dd1c73f0 00000000 00000100 00000012 000000000 c0fbfe00
[ 963.112648] fe60: df04e4d8 dd1c73f0 de53cf40 00000000 df4dff04 df04e4d8 df4dffecc df4dfe88
[ 963.178402] fe80: c0116b24 c0115ce0 00000000 c00b3b24 df4dfeac c067b174 5437dd0a4 22921900
[ 963.244947] fea0: df4dfecc df4dfeb0 c00b7a50 c19ca440 df04e4d8 df04e534 dd1c773f0 000b6650
[ 963.311517] fec0: df4dfefc df4dfed0 c00b7e4c c01168d8 df4dfefc df4dfee0 c19caa440 00000000
[ 963.377319] fee0: df4e6000 00000000 000b6650 ffffff9c df4dff94 df4dff00 c00b880b0 c00b7d94
[ 963.443083] ff00: 5437d035 00000000 dba4a8d0 d899f6e8 78ae7ba4 0000000d df4e6603c 0000000c
[ 963.509416] ff20: 00000000 c0009624 dd1c73f0 00000000 00000004 00000038 000000000 00000000
[ 963.575556] ff40: 00024182 00000000 00800021 c04c81b4 00000001 000003e8 0000003e8 00000000
[ 963.641281] ff60: 0000024d 00000000 4bfad53f 000b6650 00000008 0000000c 00000000a c0009624
[ 963.707194] ff80: df4de000 00000000 df4dffa4 df4dff98 c00b8e20 c00b7ed0 000000000 df4dffa8
[ 963.773584] ffa0: c00094c0 c00b8e18 000b6650 00000008 000b6650 bed03990 bed033990 00008000
[ 963.841022] ffc0: 000b6650 00000008 0000000c 0000000a 000b6650 00000000 b6fccc000 00000000
[ 963.907530] ffe0: 00093224 bed0398c 00071284 b6efa39c 60000010 000b6650 0000fffff 0000ffff
[ 963.973653] Backtrace: [ 964.032680] [<c006e3f0>] (find_get_entry) from [<c006ef68>] (pagecache_get_page+0x34/0x1fc)
[ 964.100751] r5:df020c8c r4:00000001
[ 964.162591] [<c006ef34>] (pagecache_get_page) from [<c00d34c0>] (__find_get_b
block_slow+0x54/0x16c)
[ 964.291505] r10:df04e4d8 r9:df020bd8 r8:df020b60 r7:df020b60 r6:00000000 r5:
:00000000
[ 964.361857] r4:0009001a
[ 964.425342] [<c00d346c>] (__find_get_block_slow) from [<c00d4828>] (__find_ge
et_block+0xd4/0x1e4)
[ 964.498345] r9:00000181 r8:df04e4d8 r7:df020b60 r6:00000000 r5:00000000 r4:0
00090068
[ 964.570979] [<c00d4754>] (__find_get_block) from [<c00d4e80>] (__getblk+0x24/
/0x358)
[ 964.643833] r8:df04e4d8 r7:df020b60 r6:dbaa2200 r5:00001000 r4:00090002
[ 964.716031] [<c00d4e5c>] (__getblk) from [<c01087c0>] (__ext4_get_inode_loc+0
0x10c/0x454)
[ 964.790734] r10:df04e4d8 r9:00000181 r8:df04e4d8 r7:df4dfe4c r6:dbaa2200 r5:
:00000000
[ 964.865945] r4:00090002
[ 964.934187] [<c01086b4>] (__ext4_get_inode_loc) from [<c010c5a8>] (ext4_reser
rve_inode_write+0x34/0x9c)
[ 965.080216] r10:df04e4d8 r9:00000000 r8:de53cf40 r7:de53cf40 r6:df4dfe4c r5:
:df04e4d8
[ 965.159656] r4:df04e4d8
[ 965.232230] [<c010c574>] (ext4_reserve_inode_write) from [<c0115dc4>] (ext4_o
orphan_add+0xf4/0x218)
[ 965.385687] r7:df04e4b4 r6:dbaa2200 r5:dee12000 r4:df04e4d8
[ 965.464523] [<c0115cd0>] (ext4_orphan_add) from [<c0116b24>] (ext4_unlink+0x2
25c/0x26c)
[ 965.547430] r10:df04e4d8 r9:df4dff04 r8:00000000 r7:de53cf40 r6:dd1c73f0 r5:
:df04e4d8
[ 965.631429] r4:c0fbfe00
[ 965.708445] [<c01168c8>] (ext4_unlink) from [<c00b7e4c>] (vfs_unlink+0xc8/0x1
13c)
[ 965.792677] r8:000b6650 r7:dd1c73f0 r6:df04e534 r5:df04e4d8 r4:c19ca440
[ 965.877297] [<c00b7d84>] (vfs_unlink) from [<c00b80b0>] (do_unlinkat+0x1f0/0x
x210)
[ 965.963851] r9:ffffff9c r8:000b6650 r7:00000000 r6:df4e6000 r5:00000000 r4:c
c19ca440
[ 966.051666] [<c00b7ec0>] (do_unlinkat) from [<c00b8e20>] (SyS_unlink+0x18/0x1
1c)
[ 966.139262] r10:00000000 r9:df4de000 r8:c0009624 r7:0000000a r6:0000000c r5:
:00000008
[ 966.228970] r4:000b6650
[ 966.311776] [<c00b8e08>] (SyS_unlink) from [<c00094c0>] (ret_fast_syscall+0x0
0/0x2c)
[ 966.401452] Code: e1a01005 eb04553f e2503000 0a00000f (e5930000)
[ 966.608250] ---[ end trace a1b54af48fda09ed ]---
[ 966.693854] Kernel panic - not syncing: Fatal exception
[ 966.781707] ---[ end Kernel panic - not syncing: Fatal exception

3) Boot a good kernel:

~ # dmesg | grep GCC
[ 0.000000] Linux version 3.17.0-mvebu-los_1b42 (***@cooljazz) (gcc version 4.9.1 (GCC) ) #1 Thu Oct 9 06:46:07 EEST 2014

4) Use the beforementioned file system and try to clean the mess:

/mnt/test # df -h .
Filesystem Size Used Available Use% Mounted on
/dev/sdc1 3.8G 796.2M 2.8G 22% /mnt/test
/mnt/test # rm -rf gcc-4.8.2
rm: can't remove 'gcc-4.8.2/gcc/testsuite/gcc.dg/tree-ssa': Directory not empty
rm: can't remove 'gcc-4.8.2/gcc/testsuite/gcc.dg': Directory not empty
rm: can't remove 'gcc-4.8.2/gcc/testsuite/gfortran.dg': Directory not empty
rm: can't remove 'gcc-4.8.2/gcc/testsuite': Directory not empty
rm: can't remove 'gcc-4.8.2/gcc': Directory not empty
rm: can't remove 'gcc-4.8.2': Directory not empty
/mnt/test # rm -rf gcc-4.8.2
rm: can't remove 'gcc-4.8.2/gcc/testsuite/gcc.dg/tree-ssa': Directory not empty
rm: can't remove 'gcc-4.8.2/gcc/testsuite/gcc.dg': Directory not empty
rm: can't remove 'gcc-4.8.2/gcc/testsuite/gfortran.dg': Directory not empty
rm: can't remove 'gcc-4.8.2/gcc/testsuite': Directory not empty
rm: can't remove 'gcc-4.8.2/gcc': Directory not empty
rm: can't remove 'gcc-4.8.2': Directory not empty
/mnt/test # df -h .
Filesystem Size Used Available Use% Mounted on
/dev/sdc1 3.8G 90.5M 3.5G 2% /mnt/test
/mnt/test # find gcc-4.8.2
gcc-4.8.2
gcc-4.8.2/gcc
gcc-4.8.2/gcc/testsuite
gcc-4.8.2/gcc/testsuite/gcc.dg
gcc-4.8.2/gcc/testsuite/gcc.dg/tree-ssa
find: gcc-4.8.2/gcc/testsuite/gcc.dg/tree-ssa/forwprop-8.c: No such file or directory
gcc-4.8.2/gcc/testsuite/gfortran.dg
find: gcc-4.8.2/gcc/testsuite/gfortran.dg/result_default_init_1.f90: No such file or directory

5) fsck to rescue:

/mnt/test # cd /
~ # umount /mnt/test
~ # fsck /dev/sdc1
fsck 1.42.9 (28-Dec-2013)
e2fsck 1.42.9 (28-Dec-2013)
/dev/sdc1: clean, 21/262144 files, 72408/1048576 blocks
~ # fsck -f /dev/sdc1
fsck 1.42.9 (28-Dec-2013)
e2fsck 1.42.9 (28-Dec-2013)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Problem in HTREE directory inode 118267: block #4 has bad min hash
Problem in HTREE directory inode 118267: block #26 has bad max hash
Invalid HTREE directory inode 118267 (/gcc-4.8.2/gcc/testsuite/gfortran.dg). Clear HTree index<y>? yes
Problem in HTREE directory inode 174218: block #8 has bad min hash
Invalid HTREE directory inode 174218 (/gcc-4.8.2/gcc/testsuite/gcc.dg/tree-ssa). Clear HTree index<y>? yes
Pass 3: Checking directory connectivity
Pass 3A: Optimizing directories
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/sdc1: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sdc1: 21/262144 files (19.0% non-contiguous), 72368/1048576 blocks
~ # mount /dev/sdc1 /mnt/
~ # rm -rf /mnt/gcc-4.8.2
~ #

So in this case fsck was able to fix it.

A.
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...