kernel 3.0: BUG: soft lockup: find_get

Discussion:

kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110

Paweł Sikora

2011-10-12 18:12:33 UTC

Hi Hugh,
i'm resending previous private email with larger cc list as you've requ=
ested.

in the last weekend my server died again (processes stuck for 22/23s!) =
but this time i have more logs for you.
=20
on my dual-opteron machines i have non-standard settings:
- DISABLED swap space (fast user process killing is better for my autot=
est farm than long disk swapping
and 64GB ecc-ram is enough for my processing).
- vm.overcommit_memory =3D 2,
- vm.overcommit_ratio =3D 100.

after initial BUG_ON (pasted below) there's a flood of 'rcu_sched_state=
detected stalls / CPU#X stuck for 22s!'
(full compressed log is available at: http://pluto.agmk.net/kernel/kern=
el.bz2)

Oct 9 08:06:43 hal kernel: [408578.629070] ------------[ cut here ]---=
---------
Oct 9 08:06:43 hal kernel: [408578.629143] kernel BUG at include/linux=
/swapops.h:105!
Oct 9 08:06:43 hal kernel: [408578.629143] invalid opcode: 0000 [#1] S=
MP
Oct 9 08:06:43 hal kernel: [408578.629143] CPU 14
Oct 9 08:06:43 hal kernel: [408578.629143] Modules linked in: nfs fsca=
che binfmt_misc nfsd lockd nfs_acl auth_rpcgss sunrpc ipmi_si ipmi_devi=
ntf ipmi_msghandler sch_sfq iptable_nat nf_nat nf_conntrack_ipv4 nf_con=
ntrack nf_defrag_ipv4 iptable_filter xt_TCPMSS xt_tcpudp iptable_mangle=
ip_tables ip6table_filter ip6_tables x_tables ext4 jbd2 crc16 raid10 r=
aid0 dm_mod uvesafb autofs4 dummy aoe joydev usbhid hid ide_cd_mod cdro=
m ata_generic pata_acpi ohci_hcd pata_atiixp sp5100_tco ide_pci_generic=
ssb ehci_hcd igb i2c_piix4 pcmcia evdev atiixp pcmcia_core psmouse ide=
_core usbcore i2c_core amd64_edac_mod mmc_core edac_core pcspkr sg edac=
_mce_amd button serio_raw processor dca ghes k10temp hwmon hed sd_mod c=
rc_t10dif raid1 md_mod ext3 jbd mbcache ahci libahci libata scsi_mod [l=
ast unloaded: scsi_wait_scan]
Oct 9 08:06:43 hal kernel: [408578.629143]
Oct 9 08:06:43 hal kernel: [408578.629143] Pid: 29214, comm: bitgen No=
t tainted 3.0.4 #5 Supermicro H8DGU/H8DGU
Oct 9 08:06:43 hal kernel: [408578.629143] RIP: 0010:[<ffffffff81127b7=
6>] [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
Oct 9 08:06:43 hal kernel: [408578.629143] RSP: 0000:ffff88021cee7d18 =
EFLAGS: 00010246
Oct 9 08:06:43 hal kernel: [408578.629143] RAX: 0a00000000080068 RBX: =
ffffea001d1dbe70 RCX: ffff880c02d18978
Oct 9 08:06:43 hal kernel: [408578.629143] RDX: 0000000000851a42 RSI: =
ffff880d8fe33618 RDI: ffffea002a09dd50
Oct 9 08:06:43 hal kernel: [408578.629143] RBP: ffff88021cee7d38 R08: =
ffff880d8fe33618 R09: 0000000000000028
Oct 9 08:06:43 hal kernel: [408578.629143] R10: ffff881006eb0f00 R11: =
f800000000851a42 R12: ffffea002a09dd40
Oct 9 08:06:43 hal kernel: [408578.629143] R13: 0000000c02d18978 R14: =
00000000d872f000 R15: ffff880d8fe33618
Oct 9 08:06:43 hal kernel: [408578.629143] FS: 00007f864d7fa700(0000)=
GS:ffff880c1fc80000(0063) knlGS:00000000f432a910
Oct 9 08:06:43 hal kernel: [408578.629143] CS: 0010 DS: 002b ES: 002b=
CR0: 0000000080050033
Oct 9 08:06:43 hal kernel: [408578.629143] CR2: 00000000d872f000 CR3: =
000000100668c000 CR4: 00000000000006e0
Oct 9 08:06:43 hal kernel: [408578.629143] DR0: 0000000000000000 DR1: =
0000000000000000 DR2: 0000000000000000
Oct 9 08:06:43 hal kernel: [408578.629143] DR3: 0000000000000000 DR6: =
00000000ffff0ff0 DR7: 0000000000000400
Oct 9 08:06:43 hal kernel: [408578.629143] Process bitgen (pid: 29214,=
threadinfo ffff88021cee6000, task ffff880407b06900)
Oct 9 08:06:43 hal kernel: [408578.629143] Stack:
Oct 9 08:06:43 hal kernel: [408578.629143] 00000000dcef4000 ffff880c0=
0968c98 ffff880c02d18978 000000010a34843e
Oct 9 08:06:43 hal kernel: [408578.629143] ffff88021cee7de8 ffffffff8=
11016a1 ffff88021cee7d78 ffff881006eb0f00
Oct 9 08:06:43 hal kernel: [408578.629143] ffff88021cee7d98 ffffffff8=
10feee2 ffff880c06f0d170 8000000b98d14067
Oct 9 08:06:43 hal kernel: [408578.629143] Call Trace:
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff811016a1>] handl=
e_pte_fault+0xae1/0xaf0
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff810feee2>] ? __p=
te_alloc+0x42/0x120
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff8112c26b>] ? do_=
huge_pmd_anonymous_page+0xab/0x310
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81102a31>] handl=
e_mm_fault+0x181/0x310
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81106097>] ? vma=
_adjust+0x537/0x570
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81424bed>] do_pa=
ge_fault+0x11d/0x4e0
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81109a05>] ? do_=
mremap+0x2d5/0x570
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81421d5f>] page_=
fault+0x1f/0x30
Oct 9 08:06:43 hal kernel: [408578.629143] Code: 80 00 00 00 00 31 f6 =
48 89 df e8 e6 58 fb ff eb d7 85 c9 0f 84 44 ff ff ff 8d 51 01 89 c8 f0=
0f b1 16 39 c1 90 74 b5 89 c1 eb e6 <0f> 0b 0f 1f 84 00 00 00 00 00 55=
48 89 e5 48 83 ec 20 48 85 ff
Oct 9 08:06:43 hal kernel: [408578.629143] RIP [<ffffffff81127b76>] m=
igration_entry_wait+0x156/0x160
Oct 9 08:06:43 hal kernel: [408578.629143] RSP <ffff88021cee7d18>
Oct 9 08:06:43 hal kernel: [408578.642823] ---[ end trace 0a3736230116=
3711 ]---
Oct 9 08:07:10 hal kernel: [408605.283257] BUG: soft lockup - CPU#12 s=
tuck for 23s! [par:29801]
Oct 9 08:07:10 hal kernel: [408605.283367] Modules linked in: nfs fsca=
che binfmt_misc nfsd lockd nfs_acl auth_rpcgss sunrpc ipmi_si ipmi_devi=
ntf ipmi_msghandler sch_sfq iptable_nat nf_nat nf_conntrack_ipv4 nf_con=
ntrack nf_defrag_ipv4 iptable_filter xt_TCPMSS xt_tcpudp iptable_mangle=
ip_tables ip6table_filter ip6_tables x_tables ext4 jbd2 crc16 raid10 r=
aid0 dm_mod uvesafb autofs4 dummy aoe joydev usbhid hid ide_cd_mod cdro=
m ata_generic pata_acpi ohci_hcd pata_atiixp sp5100_tco ide_pci_generic=
ssb ehci_hcd igb i2c_piix4 pcmcia evdev atiixp pcmcia_core psmouse ide=
_core usbcore i2c_core amd64_edac_mod mmc_core edac_core pcspkr sg edac=
_mce_amd button serio_raw processor dca ghes k10temp hwmon hed sd_mod c=
rc_t10dif raid1 md_mod ext3 jbd mbcache ahci libahci libata scsi_mod [l=
ast unloaded: scsi_wait_scan]
Oct 9 08:07:10 hal kernel: [408605.285807] CPU 12
Oct 9 08:07:10 hal kernel: [408605.285807] Modules linked in: nfs fsca=
che binfmt_misc nfsd lockd nfs_acl auth_rpcgss sunrpc ipmi_si ipmi_devi=
ntf ipmi_msghandler sch_sfq iptable_nat nf_nat nf_conntrack_ipv4 nf_con=
ntrack nf_defrag_ipv4 iptable_filter xt_TCPMSS xt_tcpudp iptable_mangle=
ip_tables ip6table_filter ip6_tables x_tables ext4 jbd2 crc16 raid10 r=
aid0 dm_mod uvesafb autofs4 dummy aoe joydev usbhid hid ide_cd_mod cdro=
m ata_generic pata_acpi ohci_hcd pata_atiixp sp5100_tco ide_pci_generic=
ssb ehci_hcd igb i2c_piix4 pcmcia evdev atiixp pcmcia_core psmouse ide=
_core usbcore i2c_core amd64_edac_mod mmc_core edac_core pcspkr sg edac=
_mce_amd button serio_raw processor dca ghes k10temp hwmon hed sd_mod c=
rc_t10dif raid1 md_mod ext3 jbd mbcache ahci libahci libata scsi_mod [l=
ast unloaded: scsi_wait_scan]
Oct 9 08:07:10 hal kernel: [408605.285807]
Oct 9 08:07:10 hal kernel: [408605.285807] Pid: 29801, comm: par Taint=
ed: G D 3.0.4 #5 Supermicro H8DGU/H8DGU
Oct 9 08:07:10 hal kernel: [408605.285807] RIP: 0010:[<ffffffff814216a=
4>] [<ffffffff814216a4>] _raw_spin_lock+0x14/0x20
Oct 9 08:07:10 hal kernel: [408605.285807] RSP: 0018:ffff880c02def808 =
EFLAGS: 00000293
Oct 9 08:07:10 hal kernel: [408605.285807] RAX: 0000000000000b09 RBX: =
ffffea002741f6b8 RCX: ffff880000000000
Oct 9 08:07:10 hal kernel: [408605.285807] RDX: ffffea0000000000 RSI: =
000000002a09dd40 RDI: ffffea002a09dd50
Oct 9 08:07:10 hal kernel: [408605.285807] RBP: ffff880c02def808 R08: =
0000000000000000 R09: ffff880f2e4f4d70
Oct 9 08:07:10 hal kernel: [408605.285807] R10: ffff880c1fffbe00 R11: =
0000000000000050 R12: ffffffff8142988e
Oct 9 08:07:10 hal kernel: [408605.285807] R13: ffff880c02def7b8 R14: =
ffffffff810e63dc R15: ffff880c02def7b8
Oct 9 08:07:10 hal kernel: [408605.285807] FS: 00007fe6b677c720(0000)=
GS:ffff880c1fc00000(0063) knlGS:00000000f6e0d910
Oct 9 08:07:10 hal kernel: [408605.285807] CS: 0010 DS: 002b ES: 002b=
CR0: 000000008005003b
Oct 9 08:07:10 hal kernel: [408605.285807] CR2: 00000000dd40012c CR3: =
00000009f6b78000 CR4: 00000000000006e0
Oct 9 08:07:10 hal kernel: [408605.285807] DR0: 0000000000000000 DR1: =
0000000000000000 DR2: 0000000000000000
Oct 9 08:07:10 hal kernel: [408605.285807] DR3: 0000000000000000 DR6: =
00000000ffff0ff0 DR7: 0000000000000400
Oct 9 08:07:10 hal kernel: [408605.285807] Process par (pid: 29801, th=
readinfo ffff880c02dee000, task ffff880c07c90700)
Oct 9 08:07:10 hal kernel: [408605.285807] Stack:
Oct 9 08:07:10 hal kernel: [408605.285807] ffff880c02def858 ffffffff8=
110a2d7 ffffea001fc16fe0 ffff880c02d183a8
Oct 9 08:07:10 hal kernel: [408605.285807] ffff880c02def8c8 ffffea001=
fc16fa8 ffff880c00968c98 ffff881006eb0f00
Oct 9 08:07:10 hal kernel: [408605.285807] 0000000000000301 00000000d=
8675000 ffff880c02def8c8 ffffffff8110aa6a
Oct 9 08:07:10 hal kernel: [408605.285807] Call Trace:
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110a2d7>] __pag=
e_check_address+0x107/0x1a0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110aa6a>] try_t=
o_unmap_one+0x3a/0x420
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110be44>] try_t=
o_unmap_anon+0xb4/0x130
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110bf75>] try_t=
o_unmap+0x65/0x80
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff811285d0>] migra=
te_pages+0x310/0x4c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e93c2>] ? ___=
_pagevec_lru_add+0x12/0x20
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111cbf0>] ? ftr=
ace_define_fields_mm_compaction_isolate_template+0x70/0x70
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111d5da>] compa=
ct_zone+0x52a/0x8c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810f9919>] ? zon=
e_statistics+0x99/0xc0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111dade>] compa=
ct_zone_order+0x7e/0xb0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e46a8>] ? get=
_page_from_freelist+0x3b8/0x7e0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111dbcd>] try_t=
o_compact_pages+0xbd/0xf0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e5148>] __all=
oc_pages_direct_compact+0xa8/0x180
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e56c5>] __all=
oc_pages_nodemask+0x4a5/0x7f0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e9698>] ? lru=
_cache_add_lru+0x28/0x50
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110a92d>] ? pag=
e_add_new_anon_rmap+0x9d/0xb0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111b865>] alloc=
_pages_vma+0x95/0x180
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8112c2f8>] do_hu=
ge_pmd_anonymous_page+0x138/0x310
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81102ace>] handl=
e_mm_fault+0x21e/0x310
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81001716>] ? __s=
witch_to+0x1e6/0x2c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81424bed>] do_pa=
ge_fault+0x11d/0x4e0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8141f178>] ? sch=
edule+0x308/0xa10
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff811077a7>] ? do_=
mmap_pgoff+0x357/0x370
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110790d>] ? sys=
_mmap_pgoff+0x14d/0x220
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81421d5f>] page_=
fault+0x1f/0x30
Oct 9 08:07:10 hal kernel: [408605.285807] Code: 0f b6 c2 85 c0 0f 95 =
c0 0f b6 c0 5d c3 66 2e 0f 1f 84 00 00 00 00 00 55 b8 00 01 00 00 48 89=
e5 f0 66 0f c1 07 38 e0 74 06 f3 90 <8a> 07 eb f6 5d c3 66 0f 1f 44 00=
00 55 48 89 e5 9c 58 fa ba 00
Oct 9 08:07:10 hal kernel: [408605.285807] Call Trace:
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110a2d7>] __pag=
e_check_address+0x107/0x1a0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110aa6a>] try_t=
o_unmap_one+0x3a/0x420
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110be44>] try_t=
o_unmap_anon+0xb4/0x130
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110bf75>] try_t=
o_unmap+0x65/0x80
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff811285d0>] migra=
te_pages+0x310/0x4c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e93c2>] ? ___=
_pagevec_lru_add+0x12/0x20
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111cbf0>] ? ftr=
ace_define_fields_mm_compaction_isolate_template+0x70/0x70
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111d5da>] compa=
ct_zone+0x52a/0x8c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810f9919>] ? zon=
e_statistics+0x99/0xc0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111dade>] compa=
ct_zone_order+0x7e/0xb0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e46a8>] ? get=
_page_from_freelist+0x3b8/0x7e0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111dbcd>] try_t=
o_compact_pages+0xbd/0xf0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e5148>] __all=
oc_pages_direct_compact+0xa8/0x180
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e56c5>] __all=
oc_pages_nodemask+0x4a5/0x7f0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e9698>] ? lru=
_cache_add_lru+0x28/0x50
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110a92d>] ? pag=
e_add_new_anon_rmap+0x9d/0xb0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111b865>] alloc=
_pages_vma+0x95/0x180
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8112c2f8>] do_hu=
ge_pmd_anonymous_page+0x138/0x310
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81102ace>] handl=
e_mm_fault+0x21e/0x310
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81001716>] ? __s=
witch_to+0x1e6/0x2c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81424bed>] do_pa=
ge_fault+0x11d/0x4e0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8141f178>] ? sch=
edule+0x308/0xa10
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff811077a7>] ? do_=
mmap_pgoff+0x357/0x370
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110790d>] ? sys=
_mmap_pgoff+0x14d/0x220
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81421d5f>] page_=
fault+0x1f/0x30

BR,
Pawe=C5=82.

Hugh Dickins

2011-10-13 23:16:01 UTC