Discussion:
What is wrong with dtrace's stack()?
Lev Serebryakov
2018-10-21 17:10:21 UTC
Permalink
Hello Freebsd-hackers,

I'm trying to profile strange if_gif and if_gre performance on hardware
without pmc. So, I'm sampling kernel stacks with simple dtrace script.

And I have a lot of stacks which show something like this:

kernel`ipsec_hdrsiz_inpcb+0xa1
kernel`soo_write+0x33
kernel`dofilewrite+0x79
kernel`sys_write+0xc3
kernel`amd64_syscall+0x332
kernel`0xffffffff8086c87d

Functions after soo_write could be different, but address is always the
same: soo_write+0x33.

But soo_write doesn't call all these functions, in first place! soo_write
looks like:

Dump of assembler code for function soo_write:
0xffffffff8060f930 <+0>: push %rbp
0xffffffff8060f931 <+1>: mov %rsp,%rbp
0xffffffff8060f934 <+4>: push %r15
0xffffffff8060f936 <+6>: push %r14
0xffffffff8060f938 <+8>: push %r12
0xffffffff8060f93a <+10>: push %rbx
0xffffffff8060f93b <+11>: sub $0x10,%rsp
0xffffffff8060f93f <+15>: mov %rsi,%r12
0xffffffff8060f942 <+18>: mov (%rdi),%rbx
0xffffffff8060f945 <+21>: mov 0x28(%r12),%rax
0xffffffff8060f94a <+26>: mov %rax,(%rsp)
0xffffffff8060f94e <+30>: xor %esi,%esi
0xffffffff8060f950 <+32>: xor %ecx,%ecx
0xffffffff8060f952 <+34>: xor %r8d,%r8d
....

Now I can not trust all these collected stacks. What do I do wrong?!

I have in my kernel config:

makeoptions DEBUG=-g
makeoptions WITH_CTF=1 # Run ctfconvert(1) for DTrace support
options KDTRACE_FRAME # Ensure frames are compiled in
options KDTRACE_HOOKS # Kernel DTrace hooks
--
Best regards,
Lev mailto:***@FreeBSD.org
Conrad Meyer
2018-10-21 17:21:43 UTC
Permalink
Your assembler dump offsets are in decimal. Look for offset 0x33 =
+51, not +33.

Conrad
Post by Lev Serebryakov
Hello Freebsd-hackers,
I'm trying to profile strange if_gif and if_gre performance on hardware
without pmc. So, I'm sampling kernel stacks with simple dtrace script.
kernel`ipsec_hdrsiz_inpcb+0xa1
kernel`soo_write+0x33
kernel`dofilewrite+0x79
kernel`sys_write+0xc3
kernel`amd64_syscall+0x332
kernel`0xffffffff8086c87d
Functions after soo_write could be different, but address is always the
same: soo_write+0x33.
But soo_write doesn't call all these functions, in first place! soo_write
0xffffffff8060f930 <+0>: push %rbp
0xffffffff8060f931 <+1>: mov %rsp,%rbp
0xffffffff8060f934 <+4>: push %r15
0xffffffff8060f936 <+6>: push %r14
0xffffffff8060f938 <+8>: push %r12
0xffffffff8060f93a <+10>: push %rbx
0xffffffff8060f93b <+11>: sub $0x10,%rsp
0xffffffff8060f93f <+15>: mov %rsi,%r12
0xffffffff8060f942 <+18>: mov (%rdi),%rbx
0xffffffff8060f945 <+21>: mov 0x28(%r12),%rax
0xffffffff8060f94a <+26>: mov %rax,(%rsp)
0xffffffff8060f94e <+30>: xor %esi,%esi
0xffffffff8060f950 <+32>: xor %ecx,%ecx
0xffffffff8060f952 <+34>: xor %r8d,%r8d
....
Now I can not trust all these collected stacks. What do I do wrong?!
makeoptions DEBUG=-g
makeoptions WITH_CTF=1 # Run ctfconvert(1) for DTrace support
options KDTRACE_FRAME # Ensure frames are compiled in
options KDTRACE_HOOKS # Kernel DTrace hooks
--
Best regards,
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
Lev Serebryakov
2018-10-21 17:26:52 UTC
Permalink
Hello Conrad,
Post by Conrad Meyer
Your assembler dump offsets are in decimal.
Ooops!
Post by Conrad Meyer
Look for offset 0x33 = +51, not +33.
It is call to sosend(), which could call many other things, but why it is
missed at stack output!?
--
Best regards,
Lev mailto:***@FreeBSD.org
Alan Somers
2018-10-21 21:22:25 UTC
Permalink
Post by Lev Serebryakov
Hello Conrad,
Post by Conrad Meyer
Your assembler dump offsets are in decimal.
Ooops!
Also, rather than inspecting the assembly, you can fire up kgdb and type
(kgdb) list *(soo_write+0x33)
And it will tell you the corresponding source line.
Post by Lev Serebryakov
Post by Conrad Meyer
Look for offset 0x33 = +51, not +33.
It is call to sosend(), which could call many other things, but why it is
missed at stack output!?
dtrace doesn't have visibility into functions that get automatically
inlined. So effectively it leaves out some stack frames. Also, dtrace
doesn't indicate when a function got called via a function pointer as
opposed to a direct call. ipsec_hdrsiz_input, for example, gets called
only via function pointers, referenced by the IPSEC_HDRSIZE macro. That,
in turn, is called by tcp_output. Finally, if the compiler uses a tail
call optimization, then dtrace won't see the tail caller's stack frame. In
your case, it looks like there are several layers that are invisible to
dtrace. The true stack probably looks something like this:

soo_write->sosend->sosend_generic (via pru_sosend function
pointer)->tcp_usr_send(via pru_send function pointer)->tcp_output(via
tfb_tcp_output function pointer)->IPSEC_HDRSIZE->ipsec_hdrsiz_inpcb (via
hdrsiz function pointer)

Hope that helps.
-Alan
Post by Lev Serebryakov
--
Best regards,
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
Lev Serebryakov
2018-10-21 21:37:34 UTC
Permalink
Hello Conrad,
Post by Conrad Meyer
Your assembler dump offsets are in decimal. Look for offset 0x33 =
+51, not +33.
Problem is, sosend() is not very interesting by itself, and looks like
several layers of stack are always lost.

I see a lot of stacks like this:

kernel`lock_delay+0x42
kernel`soo_write+0x33
kernel`dofilewrite+0x79
kernel`sys_write+0xc3
kernel`amd64_syscall+0x332
kernel`0xffffffff8086c87d

But event sosend() doesn't call lock_delay(), so it is impossible to
understand why do lock_delay() seen 41932 times in 60 seconds at top of the
stack. Where are all call stack?! All these functions could not be inlined,
as sosend() is located in other translation unit and it calls function by
pointer, this call could not be inlined too.
--
Best regards,
Lev mailto:***@FreeBSD.org
Alan Somers
2018-10-21 21:47:10 UTC
Permalink
Post by Lev Serebryakov
Hello Conrad,
Post by Conrad Meyer
Your assembler dump offsets are in decimal. Look for offset 0x33 =
+51, not +33.
Problem is, sosend() is not very interesting by itself, and looks like
several layers of stack are always lost.
kernel`lock_delay+0x42
kernel`soo_write+0x33
kernel`dofilewrite+0x79
kernel`sys_write+0xc3
kernel`amd64_syscall+0x332
kernel`0xffffffff8086c87d
But event sosend() doesn't call lock_delay(), so it is impossible to
understand why do lock_delay() seen 41932 times in 60 seconds at top of the
stack. Where are all call stack?! All these functions could not be inlined,
as sosend() is located in other translation unit and it calls function by
pointer, this call could not be inlined too.
If you're sure that the function isn't inlined, then it might be using the
tail-call optimization instead. That would also explain the missing stack
frames, too. If you can manually narrow the options down to a few possible
callers, then you could try adding a few SDT probes. That's what I usually
do in cases like this. Or, for static functions that are inlined, you can
remove the static keyword to (usually) prevent the inlining.
-Alan
Post by Lev Serebryakov
--
Best regards,
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
Lev Serebryakov
2018-10-22 11:45:19 UTC
Permalink
 kernel`lock_delay+0x42
 kernel`soo_write+0x33
 kernel`dofilewrite+0x79
 kernel`sys_write+0xc3
 kernel`amd64_syscall+0x332
 kernel`0xffffffff8086c87d
 But event sosend() doesn't call lock_delay(), so it is impossible to
understand why do lock_delay() seen 41932 times in 60 seconds at top of the
stack. Where are all call stack?! All these functions could not be inlined,
as sosend() is located in other translation unit and it calls function by
pointer, this call could not be inlined too.
If you're sure that the function isn't inlined, then it might be using
the tail-call optimization instead.  That would also explain the missing
stack frames, too.
I know about inlining at TCO, but it is not the case for sure. Problem
is, they can not be inlined it TCO'ed.

soo_wrtite() calls sosend():

error = sosend(so, 0, uio, 0, 0, 0, uio->uio_td);
if (error == EPIPE && (so->so_options & SO_NOSIGPIPE) == 0) {
PROC_LOCK(uio->uio_td->td_proc);
tdsignal(uio->uio_td, SIGPIPE);
PROC_UNLOCK(uio->uio_td->td_proc);
}

It could not be TCO'ed and sosend() is in another translation unit, so
it could not be inlined.

sosend() calls protocol-specific handler via function pointer, so it
should be true call:

CURVNET_SET(so->so_vnet);
if (!SOLISTENING(so))
error = so->so_proto->pr_usrreqs->pru_sosend(so, addr, uio,
top, control, flags, td);
else {
m_freem(top);
m_freem(control);
error = ENOTCONN;
}
CURVNET_RESTORE();
return (error);

These frames MUST be here...
If you can manually narrow the options down to a few
possible callers, then you could try adding a few SDT probes.  That's
what I usually do in cases like this.  Or, for static functions that are
inlined, you can remove the static keyword to (usually) prevent the
inlining.
Narrowing all paths, which leads to very generic lock_delay()... Uh-oh.
--
// Lev Serebryakov
Slawa Olhovchenkov
2018-10-22 13:29:55 UTC
Permalink
Post by Lev Serebryakov
sosend() calls protocol-specific handler via function pointer, so it
CURVNET_SET(so->so_vnet);
if (!SOLISTENING(so))
error = so->so_proto->pr_usrreqs->pru_sosend(so, addr, uio,
top, control, flags, td);
else {
m_freem(top);
m_freem(control);
error = ENOTCONN;
}
CURVNET_RESTORE();
return (error);
These frames MUST be here...
(kgdb) x/80i sosend
0xffffffff8054d6b0 <sosend>: push %rbp
0xffffffff8054d6b1 <sosend+1>: mov %rsp,%rbp
0xffffffff8054d6b4 <sosend+4>: mov 0x20(%rdi),%rax
0xffffffff8054d6b8 <sosend+8>: mov 0x58(%rax),%rax
0xffffffff8054d6bc <sosend+12>: mov 0x98(%rax),%rax
0xffffffff8054d6c3 <sosend+19>: pop %rbp
0xffffffff8054d6c4 <sosend+20>: jmpq *%rax
Lev Serebryakov
2018-10-22 13:42:14 UTC
Permalink
Post by Lev Serebryakov
CURVNET_SET(so->so_vnet);
if (!SOLISTENING(so))
error = so->so_proto->pr_usrreqs->pru_sosend(so, addr, uio,
top, control, flags, td);
else {
m_freem(top);
m_freem(control);
error = ENOTCONN;
}
CURVNET_RESTORE();
Oh, I'm idiot, it is eligible for TCO, if VNET is not compiled-in.
--
// Lev Serebryakov
Ryan Stone
2018-10-22 20:50:29 UTC
Permalink
Adding -fno-optimize-sibling-calls to the compiler flags will eliminate the TCO.
Post by Lev Serebryakov
Post by Lev Serebryakov
CURVNET_SET(so->so_vnet);
if (!SOLISTENING(so))
error = so->so_proto->pr_usrreqs->pru_sosend(so, addr, uio,
top, control, flags, td);
else {
m_freem(top);
m_freem(control);
error = ENOTCONN;
}
CURVNET_RESTORE();
Oh, I'm idiot, it is eligible for TCO, if VNET is not compiled-in.
--
// Lev Serebryakov
Lev Serebryakov
2018-10-23 13:26:41 UTC
Permalink
Post by Ryan Stone
Adding -fno-optimize-sibling-calls to the compiler flags will eliminate the TCO.
How could I add it for kernel properly?! Adding
"CFLAGS=-fno-optimize-sibling-calls" to make.conf affects modules build,
but not kernel itself (I'm surprised, to be honest).
--
// Lev Serebryakov
Yuri Pankov
2018-10-23 13:48:15 UTC
Permalink
Post by Lev Serebryakov
Post by Ryan Stone
Adding -fno-optimize-sibling-calls to the compiler flags will eliminate the TCO.
How could I add it for kernel properly?! Adding
"CFLAGS=-fno-optimize-sibling-calls" to make.conf affects modules build,
but not kernel itself (I'm surprised, to be honest).
Try using:

COPTFLAGS+= -fno-optimize-sibling-calls
Lev Serebryakov
2018-10-23 13:50:11 UTC
Permalink
Post by Yuri Pankov
Post by Lev Serebryakov
Post by Ryan Stone
Adding -fno-optimize-sibling-calls to the compiler flags will eliminate the TCO.
How could I add it for kernel properly?! Adding
"CFLAGS=-fno-optimize-sibling-calls" to make.conf affects modules build,
but not kernel itself (I'm surprised, to be honest).
COPTFLAGS+= -fno-optimize-sibling-calls
Yep, thnx

I should better read documentation.
--
// Lev Serebryakov
Lev Serebryakov
2018-10-23 21:34:12 UTC
Permalink
Hello Ryan,
Post by Ryan Stone
Adding -fno-optimize-sibling-calls to the compiler flags will eliminate the TCO.
Stacks are better but still strange. For example I have such stack (which I
like better than previous):

kernel`lock_delay+0x72
kernel`sosend_generic+0x112c
kernel`sosend+0x79
kernel`soo_write+0x6b
kernel`fo_write+0x4a
kernel`dofilewrite+0xcd
kernel`kern_writev+0x79
kernel`sys_write+0x8f
kernel`syscallenter+0x774
kernel`amd64_syscall+0x1b
kernel`0xffffffff80cebf6d

According to addr2line `sosend_generic+0x112c' is

https://svnweb.freebsd.org/base/head/sys/kern/uipc_socket.c?revision=339419&view=markup#l1579

Which is call to protocol-specific send function. Where is this function
(it should be tcp for sure)?!
--
Best regards,
Lev mailto:***@FreeBSD.org
Ryan Stone
2018-10-24 15:48:40 UTC
Permalink
sosend_generic+0x112c is the return address, so it's one instruction
after the call instruction that called lock_delay. What's the line
number of sosend_generic+0x112b?
Post by Lev Serebryakov
Hello Ryan,
Post by Ryan Stone
Adding -fno-optimize-sibling-calls to the compiler flags will eliminate the TCO.
Stacks are better but still strange. For example I have such stack (which I
kernel`lock_delay+0x72
kernel`sosend_generic+0x112c
kernel`sosend+0x79
kernel`soo_write+0x6b
kernel`fo_write+0x4a
kernel`dofilewrite+0xcd
kernel`kern_writev+0x79
kernel`sys_write+0x8f
kernel`syscallenter+0x774
kernel`amd64_syscall+0x1b
kernel`0xffffffff80cebf6d
According to addr2line `sosend_generic+0x112c' is
https://svnweb.freebsd.org/base/head/sys/kern/uipc_socket.c?revision=339419&view=markup#l1579
Which is call to protocol-specific send function. Where is this function
(it should be tcp for sure)?!
--
Best regards,
Lev Serebryakov
2018-10-24 16:10:41 UTC
Permalink
Post by Ryan Stone
sosend_generic+0x112c is the return address, so it's one instruction
after the call instruction that called lock_delay.
Yes, but this instruction (+0x112b) calls real protocol send function
via function pointer.
Post by Ryan Stone
What's the line number of sosend_generic+0x112b?
It is here, uipc_socket.c:1589:

error = (*so->so_proto->pr_usrreqs->pru_send)(so,
(flags & MSG_OOB) ? PRUS_OOB :
/*
* If the user set MSG_EOF, the protocol understands
* this flag and nothing left to send then use
* PRU_SEND_EOF instead of PRU_SEND.
*/
((flags & MSG_EOF) &&
(so->so_proto->pr_flags & PR_IMPLOPCL) &&
(resid <= 0)) ?
PRUS_EOF :
/* If there is more to send set PRUS_MORETOCOME. */
(flags & MSG_MORETOCOME) ||
(resid > 0 && space > 0) ? PRUS_MORETOCOME : 0,
top, addr, control, td);

I have these three DIFFERENT stack, which should be one for sure, as it
is confirmed by custom SDT probes now (I've added them and checked).
Addresses are different from previous examples due to some code was
shifted by SDT probes. 26462 times stack is Ok, but 4068+3993 times some
frames are lost. Same address could not call both tcp_usr_send() and
ia32_pause() and lock_delay(). Really, it should be one stack trace!

kernel`lock_delay+0x72
kernel`sosend_generic+0xf61
kernel`sosend+0x79
kernel`soo_write+0x6b
kernel`fo_write+0x4a
kernel`dofilewrite+0xcd
kernel`kern_writev+0x79
kernel`sys_write+0x8f
kernel`syscallenter+0x774
kernel`amd64_syscall+0x1b
kernel`0xffffffff80cedfbd
3993

kernel`ia32_pause+0x7
kernel`sosend_generic+0xf61
kernel`sosend+0x79
kernel`soo_write+0x6b
kernel`fo_write+0x4a
kernel`dofilewrite+0xcd
kernel`kern_writev+0x79
kernel`sys_write+0x8f
kernel`syscallenter+0x774
kernel`amd64_syscall+0x1b
kernel`0xffffffff80cedfbd
4068

kernel`ia32_pause+0x6
kernel`tcp_usr_send+0x131
kernel`sosend_generic+0xf61
kernel`sosend+0x79
kernel`soo_write+0x6b
kernel`fo_write+0x4a
kernel`dofilewrite+0xcd
kernel`kern_writev+0x79
kernel`sys_write+0x8f
kernel`syscallenter+0x774
kernel`amd64_syscall+0x1b
kernel`0xffffffff80cedfbd
26462
Post by Ryan Stone
Post by Lev Serebryakov
Hello Ryan,
Post by Ryan Stone
Adding -fno-optimize-sibling-calls to the compiler flags will eliminate the TCO.
Stacks are better but still strange. For example I have such stack (which I
kernel`lock_delay+0x72
kernel`sosend_generic+0x112c
kernel`sosend+0x79
kernel`soo_write+0x6b
kernel`fo_write+0x4a
kernel`dofilewrite+0xcd
kernel`kern_writev+0x79
kernel`sys_write+0x8f
kernel`syscallenter+0x774
kernel`amd64_syscall+0x1b
kernel`0xffffffff80cebf6d
According to addr2line `sosend_generic+0x112c' is
https://svnweb.freebsd.org/base/head/sys/kern/uipc_socket.c?revision=339419&view=markup#l1579
Which is call to protocol-specific send function. Where is this function
(it should be tcp for sure)?!
--
Best regards,
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
--
// Lev Serebryakov
Ryan Stone
2018-10-24 17:14:10 UTC
Permalink
ia32_pause() is an inline function. How does dtrace map instruction
pointers to symbol names? Is it getting that mapping from some CTF
data, and is that CTF data aware of inline functions?

If so, I'd argue that behaviour is counter-intuitive and unhelpful, as
this example here shows.
Mark Johnston
2018-10-24 17:23:02 UTC
Permalink
Post by Ryan Stone
ia32_pause() is an inline function. How does dtrace map instruction
pointers to symbol names?
It uses a kernel linker API whose implementation searches the symbol
tables of the kernel and loaded KLDs.
Post by Ryan Stone
Is it getting that mapping from some CTF
data, and is that CTF data aware of inline functions?
No, and no. CTF does include a function table, but that's just for
encoding function parameter and return types. Functions that get
inlined into every caller (as I'd expect with ia32_pause()) shouldn't be
showing up in stack traces. There is no ia32_pause() symbol in my
kernel; presumably the observed behaviour is related to some non-default
compiler flags being used, but I haven't carefully read through the
whole thread to see.
Post by Ryan Stone
If so, I'd argue that behaviour is counter-intuitive and unhelpful, as
this example here shows.
Lev Serebryakov
2018-10-24 17:26:21 UTC
Permalink
Post by Ryan Stone
ia32_pause() is an inline function. How does dtrace map instruction
pointers to symbol names? Is it getting that mapping from some CTF
data, and is that CTF data aware of inline functions?
It looks like plausable explanation, but why sometimes it pick-up
"middle layer" like this:

kernel`ia32_pause+0x6
kernel`tcp_usr_send+0x131
kernel`sosend_generic+0xf61

And sometimes doesn't:

kernel`ia32_pause+0x7
kernel`sosend_generic+0xf61

As I said, these stacks are equivalent, I've proved it (ok, it is not
mathematical proof, of course) with custom SDT probes around this
"`tcp_usr_send+0x131" place.
--
// Lev Serebryakov
Loading...