Discussion:
Dealing with the NMI mess
(too old to reply)
Andy Lutomirski
2015-07-23 20:30:02 UTC
Permalink
[moved to a new thread, cc list trimmed]

Hi all-

We've considered two approaches to dealing with NMIs:

1. Allow nesting. We know quite well how messy that is.

2. Forbid IRET inside NMIs. Doable but maybe not that pretty.

We haven't considered:

3. Forbid faults (other than MCE) inside NMI.

Option 3 is almost easy. There are really only two kinds of faults
that can legitimately nest inside NMI: #PF and #DB. #DB is easy to
fix (e.g. with my patches or Peter's patches).

What if we went all out and forbade page faults in NMI as well. There
are two reasons that I can think of that we might page fault inside an
NMI:

a) vmalloc fault. I think Ingo already half-implemented a rework to
eliminate vmalloc faults entirely.

b) User memory access faults.

The reason we access user state in general from an NMI is to allow
perf to capture enough user stack data to let the tooling backtrace
back to user space. What if we did it differently? Instead of
capturing this data in NMI context, capture it in
prepare_exit_to_usermode. That would let us capture user state
*correctly*, which we currently can't really do. There's a
never-ending series of minor bugs in which we try to guess the user
register state from NMI context, and it sort of works. In
prepare_exit_to_usermode, we really truly know the user state.
There's a race where an NMI hits during or after
prepare_exit_to_usermode, but maybe that's okay -- just admit defeat
in that case and don't show the user state. (Realistically, without
CFI data, we're not going to be guaranteed to get the right state
anyway.)

To make this work, we'd have to teach NMI-from-userspace to call the
callback itself. It would look like:

prepare_exit_to_usermode() {
...
while (blah blah blah) {
if (cached_flags & TIF_PERF_CAPTURE_USER_STATE)
perf_capture_user_state();
...
}
...
}

and then, on NMI exit, we'd call perf_capture_user_state directly,
since we don't want to enable IRQs or do opportunsitic sysret on exit
from NMI. (Why not? Because NMIs are still masked, and we don't want
to pay for double-IRET to unmask them, so we really want to leave IRQs
off and IRET straight back to user mode.)

There's an unavoidable race in which we enter user mode with
TIF_PERF_CAPTURE_USER_STATE still set. In principle, we could
IPI-to-self from the NMI handler to cover that case (mostly -- we
capture the wrong state if we're on our way to an IRET fault), or we
could just check on entry if the flag is still set and, if so, admit
defeat.

Peter, can this be done without breaking the perf ABI? If we were
designing all of this stuff from scratch right now, I'd suggest doing
it this way, but I'm not sure whether it makes sense to try to
retrofit it in.


If we decide to stick with option 2, then I've now convinced myself
that banning all kernel breakpoints and watchpoints during NMI
processing is probably for the best. Maybe we should go one step
farther and ban all DR7 breakpoints period. Sure, it will slow down
perf if there are user breakpoints or watchpoints set, but, having
looked at the asm, returning from #DB using RET is, while doable,
distinctly ugly.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2015-07-23 20:40:02 UTC
Permalink
Post by Andy Lutomirski
2. Forbid IRET inside NMIs. Doable but maybe not that pretty.
3. Forbid faults (other than MCE) inside NMI.
I'd really prefer #2. #3 depends on us getting many things right, and
never introducing new cases in the future.

#2, in contrast, seems to be fairly localized. Yes, RF is an issue,
but returning to user space with RF clear doesn't really seem to be
all that problematic.

The point of RF is to make forward progress in the face of debug
register faults, but I don't see what was wrong with the whole
"disable any debug events that happen with interrupts disabled".

And no, I do *not* believe that we should disable debug faults ahead
of time. We should take them, disable them, and return with 'ret'. No
complex "you can't put breakpoints in this region" crap, no magic
rules, no subtle issues.

I really think your "disallow #DB" is pointless. I think your "prevent
instruction breakpoints in NMI" is wrong. Let them happen. Take them
and disable them. Return with RT clear. Go on with your life.

And the "take them and disable them" is really simple. No "am I in an
NMI contect" thing (because that leads to the whole question about
"what is NMI context"). That's not the real rule anyway.

No, make it very simple and straightforward. Make the test be "uhhuh,
I got a #DB in kernel mode, and interrupts were disabled - I know I'm
going to return with "ret", so I'm just going to have to disable this
breakpoint".

Nothing clever. Nothing subtle. Nothing that needs "this range of
instructions is magical". No. Just a very simple rule: if the context
we return to is kernel mode and interrupts are disabled, we're using
'ret', so we cannot suppress debug faults.

Did I miss something? There were a lot of emails flying around, but I
*thought* I saw them all..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-07-23 20:50:02 UTC
Permalink
On Thu, Jul 23, 2015 at 1:38 PM, Linus Torvalds
Post by Linus Torvalds
Post by Andy Lutomirski
2. Forbid IRET inside NMIs. Doable but maybe not that pretty.
3. Forbid faults (other than MCE) inside NMI.
I'd really prefer #2. #3 depends on us getting many things right, and
never introducing new cases in the future.
#2, in contrast, seems to be fairly localized. Yes, RF is an issue,
but returning to user space with RF clear doesn't really seem to be
all that problematic.
The point of RF is to make forward progress in the face of debug
register faults, but I don't see what was wrong with the whole
"disable any debug events that happen with interrupts disabled".
And no, I do *not* believe that we should disable debug faults ahead
of time. We should take them, disable them, and return with 'ret'. No
complex "you can't put breakpoints in this region" crap, no magic
rules, no subtle issues.
I really think your "disallow #DB" is pointless. I think your "prevent
instruction breakpoints in NMI" is wrong. Let them happen. Take them
and disable them. Return with RT clear. Go on with your life.
And the "take them and disable them" is really simple. No "am I in an
NMI contect" thing (because that leads to the whole question about
"what is NMI context"). That's not the real rule anyway.
No, make it very simple and straightforward. Make the test be "uhhuh,
I got a #DB in kernel mode, and interrupts were disabled - I know I'm
going to return with "ret", so I'm just going to have to disable this
breakpoint".
Nothing clever. Nothing subtle. Nothing that needs "this range of
instructions is magical". No. Just a very simple rule: if the context
we return to is kernel mode and interrupts are disabled, we're using
'ret', so we cannot suppress debug faults.
There are some subtleties in here.

Issue A: to return with RF clear, we need to disarm the breakpoint.
If it's limited to the duration of the NMI, that's easy. If not, when
do we re-arm? New prepare_exit_to_usermode hook? Hmm, setting ti
flags during context switch may target the wrong task.

Issue B: single-step exception after SYSENTER. The patches I just
sent fix that, though.

Issue C: #DB with invalid stack pointer (can happen due to watchpoints
during SYSCALL entry or SYSRET exit). I guess we need to ban such
watchpoints.

Issue D: debug exception inside EFI (especially mixed-mode EFI). We
can't return using RET, so we need to catch that case.

These issues mostly go away if we preemptively disarm DR7 early in NMI
processing and rearm it at the end.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2015-07-23 21:10:02 UTC
Permalink
Post by Andy Lutomirski
Issue A: to return with RF clear, we need to disarm the breakpoint.
If it's limited to the duration of the NMI, that's easy. If not, when
do we re-arm? New prepare_exit_to_usermode hook? Hmm, setting ti
flags during context switch may target the wrong task.
We don't re-arm it.

We can entertain the notion *eventually* to do something clever, but
for now, just say: stability and simplicity is more important.

People can use tracepoints in interrupts-off code (they get rewritten
with 'int3', that's fine), but not instruction breakpoints.
Post by Andy Lutomirski
Issue C: #DB with invalid stack pointer (can happen due to watchpoints
during SYSCALL entry or SYSRET exit). I guess we need to ban such
watchpoints.
.. but this is unrelated, to NMI, just "syscall is a nasty interface".
Don't we already ban them?
Post by Andy Lutomirski
Issue D: debug exception inside EFI (especially mixed-mode EFI). We
can't return using RET, so we need to catch that case.
If NMI code calls EFI code, then it's broken.
Post by Andy Lutomirski
These issues mostly go away if we preemptively disarm DR7 early in NMI
processing and rearm it at the end.
I'm not *violently* opposed to that, but it's just a band-aid. It
doesn't *fix* anything. You aren't protecting against random DB
exceptions just because somebody put a data breakpoint on the NMI
stack, for example. You still get page faults. Etc etc.

So I thinkt he whole "use ret instead" is a pretty simple approach.
Make that "just work".

Then, if you want to play with dr7 inside NMI to make it more likely
that you can have breakpoints live in irq-off situation, I think
that's a magic special case. It shouldn't be part of the design.
Things should work without it.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Steven Rostedt
2015-07-23 21:40:03 UTC
Permalink
On Thu, 23 Jul 2015 14:08:59 -0700
Post by Linus Torvalds
Post by Andy Lutomirski
Issue A: to return with RF clear, we need to disarm the breakpoint.
If it's limited to the duration of the NMI, that's easy. If not, when
do we re-arm? New prepare_exit_to_usermode hook? Hmm, setting ti
flags during context switch may target the wrong task.
We don't re-arm it.
Let me get this straight. The idea is in the #DB handler to detect that
it was triggered in NMI context, and if so, simply disarm that
breakpoint permanently, right?

Nothing should be adding hw breakpoints to NMI code anyway. Sounds
perfectly reasonable to me. Of course, how we tell we are in NMI
brings back all the races as we had in the nesting code. We can check
the per-cpu variable that is set with nmi_enter() and cleared at
nmi_exit() but what happens if the breakpoint is outside those calls.
We can check the stack pointer, but then we are back to userspace
fooling us. Maybe add the DF trick again?

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2015-07-23 21:50:02 UTC
Permalink
Post by Steven Rostedt
Let me get this straight. The idea is in the #DB handler to detect that
it was triggered in NMI context, and if so, simply disarm that
breakpoint permanently, right?
No, for simplicity, I'd make it cover not just NMI code, but any
"kernel code with interrupts disabled".

Because that's the test we'd use for "use ret instead of iret".

And that wider test is exactly because it's so damn hard to get the
exact instruction boundaries right. Let's *not* go down the path
(again) of having to get the whole %rip range and "magic stack pointer
values" etc.

Make it simple and completely unambiguous. The rule really would be:

- if we return to kernel space and interrupts are disabled, we will
use "ret" rather than "iret"

Hard rule. Simple. Straightforward. No random %rip values. No
random %rsp values. NO CRAP.

- but because we use "ret" rather than "iret" we can't get RF
semantics, it means that #DB is special. RF is supposed to make us
make forward progress

So for that reason, #DB just says "if the breakpoint happened
during that interrupts-ff reghion, I will clear %dr7 to guarantee
forward progress"

So those would be the two main rules. Very simple, and avoiding all nasty cases.

Now, I'd be willing to then hide the "oops, we clear dr7 very
agrressively" issue by having a few additional _heuristics_. But I
call them "heuristics" because unlike the current NMI nesting games,
they aren't about core stability. They are about "ok, maybe somebody
wants to trigger those faults, and we'll be _nice_ and try to make it
easy for them", but nothing more.

So for example, if that "#DB clears %dr7" happened, it sounds easy to
set _TIF_USER_WORK_MASK, and just force %dr7 to be re-loaded from a
cached value, so that if we disabled things because of some user stack
trace access, it will be re-enabled by the time we return to user
space. I think that sounds reasonable, but it's not something the core
low-level entry x86 assembly code needs to even care about. It's not
that level of "core", it's just being polite.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-07-23 22:00:02 UTC
Permalink
On Thu, Jul 23, 2015 at 2:48 PM, Linus Torvalds
Post by Linus Torvalds
Post by Steven Rostedt
Let me get this straight. The idea is in the #DB handler to detect that
it was triggered in NMI context, and if so, simply disarm that
breakpoint permanently, right?
No, for simplicity, I'd make it cover not just NMI code, but any
"kernel code with interrupts disabled".
Because that's the test we'd use for "use ret instead of iret".
And that wider test is exactly because it's so damn hard to get the
exact instruction boundaries right. Let's *not* go down the path
(again) of having to get the whole %rip range and "magic stack pointer
values" etc.
- if we return to kernel space and interrupts are disabled, we will
use "ret" rather than "iret"
Hard rule. Simple. Straightforward. No random %rip values. No
random %rsp values. NO CRAP.
- but because we use "ret" rather than "iret" we can't get RF
semantics, it means that #DB is special. RF is supposed to make us
make forward progress
So for that reason, #DB just says "if the breakpoint happened
during that interrupts-ff reghion, I will clear %dr7 to guarantee
forward progress"
What if we relax it slightly: "if the breakpoint happened during that
interrupts-off region, I will clear all *kernel breakpoints* in %dr7
to guarantee forward progress"?

Watchpoints don't need RF to make forward progress, and, by leaving
watchpoints alone, we avoid breaking gdb.
Post by Linus Torvalds
So those would be the two main rules. Very simple, and avoiding all nasty cases.
Now, I'd be willing to then hide the "oops, we clear dr7 very
agrressively" issue by having a few additional _heuristics_. But I
call them "heuristics" because unlike the current NMI nesting games,
they aren't about core stability. They are about "ok, maybe somebody
wants to trigger those faults, and we'll be _nice_ and try to make it
easy for them", but nothing more.
So for example, if that "#DB clears %dr7" happened, it sounds easy to
set _TIF_USER_WORK_MASK, and just force %dr7 to be re-loaded from a
cached value, so that if we disabled things because of some user stack
trace access, it will be re-enabled by the time we return to user
space. I think that sounds reasonable, but it's not something the core
low-level entry x86 assembly code needs to even care about. It's not
that level of "core", it's just being polite.
Once we limit it to instruction breakpoints, I don't think re-enabling
before returning to userspace matters.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2015-07-23 22:10:02 UTC
Permalink
Post by Andy Lutomirski
What if we relax it slightly: "if the breakpoint happened during that
interrupts-off region, I will clear all *kernel breakpoints* in %dr7
to guarantee forward progress"?
Watchpoints don't need RF to make forward progress, and, by leaving
watchpoints alone, we avoid breaking gdb.
Hmmm. I thought watchpoints were "before the instruction" too, but
that's just because I haven't used them in ages, and I didn't remember
the details. I just looked it up.

You're right - the memory watchpoints trigger after the instruction
has executed, so RF isn't an issue. So yes, the only issue is
instruction breakpoints, and those are the only ones we need to clear.

And that makes it really easy.

So yes, I agree. We only need to clear all kernel breakpoints.

So we don't even need that _TIF_USER_WORK_MASK thing, because user
space isn't setting kernel code breakpoints, it's just kgdb.

Sounds good to me.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Peter Zijlstra
2015-07-24 08:20:01 UTC
Permalink
Post by Linus Torvalds
Hmmm. I thought watchpoints were "before the instruction" too, but
that's just because I haven't used them in ages, and I didn't remember
the details. I just looked it up.
You're right - the memory watchpoints trigger after the instruction
has executed, so RF isn't an issue. So yes, the only issue is
instruction breakpoints, and those are the only ones we need to clear.
And that makes it really easy.
So yes, I agree. We only need to clear all kernel breakpoints.
But but but, we can access userspace with !IF, imagine someone doing:

local_irq_disable();
copy_from_user_inatomic();

and as luck would have it, there's a breakpoint on the user memory we
just touched. And we go and disable a user breakpoint.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Willy Tarreau
2015-07-24 09:10:01 UTC
Permalink
Post by Peter Zijlstra
Post by Linus Torvalds
Hmmm. I thought watchpoints were "before the instruction" too, but
that's just because I haven't used them in ages, and I didn't remember
the details. I just looked it up.
You're right - the memory watchpoints trigger after the instruction
has executed, so RF isn't an issue. So yes, the only issue is
instruction breakpoints, and those are the only ones we need to clear.
And that makes it really easy.
So yes, I agree. We only need to clear all kernel breakpoints.
local_irq_disable();
copy_from_user_inatomic();
and as luck would have it, there's a breakpoint on the user memory we
just touched. And we go and disable a user breakpoint.
Then shouldn't we use !IF && RSP matches NMI's stack ?
User-space cannot control the two at once.

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Steven Rostedt
2015-07-24 12:00:04 UTC
Permalink
On Fri, 24 Jul 2015 10:13:26 +0200
Post by Peter Zijlstra
Post by Linus Torvalds
Hmmm. I thought watchpoints were "before the instruction" too, but
that's just because I haven't used them in ages, and I didn't remember
the details. I just looked it up.
You're right - the memory watchpoints trigger after the instruction
has executed, so RF isn't an issue. So yes, the only issue is
instruction breakpoints, and those are the only ones we need to clear.
And that makes it really easy.
So yes, I agree. We only need to clear all kernel breakpoints.
local_irq_disable();
copy_from_user_inatomic();
and as luck would have it, there's a breakpoint on the user memory we
just touched. And we go and disable a user breakpoint.
Where does the kernel do that to user text? I would think that user
data would only have watchpoints, and Andy and Linus said that those
would not be disabled (I'm guessing because they don't have the RF flag
set, and forward progress can proceed). If the kernel does the above to
user code and there's a breakpoint there, would it even trigger?

I'm not too familiar with how to use hw breakpoints, but I'm guessing
(correct me if I'm wrong) that breakpoints on code that trigger when
executed, but watchpoints on data trigger when accessed. Then
copy_from_user_inatomic() would only trigger on watchpoints (it's not
executing that code, at least I hope it isn't!), and those wont bother
us.

Or am I totally off base here?

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Peter Zijlstra
2015-07-24 12:50:01 UTC
Permalink
Post by Steven Rostedt
On Fri, 24 Jul 2015 10:13:26 +0200
Post by Peter Zijlstra
Post by Linus Torvalds
Hmmm. I thought watchpoints were "before the instruction" too, but
that's just because I haven't used them in ages, and I didn't remember
the details. I just looked it up.
You're right - the memory watchpoints trigger after the instruction
has executed, so RF isn't an issue. So yes, the only issue is
instruction breakpoints, and those are the only ones we need to clear.
And that makes it really easy.
So yes, I agree. We only need to clear all kernel breakpoints.
local_irq_disable();
copy_from_user_inatomic();
and as luck would have it, there's a breakpoint on the user memory we
just touched. And we go and disable a user breakpoint.
Where does the kernel do that to user text? I would think that user
data would only have watchpoints, and Andy and Linus said that those
would not be disabled (I'm guessing because they don't have the RF flag
set, and forward progress can proceed). If the kernel does the above to
user code and there's a breakpoint there, would it even trigger?
I'm not too familiar with how to use hw breakpoints, but I'm guessing
(correct me if I'm wrong) that breakpoints on code that trigger when
executed, but watchpoints on data trigger when accessed. Then
copy_from_user_inatomic() would only trigger on watchpoints (it's not
executing that code, at least I hope it isn't!), and those wont bother
us.
These things can be: RW, W, X.

Sure, hitting a user X watchpoint is going to be 'interesting', but its
fairly easy to hit a RW one.

Just watch an on-stack variable and get perf to copy a huge chunk of
stack (like it does for the dwarf stuff).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Steven Rostedt
2015-07-24 13:10:02 UTC
Permalink
On Fri, 24 Jul 2015 14:43:04 +0200
Post by Peter Zijlstra
Post by Steven Rostedt
I'm not too familiar with how to use hw breakpoints, but I'm guessing
(correct me if I'm wrong) that breakpoints on code that trigger when
executed, but watchpoints on data trigger when accessed. Then
copy_from_user_inatomic() would only trigger on watchpoints (it's not
executing that code, at least I hope it isn't!), and those wont bother
us.
These things can be: RW, W, X.
Sure, hitting a user X watchpoint is going to be 'interesting', but its
fairly easy to hit a RW one.
But do we care if we do hit one? The return from the #DB handler can
use a RET. Right?

-- Steve
Post by Peter Zijlstra
Just watch an on-stack variable and get perf to copy a huge chunk of
stack (like it does for the dwarf stuff).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Willy Tarreau
2015-07-24 13:30:03 UTC
Permalink
Post by Steven Rostedt
On Fri, 24 Jul 2015 14:43:04 +0200
Post by Peter Zijlstra
Post by Steven Rostedt
I'm not too familiar with how to use hw breakpoints, but I'm guessing
(correct me if I'm wrong) that breakpoints on code that trigger when
executed, but watchpoints on data trigger when accessed. Then
copy_from_user_inatomic() would only trigger on watchpoints (it's not
executing that code, at least I hope it isn't!), and those wont bother
us.
These things can be: RW, W, X.
Sure, hitting a user X watchpoint is going to be 'interesting', but its
fairly easy to hit a RW one.
But do we care if we do hit one? The return from the #DB handler can
use a RET. Right?
My understanding is that by using RET we can't set the RF flag and #DB
will immediately strike again when the operation is attempted again. Thus
we have to completely disable the breakpoints on leaving after the first
one strikes, resulting in some userland breakpoints being missed. Maybe
it can be accepted as a limitation when perf is running. I don't know if
the output of perf is that relevant when a debugger is present BTW.

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Peter Zijlstra
2015-07-24 13:40:02 UTC
Permalink
Post by Willy Tarreau
Post by Steven Rostedt
On Fri, 24 Jul 2015 14:43:04 +0200
Post by Peter Zijlstra
Post by Steven Rostedt
I'm not too familiar with how to use hw breakpoints, but I'm guessing
(correct me if I'm wrong) that breakpoints on code that trigger when
executed, but watchpoints on data trigger when accessed. Then
copy_from_user_inatomic() would only trigger on watchpoints (it's not
executing that code, at least I hope it isn't!), and those wont bother
us.
These things can be: RW, W, X.
Sure, hitting a user X watchpoint is going to be 'interesting', but its
fairly easy to hit a RW one.
But do we care if we do hit one? The return from the #DB handler can
use a RET. Right?
Look at do_debug(), it has lovely bits like:

preempt_conditional_sti();

in it, we do _NOT_ want to be re-enabling interrupts if we're called
from an !IF context, that'd be _bad_.
Post by Willy Tarreau
My understanding is that by using RET we can't set the RF flag and #DB
will immediately strike again when the operation is attempted again. Thus
we have to completely disable the breakpoints on leaving after the first
one strikes, resulting in some userland breakpoints being missed. Maybe
it can be accepted as a limitation when perf is running. I don't know if
the output of perf is that relevant when a debugger is present BTW.
The patch I posted will re-enable the breakpoints before returning to
userspace. So userspace will only 'miss' events generated by the kernel.

Missing reads from the kernel is not a problem -- and maybe even
expected, but certainly unavoidable.

Missing updates from the kernel might be a problem, you'd get a variable
change content even though you have a W watchpoint on it, that'd be
surprising.

Then again, I suppose we can argue the variable changed through another
mapping and watchpoints work on the virtual address, so tough cookies or
somesuch -- the kernel could in fact do this on highmem kernel anyway.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Peter Zijlstra
2015-07-24 13:40:02 UTC
Permalink
Post by Peter Zijlstra
Post by Steven Rostedt
But do we care if we do hit one? The return from the #DB handler can
use a RET. Right?
preempt_conditional_sti();
in it, we do _NOT_ want to be re-enabling interrupts if we're called
from an !IF context, that'd be _bad_.
Ah, I forgot the conditional thing was the STI depending on regs->flags
& IF..

In any case, better safe than sorry and simply not do #DB ever if !IF.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Steven Rostedt
2015-07-24 14:40:02 UTC
Permalink
On Fri, 24 Jul 2015 15:21:28 +0200
Post by Willy Tarreau
My understanding is that by using RET we can't set the RF flag and #DB
But the RF flag is only set for instruction (executing) breakpoints. It
is not set for data (RW) ones.

-- Steve
Post by Willy Tarreau
will immediately strike again when the operation is attempted again. Thus
we have to completely disable the breakpoints on leaving after the first
one strikes, resulting in some userland breakpoints being missed. Maybe
it can be accepted as a limitation when perf is running. I don't know if
the output of perf is that relevant when a debugger is present BTW.
Willy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Willy Tarreau
2015-07-24 15:00:01 UTC
Permalink
Post by Steven Rostedt
On Fri, 24 Jul 2015 15:21:28 +0200
Post by Willy Tarreau
My understanding is that by using RET we can't set the RF flag and #DB
But the RF flag is only set for instruction (executing) breakpoints. It
is not set for data (RW) ones.
True but these also are the most complicated to deal with. The data
accesses can always be emulated (not what I'm suggesting here) while
instructions are much harder to emulate.

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Steven Rostedt
2015-07-24 15:20:02 UTC
Permalink
On Fri, 24 Jul 2015 16:59:01 +0200
Post by Willy Tarreau
Post by Steven Rostedt
On Fri, 24 Jul 2015 15:21:28 +0200
Post by Willy Tarreau
My understanding is that by using RET we can't set the RF flag and #DB
But the RF flag is only set for instruction (executing) breakpoints. It
is not set for data (RW) ones.
True but these also are the most complicated to deal with. The data
accesses can always be emulated (not what I'm suggesting here) while
instructions are much harder to emulate.
The point is, if we trigger a #DB on an instruction breakpoint
while !IF, then we simply disable that breakpoint and do the RET. What
emulation is needed?

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Willy Tarreau
2015-07-24 15:30:02 UTC
Permalink
Post by Steven Rostedt
On Fri, 24 Jul 2015 16:59:01 +0200
Post by Willy Tarreau
Post by Steven Rostedt
On Fri, 24 Jul 2015 15:21:28 +0200
Post by Willy Tarreau
My understanding is that by using RET we can't set the RF flag and #DB
But the RF flag is only set for instruction (executing) breakpoints. It
is not set for data (RW) ones.
True but these also are the most complicated to deal with. The data
accesses can always be emulated (not what I'm suggesting here) while
instructions are much harder to emulate.
The point is, if we trigger a #DB on an instruction breakpoint
while !IF, then we simply disable that breakpoint and do the RET.
Yes but the breakpoint remains disabled then. Or I'm missing
something.
Post by Steven Rostedt
What emulation is needed?
I was speaking about redoing the operation with BP disabled before
re-enabling it. But that's not the point here anyway.

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Willy Tarreau
2015-07-24 15:40:02 UTC
Permalink
Post by Willy Tarreau
Post by Steven Rostedt
The point is, if we trigger a #DB on an instruction breakpoint
while !IF, then we simply disable that breakpoint and do the RET.
Yes but the breakpoint remains disabled then. Or I'm missing
something.
http://marc.info/?l=linux-kernel&m=143773601130974
We re-enable before going back to userspace.
Ah OK thanks Peter. I'm sorry if I'm adding more noise than
anything here, it's hard to follow and it becomes a bit complex.

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Peter Zijlstra
2015-07-24 15:40:04 UTC
Permalink
Post by Willy Tarreau
Post by Steven Rostedt
The point is, if we trigger a #DB on an instruction breakpoint
while !IF, then we simply disable that breakpoint and do the RET.
Yes but the breakpoint remains disabled then. Or I'm missing
something.
http://marc.info/?l=linux-kernel&m=143773601130974

We re-enable before going back to userspace.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2015-07-24 18:30:02 UTC
Permalink
Post by Willy Tarreau
Post by Steven Rostedt
The point is, if we trigger a #DB on an instruction breakpoint
while !IF, then we simply disable that breakpoint and do the RET.
Yes but the breakpoint remains disabled then. Or I'm missing
something.
http://marc.info/?l=linux-kernel&m=143773601130974
We re-enable before going back to userspace.
Actually, Andy had a good argument that we don't even need this.

We just don't ever need to disable data breakpoints. Even if we end up doing

cli();
copy_from_user_inatomic();

that actually works fine. If there are data breakpoints, we will have

(a) things will run slow as hell anyway. Intel CPU's slow down to a
relative crawl.

(b) let's say we have a data breakpoint on the data we're reading above

(c) we take a #DB fault after the instruction has completed, so we
make forward progress even if we return with RF clear

(d) even if the data breakpoint is unaligned and triggers multiple
times, it's going to be a "small number" of multiple times. And see
(a). This never happens in practice, and the much bigger slowdown is
how data breakpoints tend to slow things down in general.

(e) yes, the string instructions may hit the data breakpoint multilpe
times for the "same" instruction, but the forward progress part is
still true even for the string instructions. In fact, it's actually
likely <i>more</i> true for string instructions, because they are
documented to be less exact, and may trigger the data watchpoint only
after a "group of iterations".

so I think we just leave data breakpoint alone. The only debug
conditions that are *faults* rather than traps are the instruction
breakpoints, and we can detect and disable those by just saying "oh,
we're in kernel mode".

So in the #DB handler, we would basically only clear instruction
breakpoints, and only when they trigger. If we have a data breakpoint
that triggers (even in kernel mode, and with interrupts disabled), let
it trigger and return with "ret" anyway. No biggie.

(Ok, so the "General detect fault" is also a fault rather than a trap,
but that's the "write to debug registers when it's disabled" thing,
very different)

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2015-07-24 18:50:02 UTC
Permalink
On Fri, Jul 24, 2015 at 11:29 AM, Linus Torvalds
Post by Linus Torvalds
So in the #DB handler, we would basically only clear instruction
breakpoints, and only when they trigger. If we have a data breakpoint
that triggers (even in kernel mode, and with interrupts disabled), let
it trigger and return with "ret" anyway. No biggie.
So we'd not only look at "which breakpoint triggered", we'd also look
at the actual debug register and check that "R/Wn == 0", and only
disable it for that case.

So you'd read %dr6 and %dr7, and then iterate 0..3 and check whether
it triggerd (bit #n in %dr6), and that R/Wn (bits 16-17+n*4 of %dr7)
is zero, and if so, clear LGn bits (bits 0-1+n*2) in %dr7.

Something like

unsigned long mask = 0;
unsigned int dr6 = debug_read(6);
unsigned int dr7 = debug_read(7)
int i;

for (i = 0; i < 4; i++) {
if ((dr6 >> i) & 1) {
if (!((dr7 >> (4*i+16)) & 3))
mask |= 3 << (i*2);
}
}

if (mask)
debug_write(dr7 & ~mask, 7);

(yeah, I could easily have screwed that up)

But the above should only clear bits in dr7 that are actually
associated with the instruction breakpoint that triggered, and since
it's a _kernel_ instruction breakpoint, not a user one, we can clear
it and forget it. No need to re-enable at all.

Hmm?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Steven Rostedt
2015-07-24 19:10:03 UTC
Permalink
On Fri, 24 Jul 2015 11:41:55 -0700
Post by Linus Torvalds
On Fri, Jul 24, 2015 at 11:29 AM, Linus Torvalds
Post by Linus Torvalds
So in the #DB handler, we would basically only clear instruction
breakpoints, and only when they trigger. If we have a data breakpoint
that triggers (even in kernel mode, and with interrupts disabled), let
it trigger and return with "ret" anyway. No biggie.
So we'd not only look at "which breakpoint triggered", we'd also look
at the actual debug register and check that "R/Wn == 0", and only
disable it for that case.
So you'd read %dr6 and %dr7, and then iterate 0..3 and check whether
it triggerd (bit #n in %dr6), and that R/Wn (bits 16-17+n*4 of %dr7)
is zero, and if so, clear LGn bits (bits 0-1+n*2) in %dr7.
Something like
unsigned long mask = 0;
unsigned int dr6 = debug_read(6);
unsigned int dr7 = debug_read(7)
int i;
for (i = 0; i < 4; i++) {
if ((dr6 >> i) & 1) {
if (!((dr7 >> (4*i+16)) & 3))
mask |= 3 << (i*2);
}
}
if (mask)
debug_write(dr7 & ~mask, 7);
Macros would be nice for readability.

for (i = 0; i < 4; i++) {
if ((dr6 >> i) & 1) {
int shift = DR_CONTROL_SIZE * i + DR_CONTROL_SHIFT;
if (!((dr7 >> shift) & DR_RW_READ))
mask |= (DR_LOCAL_ENABLE|DR_GLOBAL_ENABLE) << (i * DR_ENABLE_SIZE);
}
}

-- Steve
Post by Linus Torvalds
(yeah, I could easily have screwed that up)
But the above should only clear bits in dr7 that are actually
associated with the instruction breakpoint that triggered, and since
it's a _kernel_ instruction breakpoint, not a user one, we can clear
it and forget it. No need to re-enable at all.
Hmm?
Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Peter Zijlstra
2015-07-24 20:00:03 UTC
Permalink
Post by Linus Torvalds
Post by Willy Tarreau
Post by Steven Rostedt
The point is, if we trigger a #DB on an instruction breakpoint
while !IF, then we simply disable that breakpoint and do the RET.
Yes but the breakpoint remains disabled then. Or I'm missing
something.
http://marc.info/?l=linux-kernel&m=143773601130974
We re-enable before going back to userspace.
Actually, Andy had a good argument that we don't even need this.
We just don't ever need to disable data breakpoints. Even if we end up doing
cli();
copy_from_user_inatomic();
that actually works fine. If there are data breakpoints, we will have
I worry that we'll end up running the do_debug() handlers from effective
NMI context.

The NMI might have preempted locks which these handlers require etc..
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2015-07-24 20:30:02 UTC
Permalink
Post by Peter Zijlstra
I worry that we'll end up running the do_debug() handlers from effective
NMI context.
The NMI might have preempted locks which these handlers require etc..
If #DB takes any locks like that, then #DB is broken.

Pretty much by definition, a data breakpoint can happen on pretty much
absolutely any code. This is in no way NMI-specific as far as I can
tell.

Do we really take locks in the #DB handler?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Peter Zijlstra
2015-07-24 21:00:02 UTC
Permalink
Post by Linus Torvalds
Post by Peter Zijlstra
I worry that we'll end up running the do_debug() handlers from effective
NMI context.
The NMI might have preempted locks which these handlers require etc..
If #DB takes any locks like that, then #DB is broken.
Pretty much by definition, a data breakpoint can happen on pretty much
absolutely any code. This is in no way NMI-specific as far as I can
tell.
Do we really take locks in the #DB handler?
do_debug()
send_sigtrap()
force_sig_info()
spin_lock_irqsave()

Now, I don't pretend to understand the condition before send_sigtrap(),
so it _might_ be ok, but it sure as heck could do with a comment.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-07-24 21:10:01 UTC
Permalink
Post by Peter Zijlstra
Post by Linus Torvalds
Post by Peter Zijlstra
I worry that we'll end up running the do_debug() handlers from effective
NMI context.
The NMI might have preempted locks which these handlers require etc..
If #DB takes any locks like that, then #DB is broken.
Pretty much by definition, a data breakpoint can happen on pretty much
absolutely any code. This is in no way NMI-specific as far as I can
tell.
Do we really take locks in the #DB handler?
do_debug()
send_sigtrap()
force_sig_info()
spin_lock_irqsave()
Now, I don't pretend to understand the condition before send_sigtrap(),
so it _might_ be ok, but it sure as heck could do with a comment.
Let's try to decode it.

user_icebp is set if int $0x01 happens, except it isn't because user
code can't actually do that -- it'll cause #GP instead.

user_icebp is also set if the user has a bloody in-circuit emulator,
given the name. But who on Earth has one of those on a system new
enough to run Linux and, even if they have one, why on Earth are they
using it to send SIGTRAP.

In any event, user_icebp is only set if user_mode(regs), so it's safe
locking-wise. But please let's delete it.

Otherwise, we do send_sigtrap if we got a single-step exception from
user mode (because we suppress single-step exceptions from kernel mode
a couple lines above, but we should really BUG on those except for the
single case of SYSENTER with TF set) or if we get a breakpoint
exception that wasn't eaten by perf.

For *#&!'s sake. we should rewrite this pile of crap.

// before kprobes and notify_die
#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
if (!user_mode(regs) && regs->ip == sysenter_target) {
fix it up and return;
}

notify_die, etc.

preempt_conditional_sti(regs);
do_trap(X86_TRAP_DB, SIGTRAP, "debug", regs, error_code, NULL);
preempt_conditional_cli(regs);

except we should do something to disallow fixup_exception here. Or we
could open-code if(user_mode) send_sigtrap() else die() here.

I really don't think that we should be sending signals to userspace
due to user address watchpoints that hit in kernel mode. Or, if we do
think we should send signals for those, then, as Steven said, we
should make that explicit and use IRQ work for that.

As it stands, this is probably an exploitable DoS -- just point a
watchpoint down the stack a little bit from yourself and call raise().

--Andy
--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Paolo Bonzini
2015-07-30 15:50:01 UTC
Permalink
Post by Andy Lutomirski
user_icebp is set if int $0x01 happens, except it isn't because user
code can't actually do that -- it'll cause #GP instead.
user_icebp is also set if the user has a bloody in-circuit emulator,
given the name. But who on Earth has one of those on a system new
enough to run Linux and, even if they have one, why on Earth are they
using it to send SIGTRAP.
You do not need either "int $0x01" or an ICE to set user_icebp = 1. You
can use the 0xf1 opcode, which is kinda like 0xcc but generates #DB
instead of #BP.

The historical name is ICEBP because in-circuit emulators used it for
software breakpoints, just like your usual debugger used 0xcc aka int3.
And just like 0xcc it's unprivileged, so you can actually get a SIGTRAP
with asm(".byte 0xf1").

So...
Post by Andy Lutomirski
In any event, user_icebp is only set if user_mode(regs), so it's safe
locking-wise. But please let's delete it.
... it's safe, but it has some use (!).

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-07-30 21:30:04 UTC
Permalink
Post by Paolo Bonzini
Post by Andy Lutomirski
user_icebp is set if int $0x01 happens, except it isn't because user
code can't actually do that -- it'll cause #GP instead.
user_icebp is also set if the user has a bloody in-circuit emulator,
given the name. But who on Earth has one of those on a system new
enough to run Linux and, even if they have one, why on Earth are they
using it to send SIGTRAP.
You do not need either "int $0x01" or an ICE to set user_icebp = 1. You
can use the 0xf1 opcode, which is kinda like 0xcc but generates #DB
instead of #BP.
Great. There's an opcode that invokes an interrupt gate that's not
marked as allowing unprivileged access, and that opcode doesn't appear
in the SDM. It appears in the APM opcode map with no explanation at
all.

Thanks, CPU vendors.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Brian Gerst
2015-07-30 22:00:04 UTC
Permalink
Post by Andy Lutomirski
Post by Paolo Bonzini
Post by Andy Lutomirski
user_icebp is set if int $0x01 happens, except it isn't because user
code can't actually do that -- it'll cause #GP instead.
user_icebp is also set if the user has a bloody in-circuit emulator,
given the name. But who on Earth has one of those on a system new
enough to run Linux and, even if they have one, why on Earth are they
using it to send SIGTRAP.
You do not need either "int $0x01" or an ICE to set user_icebp = 1. You
can use the 0xf1 opcode, which is kinda like 0xcc but generates #DB
instead of #BP.
Great. There's an opcode that invokes an interrupt gate that's not
marked as allowing unprivileged access, and that opcode doesn't appear
in the SDM. It appears in the APM opcode map with no explanation at
all.
Thanks, CPU vendors.
--Andy
Some Windows programs (running in Wine) use this opcode for
anti-debugging code. See commit
a1e80fafc9f0742a1776a0490258cb64912411b0.

--
Brian Gerst
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Thomas Gleixner
2015-07-30 23:10:03 UTC
Permalink
Post by Andy Lutomirski
Post by Paolo Bonzini
Post by Andy Lutomirski
user_icebp is set if int $0x01 happens, except it isn't because user
code can't actually do that -- it'll cause #GP instead.
user_icebp is also set if the user has a bloody in-circuit emulator,
given the name. But who on Earth has one of those on a system new
enough to run Linux and, even if they have one, why on Earth are they
using it to send SIGTRAP.
You do not need either "int $0x01" or an ICE to set user_icebp = 1. You
can use the 0xf1 opcode, which is kinda like 0xcc but generates #DB
instead of #BP.
Great. There's an opcode that invokes an interrupt gate that's not
marked as allowing unprivileged access, and that opcode doesn't appear
in the SDM. It appears in the APM opcode map with no explanation at
all.
The only SDM reference I found is:

"The opcodes D6 and F1 are undefined opcodes reserved by the Intel 64
and IA-32 architectures. These opcodes, even though undefined, do
not generate an invalid opcode exception."

D6 is actually something useful:

if (carry flag set)
AL = FF
else
AL = 0

It's been there since i386. It has been conveniant for return code
magic from ASM to C. I haven't thought of it for at least a decade :)

So all we need to worry about is F1, but thats bad enough :(

Thanks,

tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-07-31 04:30:01 UTC
Permalink
Post by Andy Lutomirski
Great. There's an opcode that invokes an interrupt gate that's not
marked as allowing unprivileged access, and that opcode doesn't appear
in the SDM. It appears in the APM opcode map with no explanation at
all.
Thanks, CPU vendors.
Here's something better:

http://www.rcollins.org/secrets/opcodes/ICEBP.html
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-07-31 05:20:02 UTC
Permalink
Post by Borislav Petkov
Post by Andy Lutomirski
Great. There's an opcode that invokes an interrupt gate that's not
marked as allowing unprivileged access, and that opcode doesn't appear
in the SDM. It appears in the APM opcode map with no explanation at
all.
Thanks, CPU vendors.
http://www.rcollins.org/secrets/opcodes/ICEBP.html
This instruction is awesome. Binutils can disassemble it (it's called
"icebp") but it can't assemble it. KVM has special handling for it on
VMX and actually reports it to QEMU on SVM (complete with a defined
ABI). We have an asm macro so we can assemble it for 32-bit but not
64-bit, despite the fact that it works on 64-bit.

The kernel instruction decoder can't decode it.

Fortunately, it looks like the vm86 case is correct (or as correct as
any of the vm86 junk can be), although I haven't tested it. I bet
that icebp is like int3 in that it punches through vm86 mode instead
of sending #GP.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Paolo Bonzini
2015-07-31 08:00:01 UTC
Permalink
Post by Andy Lutomirski
This instruction is awesome. Binutils can disassemble it (it's called
"icebp") but it can't assemble it. KVM has special handling for it on
VMX and actually reports it to QEMU on SVM (complete with a defined
ABI).
FWIW it's not reported to QEMU, it's only reported to a nested
hypervisor. So the ABI is simply the SVM spec.

It's not surprising that VMX support was provided by the Wine guys...

Paolo
Post by Andy Lutomirski
We have an asm macro so we can assemble it for 32-bit but not
64-bit, despite the fact that it works on 64-bit.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-07-31 08:10:01 UTC
Permalink
Post by Andy Lutomirski
This instruction is awesome. Binutils can disassemble it (it's called
"icebp") but it can't assemble it. KVM has special handling for it on
VMX and actually reports it to QEMU on SVM (complete with a defined
ABI).
Fun.
Post by Andy Lutomirski
We have an asm macro so we can assemble it for 32-bit but not
64-bit, despite the fact that it works on 64-bit.
The kernel instruction decoder can't decode it.
Yeah, the kernel insn decoder needs to be fixed. Even my decoder can
decode it:

$ echo "0xf1" | ./x86d -
0: f1 icebp

Big deal. :-)

Let's do some fun and games:

$ cat icebp.c
int main()
{
asm volatile(".byte 0xf1");

return 0;
}

$ gcc -Wall -o icebp{,.c}
$ objdump -d icebp

...

00000000004004ac <main>:
4004ac: 55 push %rbp
4004ad: 48 89 e5 mov %rsp,%rbp
4004b0: f1 icebp
4004b1: b8 00 00 00 00 mov $0x0,%eax
4004b6: 5d pop %rbp
4004b7: c3 retq
4004b8: 90 nop
...

$ ./icebp
Trace/breakpoint trap

^ this in qemu.

On baremetal it gets a SIGTRAP with TRAP_BRKPT. Looks like signal
handling knows about it...

$ strace /tmp/icebp
execve("/tmp/icebp", ["/tmp/icebp"], [/* 27 vars */]) = 0
brk(0) = 0x1680000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f71e243d000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=127070, ...}) = 0
mmap(NULL, 127070, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f71e241d000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\34\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1729984, ...}) = 0
mmap(NULL, 3836448, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f71e1e76000
mprotect(0x7f71e2015000, 2097152, PROT_NONE) = 0
mmap(0x7f71e2215000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x19f000) = 0x7f71e2215000
mmap(0x7f71e221b000, 14880, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f71e221b000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f71e241c000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f71e241b000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f71e241a000
arch_prctl(ARCH_SET_FS, 0x7f71e241b700) = 0
mprotect(0x7f71e2215000, 16384, PROT_READ) = 0
mprotect(0x7f71e243f000, 4096, PROT_READ) = 0
munmap(0x7f71e241d000, 127070) = 0
--- SIGTRAP {si_signo=SIGTRAP, si_code=TRAP_BRKPT, si_pid=4195505, si_uid=0} ---
+++ killed by SIGTRAP +++
Trace/breakpoint trap
Post by Andy Lutomirski
Fortunately, it looks like the vm86 case is correct (or as correct as
any of the vm86 junk can be), although I haven't tested it. I bet
that icebp is like int3 in that it punches through vm86 mode instead
of sending #GP.
Yeah, INT 1. I wonder whether INT 1, i.e. CD imm8 does the same thing.

But why do you say it is special - it simply raises #DB, i.e. vector 1.
Web page seems to say so when interrupt redirection is disabled. It
sounds like a nice and quick way to generate a breakpoint. You can do
that with INT 01, i.e., the CD opcode, too.

If I'd had to guess, it isn't documented because of the proprietary ICE
aspect. And no one uses ICEs anymore so it is going to be forgotten with
people popping off and on and asking about the undocumented opcode.
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Paolo Bonzini
2015-07-31 09:30:02 UTC
Permalink
Post by Borislav Petkov
$ ./icebp
Trace/breakpoint trap
^ this in qemu.
Is the strace different between KVM and baremetal? QEMU dynamic
translation is broken I think, but KVM should be the same as baremetal.
Post by Borislav Petkov
Post by Andy Lutomirski
Fortunately, it looks like the vm86 case is correct (or as correct as
any of the vm86 junk can be), although I haven't tested it. I bet
that icebp is like int3 in that it punches through vm86 mode instead
of sending #GP.
Yeah, INT 1. I wonder whether INT 1, i.e. CD imm8 does the same thing.
No, it sends #GP.
Post by Borislav Petkov
But why do you say it is special - it simply raises #DB, i.e. vector 1.
Web page seems to say so when interrupt redirection is disabled. It
sounds like a nice and quick way to generate a breakpoint. You can do
that with INT 01, i.e., the CD opcode, too.
If I'd had to guess, it isn't documented because of the proprietary ICE
aspect. And no one uses ICEs anymore so it is going to be forgotten with
people popping off and on and asking about the undocumented opcode.
The reason why it isn't documented is probably hidden within Intel.
Besides ICEBP, which is a bit fringe, there's no reason not to document
SALC which Thomas mentioned. SALC all has been there since the 8086,
and has been undocumented for thirty-odd years.

The AAM/AAD variants with immediates other than 10 also have been
undocumented for fifteen years or so (an instruction doing a division by
10 where the second byte of the opcode is 10? oh, certainly no one is
going to try changing the second byte...)

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-07-31 10:30:01 UTC
Permalink
Post by Paolo Bonzini
Is the strace different between KVM and baremetal?
Yes, the signal part is missing from kvm:

$ strace ./icebp
execve("./icebp", ["./icebp"], [/* 20 vars */]) = 0
brk(0) = 0x601000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff7ff6000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=95207, ...}) = 0
mmap(NULL, 95207, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7ffff7fde000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\357\1\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1595408, ...}) = 0
mmap(NULL, 3709016, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7ffff7a53000
mprotect(0x7ffff7bd3000, 2097152, PROT_NONE) = 0
mmap(0x7ffff7dd3000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x180000) = 0x7ffff7dd3000
mmap(0x7ffff7dd8000, 18520, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ffff7dd8000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff7fdd000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff7fdc000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff7fdb000
arch_prctl(ARCH_SET_FS, 0x7ffff7fdc700) = 0
mprotect(0x7ffff7dd3000, 16384, PROT_READ) = 0
mprotect(0x7ffff7ffc000, 4096, PROT_READ) = 0
munmap(0x7ffff7fde000, 95207) = 0
exit_group(0) = ?
Post by Paolo Bonzini
No, it sends #GP.
True story:

[ 697.707990] traps: icebp[3537] general protection ip:4004b0 sp:7fffffffe610 error:a in icebp[400000+1000]

but why? I guess our IDT entry at 1 is funny... Too lazy to check.
Post by Paolo Bonzini
The reason why it isn't documented is probably hidden within Intel.
Besides ICEBP, which is a bit fringe, there's no reason not to document
SALC which Thomas mentioned. SALC all has been there since the 8086,
and has been undocumented for thirty-odd years.
That one is invalid (on an IVB):

[ 1306.231408] traps: icebp[3783] trap invalid opcode ip:4004b0 sp:7fffffffe610 error:0 in icebp[400000+1000]

AMD APM documents it as invalid too.
Post by Paolo Bonzini
The AAM/AAD variants with immediates other than 10 also have been
undocumented for fifteen years or so (an instruction doing a division
by 10 where the second byte of the opcode is 10? oh, certainly no one
is going to try changing the second byte...)
There's this in the AMD APM:

"In most modern assemblers, the AAM instruction adjusts to base-10
values. However, by coding the instruction directly in binary, it can
adjust to any base specified by the immediate byte value (ib) suffixed
onto the D4h opcode. For example, code D408h for octal, D40Ah for
decimal, and D40Ch for duodecimal (base 12)."
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Paolo Bonzini
2015-07-31 10:30:02 UTC
Permalink
Post by Borislav Petkov
Post by Paolo Bonzini
The reason why it isn't documented is probably hidden within Intel.
Besides ICEBP, which is a bit fringe, there's no reason not to document
SALC which Thomas mentioned. SALC all has been there since the 8086,
and has been undocumented for thirty-odd years.
[ 1306.231408] traps: icebp[3783] trap invalid opcode ip:4004b0 sp:7fffffffe610 error:0 in icebp[400000+1000]
AMD APM documents it as invalid too.
It's valid in 32-bit.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-07-31 10:40:01 UTC
Permalink
Post by Paolo Bonzini
Post by Borislav Petkov
Post by Paolo Bonzini
The reason why it isn't documented is probably hidden within Intel.
Besides ICEBP, which is a bit fringe, there's no reason not to document
SALC which Thomas mentioned. SALC all has been there since the 8086,
and has been undocumented for thirty-odd years.
[ 1306.231408] traps: icebp[3783] trap invalid opcode ip:4004b0 sp:7fffffffe610 error:0 in icebp[400000+1000]
AMD APM documents it as invalid too.
It's valid in 32-bit.
Yap, no invalid opcode there. I guess there's another bug in the APM's
opcode table then.
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Maciej W. Rozycki
2015-09-07 05:40:02 UTC
Permalink
Post by Borislav Petkov
Yeah, INT 1. I wonder whether INT 1, i.e. CD imm8 does the same thing.
But why do you say it is special - it simply raises #DB, i.e. vector 1.
Web page seems to say so when interrupt redirection is disabled. It
sounds like a nice and quick way to generate a breakpoint. You can do
that with INT 01, i.e., the CD opcode, too.
If I'd had to guess, it isn't documented because of the proprietary ICE
aspect. And no one uses ICEs anymore so it is going to be forgotten with
people popping off and on and asking about the undocumented opcode.
FYI, it's actually still in use with modern hardware, as a software
breakpoint (and hence it has to be a single byte INT1 instruction rather
than a multiple-byte regular INT 1 encoding) with JTAG probe hardware used
for bare-metal debugging. E.g. Intel Atom supports it and boards have
been available with a JTAG connector, which Intel calls XDP aka Extended
Debug Port, e.g. the D945GCLF board (aka Crown Beach IIRC) had one.

By fiddling with some bits in the CPU, which are only accessible through
JTAG, probe firmware takes control over #DB making it trap into the debug
mode rather than into the kernel. As noted above INT1 is used rather than
INT3 (which still traps into the kernel with #BP as usually) for software
breakpoints, but all the other DR0-7 resources are also available to the
probe and the General Detect fault is used to prevent the kernel from
fiddling with them. Similarly single-stepping traps into probe firmware.
Debug mode transitions are completely transparent to any kernel-mode
software run.

I did some work on this a few years ago, including emulating DR0-7
accesses in software down the JTAG handler upon a General Detect fault to
keep the kernel both happy and away from real debug registers. ;) Yes,
you can debug any software with this stuff, including the Linux kernel:
set instruction and data breakpoints, single-step it, poke at all hardware
registers, including descriptor registers not otherwise accessible (you
can set funny modes for segments, also in the 64-bit mode), etc. One
complication though is you operate on physical addresses when poking at
memory, you can't ask the CPU's MMU to remap them for you (you can walk
page tables manually of course, just as the MMU would).

I hope this clears things a bit around this stuff. :) You might be able
to find some more by issuing a query for "Extended Debug Port" with your
favourite Internet search engine.

It's been a while since this discussion, but I thought I'd chime in as
you might find it interesting. I'm actually a bit surprised the knowledge
about this is so poor among x86 experts.

Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar
2015-09-07 07:50:02 UTC
Permalink
I did some work on this a few years ago, including emulating DR0-7 accesses in
software down the JTAG handler upon a General Detect fault to keep the kernel
both happy and away from real debug registers. ;) Yes, you can debug any
software with this stuff, including the Linux kernel: set instruction and data
breakpoints, single-step it, poke at all hardware registers, including
descriptor registers not otherwise accessible (you can set funny modes for
segments, also in the 64-bit mode), etc. One complication though is you operate
on physical addresses when poking at memory, you can't ask the CPU's MMU to
remap them for you (you can walk page tables manually of course, just as the MMU
would).
Essentially the ICE breakpoint instruction enters SMM mode?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Maciej W. Rozycki
2015-09-07 08:20:02 UTC
Permalink
Post by Ingo Molnar
I did some work on this a few years ago, including emulating DR0-7 accesses in
software down the JTAG handler upon a General Detect fault to keep the kernel
both happy and away from real debug registers. ;) Yes, you can debug any
software with this stuff, including the Linux kernel: set instruction and data
breakpoints, single-step it, poke at all hardware registers, including
descriptor registers not otherwise accessible (you can set funny modes for
segments, also in the 64-bit mode), etc. One complication though is you operate
on physical addresses when poking at memory, you can't ask the CPU's MMU to
remap them for you (you can walk page tables manually of course, just as the MMU
would).
Essentially the ICE breakpoint instruction enters SMM mode?
I didn't do stuff at the probe firmware level so I can't say for sure,
but my gut feeling is the debug mode is indeed very close if not the same
as SMM. I think duplicating the logic would be an unnecessary waste of
silicon.

And obviously it's any cause of #DB that enters this mode. The probe can
also request it right at the exit from the reset state, so that you can
debug software (e.g BIOS startup) right from the reset vector. You don't
need working RAM for that.

Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Paolo Bonzini
2015-09-07 10:30:02 UTC
Permalink
Post by Maciej W. Rozycki
Post by Ingo Molnar
Essentially the ICE breakpoint instruction enters SMM mode?
I didn't do stuff at the probe firmware level so I can't say for sure,
but my gut feeling is the debug mode is indeed very close if not the same
as SMM. I think duplicating the logic would be an unnecessary waste of
silicon.
I researched SMM a bit recently in order to implement it in KVM, and the
best source of folklore seems to be http://www.rcollins.org/ddj (which I
also have on paper :)).

The author there says that SMM design was roughly based on the 386's
probe/ICE mode design, but it's actually separate. Most notably, on the
386 the state save areas almost mirror each other, but when I say
mirror... I do mean mirror: directions are reversed, and what is on top
for probe mode is on bottom for SMM. :)

In addition, AMD tried reusing ICE mode for SMM, and was sued by Intel
who actually won the lawsuit. I couldn't find more information about
the lawsuit.

It's probably diverged more and more over time, for example because SMM
is now considered security-sensitive while probe mode isn't. In
addition, the same DDJ article says that Pentium JTAG probe mode
"doesn't resemble SMM at all, doesn't use a state save map, or even
execute any code of its own", whatever that means.

Paolo
Post by Maciej W. Rozycki
And obviously it's any cause of #DB that enters this mode. The probe can
also request it right at the exit from the reset state, so that you can
debug software (e.g BIOS startup) right from the reset vector. You don't
need working RAM for that.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Maciej W. Rozycki
2015-09-07 17:10:02 UTC
Permalink
Post by Paolo Bonzini
Post by Maciej W. Rozycki
I didn't do stuff at the probe firmware level so I can't say for sure,
but my gut feeling is the debug mode is indeed very close if not the same
as SMM. I think duplicating the logic would be an unnecessary waste of
silicon.
I researched SMM a bit recently in order to implement it in KVM, and the
best source of folklore seems to be http://www.rcollins.org/ddj (which I
also have on paper :)).
Robert did an excellent job figuring it all, but his stuff is a bit
dated, things may have changed since, especially as JTAG debugging has
since become ubiquitous in the embedded world and consequently better
developed.
Post by Paolo Bonzini
The author there says that SMM design was roughly based on the 386's
probe/ICE mode design, but it's actually separate. Most notably, on the
386 the state save areas almost mirror each other, but when I say
mirror... I do mean mirror: directions are reversed, and what is on top
for probe mode is on bottom for SMM. :)
That might be a minor implementation detail, needed for whatever reason.
Post by Paolo Bonzini
In addition, AMD tried reusing ICE mode for SMM, and was sued by Intel
who actually won the lawsuit. I couldn't find more information about
the lawsuit.
It's probably diverged more and more over time, for example because SMM
is now considered security-sensitive while probe mode isn't. In
addition, the same DDJ article says that Pentium JTAG probe mode
"doesn't resemble SMM at all, doesn't use a state save map, or even
execute any code of its own", whatever that means.
At least I am fairly sure the RSM instruction is used to quit the debug
mode just like with SMM, so I'd be surprised if they bothered implementing
separate state save area structures for the two modes. Some control bits
in the CPU may well be set differently between the two modes though,
addressing issues like security sensitivity you mentioned.

A state save/restore approach is definitely used (unlike with some other
processors that expose internal registers through JTAG directly) as you
cannot switch between operation modes (e.g. real vs protected) on the fly
while in the debug mode. You actually need to return to the regular mode
(e.g. ask to single-step a NOP) for a mode change to take effect. Ditto
about other registers -- any read-only bits are only masked out in the
register state once a regular-mode instruction has executed.

The use of RSM also prompts a question whether you can nest debug mode in
SMM (to debug SMM code) -- this is actually similar to the NMI vs IRET
issue considered in this thread -- or nest debug mode in debug mode, e.g.
by taking a #DB exception from an INT1 instruction while in either mode.
I don't know. Some other processors (MIPS) that implement a JTAG debug
mode allow such nesting and care has to be taken in probe firmware to
handle it correctly and ensure the context to return to is not clobbered
if such a situation is to be arranged. And also -- as you may have
expected -- the debug mode return instruction has to be avoided in the
nested handler.

These are all implementation-specific details, including the INT1
instruction, which is why I am not at all surprised that they are omitted
from architecture manuals. The JTAG debug mode itself is no rocket
science though, everybody seems to have it these days. Though for cost
and power consumption saving reasons the RTL block implementing the debug
module may obviously be omitted from production silicon known, perhaps by
definition, to never require one.

Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-09-07 17:30:02 UTC
Permalink
Post by Maciej W. Rozycki
These are all implementation-specific details, including the INT1
instruction, which is why I am not at all surprised that they are omitted
from architecture manuals.
That bit is BS, though. The INT1 instruction, executed in user mode
(CPL3) with no hardware debugger attached, will enter the kernel
through a gate at vector 1, *even if that gate has DPL == 0*.

If there's an instruction that bypasses hardware protection
mechanisms, then Intel should document it rather than relying on OS
writers to know enough folklore to get it right.

Heck, SDM Volume 3 6.12.1.1 says "The processor checks the DPL of the
interrupt or trap gate only if an exception or interrupt is generated
with an INT n, INT 3, or INTO instruction." It does not say "the
processor does not check the DPL of the interrupt or trap gate if the
exception or interrupt is generated with the undocumented ICEBP
instruction."

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Maciej W. Rozycki
2015-09-07 19:40:01 UTC
Permalink
Post by Andy Lutomirski
Post by Maciej W. Rozycki
These are all implementation-specific details, including the INT1
instruction, which is why I am not at all surprised that they are omitted
from architecture manuals.
That bit is BS, though. The INT1 instruction, executed in user mode
(CPL3) with no hardware debugger attached, will enter the kernel
through a gate at vector 1, *even if that gate has DPL == 0*.
If there's an instruction that bypasses hardware protection
mechanisms, then Intel should document it rather than relying on OS
writers to know enough folklore to get it right.
Heck, SDM Volume 3 6.12.1.1 says "The processor checks the DPL of the
interrupt or trap gate only if an exception or interrupt is generated
with an INT n, INT 3, or INTO instruction." It does not say "the
processor does not check the DPL of the interrupt or trap gate if the
exception or interrupt is generated with the undocumented ICEBP
instruction."
It does not have to be mentioned, because it's implied by how the #DB
exception is propagated: regardless of its origin it never checks the DPL.
And user-mode software may well use POPF at any time to set the TF bit in
the flags register to the same effect, so the OS needs to be prepared for
a #DB exception it hasn't scheduled itself anyway.

Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-09-07 22:00:02 UTC
Permalink
Post by Maciej W. Rozycki
Post by Andy Lutomirski
Post by Maciej W. Rozycki
These are all implementation-specific details, including the INT1
instruction, which is why I am not at all surprised that they are omitted
from architecture manuals.
That bit is BS, though. The INT1 instruction, executed in user mode
(CPL3) with no hardware debugger attached, will enter the kernel
through a gate at vector 1, *even if that gate has DPL == 0*.
If there's an instruction that bypasses hardware protection
mechanisms, then Intel should document it rather than relying on OS
writers to know enough folklore to get it right.
Heck, SDM Volume 3 6.12.1.1 says "The processor checks the DPL of the
interrupt or trap gate only if an exception or interrupt is generated
with an INT n, INT 3, or INTO instruction." It does not say "the
processor does not check the DPL of the interrupt or trap gate if the
exception or interrupt is generated with the undocumented ICEBP
instruction."
It does not have to be mentioned, because it's implied by how the #DB
exception is propagated: regardless of its origin it never checks the DPL.
And user-mode software may well use POPF at any time to set the TF bit in
the flags register to the same effect, so the OS needs to be prepared for
a #DB exception it hasn't scheduled itself anyway.
Not really.

int $1 checks DPL. Setting TF results in saved TF set and the
corresponding bit in DR6 set as well. Triggering a #DB using the
debug registers requires active OS help.

So operating systems need to handle a #DB without no indicated cause
without spewing warnings or crashing, and there is no indication
whatsoever in the SDM or APM that this is the case.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Maciej W. Rozycki
2015-09-08 16:30:02 UTC
Permalink
Post by Andy Lutomirski
Post by Maciej W. Rozycki
It does not have to be mentioned, because it's implied by how the #DB
exception is propagated: regardless of its origin it never checks the DPL.
And user-mode software may well use POPF at any time to set the TF bit in
the flags register to the same effect, so the OS needs to be prepared for
a #DB exception it hasn't scheduled itself anyway.
Not really.
int $1 checks DPL. Setting TF results in saved TF set and the
corresponding bit in DR6 set as well. Triggering a #DB using the
debug registers requires active OS help.
INT $1 is a software interrupt instruction, it does not trigger a #DB.
Similarly INT $13 checks DPL while #GP does not. Or maybe INT $6 vs UD2
is a better analogy; the latter is as much INT6 as the 0xf1 encoding is
INT1.

Yes, you'll get a DR6 status with no new bits set. So what? You can
ignore it and IRET with no adverse effects. You can print diagnostics if
you're pedantic. You can kill the offending user program, but that's no
harm, because it already did the undefined. None of these is an issue,
and certainly not one for security.

Panicking OTOH would be, but that would IMHO be a silly choice and a bad
OS design. You never need to crash due to a user-mode exception, even an
unknown one. What if you run on a new CPU which has a new user-mode
exception unknown at the time the OS binary was compiled? That's an
analogous situation for an architecture like x86 where strict backwards
compatibility is maintained.

A reasonable #DB handler will do something like:

{
int dr6 = read_dr6();

write_dr6(0);
if (dr6 & DR6_MASK_X)
handle_dr6_x();
if (dr6 & DR6_MASK_Y)
handle_dr6_y();
/* Etc... */

return;
}

and will work just fine where invoked with no bits set in DR6.
Post by Andy Lutomirski
So operating systems need to handle a #DB without no indicated cause
without spewing warnings or crashing, and there is no indication
whatsoever in the SDM or APM that this is the case.
Strictly speaking the SDM does not state that at least one status bit
shall be set in DR6 either.

FAOD I'm not saying of course that documenting INT1 as a model-specific
instruction encoding reserved for #DB generation or stating something to
the effect that the OS is required to handle (e.g. discard) a #DB
exception seen with no status bits set in DR6 would be bad. No, it would
certainly be nice. But I maintain that I don't see it as strictly
necessary.

Pester Intel if you disagree, I'm not the right person to complain about
it anyway. ;)

Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Steven Rostedt
2015-07-24 21:10:02 UTC
Permalink
On Fri, 24 Jul 2015 22:51:19 +0200
Post by Peter Zijlstra
Post by Linus Torvalds
Do we really take locks in the #DB handler?
do_debug()
send_sigtrap()
force_sig_info()
spin_lock_irqsave()
Now, I don't pretend to understand the condition before send_sigtrap(),
so it _might_ be ok, but it sure as heck could do with a comment.
Or that force_sig_info() in send_sigtrap() looks like it can easily be
change to use an irq work queue.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2015-07-25 00:00:02 UTC
Permalink
Post by Peter Zijlstra
do_debug()
send_sigtrap()
force_sig_info()
spin_lock_irqsave()
Now, I don't pretend to understand the condition before send_sigtrap(),
so it _might_ be ok, but it sure as heck could do with a comment.
Ugh. As Andy said, I think that's ok, because it's actually the
single-step case, and won't trigger for kernel mode. So we should be
ok. Although the code I agree is not good.

I'd personally be more worried about the usual crazy "notify_die()"
crap. I absoluely detest those notifier chain things. They are hooks
for random crap that shouldn't be hooked into, but whatever. It's not
a problem in practice, it's just a sign of a certain kind of diseased
mind.

On the whole I think we're ok. I'd love to get rid of things, and yes,
I think we should probably explicitly handle the in-kernel case first
and just return without doing anything, just to make the code more
obviously safe. But it doesn't look like a fundamental problem spot.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Steven Rostedt
2015-07-24 15:40:04 UTC
Permalink
On Fri, 24 Jul 2015 17:26:37 +0200
Post by Willy Tarreau
Post by Steven Rostedt
The point is, if we trigger a #DB on an instruction breakpoint
while !IF, then we simply disable that breakpoint and do the RET.
Yes but the breakpoint remains disabled then. Or I'm missing
something.
Do we care? If it was an instruction breakpoint with !IF set, then it
had to have happened in the kernel. And kgdb or whatever added it there
needs to deal with that.

There should be no instances in the kernel where we execute userspace
code with interrupts disabled.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Willy Tarreau
2015-07-24 16:00:02 UTC
Permalink
Post by Steven Rostedt
On Fri, 24 Jul 2015 17:26:37 +0200
Post by Willy Tarreau
Post by Steven Rostedt
The point is, if we trigger a #DB on an instruction breakpoint
while !IF, then we simply disable that breakpoint and do the RET.
Yes but the breakpoint remains disabled then. Or I'm missing
something.
Do we care? If it was an instruction breakpoint with !IF set, then it
had to have happened in the kernel. And kgdb or whatever added it there
needs to deal with that.
I was concerned that an RW BP would remain disabled when returning to
user space but Peter cleared that out by pointing me to the discussion
where it was explained that they are re-enabled when returning to user
space.

So no problem here for me.

Thanks,
Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-07-24 15:50:02 UTC
Permalink
Post by Peter Zijlstra
Post by Linus Torvalds
Hmmm. I thought watchpoints were "before the instruction" too, but
that's just because I haven't used them in ages, and I didn't remember
the details. I just looked it up.
You're right - the memory watchpoints trigger after the instruction
has executed, so RF isn't an issue. So yes, the only issue is
instruction breakpoints, and those are the only ones we need to clear.
And that makes it really easy.
So yes, I agree. We only need to clear all kernel breakpoints.
local_irq_disable();
copy_from_user_inatomic();
and as luck would have it, there's a breakpoint on the user memory we
just touched. And we go and disable a user breakpoint.
The Intel SDM says:

17.3.1.2 Data Memory and I/O Breakpoint Exception Conditions

Data memory and I/O breakpoints are reported when the processor
attempts to access a memory or I/O address
specified in a breakpoint-address register (DR0 through DR3) that has
been set up to detect data or I/O accesses
(R/W flag is set to 1, 2, or 3). The processor generates the exception
after it executes the instruction that made the
access, so these breakpoint condition causes a trap-class exception to
be generated.

So by the time we detect that we've hit a watchpoint, the instruction
that tripped it is done and we don't need RF. Furthermore, after
reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
we hit a watchpoint. So this might be as simple as:

if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs))
for (i = 0; i < 4; i++)
if (dr6 & (DR_TRAP0<<i)) {
/* hit a kernel breakpoint with IF clear */
dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT));
}

I'm not saying that your code is wrong, but I think this is simpler
and avoids poking at yet more per-cpu state from NMI context, which is
kind of nice.

If you don't like the RF games above, it would also be straightforward
to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a
breakpoint.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Steven Rostedt
2015-07-24 16:10:02 UTC
Permalink
On Fri, 24 Jul 2015 08:48:57 -0700
Post by Andy Lutomirski
So by the time we detect that we've hit a watchpoint, the instruction
that tripped it is done and we don't need RF. Furthermore, after
reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
Um, isn't 0xf * DR_TRAP0 same as a constant "true"?

-- Steve
Post by Andy Lutomirski
X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs))
for (i = 0; i < 4; i++)
if (dr6 & (DR_TRAP0<<i)) {
/* hit a kernel breakpoint with IF clear */
dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT));
}
I'm not saying that your code is wrong, but I think this is simpler
and avoids poking at yet more per-cpu state from NMI context, which is
kind of nice.
If you don't like the RF games above, it would also be straightforward
to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a
breakpoint.
--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Willy Tarreau
2015-07-24 16:10:02 UTC
Permalink
Post by Steven Rostedt
On Fri, 24 Jul 2015 08:48:57 -0700
Post by Andy Lutomirski
So by the time we detect that we've hit a watchpoint, the instruction
that tripped it is done and we don't need RF. Furthermore, after
reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
Um, isn't 0xf * DR_TRAP0 same as a constant "true"?
For me it's a typo, it should have been :

if ((dr6 & (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |

(the purpose was to read all 4 lower bits at once).

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Steven Rostedt
2015-07-24 16:40:02 UTC
Permalink
On Fri, 24 Jul 2015 18:08:06 +0200
Post by Willy Tarreau
Post by Steven Rostedt
Um, isn't 0xf * DR_TRAP0 same as a constant "true"?
if ((dr6 & (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
(the purpose was to read all 4 lower bits at once).
I figured that after I sent it, but the 0xf * DR_TRAP0 is also
confusing to me. Actually, why not use proper naming:

dr6 & DR_TRAP_BITS

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Steven Rostedt
2015-07-24 16:10:03 UTC
Permalink
On Fri, 24 Jul 2015 08:48:57 -0700
Post by Andy Lutomirski
So by the time we detect that we've hit a watchpoint, the instruction
that tripped it is done and we don't need RF. Furthermore, after
reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs))
Also, wouldn't !(regs->X86_EFLAGS_IF) && !user_mode(regs) be a bug?
When do we allow coming from userspace with interrupts disabled?

-- Steve
Post by Andy Lutomirski
for (i = 0; i < 4; i++)
if (dr6 & (DR_TRAP0<<i)) {
/* hit a kernel breakpoint with IF clear */
dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT));
}
I'm not saying that your code is wrong, but I think this is simpler
and avoids poking at yet more per-cpu state from NMI context, which is
kind of nice.
If you don't like the RF games above, it would also be straightforward
to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a
breakpoint.
--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Willy Tarreau
2015-07-24 16:30:02 UTC
Permalink
Post by Andy Lutomirski
So by the time we detect that we've hit a watchpoint, the instruction
that tripped it is done and we don't need RF. Furthermore, after
reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
we hit a watchpoint.
Apparently after reading 17.3.1.1, it seems like RF can still be set
if a data breakpoint triggers in a repeated string instruction before
the last iteration. However I don't think we care because as long as
we return to the string instruction, since the data location was already
visited it won't trigger again so the loss of the flag should be safe.

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-07-24 17:30:02 UTC
Permalink
Post by Willy Tarreau
Post by Andy Lutomirski
So by the time we detect that we've hit a watchpoint, the instruction
that tripped it is done and we don't need RF. Furthermore, after
reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
we hit a watchpoint.
Apparently after reading 17.3.1.1, it seems like RF can still be set
if a data breakpoint triggers in a repeated string instruction before
the last iteration. However I don't think we care because as long as
we return to the string instruction, since the data location was already
visited it won't trigger again so the loss of the flag should be safe.
Oh, right. So my proposal is wrong: it'll clear a watchpoint
incorrectly if we hit it in the middle of a string operation.

So we should either parse dr0..dr3 (whichever one triggered) or do
Peter's think and clear dr7 entirely. I still prefer just clearing
the breakpoint that triggered.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Willy Tarreau
2015-07-24 17:20:02 UTC
Permalink
Post by Andy Lutomirski
So by the time we detect that we've hit a watchpoint, the instruction
that tripped it is done and we don't need RF. Furthermore, after
reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs))
for (i = 0; i < 4; i++)
if (dr6 & (DR_TRAP0<<i)) {
/* hit a kernel breakpoint with IF clear */
dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT));
}
I'm not saying that your code is wrong, but I think this is simpler
and avoids poking at yet more per-cpu state from NMI context, which is
kind of nice.
If you don't like the RF games above, it would also be straightforward
to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a
breakpoint.
Andy, section 5.8 of the SDM makes me think we could possibly abuse SYSRET
to emulate IRET, and then possibly simplify the flags processing. It says
that it takes the CPL3 code segment but nowhere it says that the target is
validated for effectively being userland, and further it suggests that it
doesn't validate anything :

"It is the responsibility of the OS to ensure the descriptors in
the GDT/LDT correspond to the selectors loaded by SYSCALL/SYSRET
(consistent with the base, limit, and attribute values forced by
the instructions)."

The OS has to set the RSP by itself before doing SYSRET, which opens a
race between "mov rsp" and "sysret", but if we only take that path once
we figure we come from NMI (using just IF+RSP), we know that IRQs and
NMIs are still disabled and cannot strike at this instant. Maybe MCEs
can, but they would execute within the NMI's stack just as if they were
triggered inside the NMI as well so I don't see a problem here.

I tried to imagine a case where kernel page faults, then NMI comes in,
then debug strikes and we have to return from debug to NMI then to fault
handler and I don't think we break the chain. Of course there are many
subtleties I can't grab because I don't understand all the details.

Do you think that could simplify things or that it's another stupid idea ?

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-07-24 17:30:02 UTC
Permalink
Post by Willy Tarreau
Post by Andy Lutomirski
So by the time we detect that we've hit a watchpoint, the instruction
that tripped it is done and we don't need RF. Furthermore, after
reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs))
for (i = 0; i < 4; i++)
if (dr6 & (DR_TRAP0<<i)) {
/* hit a kernel breakpoint with IF clear */
dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT));
}
I'm not saying that your code is wrong, but I think this is simpler
and avoids poking at yet more per-cpu state from NMI context, which is
kind of nice.
If you don't like the RF games above, it would also be straightforward
to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a
breakpoint.
Andy, section 5.8 of the SDM makes me think we could possibly abuse SYSRET
to emulate IRET, and then possibly simplify the flags processing. It says
that it takes the CPL3 code segment but nowhere it says that the target is
validated for effectively being userland, and further it suggests that it
"It is the responsibility of the OS to ensure the descriptors in
the GDT/LDT correspond to the selectors loaded by SYSCALL/SYSRET
(consistent with the base, limit, and attribute values forced by
the instructions)."
You are an evil bastard. I seriously doubt that this will work.
SYSRET goes to CPL3 no matter what. Also, I don't think you want to
start poking at MSRs to return.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Paolo Bonzini
2015-07-30 16:00:02 UTC
Permalink
Post by Andy Lutomirski
Post by Willy Tarreau
Andy, section 5.8 of the SDM makes me think we could possibly abuse SYSRET
to emulate IRET, and then possibly simplify the flags processing. It says
that it takes the CPL3 code segment but nowhere it says that the target is
validated for effectively being userland, and further it suggests that it
"It is the responsibility of the OS to ensure the descriptors in
the GDT/LDT correspond to the selectors loaded by SYSCALL/SYSRET
(consistent with the base, limit, and attribute values forced by
the instructions)."
You are an evil bastard. I seriously doubt that this will work.
SYSRET goes to CPL3 no matter what. Also, I don't think you want to
start poking at MSRs to return.
On Intel the bottom two bits of the selector are forced to 11. The
pseudocode of SYSRET in the SDM has an explicit

CS.Selector ← (IA32_STAR[63:48]+ either 0 or 16) OR 3;
...
SS.Selector ← (IA32_STAR[63:48]+8) OR 3;

On AMD it's even worse, because you get a weird state with
CS.DPL=CS.RPL=SS.DPL=SS.RPL=0 but still the CPL is 3. This is seriously
messed up because the CPL is always SS.DPL except in this case. AMD
even had to add a separate field for the CPL to their VM control block,
just to account for this case. Intel more sanely uses SS.DPL.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Willy Tarreau
2015-07-24 17:30:02 UTC
Permalink
Post by Willy Tarreau
The OS has to set the RSP by itself before doing SYSRET, which opens a
race between "mov rsp" and "sysret", but if we only take that path once
we figure we come from NMI (using just IF+RSP), we know that IRQs and
NMIs are still disabled and cannot strike at this instant. Maybe MCEs
can, but they would execute within the NMI's stack just as if they were
triggered inside the NMI as well so I don't see a problem here.
OK too bad I just found the response here in the code :-(

* SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET,
* restoring TF results in a trap from userspace immediately after
* SYSRET. This would cause an infinite loop whenever #DB happens
* with register state that satisfies the opportunistic SYSRET
* conditions.

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-07-23 21:50:02 UTC
Permalink
Post by Steven Rostedt
On Thu, 23 Jul 2015 14:08:59 -0700
Post by Linus Torvalds
Post by Andy Lutomirski
Issue A: to return with RF clear, we need to disarm the breakpoint.
If it's limited to the duration of the NMI, that's easy. If not, when
do we re-arm? New prepare_exit_to_usermode hook? Hmm, setting ti
flags during context switch may target the wrong task.
We don't re-arm it.
Let me get this straight. The idea is in the #DB handler to detect that
it was triggered in NMI context, and if so, simply disarm that
breakpoint permanently, right?
Nothing should be adding hw breakpoints to NMI code anyway. Sounds
perfectly reasonable to me. Of course, how we tell we are in NMI
brings back all the races as we had in the nesting code. We can check
the per-cpu variable that is set with nmi_enter() and cleared at
nmi_exit() but what happens if the breakpoint is outside those calls.
We can check the stack pointer, but then we are back to userspace
fooling us. Maybe add the DF trick again?
Can't the back link of the TSS tell us where we come from ? At least
it should not be manipulable from user-space.
Not on 64-bit -- there are no tasks :)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Willy Tarreau
2015-07-23 22:00:03 UTC
Permalink
Post by Andy Lutomirski
Can't the back link of the TSS tell us where we come from ? At least
it should not be manipulable from user-space.
Not on 64-bit -- there are no tasks :)
Ah crap, sorry for the noise then!

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Willy Tarreau
2015-07-23 21:50:02 UTC
Permalink
Post by Steven Rostedt
On Thu, 23 Jul 2015 14:08:59 -0700
Post by Linus Torvalds
Post by Andy Lutomirski
Issue A: to return with RF clear, we need to disarm the breakpoint.
If it's limited to the duration of the NMI, that's easy. If not, when
do we re-arm? New prepare_exit_to_usermode hook? Hmm, setting ti
flags during context switch may target the wrong task.
We don't re-arm it.
Let me get this straight. The idea is in the #DB handler to detect that
it was triggered in NMI context, and if so, simply disarm that
breakpoint permanently, right?
Nothing should be adding hw breakpoints to NMI code anyway. Sounds
perfectly reasonable to me. Of course, how we tell we are in NMI
brings back all the races as we had in the nesting code. We can check
the per-cpu variable that is set with nmi_enter() and cleared at
nmi_exit() but what happens if the breakpoint is outside those calls.
We can check the stack pointer, but then we are back to userspace
fooling us. Maybe add the DF trick again?
Can't the back link of the TSS tell us where we come from ? At least
it should not be manipulable from user-space.

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-07-23 21:00:02 UTC
Permalink
Post by Linus Torvalds
Post by Andy Lutomirski
2. Forbid IRET inside NMIs. Doable but maybe not that pretty.
3. Forbid faults (other than MCE) inside NMI.
I'd really prefer #2. #3 depends on us getting many things right, and
never introducing new cases in the future.
#2, in contrast, seems to be fairly localized. Yes, RF is an issue,
but returning to user space with RF clear doesn't really seem to be
all that problematic.
What's the worst case that can happen with RF cleared when returing
to user space ? My understanding is that it's just that we risk to
break again on an instruction that had a break point set and which
already triggered the breakpoint, right ?
I assume Linus meant returning to kernel space with RF clear. Returns
to userspace have their own fancy logic here, and it's survived for a
couple of releases, including through an explicit test of RF handling
:)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Willy Tarreau
2015-07-23 21:10:03 UTC
Permalink
Post by Andy Lutomirski
Post by Linus Torvalds
Post by Andy Lutomirski
2. Forbid IRET inside NMIs. Doable but maybe not that pretty.
3. Forbid faults (other than MCE) inside NMI.
I'd really prefer #2. #3 depends on us getting many things right, and
never introducing new cases in the future.
#2, in contrast, seems to be fairly localized. Yes, RF is an issue,
but returning to user space with RF clear doesn't really seem to be
all that problematic.
What's the worst case that can happen with RF cleared when returing
to user space ? My understanding is that it's just that we risk to
break again on an instruction that had a break point set and which
already triggered the breakpoint, right ?
I assume Linus meant returning to kernel space with RF clear. Returns
to userspace have their own fancy logic here, and it's survived for a
couple of releases, including through an explicit test of RF handling
:)
Ah you must be right, got it. Yes you want to break into the NMI handler
and you either disable all breakpoints/single-step until the NMI's iret
by clearing DR7, or you loop over and over on the same instruction if
you try to restart the stopped instruction with RF clear. That makes
sense.

Thanks,
Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Willy Tarreau
2015-07-23 21:00:03 UTC
Permalink
Post by Linus Torvalds
Post by Andy Lutomirski
2. Forbid IRET inside NMIs. Doable but maybe not that pretty.
3. Forbid faults (other than MCE) inside NMI.
I'd really prefer #2. #3 depends on us getting many things right, and
never introducing new cases in the future.
#2, in contrast, seems to be fairly localized. Yes, RF is an issue,
but returning to user space with RF clear doesn't really seem to be
all that problematic.
What's the worst case that can happen with RF cleared when returing
to user space ? My understanding is that it's just that we risk to
break again on an instruction that had a break point set and which
already triggered the breakpoint, right ?

If so the problem probably is whether there's a risk of looping again
without ever getting a chance to execute this instruction normally.
But if the NMIs don't bomb as fast as we can process them, at some
point the instruction should get a chance to be executed, so the
problem doesn't seem dramatic.

That makes me think that I have no idea what happens if we try to
step-trace "int 2", I don't even know if we pass through the NMI
handler.

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Willy Tarreau
2015-07-23 21:20:02 UTC
Permalink
What's the worst case that can happen with RF cleared when returing
to user space ?
Not a good idea. We are fine breaking breakpoints on the kernel ("use
the tracing infrastructure instead"). Breaking it in user space is not
really an option.
But that wouldn't disable the breakpoint, just make it strike again,
so the user would not be hurt.
And we really don't need to. We'd only use 'ret' when returning to
kernel code. And not even for the usual case, only for the "interrupts
are off" case. If somebody tries to put a breakpoint on something
that is used in an irq-off situation, they are doing something very
specialized, and we cna tell them: "sorry, we had to break your use
case because it's crazy any other way".
Those kind of people are by definition not "users". They are mucking
with kernel internals. Breaking them is not a regression.
Btw, we should still ask Intel for that "fast iret that doesn't
re-enable NMI". So for possible future CPU's we might let people do
crazy things again.
I'm just thinking that there should be an option for this : task switching.
You can store the EFLAGS in the TSS, so by preparing a dummy task with
everything needed to emulate iret, we might be able to do it without the
iret instruction. Or is this a stupid idea ? At least now I've well
understood that ugliness is not an excuse for not proposing something :-)

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2015-07-23 21:20:02 UTC
Permalink
What's the worst case that can happen with RF cleared when returing
to user space ?
Not a good idea. We are fine breaking breakpoints on the kernel ("use
the tracing infrastructure instead"). Breaking it in user space is not
really an option.

And we really don't need to. We'd only use 'ret' when returning to
kernel code. And not even for the usual case, only for the "interrupts
are off" case. If somebody tries to put a breakpoint on something
that is used in an irq-off situation, they are doing something very
specialized, and we cna tell them: "sorry, we had to break your use
case because it's crazy any other way".

Those kind of people are by definition not "users". They are mucking
with kernel internals. Breaking them is not a regression.

Btw, we should still ask Intel for that "fast iret that doesn't
re-enable NMI". So for possible future CPU's we might let people do
crazy things again.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Peter Zijlstra
2015-07-23 21:30:02 UTC
Permalink
Post by Linus Torvalds
And the "take them and disable them" is really simple. No "am I in an
NMI contect" thing (because that leads to the whole question about
"what is NMI context"). That's not the real rule anyway.
No, make it very simple and straightforward. Make the test be "uhhuh,
I got a #DB in kernel mode, and interrupts were disabled - I know I'm
going to return with "ret", so I'm just going to have to disable this
breakpoint".
Nothing clever. Nothing subtle. Nothing that needs "this range of
instructions is magical". No. Just a very simple rule: if the context
we return to is kernel mode and interrupts are disabled, we're using
'ret', so we cannot suppress debug faults.
Did I miss something? There were a lot of emails flying around, but I
*thought* I saw them all..
So the NMI could trigger userspace debug register faults, and simply
disabling them would make the whole debug register thing entirely
unreliable.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2015-07-23 21:40:03 UTC
Permalink
Post by Peter Zijlstra
So the NMI could trigger userspace debug register faults, and simply
disabling them would make the whole debug register thing entirely
unreliable.
We could easily set something to re-enable them for when we actually
return to user space. I'd be ok with just setting the
_TIF_USER_WORK_MASK.

But even that should not be a requirement for the basic stability and
core integrity of the kernel. Not like the current horrid mess with
NMI nesting and ESP fixing etc.

And realistically, nobody will ever even notice. So the whole "ok, we
can use _TIF_USER_WORK_MASK to re-enable dr7" is a tiny tiny detail
that is more like cleaning up things, not a core issue.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-07-23 21:50:02 UTC
Permalink
On Thu, Jul 23, 2015 at 2:35 PM, Linus Torvalds
Post by Linus Torvalds
Post by Peter Zijlstra
So the NMI could trigger userspace debug register faults, and simply
disabling them would make the whole debug register thing entirely
unreliable.
We could easily set something to re-enable them for when we actually
return to user space. I'd be ok with just setting the
_TIF_USER_WORK_MASK.
But even that should not be a requirement for the basic stability and
core integrity of the kernel. Not like the current horrid mess with
NMI nesting and ESP fixing etc.
And realistically, nobody will ever even notice. So the whole "ok, we
can use _TIF_USER_WORK_MASK to re-enable dr7" is a tiny tiny detail
that is more like cleaning up things, not a core issue.
Or we just re-enable them on the way out of NMI (i.e. the very last
thing we do in the NMI handler). I don't want to break regular
userspace gdb when perf is running.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2015-07-23 22:00:03 UTC
Permalink
Post by Andy Lutomirski
Or we just re-enable them on the way out of NMI (i.e. the very last
thing we do in the NMI handler). I don't want to break regular
userspace gdb when perf is running.
I'd really prefer it if we don't touch NMI code in those kinds of
ways. The NMI code is fragile as hell. All the problems we have with
it is exactly due to "where is the boundary" issues.

That's why I *don't* want NMI code to do magic crap. Anything that
says "disable this during this magic window" is broken. The problems
we've had are exactly about atomicity of the entry/exit conditions,
and there is no really good way to get them right.

I'd be much happier with a _TIF_USER_WORK_MASK approach exactly
because it's so *obvious* that it's not a boundary condition.

I dislike the "disable and re-enable dr7 in the NMI handler" exactly
because it smells like "we can only handle faults in _this_ region".
It may be true, but it's also what I want us to get away from. I'd
much rather have the "big picture" be that we can take faults anywhere
at all (*), and that none of the core code really cares. Then we "fix
up" user space.

Linus

(*) And yes, sysenter and not having a stack at all is very special,
and I think we will *always* have to have that magical special case of
the first few instructions there. But that's a separate hardware
limitation we can't get around.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-07-23 22:10:02 UTC
Permalink
On Thu, Jul 23, 2015 at 2:54 PM, Linus Torvalds
Post by Linus Torvalds
Post by Andy Lutomirski
Or we just re-enable them on the way out of NMI (i.e. the very last
thing we do in the NMI handler). I don't want to break regular
userspace gdb when perf is running.
I'd really prefer it if we don't touch NMI code in those kinds of
ways. The NMI code is fragile as hell. All the problems we have with
it is exactly due to "where is the boundary" issues.
That's why I *don't* want NMI code to do magic crap. Anything that
says "disable this during this magic window" is broken. The problems
we've had are exactly about atomicity of the entry/exit conditions,
and there is no really good way to get them right.
I'd be much happier with a _TIF_USER_WORK_MASK approach exactly
because it's so *obvious* that it's not a boundary condition.
I dislike the "disable and re-enable dr7 in the NMI handler" exactly
because it smells like "we can only handle faults in _this_ region".
It may be true, but it's also what I want us to get away from. I'd
much rather have the "big picture" be that we can take faults anywhere
at all (*), and that none of the core code really cares. Then we "fix
up" user space.
OK, new proposal:

In do_debug, if we trip an instruction breakpoint while
!user_mode(regs) && ((regs->flags & X86_EFLAGS_IF) == 0), then disarm
*that breakpoint*.

Why? It only looks at hardware state (dr6 and dr7), and it can't
break gdb, because gdb can't set a breakpoint that will cause this
problem.

All the other variants of this either need cached state or break gdb
watchpoints on stack variables with perf running.

--Andy
Post by Linus Torvalds
Linus
(*) And yes, sysenter and not having a stack at all is very special,
and I think we will *always* have to have that magical special case of
the first few instructions there. But that's a separate hardware
limitation we can't get around.
--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2015-07-23 22:10:02 UTC
Permalink
Post by Andy Lutomirski
In do_debug, if we trip an instruction breakpoint while
!user_mode(regs) && ((regs->flags & X86_EFLAGS_IF) == 0), then disarm
*that breakpoint*.
Ack. The more targeted we can make this while still guaranteeing
forward progress, the better. So that sounds really good.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Peter Zijlstra
2015-07-24 10:30:03 UTC
Permalink
Post by Andy Lutomirski
In do_debug, if we trip an instruction breakpoint while
!user_mode(regs) && ((regs->flags & X86_EFLAGS_IF) == 0), then disarm
*that breakpoint*.
Doesn't !IF already imply that it must be kernel space? AFAIK user space
cannot clear IF.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Peter Zijlstra
2015-07-24 11:10:01 UTC
Permalink
Post by Linus Torvalds
Post by Andy Lutomirski
Or we just re-enable them on the way out of NMI (i.e. the very last
thing we do in the NMI handler). I don't want to break regular
userspace gdb when perf is running.
I'd really prefer it if we don't touch NMI code in those kinds of
ways. The NMI code is fragile as hell. All the problems we have with
it is exactly due to "where is the boundary" issues.
That's why I *don't* want NMI code to do magic crap. Anything that
says "disable this during this magic window" is broken. The problems
we've had are exactly about atomicity of the entry/exit conditions,
and there is no really good way to get them right.
I'd be much happier with a _TIF_USER_WORK_MASK approach exactly
because it's so *obvious* that it's not a boundary condition.
I dislike the "disable and re-enable dr7 in the NMI handler" exactly
because it smells like "we can only handle faults in _this_ region".
It may be true, but it's also what I want us to get away from. I'd
much rather have the "big picture" be that we can take faults anywhere
at all (*), and that none of the core code really cares. Then we "fix
up" user space.
A wee bit something like so?

We need the intermediate self-IPI because NMI/MCE etc do not deal with
TIF flags.

I further cleared all of DR7 in an attempt at reducing the amount of
state tracked. And it doesn't distinguish between kernel/user
watchpoints because the kernel can touch both from !IF.

---
arch/x86/kernel/traps.c | 37 +++++++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 8e65d8a9b8db..e8308e9c2b1e 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -570,6 +570,33 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
NOKPROBE_SYMBOL(fixup_bad_iret);
#endif

+struct do_debug_state {
+ unsigned long dr7;
+ struct irq_work irq_work;
+ struct callback_head task_work;
+};
+
+static void __debug_irq_trampoline(struct irq_work *work)
+{
+ struct do_debug_state *dds =
+ container_of(work, struct do_debug_state, irq_work);
+
+ task_work_add(current, &dds->task_work, true);
+}
+
+static void __debug_restore_dr7(struct callback_head *work)
+{
+ struct do_debug_state *dds =
+ container_of(work, struct do_debug_state, task_work);
+
+ set_debugreg(dds->dr7, 7);
+}
+
+static DEFINE_PER_CPU(struct do_debug_state, do_debug_state) = {
+ .irq_work = { .func = __debug_irq_trampoline, },
+ .task_work = { .func = __debug_restore_dr7, },
+};
+
/*
* Our handling of the processor debug registers is non-trivial.
* We do not clear them on entry and exit from the kernel. Therefore
@@ -603,6 +630,16 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)

ist_enter(regs);

+ if (arch_irqs_disabled_flags(regs->flags)) {
+ struct do_debug_state *dds = this_cpu_ptr(&do_debug_state);
+
+ get_debugreg(dds->dr7, 7);
+ set_debugreg(0, 7);
+ irq_work_queue(&dds->irq_work);
+
+ goto exit;
+ }
+
get_debugreg(dr6, 6);

/* Filter out all the reserved bits which are preset to 1 */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Peter Zijlstra
2015-07-23 21:20:02 UTC
Permalink
Post by Andy Lutomirski
3. Forbid faults (other than MCE) inside NMI.
Option 3 is almost easy. There are really only two kinds of faults
that can legitimately nest inside NMI: #PF and #DB. #DB is easy to
fix (e.g. with my patches or Peter's patches).
What if we went all out and forbade page faults in NMI as well. There
are two reasons that I can think of that we might page fault inside an
b) User memory access faults.
The reason we access user state in general from an NMI is to allow
perf to capture enough user stack data to let the tooling backtrace
back to user space. What if we did it differently? Instead of
capturing this data in NMI context, capture it in
prepare_exit_to_usermode.
Peter, can this be done without breaking the perf ABI? If we were
designing all of this stuff from scratch right now, I'd suggest doing
it this way, but I'm not sure whether it makes sense to try to
retrofit it in.
Not really; but also almost :/

So the thing is that we currently attach the user backtrace to all
events -- and there can be many before we return to userspace again.

So none of those events would have a userspace stack, I'm sure that's
going to confuse the tooling.

OTOH, userspace stacks are a best effort thing, we bail at the first
sign of trouble (eg. the stack page is not there).

Now realistically this 'never' happens, and it would result in
consistently truncated user traces, where your proposal would result in
a whole bunch of events with no user traces and then an 'extra' event
with a one.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Steven Rostedt
2015-07-23 21:30:02 UTC
Permalink
On Thu, 23 Jul 2015 13:21:16 -0700
Post by Andy Lutomirski
3. Forbid faults (other than MCE) inside NMI.
Option 3 is almost easy. There are really only two kinds of faults
that can legitimately nest inside NMI: #PF and #DB. #DB is easy to
fix (e.g. with my patches or Peter's patches).
What about int3? Which is needed to make ftrace work. This was a
requirement to get rid of stomp-machine when updating ftrace functions,
as well as the rational for doing the whole NMI nesting work in the
first place.
Post by Andy Lutomirski
What if we went all out and forbade page faults in NMI as well. There
are two reasons that I can think of that we might page fault inside an
a) vmalloc fault. I think Ingo already half-implemented a rework to
eliminate vmalloc faults entirely.
b) User memory access faults.
c) stack tracing faults

I would have NMIs debug deadlocks with printing stack traces. The stack
tracer can page fault, and before the NMI nesting code, while debugging
machines, these stack dumps would randomly reboot the box. While
writing the NMI nesting code I realized why those reboots happened, and
that was due to the stack trace faulting, and the printk from NMI was
slow enough to have another NMI go off and stomp over the outer NMIs
stack. Which lead to triple faults and such.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-07-23 21:50:02 UTC
Permalink
Post by Steven Rostedt
On Thu, 23 Jul 2015 13:21:16 -0700
Post by Andy Lutomirski
3. Forbid faults (other than MCE) inside NMI.
Option 3 is almost easy. There are really only two kinds of faults
that can legitimately nest inside NMI: #PF and #DB. #DB is easy to
fix (e.g. with my patches or Peter's patches).
What about int3? Which is needed to make ftrace work. This was a
requirement to get rid of stomp-machine when updating ftrace functions,
as well as the rational for doing the whole NMI nesting work in the
first place.
OK, I'm convinced.

So I'll keep working on fixing up int3 to be less magical. Patches
coming eventually.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Raymond Jennings
2015-07-24 16:40:02 UTC
Permalink
Post by Andy Lutomirski
[moved to a new thread, cc list trimmed]
Hi all-
1. Allow nesting. We know quite well how messy that is.
This might be a stupid question, but

1. What exactly does the NMI handler handle
2. Is it possible for the NMI handler to just increment a counter and
return if it nests, and let the outer handler notice and rerun itself.
Post by Andy Lutomirski
2. Forbid IRET inside NMIs. Doable but maybe not that pretty.
3. Forbid faults (other than MCE) inside NMI.
Option 3 is almost easy. There are really only two kinds of faults
that can legitimately nest inside NMI: #PF and #DB. #DB is easy to
fix (e.g. with my patches or Peter's patches).
What if we went all out and forbade page faults in NMI as well. There
are two reasons that I can think of that we might page fault inside an
a) vmalloc fault. I think Ingo already half-implemented a rework to
eliminate vmalloc faults entirely.
b) User memory access faults.
The reason we access user state in general from an NMI is to allow
perf to capture enough user stack data to let the tooling backtrace
back to user space. What if we did it differently? Instead of
capturing this data in NMI context, capture it in
prepare_exit_to_usermode. That would let us capture user state
*correctly*, which we currently can't really do. There's a
never-ending series of minor bugs in which we try to guess the user
register state from NMI context, and it sort of works. In
prepare_exit_to_usermode, we really truly know the user state.
There's a race where an NMI hits during or after
prepare_exit_to_usermode, but maybe that's okay -- just admit defeat
in that case and don't show the user state. (Realistically, without
CFI data, we're not going to be guaranteed to get the right state
anyway.)
To make this work, we'd have to teach NMI-from-userspace to call the
prepare_exit_to_usermode() {
...
while (blah blah blah) {
if (cached_flags & TIF_PERF_CAPTURE_USER_STATE)
perf_capture_user_state();
...
}
...
}
and then, on NMI exit, we'd call perf_capture_user_state directly,
since we don't want to enable IRQs or do opportunsitic sysret on exit
from NMI. (Why not? Because NMIs are still masked, and we don't want
to pay for double-IRET to unmask them, so we really want to leave IRQs
off and IRET straight back to user mode.)
There's an unavoidable race in which we enter user mode with
TIF_PERF_CAPTURE_USER_STATE still set. In principle, we could
IPI-to-self from the NMI handler to cover that case (mostly -- we
capture the wrong state if we're on our way to an IRET fault), or we
could just check on entry if the flag is still set and, if so, admit
defeat.
Peter, can this be done without breaking the perf ABI? If we were
designing all of this stuff from scratch right now, I'd suggest doing
it this way, but I'm not sure whether it makes sense to try to
retrofit it in.
If we decide to stick with option 2, then I've now convinced myself
that banning all kernel breakpoints and watchpoints during NMI
processing is probably for the best. Maybe we should go one step
farther and ban all DR7 breakpoints period. Sure, it will slow down
perf if there are user breakpoints or watchpoints set, but, having
looked at the asm, returning from #DB using RET is, while doable,
distinctly ugly.
--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Loading...