Subject: [PATCH 00/91] pending uml patches

Discussion:

Al Viro

2011-08-18 18:58:59 UTC

My apologies for mailbomb from hell. *All* this stuff is available in
git://git.kernel.org/pub/scm/linux/kernel/git/viro/um-header.git/ #master,
but since uml folks had been stuck with mail and patch for a long time...

Anyway, most of the stuff in this pile is merging, cleaning and mutilating
subarchitecture-related code in arch/um. By the end of it we have x86
bits largely merged between 32bit and 64bit variants and taken to arch/x86/um;
headers seriously cleaned up and mostly free of x86-isms now (not completely -
we still have page size dependencies in there).

Beginning of the series are pure build and driver fixes; those should go to
Linus before 3.1-final, IMO.

As far as I know, there's no regressions introduced by that pile; testing
and comments would be, of course, welcome.

Richard Weinberger

2011-08-18 19:12:47 UTC

Permalink

Al,

Am 18.08.2011 20:58, schrieb Al Viro:
>
> My apologies for mailbomb from hell. *All* this stuff is available in
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/um-header.git/ #master,
> but since uml folks had been stuck with mail and patch for a long time...

Have you touched your patches since yesterday?
I've already pulled and uploaded them to my shiny new git repo at:
git://git.kernel.org/pub/scm/linux/kernel/git/rw/linux-um.git unstable

Due to the current mirroring problems with git.kernel.org I have not
sent made it public.
Sorry for that, I screwed it. :-(

> Anyway, most of the stuff in this pile is merging, cleaning and mutilating
> subarchitecture-related code in arch/um. By the end of it we have x86
> bits largely merged between 32bit and 64bit variants and taken to arch/x86/um;
> headers seriously cleaned up and mostly free of x86-isms now (not completely -
> we still have page size dependencies in there).
>
> Beginning of the series are pure build and driver fixes; those should go to
> Linus before 3.1-final, IMO.

Okay.

> As far as I know, there's no regressions introduced by that pile; testing
> and comments would be, of course, welcome.
>

There was a small build regression, I've already fixed it!

Thanks,
//richard

Al Viro

2011-08-18 19:19:46 UTC

Permalink

On Thu, Aug 18, 2011 at 09:12:47PM +0200, Richard Weinberger wrote:
> Have you touched your patches since yesterday?
> I've already pulled and uploaded them to my shiny new git repo at:
> git://git.kernel.org/pub/scm/linux/kernel/git/rw/linux-um.git unstable

Reordered and added missing S-o-b on a couple, split one commit.

Al Viro

2011-08-19 04:31:20 UTC

Permalink

On Thu, Aug 18, 2011 at 08:19:46PM +0100, Al Viro wrote:
> On Thu, Aug 18, 2011 at 09:12:47PM +0200, Richard Weinberger wrote:
> > Have you touched your patches since yesterday?
> > I've already pulled and uploaded them to my shiny new git repo at:
> > git://git.kernel.org/pub/scm/linux/kernel/git/rw/linux-um.git unstable
>
> Reordered and added missing S-o-b on a couple, split one commit.

Umm... One comment after looking at your tree: you probably want to rebase
for-3.2 on top of fixes (and presumably feed it to sfr for inclusion into
linux-next).

And for pity sake, do *not* merge from Linus every day; that's one sure
way to get yourself flamed into crisp. Just trying to figure out
what's in your tree is a _hard_ exercise. git cherry between Linus'
tree and e.g. #fixes in yours gives a long list of commits, most of them
_probably_ duplicates of the stuff in mainline. What are bnx2 patches
doing in there, for example?

I've tried to figure out what's going on in there; AFAICS, your #fixes
is mainline plus
Al Viro (6):
um: fix oopsable race in line_close()
um: winch_interrupt() can happen inside of free_winch()
um: fix free_winch() mess
um: PTRACE_[GS]ETFPXREGS had been wired on the wrong subarch
um: fix strrchr problems
um: clean arch_ptrace() up a bit

Ingo van Lil (1):
um: Save FPU registers between task switches

Jonathan Neusch<C3><A4>fer (3):
UserModeLinux-HOWTO.txt: fix a typo
um: drivers/xterm.c: fix a file descriptor leak
UserModeLinux-HOWTO.txt: remove ^H characters

Thadeu Lima de Souza Cascardo (1):
um: disable CMPXCHG_DOUBLE as it breaks UML build

I've cherry-picked those on top of the same branchpoint; see
#cleaned-fixes in um-headers.git. AFAICS, that's the same contents as
your #fixes, with clean history. Diff against your #fixes consists of
- .irq_set_type = pmic_irq_type, <<<<<<< HEAD
- .irq_bus_lock = pmic_irq_buslock,
+ .irq_set_type = pmic_irq_type,
+ .irq_bus_lock = pmic_bus_lock,
in drivers/platform/x86/intel_pmic_gpio.c, which is an obvious mismerge
(AFAICS, on May 29).

IME the sane policy is to keep for-linus, pulling into it when Linus
pulls from you. At that point it's a fast-forward and all previous
history is not cluttering the things up anymore. for-next I rebase and
reorder at will, TBH, but generally I start it at the current tip of
for-linus.

Beyond what you've got in #for-3.2 I have a couple of commits, but that
can wait until the history is sorted out. As it is, I 100% guarantee
that pull request on your #fixes as it is will result in pyrotechnical
effects from hell (OK, from Linus, actually, but in this case there won't
be any real difference).

Richard Weinberger

2011-08-19 08:51:51 UTC

Permalink

Am 19.08.2011 06:31, schrieb Al Viro:
> On Thu, Aug 18, 2011 at 08:19:46PM +0100, Al Viro wrote:
>> On Thu, Aug 18, 2011 at 09:12:47PM +0200, Richard Weinberger wrote:
>>> Have you touched your patches since yesterday?
>>> I've already pulled and uploaded them to my shiny new git repo at:
>>> git://git.kernel.org/pub/scm/linux/kernel/git/rw/linux-um.git unstable
>>
>> Reordered and added missing S-o-b on a couple, split one commit.
>
> Umm... One comment after looking at your tree: you probably want to rebase
> for-3.2 on top of fixes (and presumably feed it to sfr for inclusion into
> linux-next).

Please slow down a bit. :-)
All these branches are just for testing purposes.
That's why I have not announced them nor sent a pull request to Linus.

Anyway, thanks for the hints!

Thanks,
//richard

Al Viro

2011-08-20 01:18:45 UTC

Permalink

On Fri, Aug 19, 2011 at 10:51:51AM +0200, Richard Weinberger wrote:

> Please slow down a bit. :-)
> All these branches are just for testing purposes.
> That's why I have not announced them nor sent a pull request to Linus.
>
> Anyway, thanks for the hints!

np... FWIW, there's a really ugly bug present in mainline as well as
in mainline + these patches and I'd welcome any help in figuring out
what's going on.

1) USER_OBJS do not see CONFIG_..., so os-Linux/main.c doesn't see
CONFIG_ARCH_REUSE_HOST_VSYSCALL_AREA. As the result, uml/i386 doesn't
notice that host vdso is there. That one is easy to fix:
-obj-$(CONFIG_ARCH_REUSE_HOST_VSYSCALL_AREA) += elf_aux.o
+ifeq ($(CONFIG_ARCH_REUSE_HOST_VSYSCALL_AREA),y)
+obj-y += elf_aux.o
+CFLAGS_main.o += -DCONFIG_ARCH_REUSE_HOST_VSYSCALL_AREA
+endif
in arch/um/os-Linux/Makefile takes care of that. Unfortunately, it also
exposes a bug in fixrange_init():

2) fixrange_init() gets called with start (and end) not multiple of
PMD_SIZE; moreover, end is very close to the ~0UL - closer than by PMD_SIZE.
Bad things start happening to the loops in there. Again, easy to fix:

diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c
index 8137ccc..39ee674 100644
--- a/arch/um/kernel/mem.c
+++ b/arch/um/kernel/mem.c
@@ -119,19 +119,22 @@ static void __init fixrange_init(unsigned long start, unsigned long end,
int i, j;
unsigned long vaddr;

- vaddr = start;
+ vaddr = start & PMD_MASK;
i = pgd_index(vaddr);
j = pmd_index(vaddr);
pgd = pgd_base + i;
+ start >>= PMD_SHIFT;
+ end = (end - 1) >> PMD_SHIFT;

- for ( ; (i < PTRS_PER_PGD) && (vaddr < end); pgd++, i++) {
+ for ( ; (i < PTRS_PER_PGD) && start <= end; pgd++, i++) {
pud = pud_offset(pgd, vaddr);
if (pud_none(*pud))
one_md_table_init(pud);
pmd = pmd_offset(pud, vaddr);
- for (; (j < PTRS_PER_PMD) && (vaddr < end); pmd++, j++) {
+ for (; (j < PTRS_PER_PMD) && start <= end; pmd++, j++) {
one_page_table_init(pmd);
vaddr += PMD_SIZE;
+ start++;
}
j = 0;
}

That populates the page tables in the right places and fixrange_user_init()
manages to call it, avoid death-by-oom from runaway allocations and then
install references to all pages it wants. Alas, at that point the things
become really interesting.

3) with the previous two issues dealt with, we get the following magical
mistery shite when running 32bit uml kernel + userland on 64bit host:
* the system boots all the way to getty/login and sshd (i.e. gets
through the debian /etc/init.d (squeeze/i386))
* one can log into it, both on terminals and over ssh. shell and
a bunch of other stuff works. Mostly.
* /bin/bash -c "echo *" reliably segfaults. Always. So does tab
completion in bash, for that matter.
* said segfault is reproducible both from shell and under gdb.
For /bin/bash -c "echo *" under gdb it's always the 10th call of brk(3).
What happens there apparently boils down to __kernel_vsyscall() getting
called (and yes, sys_brk() is called, succeeds and results in expected
value in %eax) and corrupting the living hell out of %ecx. Namely, on
return from what presumably is __kernel_vsyscall() I'm seeing %ecx equal
to (original value of) %ebp. All registers except %eax and %ecx (including
%esp and %ebp) remain unchanged.
Again, that happens only on the same call of brk(3) - all previous
calls succeed as expected. I don't believe that it's a race. I also
very much doubt that we are calling the wrong location - it's hard to tell
with the call being call *%gs:0x10 (is there any way to find what that
is equal to in gdb, BTW? Short of hot-patching movl *%gs:0x10,%eax in place
of that call and single-stepping it, that is...) but it *does* end up
making the system call that ought to have been made, so I suspect that it
does hit __kernel_vsyscall(), after all...

The text of __kernel_vsyscall() is
0xffffe420 <__kernel_vsyscall+0>: push %ebp
0xffffe421 <__kernel_vsyscall+1>: mov %ecx,%ebp
0xffffe423 <__kernel_vsyscall+3>: syscall
0xffffe425 <__kernel_vsyscall+5>: mov $0x2b,%ecx
0xffffe42a <__kernel_vsyscall+10>: mov %ecx,%ss
0xffffe42c <__kernel_vsyscall+12>: mov %ebp,%ecx
0xffffe42e <__kernel_vsyscall+14>: pop %ebp
0xffffe42f <__kernel_vsyscall+15>: ret
so %ecx on the way out becoming equal to original %ebp is bloody curious -
it would smell like entering that sucker 3 bytes too late and skipping
mov %ecx, %ebp, but... we would also skip push %ebp, so we'd get trashed
on the way out - wrong return address, wrong value in %ebp, changed %esp.
None of that happens. And we are executing that code in userland - i.e.
to get corrupt it would have to get corrupt in *HOST* 32bit VDSO. Which
would have much more visible effects, starting with the next attempt to
run the testcase blowing up immediately instead of waiting (as it actually
does) for the same 10th call of brk()...

I'm at loss, to be honest. The sucker is nicely reproducible, but bisecting
doesn't help at all - it seems to be present all the way back at least to
2.6.33. I hadn't tried to go back further and I hadn't tried to go for
older host kernels, but I wouldn't put too much faith into that... The
reason it hadn't been noticed much earlier is that it works fine on i386
host - aforementioned shit happens only when the entire thing (identical
binary, identical fs image, identical options) is run on amd64. However,
on i386 I have a different __kernel_vsyscall, which might easily be the
reason it doesn't happen there. It's a K7 box with sysenter-based
variant ending up as __kernel_vsyscall(). Hell knows what's going on...
Behaviour is really weird and I'd appreciate any pointers re debugging
that crap. Suggestions?

Richard Weinberger

2011-08-20 15:22:23 UTC

Permalink

Am 20.08.2011 03:18, schrieb Al Viro:
> 3) with the previous two issues dealt with, we get the following magical
> mistery shite when running 32bit uml kernel + userland on 64bit host:
> * the system boots all the way to getty/login and sshd (i.e. gets
> through the debian /etc/init.d (squeeze/i386))
> * one can log into it, both on terminals and over ssh. shell and
> a bunch of other stuff works. Mostly.
> * /bin/bash -c "echo *" reliably segfaults. Always. So does tab
> completion in bash, for that matter.
> * said segfault is reproducible both from shell and under gdb.
> For /bin/bash -c "echo *" under gdb it's always the 10th call of brk(3).
> What happens there apparently boils down to __kernel_vsyscall() getting
> called (and yes, sys_brk() is called, succeeds and results in expected
> value in %eax) and corrupting the living hell out of %ecx. Namely, on
> return from what presumably is __kernel_vsyscall() I'm seeing %ecx equal
> to (original value of) %ebp. All registers except %eax and %ecx (including
> %esp and %ebp) remain unchanged.
> Again, that happens only on the same call of brk(3) - all previous
> calls succeed as expected. I don't believe that it's a race. I also
> very much doubt that we are calling the wrong location - it's hard to tell
> with the call being call *%gs:0x10 (is there any way to find what that
> is equal to in gdb, BTW? Short of hot-patching movl *%gs:0x10,%eax in place
> of that call and single-stepping it, that is...) but it *does* end up
> making the system call that ought to have been made, so I suspect that it
> does hit __kernel_vsyscall(), after all...
>
> The text of __kernel_vsyscall() is
> 0xffffe420<__kernel_vsyscall+0>: push %ebp
> 0xffffe421<__kernel_vsyscall+1>: mov %ecx,%ebp
> 0xffffe423<__kernel_vsyscall+3>: syscall
> 0xffffe425<__kernel_vsyscall+5>: mov $0x2b,%ecx
> 0xffffe42a<__kernel_vsyscall+10>: mov %ecx,%ss
> 0xffffe42c<__kernel_vsyscall+12>: mov %ebp,%ecx
> 0xffffe42e<__kernel_vsyscall+14>: pop %ebp
> 0xffffe42f<__kernel_vsyscall+15>: ret
> so %ecx on the way out becoming equal to original %ebp is bloody curious -
> it would smell like entering that sucker 3 bytes too late and skipping
> mov %ecx, %ebp, but... we would also skip push %ebp, so we'd get trashed
> on the way out - wrong return address, wrong value in %ebp, changed %esp.
> None of that happens. And we are executing that code in userland - i.e.
> to get corrupt it would have to get corrupt in *HOST* 32bit VDSO. Which
> would have much more visible effects, starting with the next attempt to
> run the testcase blowing up immediately instead of waiting (as it actually
> does) for the same 10th call of brk()...
>
> I'm at loss, to be honest. The sucker is nicely reproducible, but bisecting
> doesn't help at all - it seems to be present all the way back at least to
> 2.6.33. I hadn't tried to go back further and I hadn't tried to go for
> older host kernels, but I wouldn't put too much faith into that... The
> reason it hadn't been noticed much earlier is that it works fine on i386
> host - aforementioned shit happens only when the entire thing (identical
> binary, identical fs image, identical options) is run on amd64. However,
> on i386 I have a different __kernel_vsyscall, which might easily be the
> reason it doesn't happen there. It's a K7 box with sysenter-based
> variant ending up as __kernel_vsyscall(). Hell knows what's going on...
> Behaviour is really weird and I'd appreciate any pointers re debugging
> that crap. Suggestions?

Hmmm, very strange.
Sadly I cannot reproduce the issue. :(
Everything works fine within UML.
(Of course I've applied your vDSO/i386 patches)

My test setup:
Host kernel: 2.6.37 and 3.0.1
Distro: openSUSE 11.4/x86_64

UML kernel: 3.1-rc2
Distro: openSUSE 11.1/i386

Does the problem also occur with another host kernel or a different
guest image?

Thanks,
//richard

Al Viro

2011-08-20 20:14:06 UTC

Permalink

On Sat, Aug 20, 2011 at 05:22:23PM +0200, Richard Weinberger wrote:

> Hmmm, very strange.
> Sadly I cannot reproduce the issue. :(
> Everything works fine within UML.
> (Of course I've applied your vDSO/i386 patches)
>
> My test setup:
> Host kernel: 2.6.37 and 3.0.1
> Distro: openSUSE 11.4/x86_64
>
> UML kernel: 3.1-rc2
> Distro: openSUSE 11.1/i386
>
> Does the problem also occur with another host kernel or a different
> guest image?

Could you check what you get in __kernel_vsyscall()? On iAMD64 box
where that sucker contains sysenter-based variant the bug is not
present. IOW, it's sensitive to syscall vs. systenter vs. int 0x80
differences.

I can throw the trimmed-down fs image your way, BTW (66MB of bzipped ext2 ;-/)
if you want to see if that gets reproduced on your box. I'll drop it on
anonftp if you are interested. FWIW, the same kernel binary/same image
result in
* K7 box - no breakage, SYSENTER-based vdso
* K8 box - breakage as described, SYSCALL-based vdso32
* P4 box - no breakage, SYSENTER-based vdso32
Hell knows... In theory that would seem to point towards ia32_cstar_target(),
so I'm going to RTFS carefully through that animal.

The thing is, whatever happens happens when victim gets resumed inside
vdso page. I'll try to dump PTRACE_SETREGS and see the values host
kernel asked to set and work from there, but the interesting part is
bloody hard to singlestep through - the victim is back to user mode and
it is already traced by the guest kernel, so it's not as if we could
attach host gdb to it and walk through that crap. And guest gdb is not
going to be able to set breakpoints in there - vdso page is r/o...

Richard Weinberger

2011-08-20 20:55:45 UTC

Permalink

Am 20.08.2011 22:14, schrieb Al Viro:
> On Sat, Aug 20, 2011 at 05:22:23PM +0200, Richard Weinberger wrote:
>
>> Hmmm, very strange.
>> Sadly I cannot reproduce the issue. :(
>> Everything works fine within UML.
>> (Of course I've applied your vDSO/i386 patches)
>>
>> My test setup:
>> Host kernel: 2.6.37 and 3.0.1
>> Distro: openSUSE 11.4/x86_64
>>
>> UML kernel: 3.1-rc2
>> Distro: openSUSE 11.1/i386
>>
>> Does the problem also occur with another host kernel or a different
>> guest image?
>
> Could you check what you get in __kernel_vsyscall()? On iAMD64 box
> where that sucker contains sysenter-based variant the bug is not
> present. IOW, it's sensitive to syscall vs. systenter vs. int 0x80
> differences.

OK, this explains why I cannot reproduce it.
My Intel Core2 box is sysenter-based.

(gdb) disass __kernel_vsyscall
0xffffe420 <__kernel_vsyscall+0>: push %ecx
0xffffe421 <__kernel_vsyscall+1>: push %edx
0xffffe422 <__kernel_vsyscall+2>: push %ebp
0xffffe423 <__kernel_vsyscall+3>: mov %esp,%ebp
0xffffe425 <__kernel_vsyscall+5>: sysenter
0xffffe427 <__kernel_vsyscall+7>: nop
0xffffe428 <__kernel_vsyscall+8>: nop
0xffffe429 <__kernel_vsyscall+9>: nop
0xffffe42a <__kernel_vsyscall+10>: nop
0xffffe42b <__kernel_vsyscall+11>: nop
0xffffe42c <__kernel_vsyscall+12>: nop
0xffffe42d <__kernel_vsyscall+13>: nop
0xffffe42e <__kernel_vsyscall+14>: jmp
0xffffe423<__kernel_vsyscall+3>
0xffffe430 <__kernel_vsyscall+16>: pop %ebp
0xffffe431 <__kernel_vsyscall+17>: pop %edx
0xffffe432 <__kernel_vsyscall+18>: pop %ecx
0xffffe433 <__kernel_vsyscall+19>: ret

> I can throw the trimmed-down fs image your way, BTW (66MB of bzipped ext2 ;-/)
> if you want to see if that gets reproduced on your box. I'll drop it on
> anonftp if you are interested. FWIW, the same kernel binary/same image
> result in
> * K7 box - no breakage, SYSENTER-based vdso
> * K8 box - breakage as described, SYSCALL-based vdso32
> * P4 box - no breakage, SYSENTER-based vdso32
> Hell knows... In theory that would seem to point towards ia32_cstar_target(),
> so I'm going to RTFS carefully through that animal.

Now I'm testing with a Debian fs from:
http://fs.devloop.org.uk/filesystems/Debian-Squeeze/

> The thing is, whatever happens happens when victim gets resumed inside
> vdso page. I'll try to dump PTRACE_SETREGS and see the values host
> kernel asked to set and work from there, but the interesting part is
> bloody hard to singlestep through - the victim is back to user mode and
> it is already traced by the guest kernel, so it's not as if we could
> attach host gdb to it and walk through that crap. And guest gdb is not
> going to be able to set breakpoints in there - vdso page is r/o...

[ CC'ing ***@mit.edu ]
Andy, do you have an idea?
You can find Al's original report here:
http://marc.info/?l=linux-kernel&m=131380315624244&w=2

Thanks,
//richard

Andrew Lutomirski

2011-08-20 21:26:00 UTC

Permalink

On Sat, Aug 20, 2011 at 4:55 PM, Richard Weinberger <***@nod.at> wr=
ote:
> Am 20.08.2011 22:14, schrieb Al Viro:
>>
>> On Sat, Aug 20, 2011 at 05:22:23PM +0200, Richard Weinberger wrote:
>>
>>> Hmmm, very strange.
>>> Sadly I cannot reproduce the issue. :(
>>> Everything works fine within UML.
>>> (Of course I've applied your vDSO/i386 patches)
>>>
>>> My test setup:
>>> Host kernel: 2.6.37 and 3.0.1
>>> Distro: openSUSE 11.4/x86_64
>>>
>>> UML kernel: 3.1-rc2
>>> Distro: openSUSE 11.1/i386
>>>
>>> Does the problem also occur with another host kernel or a different
>>> guest image?
>>
>> Could you check what you get in __kernel_vsyscall()? =A0On iAMD64 bo=
x
>> where that sucker contains sysenter-based variant the bug is not
>> present. =A0IOW, it's sensitive to syscall vs. systenter vs. int 0x8=
0
>> differences.
>
> OK, this explains why I cannot reproduce it.
> My Intel Core2 box is sysenter-based.
>
> (gdb) disass __kernel_vsyscall
> 0xffffe420 <__kernel_vsyscall+0>: =A0 =A0 =A0 push =A0 %ecx
> 0xffffe421 <__kernel_vsyscall+1>: =A0 =A0 =A0 push =A0 %edx
> 0xffffe422 <__kernel_vsyscall+2>: =A0 =A0 =A0 push =A0 %ebp
> 0xffffe423 <__kernel_vsyscall+3>: =A0 =A0 =A0 mov =A0 =A0%esp,%ebp
> 0xffffe425 <__kernel_vsyscall+5>: =A0 =A0 =A0 sysenter
> 0xffffe427 <__kernel_vsyscall+7>: =A0 =A0 =A0 nop
> 0xffffe428 <__kernel_vsyscall+8>: =A0 =A0 =A0 nop
> 0xffffe429 <__kernel_vsyscall+9>: =A0 =A0 =A0 nop
> 0xffffe42a <__kernel_vsyscall+10>: =A0 =A0 =A0nop
> 0xffffe42b <__kernel_vsyscall+11>: =A0 =A0 =A0nop
> 0xffffe42c <__kernel_vsyscall+12>: =A0 =A0 =A0nop
> 0xffffe42d <__kernel_vsyscall+13>: =A0 =A0 =A0nop
> 0xffffe42e <__kernel_vsyscall+14>: =A0 =A0 =A0jmp 0xffffe423<__kernel=
_vsyscall+3>
> 0xffffe430 <__kernel_vsyscall+16>: =A0 =A0 =A0pop =A0 =A0%ebp
> 0xffffe431 <__kernel_vsyscall+17>: =A0 =A0 =A0pop =A0 =A0%edx
> 0xffffe432 <__kernel_vsyscall+18>: =A0 =A0 =A0pop =A0 =A0%ecx
> 0xffffe433 <__kernel_vsyscall+19>: =A0 =A0 =A0ret
>
>> I can throw the trimmed-down fs image your way, BTW (66MB of bzipped=
ext2
>> ;-/)
>> if you want to see if that gets reproduced on your box. =A0I'll drop=
it on
>> anonftp if you are interested. =A0FWIW, the same kernel binary/same =
image
>> result in
>> =A0 =A0 =A0 =A0* K7 box - no breakage, SYSENTER-based vdso
>> =A0 =A0 =A0 =A0* K8 box - breakage as described, SYSCALL-based vdso3=
2
>> =A0 =A0 =A0 =A0* P4 box - no breakage, SYSENTER-based vdso32
>> Hell knows... =A0In theory that would seem to point towards
>> ia32_cstar_target(),
>> so I'm going to RTFS carefully through that animal.
>
> Now I'm testing with a Debian fs from:
> http://fs.devloop.org.uk/filesystems/Debian-Squeeze/
>
>> The thing is, whatever happens happens when victim gets resumed insi=
de
>> vdso page. =A0I'll try to dump PTRACE_SETREGS and see the values hos=
t
>> kernel asked to set and work from there, but the interesting part is
>> bloody hard to singlestep through - the victim is back to user mode =
and
>> it is already traced by the guest kernel, so it's not as if we could
>> attach host gdb to it and walk through that crap. =A0And guest gdb i=
s not
>> going to be able to set breakpoints in there - vdso page is r/o...
>
> [ CC'ing ***@mit.edu ]
> Andy, do you have an idea?
> You can find Al's original report here:
> http://marc.info/?l=3Dlinux-kernel&m=3D131380315624244&w=3D2

I'm missing a bit of the background. Is the user-on-UML app calling
into a vdso entry provided by UML or into a vdso entry provided by the
host?

Why does anything care whether ecx is saved? Doesn't the default
calling convention allow the callee to clobber ecx?

But my guess is that the 64-bit host sysret code might be buggy (or
the value in gs:whatever is wrong). Can you get gdb to breakpoint at
the beginning of __kernel_vsyscall before the crash?

--Andy

Richard Weinberger

2011-08-20 21:38:41 UTC

Permalink

Am 20.08.2011 23:26, schrieb Andrew Lutomirski:
> On Sat, Aug 20, 2011 at 4:55 PM, Richard Weinberger<***@nod.at> wrote:
>> Am 20.08.2011 22:14, schrieb Al Viro:
>>>
>>> On Sat, Aug 20, 2011 at 05:22:23PM +0200, Richard Weinberger wrote:
>>>
>>>> Hmmm, very strange.
>>>> Sadly I cannot reproduce the issue. :(
>>>> Everything works fine within UML.
>>>> (Of course I've applied your vDSO/i386 patches)
>>>>
>>>> My test setup:
>>>> Host kernel: 2.6.37 and 3.0.1
>>>> Distro: openSUSE 11.4/x86_64
>>>>
>>>> UML kernel: 3.1-rc2
>>>> Distro: openSUSE 11.1/i386
>>>>
>>>> Does the problem also occur with another host kernel or a different
>>>> guest image?
>>>
>>> Could you check what you get in __kernel_vsyscall()? On iAMD64 box
>>> where that sucker contains sysenter-based variant the bug is not
>>> present. IOW, it's sensitive to syscall vs. systenter vs. int 0x80
>>> differences.
>>
>> OK, this explains why I cannot reproduce it.
>> My Intel Core2 box is sysenter-based.
>>
>> (gdb) disass __kernel_vsyscall
>> 0xffffe420<__kernel_vsyscall+0>: push %ecx
>> 0xffffe421<__kernel_vsyscall+1>: push %edx
>> 0xffffe422<__kernel_vsyscall+2>: push %ebp
>> 0xffffe423<__kernel_vsyscall+3>: mov %esp,%ebp
>> 0xffffe425<__kernel_vsyscall+5>: sysenter
>> 0xffffe427<__kernel_vsyscall+7>: nop
>> 0xffffe428<__kernel_vsyscall+8>: nop
>> 0xffffe429<__kernel_vsyscall+9>: nop
>> 0xffffe42a<__kernel_vsyscall+10>: nop
>> 0xffffe42b<__kernel_vsyscall+11>: nop
>> 0xffffe42c<__kernel_vsyscall+12>: nop
>> 0xffffe42d<__kernel_vsyscall+13>: nop
>> 0xffffe42e<__kernel_vsyscall+14>: jmp 0xffffe423<__kernel_vsyscall+3>
>> 0xffffe430<__kernel_vsyscall+16>: pop %ebp
>> 0xffffe431<__kernel_vsyscall+17>: pop %edx
>> 0xffffe432<__kernel_vsyscall+18>: pop %ecx
>> 0xffffe433<__kernel_vsyscall+19>: ret
>>
>>> I can throw the trimmed-down fs image your way, BTW (66MB of bzipped ext2
>>> ;-/)
>>> if you want to see if that gets reproduced on your box. I'll drop it on
>>> anonftp if you are interested. FWIW, the same kernel binary/same image
>>> result in
>>> * K7 box - no breakage, SYSENTER-based vdso
>>> * K8 box - breakage as described, SYSCALL-based vdso32
>>> * P4 box - no breakage, SYSENTER-based vdso32
>>> Hell knows... In theory that would seem to point towards
>>> ia32_cstar_target(),
>>> so I'm going to RTFS carefully through that animal.
>>
>> Now I'm testing with a Debian fs from:
>> http://fs.devloop.org.uk/filesystems/Debian-Squeeze/
>>
>>> The thing is, whatever happens happens when victim gets resumed inside
>>> vdso page. I'll try to dump PTRACE_SETREGS and see the values host
>>> kernel asked to set and work from there, but the interesting part is
>>> bloody hard to singlestep through - the victim is back to user mode and
>>> it is already traced by the guest kernel, so it's not as if we could
>>> attach host gdb to it and walk through that crap. And guest gdb is not
>>> going to be able to set breakpoints in there - vdso page is r/o...
>>
>> [ CC'ing ***@mit.edu ]
>> Andy, do you have an idea?
>> You can find Al's original report here:
>> http://marc.info/?l=linux-kernel&m=131380315624244&w=2
>
> I'm missing a bit of the background. Is the user-on-UML app calling
> into a vdso entry provided by UML or into a vdso entry provided by the
> host?

UML/i386 reuses the host's vDSO page.
IOW it does not have it's own vDSO like UML/x86_64.

Thanks,
//richard

Andrew Lutomirski

2011-08-20 21:40:03 UTC

Permalink

On Sat, Aug 20, 2011 at 5:26 PM, Andrew Lutomirski <***@mit.edu> wrote=
:
> On Sat, Aug 20, 2011 at 4:55 PM, Richard Weinberger <***@nod.at> =
wrote:

> I'm missing a bit of the background. =A0Is the user-on-UML app callin=
g
> into a vdso entry provided by UML or into a vdso entry provided by th=
e
> host?
>
> Why does anything care whether ecx is saved? =A0Doesn't the default
> calling convention allow the callee to clobber ecx?
>
> But my guess is that the 64-bit host sysret code might be buggy (or
> the value in gs:whatever is wrong). Can you get gdb to breakpoint at
> the beginning of __kernel_vsyscall before the crash?
>

This is suspicious:

ENTRY(ia32_cstar_target)
CFI_STARTPROC32 simple
CFI_SIGNAL_FRAME
CFI_DEF_CFA rsp,KERNEL_STACK_OFFSET
CFI_REGISTER rip,rcx
/*CFI_REGISTER rflags,r11*/
SWAPGS_UNSAFE_STACK
movl %esp,%r8d
CFI_REGISTER rsp,r8
movq PER_CPU_VAR(kernel_stack),%rsp
/*
* No need to follow this irqs on/off section: the syscall
* disabled irqs and here we enable it straight after entry:
*/
ENABLE_INTERRUPTS(CLBR_NONE)
SAVE_ARGS 8,0,0
movl %eax,%eax /* zero extension */
movq %rax,ORIG_RAX-ARGOFFSET(%rsp)
movq %rcx,RIP-ARGOFFSET(%rsp)
CFI_REL_OFFSET rip,RIP-ARGOFFSET
movq %rbp,RCX-ARGOFFSET(%rsp) /* this lies slightly to ptrac=
e */

The entry code looks something like:

The text of __kernel_vsyscall() is
0xffffe420 <__kernel_vsyscall+0>: push %ebp
0xffffe421 <__kernel_vsyscall+1>: mov %ecx,%ebp
0xffffe423 <__kernel_vsyscall+3>: syscall
0xffffe425 <__kernel_vsyscall+5>: mov $0x2b,%ecx
0xffffe42a <__kernel_vsyscall+10>: mov %ecx,%ss
0xffffe42c <__kernel_vsyscall+12>: mov %ebp,%ecx
0xffffe42e <__kernel_vsyscall+14>: pop %ebp
0xffffe42f <__kernel_vsyscall+15>: ret

so the line:

movq %rbp,RCX-ARGOFFSET(%rsp) /* this lies slightly to ptrace */

will cause iret (if iret happens) to restore the original rbp in rcx
(why? -- it seems okay if syscall is hit in __kernel_vsyscall but not
if something else does the syscall). I don't see what saves rbp to
the stack frame.

This is also suspicious:

movq %r11,EFLAGS-ARGOFFSET(%rsp)

that's inconsistent with my reading of the AMD manual.

How well is the compat syscall entry tested through both the fast and
slow paths? UML is unusual in that it uses ptrace to trap all system
calls, right? That means that syscalls will enter through the cstar
target but return through the iret path.

--Andy

Al Viro

2011-08-21 06:34:43 UTC

Permalink

On Sat, Aug 20, 2011 at 05:40:03PM -0400, Andrew Lutomirski wrote:

> will cause iret (if iret happens) to restore the original rbp in rcx
> (why? -- it seems okay if syscall is hit in __kernel_vsyscall but not
> if something else does the syscall). I don't see what saves rbp to
> the stack frame.

Far more interesting question is how the hell does that thing manage to
work in face of syscall restarts? As the matter of fact, how does it
(and sysenter-based variant) play with ptrace() *and* restarts?

Suppose we have a traced process. foo6() is called and the thing it
stopped before the sys_foo6() is reached kernel-side. The sixth argument
is on stack, ebp is set to user esp. SYSENTER happens, we read the
6th argument from userland stack and put it along with the rest into
pt_regs. tracer examines the arguments, modifies them (including the last
one) and lets the tracee run free - e.g. detaches from the tracee.

What should happen if we happen to get a signal that would restart that
sucker? Granted, it's not going to happen with mmap() - it doesn't, AFAICS,
do anything of that kind. However, I wouldn't bet a dime on other 6-argument
syscalls not stepping on that. sendto() and recvfrom(), in particular...

OK, we return to userland. The sixth argument is placed into %ebp. Linus'
"pig and proud of that" trick works and we end up slapping userland
%esp into %ebp and hitting SYSENTER again. Only one problem, though -
the sixth argument on user stack is completely unaffected by what tracer
had done. Unlike the rest of arguments, that *are* changed.

We could deal with that in case of SYSENTER if we e.g. replaced that
jmp .Lenter_kernel
with
jmp .Lrestart
and added
.Lrestart:
movl %ebp, (%esp)
jmp .Lenter_kernel
but in case of SYSCALL it seems to be even messier... Comments?

... and there I thought that last year session of asm glue sniffing couldn't
be topped by anything more unpleasant ;-/

Al Viro

2011-08-21 08:42:30 UTC

Permalink

On Sun, Aug 21, 2011 at 07:34:43AM +0100, Al Viro wrote:
> Suppose we have a traced process. foo6() is called and the thing it
> stopped before the sys_foo6() is reached kernel-side. The sixth argument
> is on stack, ebp is set to user esp. SYSENTER happens, we read the
> 6th argument from userland stack and put it along with the rest into
> pt_regs. tracer examines the arguments, modifies them (including the last
> one) and lets the tracee run free - e.g. detaches from the tracee.
>
> What should happen if we happen to get a signal that would restart that
> sucker? Granted, it's not going to happen with mmap() - it doesn't, AFAICS,
> do anything of that kind. However, I wouldn't bet a dime on other 6-argument
> syscalls not stepping on that. sendto() and recvfrom(), in particular...
>
> OK, we return to userland. The sixth argument is placed into %ebp. Linus'
> "pig and proud of that" trick works and we end up slapping userland
> %esp into %ebp and hitting SYSENTER again. Only one problem, though -
> the sixth argument on user stack is completely unaffected by what tracer
> had done. Unlike the rest of arguments, that *are* changed.
>
> We could deal with that in case of SYSENTER if we e.g. replaced that
> jmp .Lenter_kernel
> with
> jmp .Lrestart
> and added
> .Lrestart:
> movl %ebp, (%esp)
> jmp .Lenter_kernel
> but in case of SYSCALL it seems to be even messier... Comments?

Oh, hell... Compat SYSCALL one is really buggered on syscall restarts,
ptrace or no ptrace. Look: calling conventions for SYSCALL are
arg1..5: ebx, ebp, edx, edi, esi. arg6: stack
and after syscall restart we end up with
arg1..5: ebx, ecx, edx, edi, esi. arg6: ebp
so restart will instantly clobber arg2, in effect replacing it with arg6.

And yes, adding ptrace to the mix makes things even uglier. For one thing,
changes to ECX via ptrace are completely lost on the fast exit. Not pretty,
and might make life painful for uml, but not for the majority of programs.
What's worse, combination of ptrace with restart will lose changes to arg6
(again, value on stack left as it was, changes to arg6 by tracer lost) *and*
it will lose changes to arg2 (along with arg2 itself - see above).

Linus' Dirty Trick(tm) is not trivial to apply - with SYSCALL we *do* retain
the address of next insn and that's where we end up going. IOW, SYSCALL not
inside vdso32 currently works (for small values of "works", due to restart
issues). Playing with return elsewhere might break some userland code...

Guys, that's *way* out of the area I'm comfortable with.

Andrew Lutomirski

2011-08-21 11:24:35 UTC

Permalink

On Sun, Aug 21, 2011 at 4:42 AM, Al Viro <***@zeniv.linux.org.uk> wrot=
e:
> On Sun, Aug 21, 2011 at 07:34:43AM +0100, Al Viro wrote:
>> Suppose we have a traced process. =A0foo6() is called and the thing =
it
>> stopped before the sys_foo6() is reached kernel-side. =A0The sixth a=
rgument
>> is on stack, ebp is set to user esp. =A0SYSENTER happens, we read th=
e
>> 6th argument from userland stack and put it along with the rest into
>> pt_regs. =A0tracer examines the arguments, modifies them (including =
the last
>> one) and lets the tracee run free - e.g. detaches from the tracee.
>>
>> What should happen if we happen to get a signal that would restart t=
hat
>> sucker? =A0Granted, it's not going to happen with mmap() - it doesn'=
t, AFAICS,
>> do anything of that kind. =A0However, I wouldn't bet a dime on other=
6-argument
>> syscalls not stepping on that. =A0sendto() and recvfrom(), in partic=
ular...
>>
>> OK, we return to userland. =A0The sixth argument is placed into %ebp=
=2E =A0Linus'
>> "pig and proud of that" trick works and we end up slapping userland
>> %esp into %ebp and hitting SYSENTER again. =A0Only one problem, thou=
gh -
>> the sixth argument on user stack is completely unaffected by what tr=
acer
>> had done. =A0Unlike the rest of arguments, that *are* changed.
>>
>> We could deal with that in case of SYSENTER if we e.g. replaced that
>> =A0 =A0 =A0 =A0 jmp .Lenter_kernel
>> with
>> =A0 =A0 =A0 =A0 jmp .Lrestart
>> and added
>> .Lrestart:
>> =A0 =A0 =A0 movl %ebp, (%esp)
>> =A0 =A0 =A0 jmp .Lenter_kernel
>> but in case of SYSCALL it seems to be even messier... =A0Comments?
>
> Oh, hell... =A0Compat SYSCALL one is really buggered on syscall resta=
rts,
> ptrace or no ptrace. =A0Look: calling conventions for SYSCALL are
> =A0 =A0 =A0 =A0arg1..5: ebx, ebp, edx, edi, esi. =A0arg6: stack
> and after syscall restart we end up with
> =A0 =A0 =A0 =A0arg1..5: ebx, ecx, edx, edi, esi. =A0arg6: ebp
> so restart will instantly clobber arg2, in effect replacing it with a=
rg6.
>
> And yes, adding ptrace to the mix makes things even uglier. =A0For on=
e thing,
> changes to ECX via ptrace are completely lost on the fast exit. =A0No=
t pretty,
> and might make life painful for uml, but not for the majority of prog=
rams.
> What's worse, combination of ptrace with restart will lose changes to=
arg6
> (again, value on stack left as it was, changes to arg6 by tracer lost=
) *and*
> it will lose changes to arg2 (along with arg2 itself - see above).
>
> Linus' Dirty Trick(tm) is not trivial to apply - with SYSCALL we *do*=
retain
> the address of next insn and that's where we end up going. =A0IOW, SY=
SCALL not
> inside vdso32 currently works (for small values of "works", due to re=
start
> issues). =A0Playing with return elsewhere might break some userland c=
ode...
>
> Guys, that's *way* out of the area I'm comfortable with.
>

I don't see the point of all this hackery at all. sysenter/sysexit
indeed screws up some registers, but we can return on the iret path in
the case of restart.

So why do we lie to ptrace (and iret!) at all? Why not just fill in
pt_regs with the registers as they were (at least the
non-clobbered-by-sysenter ones), set the actual C parameters correctly
to contain the six arguments (in rdi, rsi, etc.), do the syscall, and
return back to userspace without any funny business? Is there some
ABI reason that, once we've started lying to tracers, we have to keep
doing so?

--Andy

Andrew Lutomirski

2011-08-21 13:37:18 UTC

Permalink

On Sun, Aug 21, 2011 at 7:24 AM, Andrew Lutomirski <***@mit.edu> wrote=
:
> On Sun, Aug 21, 2011 at 4:42 AM, Al Viro <***@zeniv.linux.org.uk> wr=
ote:
>> On Sun, Aug 21, 2011 at 07:34:43AM +0100, Al Viro wrote:
>>> Suppose we have a traced process. =A0foo6() is called and the thing=
it
>>> stopped before the sys_foo6() is reached kernel-side. =A0The sixth =
argument
>>> is on stack, ebp is set to user esp. =A0SYSENTER happens, we read t=
he
>>> 6th argument from userland stack and put it along with the rest int=
o
>>> pt_regs. =A0tracer examines the arguments, modifies them (including=
the last
>>> one) and lets the tracee run free - e.g. detaches from the tracee.
>>>
>>> What should happen if we happen to get a signal that would restart =
that
>>> sucker? =A0Granted, it's not going to happen with mmap() - it doesn=
't, AFAICS,
>>> do anything of that kind. =A0However, I wouldn't bet a dime on othe=
r 6-argument
>>> syscalls not stepping on that. =A0sendto() and recvfrom(), in parti=
cular...
>>>
>>> OK, we return to userland. =A0The sixth argument is placed into %eb=
p. =A0Linus'
>>> "pig and proud of that" trick works and we end up slapping userland
>>> %esp into %ebp and hitting SYSENTER again. =A0Only one problem, tho=
ugh -
>>> the sixth argument on user stack is completely unaffected by what t=
racer
>>> had done. =A0Unlike the rest of arguments, that *are* changed.
>>>
>>> We could deal with that in case of SYSENTER if we e.g. replaced tha=
t
>>> =A0 =A0 =A0 =A0 jmp .Lenter_kernel
>>> with
>>> =A0 =A0 =A0 =A0 jmp .Lrestart
>>> and added
>>> .Lrestart:
>>> =A0 =A0 =A0 movl %ebp, (%esp)
>>> =A0 =A0 =A0 jmp .Lenter_kernel
>>> but in case of SYSCALL it seems to be even messier... =A0Comments?
>>
>> Oh, hell... =A0Compat SYSCALL one is really buggered on syscall rest=
arts,
>> ptrace or no ptrace. =A0Look: calling conventions for SYSCALL are
>> =A0 =A0 =A0 =A0arg1..5: ebx, ebp, edx, edi, esi. =A0arg6: stack
>> and after syscall restart we end up with
>> =A0 =A0 =A0 =A0arg1..5: ebx, ecx, edx, edi, esi. =A0arg6: ebp
>> so restart will instantly clobber arg2, in effect replacing it with =
arg6.
>>
>> And yes, adding ptrace to the mix makes things even uglier. =A0For o=
ne thing,
>> changes to ECX via ptrace are completely lost on the fast exit. =A0N=
ot pretty,
>> and might make life painful for uml, but not for the majority of pro=
grams.
>> What's worse, combination of ptrace with restart will lose changes t=
o arg6
>> (again, value on stack left as it was, changes to arg6 by tracer los=
t) *and*
>> it will lose changes to arg2 (along with arg2 itself - see above).
>>
>> Linus' Dirty Trick(tm) is not trivial to apply - with SYSCALL we *do=
* retain
>> the address of next insn and that's where we end up going. =A0IOW, S=
YSCALL not
>> inside vdso32 currently works (for small values of "works", due to r=
estart
>> issues). =A0Playing with return elsewhere might break some userland =
code...
>>
>> Guys, that's *way* out of the area I'm comfortable with.
>>
>
> I don't see the point of all this hackery at all. =A0sysenter/sysexit
> indeed screws up some registers, but we can return on the iret path i=
n
> the case of restart.
>
> So why do we lie to ptrace (and iret!) at all? =A0Why not just fill i=
n
> pt_regs with the registers as they were (at least the
> non-clobbered-by-sysenter ones), set the actual C parameters correctl=
y
> to contain the six arguments (in rdi, rsi, etc.), do the syscall, and
> return back to userspace without any funny business? =A0Is there some
> ABI reason that, once we've started lying to tracers, we have to keep
> doing so?

Gack. Is this a holdover from the 32-bit code that shares the
argument save area with the parameters passed on the C stack? If so,
we could just set up the argument save area honestly and pass the real
parameters in registers like 64-bit C code expects.

If the tracing and restart cases use iret to return to userspace, this
should all just work. ptrace users shouldn't notice the overhead, and
syscall restart is presumably slow enough anyway that it wouldn't
matter. The userspace entry code would be as simple as:

sysenter
ret

or

sysexit
ret

--Andy

Al Viro

2011-08-21 14:51:26 UTC

Permalink

On Sun, Aug 21, 2011 at 09:37:18AM -0400, Andrew Lutomirski wrote:

> Gack. Is this a holdover from the 32-bit code that shares the
> argument save area with the parameters passed on the C stack? If so,
> we could just set up the argument save area honestly and pass the real
> parameters in registers like 64-bit C code expects.
>
> If the tracing and restart cases use iret to return to userspace, this
> should all just work. ptrace users shouldn't notice the overhead, and
> syscall restart is presumably slow enough anyway that it wouldn't
> matter. The userspace entry code would be as simple as:
>
> sysenter
> ret
>
> or
>
> sysexit
> ret

You are making no sense at all...

Al Viro

2011-08-21 14:43:52 UTC

Permalink

On Sun, Aug 21, 2011 at 07:24:35AM -0400, Andrew Lutomirski wrote:

> I don't see the point of all this hackery at all. sysenter/sysexit
> indeed screws up some registers, but we can return on the iret path in
> the case of restart.

We *do* return on iret path in case of restart, TYVM.

> So why do we lie to ptrace (and iret!) at all? Why not just fill in
> pt_regs with the registers as they were (at least the
> non-clobbered-by-sysenter ones), set the actual C parameters correctly
> to contain the six arguments (in rdi, rsi, etc.), do the syscall, and
> return back to userspace without any funny business? Is there some
> ABI reason that, once we've started lying to tracers, we have to keep
> doing so?

We do not lie to ptrace and iret. At all. We do just what you have
described. And fuck up when restart returns us to the SYSCALL / SYSENTER
instruction again, which expects the different calling conventions,
so the values arranged in registers in the way int 0x80 would expect
do us no good.

Al Viro

2011-08-21 16:41:24 UTC

Permalink

On Sun, Aug 21, 2011 at 03:43:52PM +0100, Al Viro wrote:

> We do not lie to ptrace and iret. At all. We do just what you have
> described. And fuck up when restart returns us to the SYSCALL / SYSENTER
> instruction again, which expects the different calling conventions,
> so the values arranged in registers in the way int 0x80 would expect
> do us no good.

FWIW, what really happens (for 32bit task on amd64) is this:
* both SYSCALL and SYSENTER variants of __kernel_vsyscall are
entered with the same calling conventions; eax contains syscall number,
ebx/ecx/edx/esi/edi/ebp contain arg1..6 resp. Same as what int 0x80
would expect.
* they arrange slightly different calling conventions for
actual SYSCALL/SYSENTER instructions. SYSENTER one: ecx and edx saved
on user stack to undo the effect of SYSEXIT clobbering them, arg6 (from
ebp) pushed to stack as well (for kernel side of SYSENTER to pick it
from there) and userland esp copied to ebp (SYSENTER clobbers esp).
SYSCALL one: arg6 (from ebp) pushed to stack (again, for kernel to pick
it from there), arg2 (from ecx) copied to ebp (SYSCALL clobbers ecx).
Then we hit the kernel.
* Both codepaths start with arranging the same thing on the kernel
stack frame; one 64bit int 0x80 would create. For the good and simple
reason: they all have to be able to leave via IRET. Stack layout is the
same, but we need to fill it accordingly to calling conventions we are
stuck with. I.e. ->cx should be initialized with arg2 and ->bp with
arg6, wherever those currently are on given codepath. _That_ is what
"lying to ptrace" is about - we store there registers according to how
they were when we entered __kernel_vsyscall(), not as they are at the
moment of actual SYSCALL insn. Which is precisely the right thing to do,
since if we *are* ptraced, the tracer expects to find the syscall argument
in the same places, whichever variant of syscall tracee happens to be using.
* In both variants it means picking arg6 from userland stack; if
that pagefaults, we act as if we returned -EFAULT in normal way. Again,
the value is stored in the expected place - ->bp, same as it would on int 0x80
path.
* If we are traced, we grow the things on stack to full pt_regs,
including the callee-saved registers. And call syscall_trace_enter(&regs).
If tracer decides to change registers, it can do so. After that call we
restore the registers from pt_regs on stack and rejoin the corresponding
common codepath.
* In both cases we reshuffle registers to match amd64 C calling
conventions; the only subtle part is that SYSCALL path has arg6 in r9d (and
ebp same as we had on entry, i.e. the original arg2, unaffected by whatever
ptrace might have done to regs->cx, BTW) while SYSENTER path has it in ebp,
same as int 0x80 one. After reshuffling arg6 ends up r9 in all cases and
in all cases ptrace changes to regs->bp (aka where ptrace expects to see
arg6) do affect what's in r9.
* The actual sys_whatever() is called in all cases. If there's
any work to do after it (signals, still being traced, need to be rescheduled,
etc.), we go for the good old IRET path (after having cleaned r8--r12 in
pt_regs - IRET path is shared with 64bit and we don't want random kernel values
leaking to userland).
* If there's no non-trivial work to do, int 0x80 *still* cleans
r8--r12 in pt_regs and goes for IRET path. End of story for it.
* In the same case, SYSENTER path will restore the contents of si and
di from pt_regs (bx is unaffected by sys_whatever(), ax holds return value
and cx/dx are going to be clobbered anyway; bp is not restored to the
conditions it had when hitting SYSENTER, but it's redundant - it was equal
to userland sp and *that* we do restore, of course). r8--r11 are cleared
in actual CPU registers and off we bugger, back to vdso32. Where we pop
ebp/ecx/edx and return to caller. Note that syscall restart couldn't have
happened on that path - it would qualify as "work to do after syscall"
(specifically, signal handling) as we'd be off to IRET path.
* In the same case, SYSCALL path will restore the contents of
si, di and dx from pt_regs (bx is unaffected by sys_whatever(), ax contains
the return value and bp is actually the same as it was on entry, after all
dances). r8-r11 are cleaned in registers, cx is clobbered by SYSRET and
we are off to __kernel_vsyscall(), again. This time back in there we
restore cx to what it used to be on entry to __kernel_vsyscall() [*NOTE*:
unaffected by ptrace manipulations; we probably don't care about that] and
restore bp (from stack). We also restore %ss along the way, but that's
a separate story. Again, no syscall restarts on that path.
* If there *was* a syscall restart to be done, we are guaranteed to
have left via IRET path. In all cases the syscall arguments end up in
registers, in the same way int 0x80 expected them. What happens afterwards
depends on how we entered, though.
+ int 0x80: all registers are restored (with ptrace
manipulations, if any, having left their effect) as they'd been the last
time around. In we go and that's it.
+ SYSENTER: return address had been set *not* to the
insn right next after SYSENTER when we'd been setting the stack frame
up. That's the dirty trick Linus had come up with - return ip is set
to insn in vfso32 (SYSENTER loses the original ip for good, unlike SYSCALL
that would store it in cx, so it has to be at fixed location anyway).
Normally we just pop ecx/edx/ebp from stack and we are done. However,
two bytes prior to that insn (i.e. where syscall restart would land us)
we have jump to just a bit before SYSENTER. Namely, to the point where
we had copied esp to ebp. That, combined with what IRET path has done,
will get us the layout SYSENTER expects once we get to SYSENTER again.
Except that ptrace modifications to arg6 will be lost - *ebp is where
SYSENTER picks it from and it's not updated. Modified value is in ebp
on return from kernel and it's overwritten (with esp) and lost. That's
ptrace vs. restarts bug I've mentioned in SYSENTER case.
+ SYSCALL: buggered. On restart we end up repeating
the call, with arg2 replaced with whatever had been in ebp when we
entered __kernel_vsyscall(). Simply because nobody cared to move it
from ecx (where IRET path has put it) to ebp (where SYSCALL expects
to find it). ebp gets what used to be in arg6 (again, IRET path doing).
Oh, and ptrace modifications, if any, are lost as well - both in arg2
and in arg6.

I *think* the above is an accurate description of what happens,
but I could certainly be wrong - that's just from RTFS of unfamiliar
and seriously convoluted code, so I'd very much appreciate ACK/NAK on
the analysis above from the people actually familiar with that thing...

Andrew Lutomirski

2011-08-22 00:44:12 UTC

Permalink

On Sun, Aug 21, 2011 at 12:41 PM, Al Viro <***@zeniv.linux.org.uk> wro=
te:
> On Sun, Aug 21, 2011 at 03:43:52PM +0100, Al Viro wrote:
>
>> We do not lie to ptrace and iret. =A0At all. =A0We do just what you =
have
>> described. =A0And fuck up when restart returns us to the SYSCALL / S=
YSENTER
>> instruction again, which expects the different calling conventions,
>> so the values arranged in registers in the way int 0x80 would expect
>> do us no good.
>
> FWIW, what really happens (for 32bit task on amd64) is this:

I think I believe your analysis...

> =A0 =A0 =A0 =A0* Both codepaths start with arranging the same thing o=
n the kernel
> stack frame; one 64bit int 0x80 would create. =A0For the good and sim=
ple
> reason: they all have to be able to leave via IRET. =A0Stack layout i=
s the
> same, but we need to fill it accordingly to calling conventions we ar=
e
> stuck with. =A0I.e. ->cx should be initialized with arg2 and ->bp wit=
h
> arg6, wherever those currently are on given codepath. =A0_That_ is wh=
at
> "lying to ptrace" is about - we store there registers according to ho=
w
> they were when we entered __kernel_vsyscall(), not as they are at the
> moment of actual SYSCALL insn. =A0Which is precisely the right thing =
to do,
> since if we *are* ptraced, the tracer expects to find the syscall arg=
ument
> in the same places, whichever variant of syscall tracee happens to be=
using.

This is, IMO, gross -- if the values in pt_regs matched what they were
when sysenter / syscall was issued, then we'd be fine -- we could
restart the syscall and everything would work. Apparently ptrace
users have a problem with that, so we're stuck with the "lie" (i.e.
reporting values as of __kernel_vsyscall, not as of the actual kernel
entry).

> =A0 =A0 =A0 =A0* If there *was* a syscall restart to be done, we are =
guaranteed to
> have left via IRET path. =A0In all cases the syscall arguments end up=
in
> registers, in the same way int 0x80 expected them. =A0What happens af=
terwards
> depends on how we entered, though.
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0+ int 0x80: all registers are restored=
(with ptrace
> manipulations, if any, having left their effect) as they'd been the l=
ast
> time around. =A0In we go and that's it.

Which suggests an easy-ish fix: if sysenter is used or if syscall is
entered from the EIP is is supposed to be entered from, then just
change ip in the argument save to point to the int 0x80 instruction.
This might also require tweaking the userspace stack. That way,
restart would hit int 0x80 instead of syscall/sysenter and the
registers are exactly as expected.

Getting this right in the case where ptrace attaches during the
syscall might be tricky, though.

--Andy

Linus Torvalds

2011-08-22 01:09:00 UTC

Permalink

On Sun, Aug 21, 2011 at 5:44 PM, Andrew Lutomirski <***@mit.edu> wrote:
>
> Which suggests an easy-ish fix: if sysenter is used or if syscall is
> entered from the EIP is is supposed to be entered from, then just
> change ip in the argument save to point to the int 0x80 instruction.

Indeed. Just add an "int 0x80" instruction to the vsyscall thing, and
you'd be done.

In fact, just replace the

jmp .Lenter_kernel

with

int 0x80

and you'd be pretty much all done, no?

(Ok, that's probably a huge over-simplification, but perhaps "close
enough" to true that it would be workable)

Linus

Al Viro

2011-08-22 01:19:39 UTC

Permalink

On Sun, Aug 21, 2011 at 06:09:00PM -0700, Linus Torvalds wrote:
> On Sun, Aug 21, 2011 at 5:44 PM, Andrew Lutomirski <***@mit.edu> wrote:
> >
> > Which suggests an easy-ish fix: if sysenter is used or if syscall is
> > entered from the EIP is is supposed to be entered from, then just
> > change ip in the argument save to point to the int 0x80 instruction.
>
> Indeed. Just add an "int 0x80" instruction to the vsyscall thing, and
> you'd be done.
>
> In fact, just replace the
>
> jmp .Lenter_kernel
>
> with
>
> int 0x80
>
> and you'd be pretty much all done, no?

In case of sysenter - almost, in case of syscall - nope...

H. Peter Anvin

2011-08-22 01:19:41 UTC

Permalink

On 08/21/2011 06:09 PM, Linus Torvalds wrote:
>
> Indeed. Just add an "int 0x80" instruction to the vsyscall thing, and
> you'd be done.
>
> In fact, just replace the
>
> jmp .Lenter_kernel
>
> with
>
> int 0x80
>
> and you'd be pretty much all done, no?
>
> (Ok, that's probably a huge over-simplification, but perhaps "close
> enough" to true that it would be workable)
>

Hm... I think a jump to something which adjusts %esp and invokes int
$0x80 might just work, but only for SYSENTER.

SYSCALL is different, especially since SYSCALL is legal to execute from
anywhere in userspace (and no, as we have learned already doing EIP
checking is *NOT* acceptable.)

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

tip-bot for H. Peter Anvin

2011-08-22 21:25:14 UTC

Permalink

Commit-ID: 6fe83695299270f5b813c037dc90492d13786b70
Gitweb: http://git.kernel.org/tip/6fe83695299270f5b813c037dc90492d13786b70
Author: H. Peter Anvin <***@linux.intel.com>
AuthorDate: Mon, 22 Aug 2011 13:27:06 -0700
Committer: H. Peter Anvin <***@linux.intel.com>
CommitDate: Mon, 22 Aug 2011 13:27:06 -0700

x86-32, vdso: On system call restart after SYSENTER, use int $0x80

When we enter a 32-bit system call via SYSENTER or SYSCALL, we shuffle
the arguments to match the int $0x80 calling convention. This was
probably a design mistake, but it's what it is now. This causes
errors if the system call as to be restarted.

For SYSENTER, we have to invoke the instruction from the vdso as the
return address is hardcoded. Accordingly, we can simply replace the
jump in the vdso with an int $0x80 instruction and use the slower
entry point for a post-restart.

Suggested-by: Linus Torvalds <***@linux-foundation.org>
Signed-off-by: H. Peter Anvin <***@linux.intel.com>
Link: http://lkml.kernel.org/r/CA%2B55aFztZ=r5wa0x26KJQxvZOaQq8s2v3u50wCyJcA-***@mail.gmail.com
---
arch/x86/vdso/vdso32/sysenter.S | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/vdso/vdso32/sysenter.S b/arch/x86/vdso/vdso32/sysenter.S
index e2800af..e354bce 100644
--- a/arch/x86/vdso/vdso32/sysenter.S
+++ b/arch/x86/vdso/vdso32/sysenter.S
@@ -43,7 +43,7 @@ __kernel_vsyscall:
.space 7,0x90

/* 14: System call restart point is here! (SYSENTER_RETURN-2) */
- jmp .Lenter_kernel
+ int $0x80
/* 16: System call normal return point is here! */
VDSO32_SYSENTER_RETURN: /* Symbol used by sysenter.c via vdso32-syms.h */
pop %ebp

tip-bot for H. Peter Anvin

2011-08-23 23:40:23 UTC

Permalink

Commit-ID: 7ca0758cdb7c241cb4e0490a8d95f0eb5b861daf
Gitweb: http://git.kernel.org/tip/7ca0758cdb7c241cb4e0490a8d95f0eb5b861daf
Author: H. Peter Anvin <***@linux.intel.com>
AuthorDate: Mon, 22 Aug 2011 13:27:06 -0700
Committer: H. Peter Anvin <***@linux.intel.com>
CommitDate: Tue, 23 Aug 2011 16:20:10 -0700

x86-32, vdso: On system call restart after SYSENTER, use int $0x80

When we enter a 32-bit system call via SYSENTER or SYSCALL, we shuffle
the arguments to match the int $0x80 calling convention. This was
probably a design mistake, but it's what it is now. This causes
errors if the system call as to be restarted.

For SYSENTER, we have to invoke the instruction from the vdso as the
return address is hardcoded. Accordingly, we can simply replace the
jump in the vdso with an int $0x80 instruction and use the slower
entry point for a post-restart.

Suggested-by: Linus Torvalds <***@linux-foundation.org>
Signed-off-by: H. Peter Anvin <***@linux.intel.com>
Link: http://lkml.kernel.org/r/CA%2B55aFztZ=r5wa0x26KJQxvZOaQq8s2v3u50wCyJcA-***@mail.gmail.com
Cc: <***@kernel.org>
---
arch/x86/vdso/vdso32/sysenter.S | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/vdso/vdso32/sysenter.S b/arch/x86/vdso/vdso32/sysenter.S
index e2800af..e354bce 100644
--- a/arch/x86/vdso/vdso32/sysenter.S
+++ b/arch/x86/vdso/vdso32/sysenter.S
@@ -43,7 +43,7 @@ __kernel_vsyscall:
.space 7,0x90

/* 14: System call restart point is here! (SYSENTER_RETURN-2) */
- jmp .Lenter_kernel
+ int $0x80
/* 16: System call normal return point is here! */
VDSO32_SYSENTER_RETURN: /* Symbol used by sysenter.c via vdso32-syms.h */
pop %ebp

Al Viro

2011-08-22 01:16:45 UTC

Permalink

On Sun, Aug 21, 2011 at 08:44:12PM -0400, Andrew Lutomirski wrote:

> This is, IMO, gross -- if the values in pt_regs matched what they were
> when sysenter / syscall was issued, then we'd be fine -- we could
> restart the syscall and everything would work. Apparently ptrace
> users have a problem with that, so we're stuck with the "lie" (i.e.
> reporting values as of __kernel_vsyscall, not as of the actual kernel
> entry).

Um, _no_. If nothing else, pt_regs is seen by sys_.... And they don't
bloody know or care how the syscall had been entered.

> Which suggests an easy-ish fix: if sysenter is used or if syscall is
> entered from the EIP is is supposed to be entered from, then just
> change ip in the argument save to point to the int 0x80 instruction.
> This might also require tweaking the userspace stack. That way,
> restart would hit int 0x80 instead of syscall/sysenter and the
> registers are exactly as expected.

Huh? Actions after SYSENTER differ from those after int 0x80. If nothing
else, you would need to tweak saved userland stack pointer as well. It is
possible, but I seriously doubt that it's a better way to deal with that
mess. And in any case, SYSEXIT buggers CX/DX, so we'd need two separate
post-syscall sequences in vdso. Yucky... I really don't like it.

The really ugly part for the SYSCALL variant is that right now we *can*
do things like this:
read_it:
pushl %ebp
movl $__NR_read, %eax
movl $0, %ebx
movl $array, %ebp
movl $100, %edx
syscall
movl $__USER32_DS, %ecx
movl %ecx, %ss
popl %ebp
ret
anywhere in your userland and have it act as an equivalent of
int read_it(void)
{
return read(0, array, 100);
}

Is that ability a part of userland ABI or are we declaring that hopelessly
wrong and require to go through the function in vdso32? Linus?

As it is, I don't see any cheap ways to deal with restarts if that thing
has to be preserved. For sysenter it's flatly prohibited and that allows
us to play such games with adjusted return address. Here, OTOH...

Linus Torvalds

2011-08-22 01:41:16 UTC

Permalink

On Sun, Aug 21, 2011 at 6:16 PM, Al Viro <***@zeniv.linux.org.uk> wrot=
e:
>
> Is that ability a part of userland ABI or are we declaring that hopel=
essly
> wrong and require to go through the function in vdso32? =A0Linus?

If people are using syscall directly, we're pretty much stuck. No
amount of "that's hopelessly wrong" will ever matter. We don't break
existing binaries.

That said, I'd *hope* that everybody uses the vdso32, simply because
user programs are not supposed to know which CPU they are running on
and if that CPU even *supports* the syscall instruction. In which case
it may be possible that we can play games with the vdso thing. But
that really would be conditional on "nobody ever reports a failure".

But if that's possible, maybe we can increment the RIP by 2 for
'syscall', and slip an "'int 0x80" after the syscall instruction in
the vdso there? Resulting in the same pseudo-solution I suggested for
sysenter...

Linus

H. Peter Anvin

2011-08-22 01:48:31 UTC

Permalink

On 08/21/2011 06:41 PM, Linus Torvalds wrote:
> If people are using syscall directly, we're pretty much stuck. No
> amount of "that's hopelessly wrong" will ever matter. We don't break
> existing binaries.
>
> That said, I'd *hope* that everybody uses the vdso32, simply because
> user programs are not supposed to know which CPU they are running on
> and if that CPU even *supports* the syscall instruction. In which case
> it may be possible that we can play games with the vdso thing. But
> that really would be conditional on "nobody ever reports a failure".

I think we found that out with the vsyscall emulation issue last cycle.
It works, so it will have been used, somewhere...

> But if that's possible, maybe we can increment the RIP by 2 for
> 'syscall', and slip an "'int 0x80" after the syscall instruction in
> the vdso there? Resulting in the same pseudo-solution I suggested for
> sysenter...

I think we have the above problem.

The problem here is that the syscall state is actually more complex than
we retain: the entire state is given by (entry point, register state);
with that amount of state we have all the information needed to *either*
extract the syscall arguments *or* the register contents. Without
those, we can only represent one of the two possible metalevels (right
now we represent the higher-level metalevel, the argument vector), but
we need both for different usages.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

Andrew Lutomirski

2011-08-22 02:01:40 UTC

Permalink

On Sun, Aug 21, 2011 at 9:48 PM, H. Peter Anvin <***@zytor.com> wrote:
> On 08/21/2011 06:41 PM, Linus Torvalds wrote:
>> If people are using syscall directly, we're pretty much stuck. No
>> amount of "that's hopelessly wrong" will ever matter. We don't break
>> existing binaries.
>>
>> That said, I'd *hope* that everybody uses the vdso32, simply because
>> user programs are not supposed to know which CPU they are running on
>> and if that CPU even *supports* the syscall instruction. In which ca=
se
>> it may be possible that we can play games with the vdso thing. But
>> that really would be conditional on "nobody ever reports a failure".
>
> I think we found that out with the vsyscall emulation issue last cycl=
e.
> =A0It works, so it will have been used, somewhere...
>
>> But if that's possible, maybe we can increment the RIP by 2 for
>> 'syscall', and slip an "'int 0x80" after the syscall instruction in
>> the vdso there? Resulting in the same pseudo-solution I suggested fo=
r
>> sysenter...
>
> I think we have the above problem.
>
> The problem here is that the syscall state is actually more complex t=
han
> we retain: the entire state is given by (entry point, register state)=
;
> with that amount of state we have all the information needed to *eith=
er*
> extract the syscall arguments *or* the register contents. =A0Without
> those, we can only represent one of the two possible metalevels (righ=
t
> now we represent the higher-level metalevel, the argument vector), bu=
t
> we need both for different usages.

My understanding of the problem is the following:

1. The SYSCALL 32-bit calling convention puts arg2 in ebp and arg6 on
the stack.

2. The int 0x80 convention is different: arg2 is in ecx.

3. We're worried that pt_regs-using compat syscalls might want the
regs to appear to match the actual arguments (why?)

4. ptrace expects the "registers" when SYSCALL happens to match the
int 0x80 convention. (This is, IMO, sick.)

5. Syscall restart with the SYSCALL instruction must switch to
userspace and back to the kernel for reasons I don't understand that
presumably involve signal delivery.

6. Existing ABI requires that the kernel not clobber syscall
arguments (except, of course, when ptrace or syscall restart
explicitly change those arguments).

So we're sort of screwed. arg2 must be in ecx to keep ptrace happy
but SYSCALL clobbers ecx, so arg2 cannot be preserved.

So here are three strawman ideas:

a) Change #4. Maybe it's too late to do this, though.

b) When SYSCALL happens, change RIP to point two bytes past an int
0x80 instruction in the vdso. Make the next instruction there be a
"ret" that returns to the instruction after the original syscall.
Patch the stack in the kernel.

c) Disable syscall restart when SYSCALL happens from somewhere outside =
the vdso.

--Andy

Al Viro

2011-08-22 02:07:38 UTC

Permalink

On Sun, Aug 21, 2011 at 10:01:40PM -0400, Andrew Lutomirski wrote:

> 3. We're worried that pt_regs-using compat syscalls might want the
> regs to appear to match the actual arguments (why?)

run strace and you'll see why.

> 4. ptrace expects the "registers" when SYSCALL happens to match the
> int 0x80 convention. (This is, IMO, sick.)

That's what ptrace is *for*. It's there to let debuggers play with
the program being debugged, including taking a look at the syscall
arguments and modifying them. In a predictable way, please.

Andrew Lutomirski

2011-08-22 02:26:00 UTC

Permalink

On Sun, Aug 21, 2011 at 10:07 PM, Al Viro <***@zeniv.linux.org.uk> wro=
te:
> On Sun, Aug 21, 2011 at 10:01:40PM -0400, Andrew Lutomirski wrote:
>
>> =A03. We're worried that pt_regs-using compat syscalls might want th=
e
>> regs to appear to match the actual arguments (why?)
>
> run strace and you'll see why.
>

I'm talking about the implementations of stub32_rt_sigreturn,
sys32_rt_sigreturn, stub32_sigreturn, stub32_sigaltstack,
stub32_execve, stub32_fork, stub32_clone, stub32_vfork, and stub32_iopl=
=2E
I don't know what this has to do with strace or user ABI at all.

>> =A04. ptrace expects the "registers" when SYSCALL happens to match t=
he
>> int 0x80 convention. =A0(This is, IMO, sick.)
>
> That's what ptrace is *for*. =A0It's there to let debuggers play with
> the program being debugged, including taking a look at the syscall
> arguments and modifying them. =A0In a predictable way, please.
>

It may be necessary, but I still think it's sick. Especially in the
case of inlined SYSCALL, where the registers reported to ptrace do not
match any register values that ever actually existed in CPU registers.
Too late to fix it, though.

Which still leaves the question of how to fix it. Restarting via an
int 0x80-based helper might be the only option that leaves everything
fully functional.

--Andy

H. Peter Anvin

2011-08-22 02:34:14 UTC

Permalink

On 08/21/2011 07:26 PM, Andrew Lutomirski wrote:
>
> Which still leaves the question of how to fix it. Restarting via an
> int 0x80-based helper might be the only option that leaves everything
> fully functional.
>

The issue is that we don't represent the entire state ... we represent
only one metalevel of it, currently the "cooked" one. The problem is
that we need the "raw" one as well, and in order to have *both* we need
to know the entry mechanism.

We need that IN EITHER CASE.

This is reasonably straightforward... we can carry the entry mechanism
forward inside the kernel, and fix it up in the IRET path.

The *really* big issue is what we drop as the sigcontext since this is
an ABI carried out to userspace. We could just say "it's currently
totally broken for SYSCALL" and just change it to drop the raw state,
but which has the potential for breaking unknown programs, *or* we could
add a bit of state (presumably by reclaiming one of the padding fields
around cs and ss) ... which *also* has the potential for breaking programs.

Right now, SYSCALL -> signal -> restart *is broken*, however, so there
is also the option of just doing nothing in this case, I guess.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

H. Peter Anvin

2011-08-22 04:05:43 UTC

Permalink

Borislav,

We're tracking down an issue with the way system call arguments are
handled on 32 bits. We have a solution for SYSENTER but not SYSCALL;
fixing SYSCALL "properly" appears to be very difficult at best.

So the question is: how much overhead would it be to simply fall back to
int $0x80 or some other legacy-style domain crossing instruction for
32-bit system calls on AMD64 processors? We don't ever use SYSCALL in
legacy mode, so native i386 kernels are unaffected.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

Ingo Molnar

2011-08-22 09:53:37 UTC

Permalink

* H. Peter Anvin <***@zytor.com> wrote:

> Borislav,
>
> We're tracking down an issue with the way system call arguments are
> handled on 32 bits. We have a solution for SYSENTER but not
> SYSCALL; fixing SYSCALL "properly" appears to be very difficult at
> best.
>
> So the question is: how much overhead would it be to simply fall
> back to int $0x80 or some other legacy-style domain crossing
> instruction for 32-bit system calls on AMD64 processors? We don't
> ever use SYSCALL in legacy mode, so native i386 kernels are
> unaffected.

Last i measured INT80 and SYSCALL costs they were pretty close to
each other on AMD CPUs - closer than on Intel.

Also, most installations are either pure 32-bit or dominantly 64-bit,
the significantly mixed-mode case is dwindling.

Unifying some more in this area would definitely simplify things ...

Thanks,

Ingo

Andrew Lutomirski

2011-08-22 13:34:27 UTC

Permalink

On Mon, Aug 22, 2011 at 5:53 AM, Ingo Molnar <***@kernel.org> wrote:
>
> * H. Peter Anvin <***@zytor.com> wrote:
>
>> Borislav,
>>
>> We're tracking down an issue with the way system call arguments are
>> handled on 32 bits. =A0We have a solution for SYSENTER but not
>> SYSCALL; fixing SYSCALL "properly" appears to be very difficult at
>> best.
>>
>> So the question is: how much overhead would it be to simply fall
>> back to int $0x80 or some other legacy-style domain crossing
>> instruction for 32-bit system calls on AMD64 processors? =A0We don't
>> ever use SYSCALL in legacy mode, so native i386 kernels are
>> unaffected.
>
> Last i measured INT80 and SYSCALL costs they were pretty close to
> each other on AMD CPUs - closer than on Intel.

=46rom memory, SYSCALL in 64-bit mode on Sandy Bridge is much faster
than int 0xcc, which is presumably about the same speed as int 0x80.
That's because SYSCALL is blazingly fast (<30 ns for a whole system
call) and int is slower.

--Andy

>
> Also, most installations are either pure 32-bit or dominantly 64-bit,
> the significantly mixed-mode case is dwindling.
>
> Unifying some more in this area would definitely simplify things ...
>
> Thanks,
>
> =A0 =A0 =A0 =A0Ingo
>

Borislav Petkov

2011-08-22 14:40:51 UTC

Permalink

On Mon, Aug 22, 2011 at 09:34:27AM -0400, Andrew Lutomirski wrote:
> On Mon, Aug 22, 2011 at 5:53 AM, Ingo Molnar <***@kernel.org> wrote=
:
> >
> > * H. Peter Anvin <***@zytor.com> wrote:
> >
> >> Borislav,
> >>
> >> We're tracking down an issue with the way system call arguments ar=
e
> >> handled on 32 bits. =A0We have a solution for SYSENTER but not
> >> SYSCALL; fixing SYSCALL "properly" appears to be very difficult at
> >> best.
> >>
> >> So the question is: how much overhead would it be to simply fall
> >> back to int $0x80 or some other legacy-style domain crossing
> >> instruction for 32-bit system calls on AMD64 processors? =A0We don=
't
> >> ever use SYSCALL in legacy mode, so native i386 kernels are
> >> unaffected.
> >
> > Last i measured INT80 and SYSCALL costs they were pretty close to
> > each other on AMD CPUs - closer than on Intel.
>=20
> From memory, SYSCALL in 64-bit mode on Sandy Bridge is much faster
> than int 0xcc, which is presumably about the same speed as int 0x80.
> That's because SYSCALL is blazingly fast (<30 ns for a whole system
> call) and int is slower.

Just to make sure I'm grokking this correctly - we want to use int $0x8=
0
only for the SYSCALL variant in __kernel_vsyscall, right? Not for all
32-bit syscalls on a 64-bit kernel.

Thanks.

--=20
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

Al Viro

2011-08-22 15:13:05 UTC

Permalink

On Mon, Aug 22, 2011 at 04:40:51PM +0200, Borislav Petkov wrote:

> Just to make sure I'm grokking this correctly - we want to use int $0x80
> only for the SYSCALL variant in __kernel_vsyscall, right? Not for all
> 32-bit syscalls on a 64-bit kernel.

Um... The problem is, syscall restart with SYSCALL insn is badly broken;
we end up with arg2 (expected by SYSCALL in %ebp) overwritten with arg6
by IRET path. With obvious nasty results.

In __kernel_vsyscall() the problem is possible to deal with; there we control
the code around that sucker. It's SYSCALL in 32bit binary outside of
vdso32 that causes real PITA...

Linus Torvalds

2011-08-22 20:05:34 UTC

Permalink

On Mon, Aug 22, 2011 at 8:13 AM, Al Viro <***@zeniv.linux.org.uk> wrot=
e:
>
> In __kernel_vsyscall() the problem is possible to deal with; there we=
control
> the code around that sucker. =A0It's SYSCALL in 32bit binary outside =
of
> vdso32 that causes real PITA...

I just checked. 'syscall' (at least on x86-64) is definitely called
outside of __kernel_vsyscall in all the normal cases. It's part of the
fundamental ABI, after all. We don't use "int 0x80" there.

But on x86-32, I think we might be better off. There, we only have
'sysenter', and can perhaps use my suggested "just use int 0x80
instead of the jump back to the sysenter instruction" trick. Plus
people *will* be using __kernel_vsyscall, since on x86-32 you aren't
guaranteed to have a CPU that supports sysenter to begin with.

Or am I missing something else?

Linus

H. Peter Anvin

2011-08-22 20:11:36 UTC

Permalink

On 08/22/2011 01:05 PM, Linus Torvalds wrote:
> On Mon, Aug 22, 2011 at 8:13 AM, Al Viro <***@zeniv.linux.org.uk> wrote:
>>
>> In __kernel_vsyscall() the problem is possible to deal with; there we control
>> the code around that sucker. It's SYSCALL in 32bit binary outside of
>> vdso32 that causes real PITA...
>
> I just checked. 'syscall' (at least on x86-64) is definitely called
> outside of __kernel_vsyscall in all the normal cases. It's part of the
> fundamental ABI, after all. We don't use "int 0x80" there.
>
> But on x86-32, I think we might be better off. There, we only have
> 'sysenter', and can perhaps use my suggested "just use int 0x80
> instead of the jump back to the sysenter instruction" trick. Plus
> people *will* be using __kernel_vsyscall, since on x86-32 you aren't
> guaranteed to have a CPU that supports sysenter to begin with.
>
> Or am I missing something else?
>

SYSCALL in 64-bit mode is not a problem.

SYSCALL in compatibility mode (32-on-64) *is* a problem, because ECX is
clobbered. Unfortunately AMD processors only support SYSENTER in legacy
mode (32-on-32) -- unlike Intel and VIA.

Your trick solves SYSENTER, which takes care of legacy mode and Intel
and VIA processors in compatibility mode.

Borislav is checking into if we can just use INT 80h on AMD processors
in compatibility mode. So far the indication seems to be that it is
probably okay.

-hpa

Andrew Lutomirski

2011-08-22 21:52:07 UTC

Permalink

On Mon, Aug 22, 2011 at 4:11 PM, H. Peter Anvin <***@zytor.com> wrote:
> On 08/22/2011 01:05 PM, Linus Torvalds wrote:
>> On Mon, Aug 22, 2011 at 8:13 AM, Al Viro <***@zeniv.linux.org.uk> w=
rote:
>>>
>>> In __kernel_vsyscall() the problem is possible to deal with; there =
we control
>>> the code around that sucker. =A0It's SYSCALL in 32bit binary outsid=
e of
>>> vdso32 that causes real PITA...
>>
>> I just checked. 'syscall' (at least on x86-64) is definitely called
>> outside of __kernel_vsyscall in all the normal cases. It's part of t=
he
>> fundamental ABI, after all. We don't use "int 0x80" there.
>>
>> But on x86-32, I think we might be better off. There, we only have
>> 'sysenter', and can perhaps use my suggested "just use int 0x80
>> instead of the jump back to the sysenter instruction" trick. Plus
>> people *will* be using __kernel_vsyscall, since on x86-32 you aren't
>> guaranteed to have a CPU that supports sysenter to begin with.
>>
>> Or am I missing something else?
>>
>
> SYSCALL in 64-bit mode is not a problem.
>
> SYSCALL in compatibility mode (32-on-64) *is* a problem, because ECX =
is
> clobbered. =A0Unfortunately AMD processors only support SYSENTER in l=
egacy
> mode (32-on-32) -- unlike Intel and VIA.
>
> Your trick solves SYSENTER, which takes care of legacy mode and Intel
> and VIA processors in compatibility mode.
>
> Borislav is checking into if we can just use INT 80h on AMD processor=
s
> in compatibility mode. =A0So far the indication seems to be that it i=
s
> probably okay.

Even if it's ok, we still have to do *something* in the cstar entry
point -- I don't think there's any way to turn SYSCALL in
compatibility mode but leave it enabled in long mode. So if we're
planning on killing off SYSCALL from outside the vdso, we could
probably get away with leaving it enabled in the vdso.

Unless, of course, there are 32-bit JITs that do things that they
shouldn't. We could still make it work by rewriting the arguments
(including arg6 on the stack) from the syscall restart path, but that
may be more trouble than it's worth.

--Andy

H. Peter Anvin

2011-08-22 22:04:48 UTC

Permalink

On 08/22/2011 02:52 PM, Andrew Lutomirski wrote:
>
> Even if it's ok, we still have to do *something* in the cstar entry
> point -- I don't think there's any way to turn SYSCALL in
> compatibility mode but leave it enabled in long mode. So if we're
> planning on killing off SYSCALL from outside the vdso, we could
> probably get away with leaving it enabled in the vdso.
>
> Unless, of course, there are 32-bit JITs that do things that they
> shouldn't. We could still make it work by rewriting the arguments
> (including arg6 on the stack) from the syscall restart path, but that
> may be more trouble than it's worth.
>

Hm, looks like you might be right; I thought leaving CSTAR at zero would
take care of that, but it looks like that might not be true.

However, we could just issue a SIGILL or SIGSEGV at this point; the same
way we would if we got an #UD or #GP fault; SIGILL/#UD would be
consistent with Intel CPUs here.

Yes, we could do an EIP check and issue SIGILL only if it isn't in the
vdso, but after the crap from earlier this month I would be happier if
we could avoid this potential problem entirely. Borislav has a query in
with his hardware folks, let's give them the chance to answer.

-hpa

Linus Torvalds

2011-08-22 23:27:51 UTC

Permalink

On Mon, Aug 22, 2011 at 3:04 PM, H. Peter Anvin <***@zytor.com> wrote:
>
> However, we could just issue a SIGILL or SIGSEGV at this point; the same
> way we would if we got an #UD or #GP fault; SIGILL/#UD would be
> consistent with Intel CPUs here.

Considering that this is not a remotely new issue, and that it has
been around for years without anybody even noticing, I'd really prefer
to just fix things going forwards rather than add any code to actively
break any possible unlucky legacy users.

So I think the "let's fix the vdso case for sysenter" + "let's remove
the 32-bit syscall vdso" is the right solution. If somebody has
hardcoded syscall instructions, or generates them dynamically with
some JIT, that's their problem. We'll continue to support it as well
as we ever have (read: "almost nobody will ever notice").

One thing we *could* do is to just say "we never restart a x86-32
'syscall' instruction at all", and just make such a case return EINTR.
IOW, do something along the lines of the appended pseudo-patch.

Because returning -EINTR is always "almost correct".

Hmm?

Linus

---
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 54ddaeb221c1..bc1a0f8b2707 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -678,6 +678,16 @@ setup_rt_frame(int sig, struct k_sigaction *ka,
siginfo_t *info,
return ret;
}

+static void restart_syscall(struct pt_regs *regs, int orig)
+{
+ if (regs->syscall_using_syscall_insn) {
+ regs->ax = -EINTR;
+ return;
+ }
+ regs->ax = orig;
+ regs->ip -= 2;
+}
+
static int
handle_signal(unsigned long sig, siginfo_t *info, struct k_sigaction *ka,
struct pt_regs *regs)
@@ -701,8 +711,7 @@ handle_signal(unsigned long sig, siginfo_t
*info, struct k_sigaction *ka,
}
/* fallthrough */
case -ERESTARTNOINTR:
- regs->ax = regs->orig_ax;
- regs->ip -= 2;
+ restart_syscall(regs, regs->orig_ax);
break;
}
}
@@ -786,13 +795,11 @@ static void do_signal(struct pt_regs *regs)
case -ERESTARTNOHAND:
case -ERESTARTSYS:
case -ERESTARTNOINTR:
- regs->ax = regs->orig_ax;
- regs->ip -= 2;
+ restart_syscall(regs, regs->orig_ax);
break;

case -ERESTART_RESTARTBLOCK:
- regs->ax = NR_restart_syscall;
- regs->ip -= 2;
+ restart_syscall(regs, NR_restart_syscall);
break;
}
}

Al Viro

2011-08-23 00:03:14 UTC

Permalink

On Mon, Aug 22, 2011 at 04:27:51PM -0700, Linus Torvalds wrote:

> So I think the "let's fix the vdso case for sysenter" + "let's remove
> the 32-bit syscall vdso" is the right solution. If somebody has
> hardcoded syscall instructions, or generates them dynamically with
> some JIT, that's their problem. We'll continue to support it as well
> as we ever have (read: "almost nobody will ever notice").

Umm... Maybe, but I really wonder if it would be better to do this:
* check if %ecx is the right one for vdso32 entry. If it isn't,
act as we act now (and possibly warn). If it is, increment it by 4.
* slap 0x90, 0x90, 0xcd, 0x80 right after that syscall insn -
i.e. nop/nop/int 0x80. Followed by what we currently do there.

Those who do syscall insn in 32bit mode outside of vdso will get
what they currently get. __kernel_vsyscall() will keep working and do
that without restart problems. And the price is comparison + branch not
taken + addition for that path...

Al Viro

2011-08-23 00:07:05 UTC

Permalink

On Tue, Aug 23, 2011 at 01:03:14AM +0100, Al Viro wrote:
> On Mon, Aug 22, 2011 at 04:27:51PM -0700, Linus Torvalds wrote:
>
> > So I think the "let's fix the vdso case for sysenter" + "let's remove
> > the 32-bit syscall vdso" is the right solution. If somebody has
> > hardcoded syscall instructions, or generates them dynamically with
> > some JIT, that's their problem. We'll continue to support it as well
> > as we ever have (read: "almost nobody will ever notice").
>
> Umm... Maybe, but I really wonder if it would be better to do this:
> * check if %ecx is the right one for vdso32 entry. If it isn't,
> act as we act now (and possibly warn). If it is, increment it by 4.
> * slap 0x90, 0x90, 0xcd, 0x80 right after that syscall insn -
> i.e. nop/nop/int 0x80. Followed by what we currently do there.
>
> Those who do syscall insn in 32bit mode outside of vdso will get
> what they currently get. __kernel_vsyscall() will keep working and do
> that without restart problems. And the price is comparison + branch not
> taken + addition for that path...

s/whine/set your "don't restart that one" flag/, even...

H. Peter Anvin

2011-08-23 00:07:06 UTC

Permalink

On 08/22/2011 05:03 PM, Al Viro wrote:
> On Mon, Aug 22, 2011 at 04:27:51PM -0700, Linus Torvalds wrote:
>
>> So I think the "let's fix the vdso case for sysenter" + "let's remove
>> the 32-bit syscall vdso" is the right solution. If somebody has
>> hardcoded syscall instructions, or generates them dynamically with
>> some JIT, that's their problem. We'll continue to support it as well
>> as we ever have (read: "almost nobody will ever notice").
>
> Umm... Maybe, but I really wonder if it would be better to do this:
> * check if %ecx is the right one for vdso32 entry. If it isn't,
> act as we act now (and possibly warn). If it is, increment it by 4.
> * slap 0x90, 0x90, 0xcd, 0x80 right after that syscall insn -
> i.e. nop/nop/int 0x80. Followed by what we currently do there.
>
> Those who do syscall insn in 32bit mode outside of vdso will get
> what they currently get. __kernel_vsyscall() will keep working and do
> that without restart problems. And the price is comparison + branch not
> taken + addition for that path...

Checking for SYSCALL in the vdso is cheap... the only problem is if
someone thinks that they can copy it out of the vdso and run it, having
special-cased SYSENTER already, but not SYSCALL. The recent experience
shows that that is apparently a very real possibility.

Without a SYSCALL in the vdso, anything that examines the vdso will end
up using int $0x80 and we don't have an issue.

Again, waiting for Borislav & co to tell us how much the SYSCALL
instruction actually buys us.

-hpa

Linus Torvalds

2011-08-23 00:22:07 UTC

Permalink

On Mon, Aug 22, 2011 at 5:07 PM, H. Peter Anvin <***@zytor.com> wrote:
>
> Checking for SYSCALL in the vdso is cheap... the only problem is if
> someone thinks that they can copy it out of the vdso and run it, havi=
ng
> special-cased SYSENTER already, but not SYSCALL. =A0The recent experi=
ence
> shows that that is apparently a very real possibility.

I don't understand why people keep harping on this "copy the vdso" thin=
g.

Guys, IF THEY COPY THE VDSO, THEN THEY WOULD GET OUR IMPROVEMENTS!

So stop being silly.

As to comparing the IP address, that's just moronic. Why the hell
would anybody ever do something that stupid? It's expensive and not
worth it. You'd be much better off comparing the contents of the
memory at 'rip', and see if that memory contains the 'int 0x80'
instruction we would add. That automatically also gets the 'somebody
copied the vdso text' case right.

You guys seem to positively _want_ to make this a bigger issue than it
is. As far as I can tell, nobody has ever even noticed this problem
before, and we already have a trivial fix ("don't do it then") for the
case Al actually did notice.

The *only* case left as far as I can tell is the case where people do
*not* copy the vdso, but simply generate (statically and/or
dynamically with a JIT) the syscall instruction directly. And we
cannot fix that, so stop even bothering to try. The absolute best we
can do for that case is to just continue to do what we do now. WE
SIMPLY HAVE NO CHOICE.

I really don't see this as being worth this much discussion.

Linus

Al Viro

2011-08-23 01:01:46 UTC

Permalink

On Mon, Aug 22, 2011 at 05:22:07PM -0700, Linus Torvalds wrote:

> You guys seem to positively _want_ to make this a bigger issue than it
> is. As far as I can tell, nobody has ever even noticed this problem
> before, and we already have a trivial fix ("don't do it then") for the
> case Al actually did notice.

I'm not sure. What I've noticed was complete weirdness in uml, apparently
dependent on SYSCALL vs. SYSENTER or int 0x80. With following round of
RTFS and finding what looks like an ugly issue with restarts.

What I don't understand at all is how the hell do we manage to avoid much
more obvious breakage all the time we ptrace a 32bit task. Look:
__kernel_vsyscall() is
push %ebp
movl %ecx, %ebp
syscall
movl $__USER32_DS, %ecx
movl %ecx, %ss
movl %ebp, %ecx
popl %ebp

now, what is going to happen to %ebp if we go through IRET path, for any
reason? From my reading it appears that right after that IRET we'll have
ebp containing arg6. I.e. what we'd pushed on stack. Now, popl %ebp
will bring the same value back. Not a problem. But what about
movl %ebp, %ecx? Again, I'm talking about the case when we have no
restart at all - just an strace(1) tracing a process.

AFAICS, in that case we ought to have %ecx == %ebp after return from
__kernel_vsyscall(). Which would blow the things up _very_ fast.

So what the hell am I missing here?

Al Viro

2011-08-23 01:13:12 UTC

Permalink

On Tue, Aug 23, 2011 at 02:01:46AM +0100, Al Viro wrote:

> now, what is going to happen to %ebp if we go through IRET path, for any
> reason? From my reading it appears that right after that IRET we'll have
> ebp containing arg6. I.e. what we'd pushed on stack. Now, popl %ebp
> will bring the same value back. Not a problem. But what about
> movl %ebp, %ecx? Again, I'm talking about the case when we have no
> restart at all - just an strace(1) tracing a process.
>
> AFAICS, in that case we ought to have %ecx == %ebp after return from
> __kernel_vsyscall(). Which would blow the things up _very_ fast.
>
> So what the hell am I missing here?

*UGH*. OK,
1) I'm an idiot; int_ret_from_sys_call does *not* usually step on
rbp (it's callee-saved). So normally ebp is left as is on the way out,
which is why we don't see stuff getting buggered left, right and center.
2) Sometimes it apparently does somehow happen. I don't see where
it happens yet, but uml breakage that started all of that looks *exactly*
like that. %ebp getting arg6 in it when we return into __kernel_vsyscall()
from the kernel fits the observed pattern precisely.
3) modulo that the situation is nowhere near as bad as I thought.
Brown paperbag time for me - for missing that if my analysis had been correct
we'd've seen breakage _much_ earlier. Mea culpa.
4) we still have a problem, apparently, but it's more narrow now -
the question is when would %rbp be shat into?

Al, off to apply a serious self-LART...

Linus Torvalds

2011-08-23 01:59:48 UTC

Permalink

On Mon, Aug 22, 2011 at 6:13 PM, Al Viro <***@zeniv.linux.org.uk> wrot=
e:
>
> *UGH*. =A0OK,
> =A0 =A0 =A0 =A01) I'm an idiot; int_ret_from_sys_call does *not* usua=
lly step on
> rbp (it's callee-saved). =A0So normally ebp is left as is on the way =
out,
> which is why we don't see stuff getting buggered left, right and cent=
er.

Check.

And the system call restart should actually work fine too, because at
syscall entry we save %ebp *both* in the slot for ebp and ecx when we
enter the first time. So the second time, we'll re-load the third
argument from ebp again, but that's fine - it's still going to be the
right value. Yes? No?

However, I note that the cstar entrypont has a comment about not saving=
ebp:

* %ebp Arg2 [note: not saved in the stack frame, should not be touc=
hed]

which sounds odd. Why don't we save it? If we take a signal handler
there, don't we want %ebp on the kernel stack in pt_regs, in order to
do everything right?

Now I'm *really* confused.

Linus

Al Viro

2011-08-23 02:59:44 UTC

Permalink

On Mon, Aug 22, 2011 at 06:59:48PM -0700, Linus Torvalds wrote:

> And the system call restart should actually work fine too, because at
> syscall entry we save %ebp *both* in the slot for ebp and ecx when we
> enter the first time. So the second time, we'll re-load the third
> argument from ebp again, but that's fine - it's still going to be the
> right value. Yes? No?
>
> However, I note that the cstar entrypont has a comment about not saving ebp:
>
> * %ebp Arg2 [note: not saved in the stack frame, should not be touched]
>
> which sounds odd. Why don't we save it? If we take a signal handler
> there, don't we want %ebp on the kernel stack in pt_regs, in order to
> do everything right?

That's exactly because it's callee-saved. amd64 doesn't build full
pt_regs on stack; there's a part built always (5 words needed for iret
to work + syscall number + rdi + rsi + rdx + rcx + rax + r8--r11) and
the rest of registers is not saved in regular cases. Reason: as long
as what we are calling follows amd64 ABI, we are guaranteed that
values of rsp/rbp/rbx/r12--r15 will not change. So we don't waste
cycles and stack space unless we need to. Which is to say,
* in fork/clone/vfork - there we want full pt_regs to copy it into
child's pt_regs.
* in {rt_,}sigreturn - we don't care about the current contents of
those registers, but we want to set them. Thus the full pt_regs on stack,
filled by sys_{rt_,}sigreturn() and these extra registers filled with
values from pt_regs.
* execve() - we want all registers reset to know state after
sys_execve(), so it fills the full pt_regs and we get the extra regs filled
out of it.
* sigaltstack() - there full pt_regs is an overkill, but we do want
userland sp.
* signal delivery - we want these registers preserved across the
duration of handler and we can't depend on handler following ABI. So we
fill the entire pt_regs, and copy it into sigcontext, to be eventually
picked up by sigreturn and reconstruct the entire state.
* ptrace - we want to be able to read/modify *all* these guys.
So we fill the entire pt_regs, let ptrace play with it and read extra regs
back. NOTE: ia32_cstar_tracesys() takes pains to prevent buggering ebp
there - we read the arg6 into r9, then swap it with ebp for duration of
that stuff. So ptrace will see arg6 in regs.bp, but when it's time
to go into syscall the (possibly modified) value will end in r9. Which
is how it's passed to C functions, so we are fine, but it'll be really
lost before we reach the userland. However, on the way *OUT* we are not
that nice, and SETREGS/POKEUSER hitting us there will end up modifying
ebp. Which will play hell on __kernel_vsyscall()...

Hell, you have done something very similar on alpha yourself... As for
ebp, it doesn't make any sense to save it on stack - ia32_cstar_entry()
itself takes care of not stomping on it just fine and IRET path
(int_ret_from_sys_call) modifies rbp only if explicitly asked to do so...
Which is most likely where it hits the fan for uml. Normally it wouldn't
hurt to ask PTRACE_PUTREGS to put into ebp the value we just got from
PTRACE_GETREGS. However, it *does* hurt when it happens on the second
stop per syscall - i.e. when we are on the way out. I'm not 100% sure
that this is what's going on (it's using PTRACE_SYSEMU, which is supposed
to avoid the second stop completely), but it looks like what I'm seeing...

Al Viro

2011-08-23 02:17:18 UTC

Permalink

On Tue, Aug 23, 2011 at 02:13:12AM +0100, Al Viro wrote:
> *UGH*. OK,
> 1) I'm an idiot; int_ret_from_sys_call does *not* usually step on
> rbp (it's callee-saved). So normally ebp is left as is on the way out,
> which is why we don't see stuff getting buggered left, right and center.
> 2) Sometimes it apparently does somehow happen. I don't see where
> it happens yet, but uml breakage that started all of that looks *exactly*
> like that. %ebp getting arg6 in it when we return into __kernel_vsyscall()
> from the kernel fits the observed pattern precisely.
> 3) modulo that the situation is nowhere near as bad as I thought.
> Brown paperbag time for me - for missing that if my analysis had been correct
> we'd've seen breakage _much_ earlier. Mea culpa.
> 4) we still have a problem, apparently, but it's more narrow now -
> the question is when would %rbp be shat into?
>
> Al, off to apply a serious self-LART...

So it smells like a nasty effect of PTRACE_SETREGS/PTRACE_POKEUSER on the
way *out* of syscall (the same on the way in wouldn't have such effect -
modified EBP value would be simply lost after the syscall; passed to it
as arg6, but that's it).

All right, now I have a nice shiny reproducer for uml folks:
main()
{
char *s = sbrk(8192);
*s = 0;
brk(s);
}
will do it on the affected boxen. It gets fucked in the second call of
brk(). What happens is this:
brk(3) in libc:
about to call brk(2), will have to stomp on %ebx
save ebx into ecx for the duration of call
the new brk level into ebx, syscall number into eax
hit __kernel_vsyscall()
push ebp
mov ecx, ebp
ecx is going to get stomped on, save it (i.e. original ebx) into ebp
syscall
and at that point ebp has changed - it became equal to arg6, aka what we'd
put on stack, aka the value of ebp prior to all of that, aka the frame
pointer of caller.
use ecx to set ss back to sanity
mov ebp, ecx
and now ecx is buggered.
pop ebp
the value of ebp has actually not been changed by that.
ret
put the value of ecx back into ebx
only it's not the value we used to have there. Not anymore. Now we are about
to store the return value of brk(2) into static variable. And that's where
it really hits the fan, since we are in a PIC code and ebx is not what it
used to be. So instead of that variable we access hell knows what address
and promptly segfault.

I have a very strong suspicion that I know what will turn out to be involved
into that - the page eviction done by sys_brk(). Note that dirtying this
sucker is really necessary - without *s = 0 it won't segfault at all. With
it we get a segfault described above.

And page eviction on uml is nasty and convoluted as hell. It has to do
munmap() on process' VM. Which is done in a rather sick way - we have a
stub present in address space of all processes, with a function that
does a given series of mmap/munmap/mprotect and traps itself. Guest
kernel puts arguments for that sucker into a shared data page and continues
the process into that function. Once it's done, we get the damn thing
stopped again, nice and ready for us to continue dealing with it.

Something in that shitstorm of ptrace() calls ends up doing SETREGS
when victim sits on the way out of (host) syscall. Boom...

Al Viro

2011-08-23 06:15:31 UTC

Permalink

On Tue, Aug 23, 2011 at 03:17:18AM +0100, Al Viro wrote:

> I have a very strong suspicion that I know what will turn out to be involved
> into that - the page eviction done by sys_brk(). Note that dirtying this
> sucker is really necessary - without *s = 0 it won't segfault at all. With
> it we get a segfault described above.
>
> And page eviction on uml is nasty and convoluted as hell. It has to do
> munmap() on process' VM. Which is done in a rather sick way - we have a
> stub present in address space of all processes, with a function that
> does a given series of mmap/munmap/mprotect and traps itself. Guest
> kernel puts arguments for that sucker into a shared data page and continues
> the process into that function. Once it's done, we get the damn thing
> stopped again, nice and ready for us to continue dealing with it.
>
> Something in that shitstorm of ptrace() calls ends up doing SETREGS
> when victim sits on the way out of (host) syscall. Boom...

Almost, but not quite. What happens is:
* process hits syscall insn
* it's stopped and tracer (guest kernel) does GETREGS
+ looks at the registers (mapped to the normal layout)
+ decides to call sys_brk()
+ notices pages to kick out
+ queues munmap request for stub
* tracer does SETREGS, pointing the child's eip to stub and sp to stub stack
* tracer does CONT, letting the child run
* child finishes with syscall insn, carefully preserving ebp. It returns to
userland, in the beginning of the stub.
* child does munmap() and hits int 3 in the end of stub.
* the damn thing is stopped again. The tracer had been waiting for it.
* tracer finishes with sys_brk() and returns success.
* it does SETREGS, setting eax to return value, eip to original return
address of syscall insn... and ebp to what it had in regs.bp. I.e. the
damn arg6 value.

And we are fucked. It doesn't happen in syscall handler. It's int3().
Having no idea that this request to set ebp should be interpreted in
a really different way - "put the value I asked to put into ecx here,
please, and ignore this one".

Sigh... The really ugly part is that ebp can be changed by the stuff
done in stub - it's not just munmap, it can do mmap as well. We can,
in principle, save ebp on its stack and restore it before trapping.
Then uml kernel could, in theory, replace that SETREGS with a bunch of
POKEUSER, leaving ebp alone. Ho-hum... In principle, that might even
be not too horrible - we need eax/eip/esp, of course, but the rest
could be dealt with by the same trick - have it pushed/popped in the
stub and to hell with wasting syscalls on setting them...

Anyway, bedtime for me. I'll look into that again in the morning...

Borislav Petkov

2011-08-23 14:26:08 UTC

Permalink

On Tue, Aug 23, 2011 at 02:15:31AM -0400, Al Viro wrote:
> Almost, but not quite. What happens is:
> * process hits syscall insn
> * it's stopped and tracer (guest kernel) does GETREGS
> + looks at the registers (mapped to the normal layout)
> + decides to call sys_brk()
> + notices pages to kick out
> + queues munmap request for stub
> * tracer does SETREGS, pointing the child's eip to stub and sp to stub stack
> * tracer does CONT, letting the child run
> * child finishes with syscall insn, carefully preserving ebp. It returns to
> userland, in the beginning of the stub.
> * child does munmap() and hits int 3 in the end of stub.
> * the damn thing is stopped again. The tracer had been waiting for it.
> * tracer finishes with sys_brk() and returns success.
> * it does SETREGS, setting eax to return value, eip to original return
> address of syscall insn... and ebp to what it had in regs.bp. I.e. the
> damn arg6 value.

Ok, stupid question: can a convoluted ptracing case like this be created
in "normal" userspace, i.e. irrespective of UML and only by using gdb,
for example?

I.e., from what I understand from above, you need to stop the tracee at
syscall and "redirect" it to the stub after it finishes the syscall so
that in another syscall it gets a debug exception... sounds complicated.

> And we are fucked. It doesn't happen in syscall handler. It's int3().
> Having no idea that this request to set ebp should be interpreted in
> a really different way - "put the value I asked to put into ecx here,
> please, and ignore this one".
>
> Sigh... The really ugly part is that ebp can be changed by the stuff
> done in stub - it's not just munmap, it can do mmap as well. We can,
> in principle, save ebp on its stack and restore it before trapping.
> Then uml kernel could, in theory, replace that SETREGS with a bunch of
> POKEUSER, leaving ebp alone. Ho-hum... In principle, that might even
> be not too horrible - we need eax/eip/esp, of course, but the rest
> could be dealt with by the same trick - have it pushed/popped in the
> stub and to hell with wasting syscalls on setting them...

which could mean that we could get away by not replacing SYSCALL32?

Hmm.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

Al Viro

2011-08-23 16:30:10 UTC

Permalink

On Tue, Aug 23, 2011 at 04:26:08PM +0200, Borislav Petkov wrote:
> On Tue, Aug 23, 2011 at 02:15:31AM -0400, Al Viro wrote:
> > Almost, but not quite. What happens is:
> > * process hits syscall insn
> > * it's stopped and tracer (guest kernel) does GETREGS
> > + looks at the registers (mapped to the normal layout)
> > + decides to call sys_brk()
> > + notices pages to kick out
> > + queues munmap request for stub
> > * tracer does SETREGS, pointing the child's eip to stub and sp to stub stack
> > * tracer does CONT, letting the child run
> > * child finishes with syscall insn, carefully preserving ebp. It returns to
> > userland, in the beginning of the stub.
> > * child does munmap() and hits int 3 in the end of stub.
> > * the damn thing is stopped again. The tracer had been waiting for it.
> > * tracer finishes with sys_brk() and returns success.
> > * it does SETREGS, setting eax to return value, eip to original return
> > address of syscall insn... and ebp to what it had in regs.bp. I.e. the
> > damn arg6 value.
>
> Ok, stupid question: can a convoluted ptracing case like this be created
> in "normal" userspace, i.e. irrespective of UML and only by using gdb,
> for example?

I don't know...

> I.e., from what I understand from above, you need to stop the tracee at
> syscall and "redirect" it to the stub after it finishes the syscall so
> that in another syscall it gets a debug exception... sounds complicated.

Basically, we need to do things that tracer can't do via ptrace() - i.e.
play with mappings in the child. I.e. we need to do several syscalls
in child, then return it to traced state. And all of that - before we
return to execution of instructions past the syscall.

BTW, booting 32bit uml with nosysemu on such boxen blows up instantly, since
there we have *all* SETREGS done on the way out of syscall (with sysemu we
use PTRACE_SYSEMU, which will stop on syscall entry, let you play with
registers and suppress both the sys_...() call itself and the stop on
the way out; without sysemu it'll use PTRACE_SYSCALL, replace syscall
number with something harmless (getpid(2)), let it execute, stop on
the way out and update the registers there). Same issue, only here it
really happens from within the syscall handler itself.

Hell knows... I have no idea what kind of weirdness ptrace users exhibit.
FWIW, I suspect that there's another mess around signals in uml - signal
frame is built by tracer and it *has* to contain ebp. And have eip pointing
to insn immediately past the syscall. What should sigreturn do? It got
to restore ebp - can't rely on signal handler not having buggered the
register. And we are again in for it - ebp set to arg6 as we return to
insn right after syscall one.

OTOH, making GETREGS/PEEKUSER return registers without arg2 -> ecx, arg6 -> ebp
would instantly break both uml and far less exotic things. strace(1), for
one. Anything that wants to examine the arguments of stopped syscall will
be broken.

Linus Torvalds

2011-08-23 16:03:04 UTC

Permalink

On Mon, Aug 22, 2011 at 11:15 PM, Al Viro <***@zeniv.linux.org.uk> wro=
te:
>
> * it does SETREGS, setting eax to return value, eip to original retur=
n
> address of syscall insn... and ebp to what it had in regs.bp. =A0I.e.=
the
> damn arg6 value.

Ok, I think that exhaustively explains that

(a) our system call restart has always worked correctly, and we're goo=
d.

(b) it's simply just UML that is buggy, and doesn't understand the
subtleties about doing GETREGS at a system call.

and I think that the correct and simple solution is to just teach UML
to understand the proper logic of pt_regs during a system call (and a
'syscall' instruction in particular).

The thing is, UML emulates 'syscall' the way the *CPU* does it, not
the way *we* do it. That may make sense, but it's simply not correct.

So I would vote very strongly against actually changing anything in
arch/x86. This is very much an UML issue.

Suggested fixes:

- instead of blindly doing SETREGS, just write the result registers
individually like you suggested

OR (and perhaps preferably):

- teach UML that when you do 'GETREGS' after a system call trapped,
we have switched things around to match the "official system call
order", and UML should just undo our swizzling, and do a "regs.ebp =3D
regs.ecx" to make it be what the actual original registers were (or
whatever the actual correct swizzle is - I didn't think that through
very much).

IOW, I think the core kernel does the right thing. Our argument
register swizzling is odd, yes, but it's an implementation detail that
something like uml should just have to take into account. No?

Hmm?

Linus

Andrew Lutomirski

2011-08-23 16:11:43 UTC

Permalink

On Tue, Aug 23, 2011 at 12:03 PM, Linus Torvalds
<***@linux-foundation.org> wrote:
> On Mon, Aug 22, 2011 at 11:15 PM, Al Viro <***@zeniv.linux.org.uk> w=
rote:
>>
>> * it does SETREGS, setting eax to return value, eip to original retu=
rn
>> address of syscall insn... and ebp to what it had in regs.bp. =A0I.e=
=2E the
>> damn arg6 value.
>
> Ok, I think that exhaustively explains that
>
> =A0(a) our system call restart has always worked correctly, and we're=
good.
>
> =A0(b) it's simply just UML that is buggy, and doesn't understand the
> subtleties about doing GETREGS at a system call.
>
> and I think that the correct and simple solution is to just teach UML
> to understand the proper logic of pt_regs during a system call (and a
> 'syscall' instruction in particular).
>
> The thing is, UML emulates 'syscall' the way the *CPU* does it, not
> the way *we* do it. That may make sense, but it's simply not correct.
>
> So I would vote very strongly against actually changing anything in
> arch/x86. This is very much an UML issue.
>
> Suggested fixes:
>
> =A0- instead of blindly doing SETREGS, just write the result register=
s
> individually like you suggested
>
> OR (and perhaps preferably):
>
> =A0- teach UML that when you do 'GETREGS' after a system call trapped=
,
> we have switched things around to match the "official system call
> order", and UML should just undo our swizzling, and do a "regs.ebp =3D
> regs.ecx" to make it be what the actual original registers were (or
> whatever the actual correct swizzle is - I didn't think that through
> very much).
>
> IOW, I think the core kernel does the right thing. Our argument
> register swizzling is odd, yes, but it's an implementation detail tha=
t
> something like uml should just have to take into account. No?
>
> Hmm?

Egads. Does this mean that doing GETREGS and then doing SETREGS later
on with the *exact same values* is considered incorrect? IMO, this
way lies madness.

In any case, this seems insanely overcomplicated. I'd be less afraid
of something like my approach (which, I think, makes all of the
SYSCALL weirdness pretty much transparent to ptrace users) or of just
removing SYSCALL entirely from 32-bit code.

--Andy

Linus Torvalds

2011-08-23 16:20:12 UTC

Permalink

On Tue, Aug 23, 2011 at 9:11 AM, Andrew Lutomirski <***@mit.edu> wrote=
:
>
> Egads. =A0Does this mean that doing GETREGS and then doing SETREGS la=
ter
> on with the *exact same values* is considered incorrect?

No. If it does it with exactly the same values, it's perfectly ok. But
it isn't. It's changing some of those values.

> =A0IMO, this way lies madness.

No, the madness is in UML.

It's EMULATING A SYSTEM CALL. That original "getregs" value is not
some "user space state". It's the *system call* state that you got
after the system call trapped. Setting it back is an insane operation,
but it would happen to work - if you make no changes.

But UML *does* make changes. It takes that system call state, and then
EMULATES THE SYSTEM CALL INCORRECTLY.

If you see it that way (which is the correct way), then it's clearly
an UML problem, and it's not at all "madness" that your
getregs/setregs pairing doesn't work.

See? Buggy system call emulation. It's really that simple. Of course,
"simple" in this case is "really really subtle differences in how the
kernel treats syscall/sysenter/int80", so the *details* are certainly
not simple, but the concept is.

Linus

Al Viro

2011-08-23 17:33:17 UTC

Permalink

On Tue, Aug 23, 2011 at 09:20:12AM -0700, Linus Torvalds wrote:

> It's EMULATING A SYSTEM CALL. That original "getregs" value is not
> some "user space state". It's the *system call* state that you got
> after the system call trapped. Setting it back is an insane operation,
> but it would happen to work - if you make no changes.
>
> But UML *does* make changes. It takes that system call state, and then
> EMULATES THE SYSTEM CALL INCORRECTLY.
>
> If you see it that way (which is the correct way), then it's clearly
> an UML problem, and it's not at all "madness" that your
> getregs/setregs pairing doesn't work.
>
> See? Buggy system call emulation. It's really that simple. Of course,
> "simple" in this case is "really really subtle differences in how the
> kernel treats syscall/sysenter/int80", so the *details* are certainly
> not simple, but the concept is.

It's a bit more than that (ptrace changes to syscall arguments *are*
lost on syscall restart), but... as far as I'm concerned, the situation
is simple now:
* SYSCALL is not terminally broken wrt restarts. My apologies for
misreading what was going on.
* SYSENTER with Linus' patch does work just fine wrt restarts + ptrace
* SYSCALL is losing ptrace-made changes to arguments when it restarts.
Might or might not be a problem for somebody.
* UML should not touch SYSCALL for 32bit. Not without serious changes
in UML and I'm not convinced that it won't be worse than what we probably
ought to do there: check if __kernel_vsyscall() does SYSCALL (recognizable
by interaction with POKEUSER) and don't tell about vdso to guest processes.
Anything well-behaving won't step on SYSCALL and the things that do deserve
the subtle bugs they get.
* asm glue is subtle, evil and doesn't have anywhere near enough
documentation ;-/

Al Viro

2011-08-23 18:04:18 UTC

Permalink

On Tue, Aug 23, 2011 at 06:33:17PM +0100, Al Viro wrote:
> * SYSCALL is not terminally broken wrt restarts. My apologies for
> misreading what was going on.
> * SYSENTER with Linus' patch does work just fine wrt restarts + ptrace
> * SYSCALL is losing ptrace-made changes to arguments when it restarts.
> Might or might not be a problem for somebody.

BTW, that one (irrelevant to UML even if we do end up coping with SYSCALL
there) might be worth spelling it out:

tracer:
ptrace(tracee, PTRACE_SYSCALL);
tracee:
recvfrom(..., &addrlen);
tracer:
ptrace(tracee, PTRACE_POKEUSER, EBP, &len2);
ptrace(tracee, PTRACE_DETACH, 0, 0);
tracee:
completes recvfrom(), using &len2 instead of the &addrlen

That works just fine, regardless of the way syscall is entered; yes, including
SYSCALL - there we take care to handle ptrace on the way in. However, if it's
SYSCALL and (ex-)tracee takes a restart, the second time around we'll have the
original value of 6th argument used. Changes made by POKEUSER are lost.

It's not a problem with int 0x80 or SYSENTER (now, with int 0x80 instead of
jmp). It's probably not going to be a real issue for anyone, but I pity the
poor bastard stuck with debugging that if it *does* become someone's problem.

Borislav Petkov

2011-08-24 12:44:19 UTC

Permalink

On Tue, Aug 23, 2011 at 06:33:17PM +0100, Al Viro wrote:
> * asm glue is subtle, evil and doesn't have anywhere near enough
> documentation ;-/

I took the liberty to document some of your asm glue analysis in an
attempt to make the code a bit more understandable. How about the
following:

--
From: Borislav Petkov <***@amd.com>
Date: Wed, 24 Aug 2011 14:30:43 +0200
Subject: [PATCH] x86, asm: Document some of the syscall asm glue

Document some of the asm glue around compat SYSCALL32 and do a
whitespace cleanup while at it. See linked thread below for further
reference.

Link: http://lkml.kernel.org/r/***@ZenIV.linux.org.uk
Signed-off-by: Borislav Petkov <***@amd.com>
---
arch/x86/ia32/ia32entry.S | 138 ++++++++++++++++++++++++++-----------------
arch/x86/kernel/entry_64.S | 19 +++++-
2 files changed, 98 insertions(+), 59 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index a0e866d..8254432 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -1,16 +1,16 @@
/*
- * Compatibility mode system call entry point for x86-64.
- *
+ * Compatibility mode system call entry point for x86-64.
+ *
* Copyright 2000-2002 Andi Kleen, SuSE Labs.
- */
+ */

#include <asm/dwarf2.h>
#include <asm/calling.h>
#include <asm/asm-offsets.h>
#include <asm/current.h>
#include <asm/errno.h>
-#include <asm/ia32_unistd.h>
-#include <asm/thread_info.h>
+#include <asm/ia32_unistd.h>
+#include <asm/thread_info.h>
#include <asm/segment.h>
#include <asm/irqflags.h>
#include <linux/linkage.h>
@@ -38,11 +38,11 @@
xchg %ecx,%esi
movl %ebx,%edi
movl %edx,%edx /* zero extension */
- .endm
+ .endm

- /* clobbers %eax */
+ /* clobbers %eax */
.macro CLEAR_RREGS offset=0, _r9=rax
- xorl %eax,%eax
+ xorl %eax,%eax
movq %rax,\offset+R11(%rsp)
movq %rax,\offset+R10(%rsp)
movq %\_r9,\offset+R9(%rsp)
@@ -69,7 +69,7 @@
movl \offset+64(%rsp),%edi
movl %eax,%eax /* zero extension */
.endm
-
+
.macro CFI_STARTPROC32 simple
CFI_STARTPROC \simple
CFI_UNDEFINED r8
@@ -106,14 +106,14 @@ ENDPROC(native_irq_enable_sysexit)
* %esi Arg4
* %edi Arg5
* %ebp user stack
- * 0(%ebp) Arg6
- *
+ * 0(%ebp) Arg6
+ *
* Interrupts off.
- *
+ *
* This is purely a fast path. For anything complicated we use the int 0x80
* path below. Set up a complete hardware stack frame to share code
* with the int 0x80 path.
- */
+ */
ENTRY(ia32_sysenter_target)
CFI_STARTPROC32 simple
CFI_SIGNAL_FRAME
@@ -127,7 +127,7 @@ ENTRY(ia32_sysenter_target)
* disabled irqs, here we enable it straight after entry:
*/
ENABLE_INTERRUPTS(CLBR_NONE)
- movl %ebp,%ebp /* zero extension */
+ movl %ebp,%ebp /* zero extension */
pushq_cfi $__USER32_DS
/*CFI_REL_OFFSET ss,0*/
pushq_cfi %rbp
@@ -144,12 +144,12 @@ ENTRY(ia32_sysenter_target)
pushq_cfi %rax
cld
SAVE_ARGS 0,1,0
- /* no need to do an access_ok check here because rbp has been
- 32bit zero extended */
+ /* no need to do an access_ok check here because rbp has been
+ 32bit zero extended */
1: movl (%rbp),%ebp
- .section __ex_table,"a"
- .quad 1b,ia32_badarg
- .previous
+ .section __ex_table,"a"
+ .quad 1b,ia32_badarg
+ .previous
GET_THREAD_INFO(%r10)
orl $TS_COMPAT,TI_status(%r10)
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10)
@@ -170,7 +170,7 @@ sysenter_dispatch:
sysexit_from_sys_call:
andl $~TS_COMPAT,TI_status(%r10)
/* clear IF, that popfq doesn't enable interrupts early */
- andl $~0x200,EFLAGS-R11(%rsp)
+ andl $~0x200,EFLAGS-R11(%rsp)
movl RIP-R11(%rsp),%edx /* User %eip */
CFI_REGISTER rip,rdx
RESTORE_ARGS 0,24,0,0,0,0
@@ -260,20 +260,21 @@ ENDPROC(ia32_sysenter_target)
* Arguments:
* %eax System call number.
* %ebx Arg1
- * %ecx return EIP
+ * %ecx return EIP
* %edx Arg3
* %esi Arg4
* %edi Arg5
- * %ebp Arg2 [note: not saved in the stack frame, should not be touched]
- * %esp user stack
+ * %ebp Arg2 [note: not saved in the stack frame, should not be touched
+ * because it is callee-saved in 64-bit calling convention]
+ * %esp user stack
* 0(%esp) Arg6
- *
+ *
* Interrupts off.
- *
+ *
* This is purely a fast path. For anything complicated we use the int 0x80
* path below. Set up a complete hardware stack frame to share code
- * with the int 0x80 path.
- */
+ * with the int 0x80 path.
+ */
ENTRY(ia32_cstar_target)
CFI_STARTPROC32 simple
CFI_SIGNAL_FRAME
@@ -281,34 +282,57 @@ ENTRY(ia32_cstar_target)
CFI_REGISTER rip,rcx
/*CFI_REGISTER rflags,r11*/
SWAPGS_UNSAFE_STACK
+
+ /* stash away usermode stack ptr */
movl %esp,%r8d
CFI_REGISTER rsp,r8
movq PER_CPU_VAR(kernel_stack),%rsp
+
/*
* No need to follow this irqs on/off section: the syscall
* disabled irqs and here we enable it straight after entry:
*/
ENABLE_INTERRUPTS(CLBR_NONE)
SAVE_ARGS 8,0,0
- movl %eax,%eax /* zero extension */
+ movl %eax,%eax /* zero extension */
movq %rax,ORIG_RAX-ARGOFFSET(%rsp)
+
+ /* return-RIP is in %ecx when executing SYSCALL */
movq %rcx,RIP-ARGOFFSET(%rsp)
CFI_REL_OFFSET rip,RIP-ARGOFFSET
- movq %rbp,RCX-ARGOFFSET(%rsp) /* this lies slightly to ptrace */
+
+ /*
+ * Put Arg2 into %rcx pt_regs slot to match kernel syscall
+ * calling conventions, i.e. what INT80 would expect;
+ * this lies slightly to ptrace
+ */
+ movq %rbp,RCX-ARGOFFSET(%rsp)
movl %ebp,%ecx
movq $__USER32_CS,CS-ARGOFFSET(%rsp)
movq $__USER32_DS,SS-ARGOFFSET(%rsp)
+
+ /* rFLAGS is in %r11 when executing SYSCALL */
movq %r11,EFLAGS-ARGOFFSET(%rsp)
/*CFI_REL_OFFSET rflags,EFLAGS-ARGOFFSET*/
- movq %r8,RSP-ARGOFFSET(%rsp)
+
+ /* save usermode stack ptr into pt_regs */
+ movq %r8,RSP-ARGOFFSET(%rsp)
CFI_REL_OFFSET rsp,RSP-ARGOFFSET
- /* no need to do an access_ok check here because r8 has been
- 32bit zero extended */
- /* hardware stack frame is complete now */
+
+ /*
+ * Get Arg6 which is on the usermode stack; no need to do an
+ * access_ok check here because %r8 has been 32bit zero extended.
+ * hardware stack frame is complete now.
+ */
1: movl (%r8),%r9d
+
+ /*
+ * handle pagefaulting when accessing usermode stack by returning
+ * -EFAULT
+ */
.section __ex_table,"a"
.quad 1b,ia32_badarg
- .previous
+ .previous
GET_THREAD_INFO(%r10)
orl $TS_COMPAT,TI_status(%r10)
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10)
@@ -331,7 +355,7 @@ sysretl_from_sys_call:
RESTORE_ARGS 0,-ARG_SKIP,0,0,0
movl RIP-ARGOFFSET(%rsp),%ecx
CFI_REGISTER rip,rcx
- movl EFLAGS-ARGOFFSET(%rsp),%r11d
+ movl EFLAGS-ARGOFFSET(%rsp),%r11d
/*CFI_REGISTER rflags,r11*/
xorq %r10,%r10
xorq %r9,%r9
@@ -340,7 +364,7 @@ sysretl_from_sys_call:
movl RSP-ARGOFFSET(%rsp),%esp
CFI_RESTORE rsp
USERGS_SYSRET32
-
+
#ifdef CONFIG_AUDITSYSCALL
cstar_auditsys:
CFI_RESTORE_STATE
@@ -358,6 +382,8 @@ cstar_tracesys:
testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags(%r10)
jz cstar_auditsys
#endif
+
+ /* put Arg6 into %ebp where ptrace expects it */
xchgl %r9d,%ebp
SAVE_REST
CLEAR_RREGS 0, r9
@@ -366,21 +392,23 @@ cstar_tracesys:
call syscall_trace_enter
LOAD_ARGS32 ARGOFFSET, 1 /* reload args from stack in case ptrace changed it */
RESTORE_REST
+
+ /* sync back Arg6's possibly changed value where it is expected by C */
xchgl %ebp,%r9d
cmpq $(IA32_NR_syscalls-1),%rax
ja int_ret_from_sys_call /* cstar_tracesys has set RAX(%rsp) */
jmp cstar_do_call
END(ia32_cstar_target)
-
+
ia32_badarg:
movq $-EFAULT,%rax
jmp ia32_sysret
CFI_ENDPROC

-/*
- * Emulated IA32 system calls via int 0x80.
+/*
+ * Emulated IA32 system calls via int 0x80.
*
- * Arguments:
+ * Arguments:
* %eax System call number.
* %ebx Arg1
* %ecx Arg2
@@ -390,13 +418,13 @@ ia32_badarg:
* %ebp Arg6 [note: not saved in the stack frame, should not be touched]
*
* Notes:
- * Uses the same stack frame as the x86-64 version.
+ * Uses the same stack frame as the x86-64 version.
* All registers except %eax must be saved (but ptrace may violate that)
* Arguments are zero extended. For system calls that want sign extension and
* take long arguments a wrapper is needed. Most calls can just be called
* directly.
- * Assumes it is only called from user space and entered with interrupts off.
- */
+ * Assumes it is only called from user space and entered with interrupts off.
+ */

ENTRY(ia32_syscall)
CFI_STARTPROC32 simple
@@ -433,9 +461,9 @@ ia32_sysret:
movq %rax,RAX-ARGOFFSET(%rsp)
ia32_ret_from_sys_call:
CLEAR_RREGS -ARGOFFSET
- jmp int_ret_from_sys_call
+ jmp int_ret_from_sys_call

-ia32_tracesys:
+ia32_tracesys:
SAVE_REST
CLEAR_RREGS
movq $-ENOSYS,RAX(%rsp) /* ptrace can change this for a bad syscall */
@@ -457,13 +485,13 @@ quiet_ni_syscall:
movq $-ENOSYS,%rax
ret
CFI_ENDPROC
-
+
.macro PTREGSCALL label, func, arg
.globl \label
\label:
leaq \func(%rip),%rax
leaq -ARGOFFSET+8(%rsp),\arg /* 8 for return address */
- jmp ia32_ptregs_common
+ jmp ia32_ptregs_common
.endm

CFI_STARTPROC32
@@ -537,7 +565,7 @@ ia32_sys_call_table:
.quad quiet_ni_syscall /* old stty syscall holder */
.quad quiet_ni_syscall /* old gtty syscall holder */
.quad sys_access
- .quad sys_nice
+ .quad sys_nice
.quad quiet_ni_syscall /* 35 */ /* old ftime syscall holder */
.quad sys_sync
.quad sys32_kill
@@ -616,7 +644,7 @@ ia32_sys_call_table:
.quad stub32_iopl /* 110 */
.quad sys_vhangup
.quad quiet_ni_syscall /* old "idle" system call */
- .quad sys32_vm86_warning /* vm86old */
+ .quad sys32_vm86_warning /* vm86old */
.quad compat_sys_wait4
.quad sys_swapoff /* 115 */
.quad compat_sys_sysinfo
@@ -669,7 +697,7 @@ ia32_sys_call_table:
.quad sys_mremap
.quad sys_setresuid16
.quad sys_getresuid16 /* 165 */
- .quad sys32_vm86_warning /* vm86 */
+ .quad sys32_vm86_warning /* vm86 */
.quad quiet_ni_syscall /* query_module */
.quad sys_poll
.quad compat_sys_nfsservctl
@@ -724,10 +752,10 @@ ia32_sys_call_table:
.quad sys_mincore
.quad sys_madvise
.quad compat_sys_getdents64 /* 220 getdents64 */
- .quad compat_sys_fcntl64
+ .quad compat_sys_fcntl64
.quad quiet_ni_syscall /* tux */
- .quad quiet_ni_syscall /* security */
- .quad sys_gettid
+ .quad quiet_ni_syscall /* security */
+ .quad sys_gettid
.quad sys32_readahead /* 225 */
.quad sys_setxattr
.quad sys_lsetxattr
@@ -742,7 +770,7 @@ ia32_sys_call_table:
.quad sys_lremovexattr
.quad sys_fremovexattr
.quad sys_tkill
- .quad sys_sendfile64
+ .quad sys_sendfile64
.quad compat_sys_futex /* 240 */
.quad compat_sys_sched_setaffinity
.quad compat_sys_sched_getaffinity
@@ -754,7 +782,7 @@ ia32_sys_call_table:
.quad compat_sys_io_submit
.quad sys_io_cancel
.quad sys32_fadvise64 /* 250 */
- .quad quiet_ni_syscall /* free_huge_pages */
+ .quad quiet_ni_syscall /* free_huge_pages */
.quad sys_exit_group
.quad sys32_lookup_dcookie
.quad sys_epoll_create
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 6419bb0..9569f11 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -607,10 +607,16 @@ tracesys:
GLOBAL(int_ret_from_sys_call)
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
+
+ /*
+ * check the Requestor Privilege Level of the CS selector
+ * previously pushed on the stack. If 0, we're returning
+ * to kernel space.
+ */
testl $3,CS-ARGOFFSET(%rsp)
je retint_restore_args
- movl $_TIF_ALLWORK_MASK,%edi
/* edi: mask to check */
+ movl $_TIF_ALLWORK_MASK,%edi
GLOBAL(int_with_check)
LOCKDEP_SYS_EXIT_IRQ
GET_THREAD_INFO(%rcx)
@@ -618,11 +624,16 @@ GLOBAL(int_with_check)
andl %edi,%edx
jnz int_careful
andl $~TS_COMPAT,TI_status(%rcx)
+
+ /* no work pending, return to userspace */
jmp retint_swapgs

- /* Either reschedule or signal or syscall exit tracking needed. */
- /* First do a reschedule test. */
- /* edx: work, edi: workmask */
+ /*
+ * Either reschedule or signal or syscall exit tracking
+ * needed. First do a reschedule test.
+ *
+ * edx: work, edi: workmask
+ */
int_careful:
bt $TIF_NEED_RESCHED,%edx
jnc int_very_careful
--
1.7.4

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

Borislav Petkov

2011-08-23 16:22:51 UTC

Permalink

On Tue, Aug 23, 2011 at 12:11:43PM -0400, Andrew Lutomirski wrote:
> In any case, this seems insanely overcomplicated. I'd be less afraid
> of something like my approach (which, I think, makes all of the
> SYSCALL weirdness pretty much transparent to ptrace users) or of just
> removing SYSCALL entirely from 32-bit code.

I don't think that removing SYSCALL from 32-bit code just so that UML
trapped syscalls work is something we'd like since SYSCALL is much
cheaper than INT $0x80:

"As a result, SYSCALL and SYSRET can take fewer than one-fourth the
number of internal clock cycles to complete than the legacy CALL and RET
instructions."

http://support.amd.com/us/Processor_TechDocs/24593.pdf, p. 152.

I know, it is 32-bit syscall on 64-bit kernel which should be pretty
rare but still...

Thanks.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

Linus Torvalds

2011-08-23 16:29:29 UTC

Permalink

On Tue, Aug 23, 2011 at 9:22 AM, Borislav Petkov <***@amd64.org> wrote:
>
> I don't think that removing SYSCALL from 32-bit code just so that UML
> trapped syscalls work is something we'd like since SYSCALL is much
> cheaper than INT $0x80:

Oh yes.

System call performance is *important*. And x86 is *important*.

UML? In comparison, not that important.

So quite frankly, if this is purely an UML issue (and unless we're
missing something else, that's what it looks like, despite all the
confusion we've had so far), then if we have a choice between "remove
syscall instruction" and "remove UML", then ...

Linus

Al Viro

2011-08-23 16:53:20 UTC

Permalink

On Tue, Aug 23, 2011 at 09:29:29AM -0700, Linus Torvalds wrote:

> Oh yes.
>
> System call performance is *important*. And x86 is *important*.
>
> UML? In comparison, not that important.
>
> So quite frankly, if this is purely an UML issue (and unless we're
> missing something else, that's what it looks like, despite all the
> confusion we've had so far), then if we have a choice between "remove
> syscall instruction" and "remove UML", then ...

Agreed. Note, BTW, that UML has perfectly usable workaround for 99% of
that - don't tell the binaries it has *any* vdso in such cases. And
"remove UML" turns into "remove support under UML for 32bit binaries
that go out of their way to do SYSCALL directly, which wouldn't work
on native 32bit", which is really a no-brainer.

Richard Weinberger

2011-08-23 16:58:18 UTC

Permalink

Am 23.08.2011 18:53, schrieb Al Viro:
> On Tue, Aug 23, 2011 at 09:29:29AM -0700, Linus Torvalds wrote:
>
>> Oh yes.
>>
>> System call performance is *important*. And x86 is *important*.
>>
>> UML? In comparison, not that important.
>>
>> So quite frankly, if this is purely an UML issue (and unless we're
>> missing something else, that's what it looks like, despite all the
>> confusion we've had so far), then if we have a choice between "remove
>> syscall instruction" and "remove UML", then ...
>
> Agreed. Note, BTW, that UML has perfectly usable workaround for 99% of
> that - don't tell the binaries it has *any* vdso in such cases. And
> "remove UML" turns into "remove support under UML for 32bit binaries
> that go out of their way to do SYSCALL directly, which wouldn't work
> on native 32bit", which is really a no-brainer.

What about this hack/solution?
While booting UML can check whether the host's vDSO contains
a SYSCALL instruction.
If so, UML will not make the host's vDSO available to it's
processes...

Thanks,
//richard

Al Viro

2011-08-23 17:07:40 UTC

Permalink

On Tue, Aug 23, 2011 at 06:58:18PM +0200, Richard Weinberger wrote:

> What about this hack/solution?
> While booting UML can check whether the host's vDSO contains
> a SYSCALL instruction.
> If so, UML will not make the host's vDSO available to it's
> processes...

Note that this is *only* for 32bit side of things. 64bit one works fine...

I wouldn't search for SYSCALL in vdso, BTW - not when we have a good way
to trigger that crap and recognize it.

At boot time, fork a child. Have it traced with PTRACE_SYSCALL. Let it
put recognizable values in registers and call __kernel_vsyscall(). Then
let the parent do one more PTRACE_SYSCALL, then PTRACE_POKEUSER and set ebp
to 0x69696969. PTRACE_CONT the sucker and let it report what it sees in ecx.
If it's what we'd put there - fine, it looks safe. If it's 0x69696969 -
we have a problem, no vdso for us.

Richard Weinberger

2011-08-23 17:29:51 UTC

Permalink

Am 23.08.2011 19:07, schrieb Al Viro:
> On Tue, Aug 23, 2011 at 06:58:18PM +0200, Richard Weinberger wrote:
>
>> What about this hack/solution?
>> While booting UML can check whether the host's vDSO contains
>> a SYSCALL instruction.
>> If so, UML will not make the host's vDSO available to it's
>> processes...
>
> Note that this is *only* for 32bit side of things. 64bit one works fine...

I know. :)

> I wouldn't search for SYSCALL in vdso, BTW - not when we have a good way
> to trigger that crap and recognize it.
>
> At boot time, fork a child. Have it traced with PTRACE_SYSCALL. Let it
> put recognizable values in registers and call __kernel_vsyscall(). Then
> let the parent do one more PTRACE_SYSCALL, then PTRACE_POKEUSER and set ebp
> to 0x69696969. PTRACE_CONT the sucker and let it report what it sees in ecx.
> If it's what we'd put there - fine, it looks safe. If it's 0x69696969 -
> we have a problem, no vdso for us.

Okay, this is a much cleaner approach.
But first I've to find a machine where I can test the issue.
At home none on of my x86_64 machines is SYSCALL-based.
Tomorrow I'll search at the university for one...

Thanks,
//richard

Richard Weinberger

2011-08-25 00:05:39 UTC

Permalink

Am 23.08.2011 19:07, schrieb Al Viro:
> On Tue, Aug 23, 2011 at 06:58:18PM +0200, Richard Weinberger wrote:
>
>> What about this hack/solution?
>> While booting UML can check whether the host's vDSO contains
>> a SYSCALL instruction.
>> If so, UML will not make the host's vDSO available to it's
>> processes...
>
> Note that this is *only* for 32bit side of things. 64bit one works fine...
>
> I wouldn't search for SYSCALL in vdso, BTW - not when we have a good way
> to trigger that crap and recognize it.
>
> At boot time, fork a child. Have it traced with PTRACE_SYSCALL. Let it
> put recognizable values in registers and call __kernel_vsyscall(). Then
> let the parent do one more PTRACE_SYSCALL, then PTRACE_POKEUSER and set ebp
> to 0x69696969. PTRACE_CONT the sucker and let it report what it sees in ecx.
> If it's what we'd put there - fine, it looks safe. If it's 0x69696969 -
> we have a problem, no vdso for us.

BTW: IMHO we can completely disable the vDSO for 32bit guests.
I did some benchmarks, there is no performance gain at all
within UML.

The attached program runs some syscalls for 10 seconds and prints
the number of iterations.

Some results (5 runs on my Intel Core2):
vdso: 360099 362057 365982 367132 368907
none: 344195 355759 358974 366630 420027

Thanks,
//richard

H. Peter Anvin

2011-08-23 19:15:58 UTC

Permalink

On 08/23/2011 09:22 AM, Borislav Petkov wrote:
> On Tue, Aug 23, 2011 at 12:11:43PM -0400, Andrew Lutomirski wrote:
>> In any case, this seems insanely overcomplicated. I'd be less afraid
>> of something like my approach (which, I think, makes all of the
>> SYSCALL weirdness pretty much transparent to ptrace users) or of just
>> removing SYSCALL entirely from 32-bit code.
>
> I don't think that removing SYSCALL from 32-bit code just so that UML
> trapped syscalls work is something we'd like since SYSCALL is much
> cheaper than INT $0x80:
>
> "As a result, SYSCALL and SYSRET can take fewer than one-fourth the
> number of internal clock cycles to complete than the legacy CALL and RET
> instructions."
>
> http://support.amd.com/us/Processor_TechDocs/24593.pdf, p. 152.
>
> I know, it is 32-bit syscall on 64-bit kernel which should be pretty
> rare but still...
>

Right, but if you had said the difference had disappeared on current AMD
silicon it would be much less of an issue. That's why I wanted to find
that bit out from you.

-hpa

Borislav Petkov

2011-08-23 20:56:16 UTC

Permalink

On Tue, Aug 23, 2011 at 03:15:58PM -0400, H. Peter Anvin wrote:
> Right, but if you had said the difference had disappeared on current AMD
> silicon it would be much less of an issue. That's why I wanted to find
> that bit out from you.

Yeah, I dug this out in the APM, btw.

But no, I don't think the difference has disappeared - to the contrary,
AFAICT, the intention is for SYSCALL to be the fastest way to do
syscalls on x86 due to diminished number of segment checks etc. INT80
is legacy, slower, etc. I believe Andy measured a similar situation on
Sandy Bridge with SYSCALL having latencies in the tens of nsecs range
and INT80 being much slower. Ingo also measured a similar situation
where the latency gap between the two on Intel is even bigger.

So, the only problem left now is what we're going
to do with cases similar to what Al conjured up:
http://marc.info/?l=linux-kernel&m=131412271112461&w=2

I don't know, maybe the most cowardly approach is to issue a warning and
shrug with the shoulders, or do some asm magic... Also, do we care at
all, how relevant is a case like that?

Hmmm.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

H. Peter Anvin

2011-08-23 21:06:03 UTC

Permalink

On 08/23/2011 01:56 PM, Borislav Petkov wrote:
>
> But no, I don't think the difference has disappeared - to the contrary,
> AFAICT, the intention is for SYSCALL to be the fastest way to do
> syscalls on x86 due to diminished number of segment checks etc. INT80
> is legacy, slower, etc. I believe Andy measured a similar situation on
> Sandy Bridge with SYSCALL having latencies in the tens of nsecs range
> and INT80 being much slower. Ingo also measured a similar situation
> where the latency gap between the two on Intel is even bigger.
>

Sandy Bridge doesn't have SYSCALL32 at all. It has SYSENTER and SYSCALL64.

-hpa

Borislav Petkov

2011-08-23 21:10:47 UTC

Permalink

On Tue, Aug 23, 2011 at 05:06:03PM -0400, H. Peter Anvin wrote:
> On 08/23/2011 01:56 PM, Borislav Petkov wrote:
> >
> > But no, I don't think the difference has disappeared - to the contrary,
> > AFAICT, the intention is for SYSCALL to be the fastest way to do
> > syscalls on x86 due to diminished number of segment checks etc. INT80
> > is legacy, slower, etc. I believe Andy measured a similar situation on
> > Sandy Bridge with SYSCALL having latencies in the tens of nsecs range
> > and INT80 being much slower. Ingo also measured a similar situation
> > where the latency gap between the two on Intel is even bigger.
> >
>
> Sandy Bridge doesn't have SYSCALL32 at all. It has SYSENTER and SYSCALL64.

Yeah, I was talking about SYSCALL in general.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

H. Peter Anvin

2011-08-23 23:04:45 UTC

Permalink

On 08/23/2011 02:10 PM, Borislav Petkov wrote:
> On Tue, Aug 23, 2011 at 05:06:03PM -0400, H. Peter Anvin wrote:
>> On 08/23/2011 01:56 PM, Borislav Petkov wrote:
>>>
>>> But no, I don't think the difference has disappeared - to the contrary,
>>> AFAICT, the intention is for SYSCALL to be the fastest way to do
>>> syscalls on x86 due to diminished number of segment checks etc. INT80
>>> is legacy, slower, etc. I believe Andy measured a similar situation on
>>> Sandy Bridge with SYSCALL having latencies in the tens of nsecs range
>>> and INT80 being much slower. Ingo also measured a similar situation
>>> where the latency gap between the two on Intel is even bigger.
>>>
>>
>> Sandy Bridge doesn't have SYSCALL32 at all. It has SYSENTER and SYSCALL64.
>
> Yeah, I was talking about SYSCALL in general.
>

Right, but there could be an arbitrary chasm between SYSCALL32 and
SYSCALL64.

-hpa

H. Peter Anvin

2011-08-24 21:10:04 UTC

Permalink

On 08/23/2011 02:10 PM, Borislav Petkov wrote:
> On Tue, Aug 23, 2011 at 05:06:03PM -0400, H. Peter Anvin wrote:
>> On 08/23/2011 01:56 PM, Borislav Petkov wrote:
>>>
>>> But no, I don't think the difference has disappeared - to the contrary,
>>> AFAICT, the intention is for SYSCALL to be the fastest way to do
>>> syscalls on x86 due to diminished number of segment checks etc. INT80
>>> is legacy, slower, etc. I believe Andy measured a similar situation on
>>> Sandy Bridge with SYSCALL having latencies in the tens of nsecs range
>>> and INT80 being much slower. Ingo also measured a similar situation
>>> where the latency gap between the two on Intel is even bigger.
>>>
>>
>> Sandy Bridge doesn't have SYSCALL32 at all. It has SYSENTER and SYSCALL64.
>
> Yeah, I was talking about SYSCALL in general.
>

By the way, Borislav;

any way you could nudge your hardware people into

a) supporting SYSENTER in compatibility mode, and
b) giving us a way to turn SYSCALL *off* in compat mode?

... for future chips?

-hpa

Al Viro

2011-08-23 16:48:49 UTC

Permalink

On Tue, Aug 23, 2011 at 09:03:04AM -0700, Linus Torvalds wrote:

> Suggested fixes:
>
> - instead of blindly doing SETREGS, just write the result registers
> individually like you suggested

Not enough. There is also a PITA with signal handlers. There we can't
avoid modifying ebp on the way out of handler (i.e. by emulated sigreturn).
And it lands us straight after syscall insn, with ebp "restored" to the
wrong value.

> OR (and perhaps preferably):
>
> - teach UML that when you do 'GETREGS' after a system call trapped,
> we have switched things around to match the "official system call
> order", and UML should just undo our swizzling, and do a "regs.ebp =
> regs.ecx" to make it be what the actual original registers were (or
> whatever the actual correct swizzle is - I didn't think that through
> very much).

Um... How would it know which syscall variant had that been, to start
with? For int 0x80 it would need to use registers as-is. For SYSENTER
it also could use them as-is - ebp will differ from what we put there
when entering the sucker, but not critically so; on the way out of
syscall we'll overwrite it anyway immediately (either by pop or mov).
For SYSCALL... we don't really care about ecx contents prior to entering
the kernel (and it'll be blown out anyway), and ebp one could be found in
regs.ecx. So yes, we can do it that way, but... how to tell what variant
had been triggered? Examining two bytes prior to user eip? Sounds bloody
brittle...

Linus Torvalds

2011-08-23 17:33:09 UTC

Permalink

On Tue, Aug 23, 2011 at 9:48 AM, Al Viro <***@zeniv.linux.org.uk> wrot=
e:
>
> Um... =A0How would it know which syscall variant had that been, to st=
art
> with?

Just read the instruction, for chissake.

UML *already* does that, to see if it's "int80" or "sysenter" ('is_sysc=
all()').

Now, I do agree that if we had designed the ptrace interface with
these kinds of issues in mind, then we would have added a "state"
field to the thing that could have this kind of information as part of
the GETREGS interface. There is no question that that would have been
a good idea - but we have what we have.

I mean, technically, we could also have always just given "raw user
space register state" to ptrace, and then just said that "anybody who
traces system calls needs to know the exact calling conventions for
*that* kind of system call". But instead of that, we give the "cooked"
pt_regs values on read-out, to make it simpler for strace and friends.

And it's actualyl simpler for UML too. If we *didn't* give that cooked
register set information, then UML would *still* have to look at the
actual instruction in order to emulate the system call correctly
("it's sysenter, so now I need to take some of the system call
arguments from the stack"). So the fact that we do that register state
swizzling actually helps not just strace, but UML too.

It would be *nice* if we did the swizzling automatically at setregs()
time too, but we simply don't have enough information in the kernel to
do that. Again, exactly because pt_regs doesn't have a "state"
variable, when user-space does the SETREGS call, we simply don't know
whether we are in "normal" code or in some system call entry or exit
state. So the kernel does the swizzling at GETREGS time (by virtue of
always having the registers in a "canonical" state for system call
entry), but we fundamentally *cannot* to do the unswizzle, because we
don't know what the SETREGS caller actually did.

So I think the current state is actually the best we could possibly
do, with the caveat that *if* we had known about the "different system
calls have different register layouts" originally and had thought of
it, we could have added a 'state' word that the kernel could set at
GETREGS time, and use at SETREGS time to decide whether swizzling is
needed or not.

But not only would that have required time travel (ptrace existed
before the multiple system calls did), even then it's not 100% clear
that the current simpler model (with the admittedly subtle case of
implicit state and its effect on register state) isn't actually the
better solution. *Somebody* has to do the register swizzling, and the
current "kernel canonicalizes registers at read time, you need to
swizzle them if you change state" may simply be the RightThing(tm).

Linus

H. Peter Anvin

2011-08-23 21:08:27 UTC

Permalink

On 08/23/2011 10:33 AM, Linus Torvalds wrote:
>
> It would be *nice* if we did the swizzling automatically at setregs()
> time too, but we simply don't have enough information in the kernel to
> do that. Again, exactly because pt_regs doesn't have a "state"
> variable, when user-space does the SETREGS call, we simply don't know
> whether we are in "normal" code or in some system call entry or exit
> state. So the kernel does the swizzling at GETREGS time (by virtue of
> always having the registers in a "canonical" state for system call
> entry), but we fundamentally *cannot* to do the unswizzle, because we
> don't know what the SETREGS caller actually did.
>

Again, can we steal one of the padding fields to use for that state
variable? We have two 16-bit padding fields; one for cs and one for ss.

For UML, I agree, let's just not expose the vdso assuming that is
possible, but for other -- possibly future -- users.

-hpa

Linus Torvalds

2011-08-23 21:20:20 UTC

Permalink

On Tue, Aug 23, 2011 at 2:08 PM, H. Peter Anvin <***@zytor.com> wrote:
>
> Again, can we steal one of the padding fields to use for that state
> variable? =A0We have two 16-bit padding fields; one for cs and one fo=
r ss.

We can steal them for passing the information to the user, but no, I
don't think we can use them to then take the information *from* the
user.

Somebody may well be setting up a 'pt_regs' structure on his own, and
simply not fill in the padding, resulting in random data in those
fields.

Linus

H. Peter Anvin

2011-08-23 23:04:17 UTC

Permalink

On 08/23/2011 02:20 PM, Linus Torvalds wrote:
> On Tue, Aug 23, 2011 at 2:08 PM, H. Peter Anvin <***@zytor.com> wrote:
>>
>> Again, can we steal one of the padding fields to use for that state
>> variable? We have two 16-bit padding fields; one for cs and one for ss.
>
> We can steal them for passing the information to the user, but no, I
> don't think we can use them to then take the information *from* the
> user.
>
> Somebody may well be setting up a 'pt_regs' structure on his own, and
> simply not fill in the padding, resulting in random data in those
> fields.
>

That would be fine, I'd think... just gives the user space application
enough information to know how it would have to reshuffle the registers
if it needs to.

-hpa

H. Peter Anvin

2011-08-23 19:18:15 UTC

Permalink

On 08/23/2011 09:48 AM, Al Viro wrote:
>
> Um... How would it know which syscall variant had that been, to start
> with? For int 0x80 it would need to use registers as-is. For SYSENTER
> it also could use them as-is - ebp will differ from what we put there
> when entering the sucker, but not critically so; on the way out of
> syscall we'll overwrite it anyway immediately (either by pop or mov).
> For SYSCALL... we don't really care about ecx contents prior to entering
> the kernel (and it'll be blown out anyway), and ebp one could be found in
> regs.ecx. So yes, we can do it that way, but... how to tell what variant
> had been triggered? Examining two bytes prior to user eip? Sounds bloody
> brittle...

We could drop that information in a metaregister. It's not backward
compatible, but at least it will be obvious when that information is
available and not.

-hpa

Linus Torvalds

2011-08-23 19:24:22 UTC

Permalink

On Tue, Aug 23, 2011 at 12:18 PM, H. Peter Anvin <***@zytor.com> wrote:
>
> We could drop that information in a metaregister. =A0It's not backwar=
d
> compatible, but at least it will be obvious when that information is
> available and not.

Well, seriously, UML already looks at the word at "ip-2" for other
reasons. So it isn't like there is any point in adding more support to
just give you that information in another form.

Linus

H. Peter Anvin

2011-08-23 19:26:24 UTC

Permalink

On 08/23/2011 12:24 PM, Linus Torvalds wrote:
> On Tue, Aug 23, 2011 at 12:18 PM, H. Peter Anvin <***@zytor.com> wrote:
>>
>> We could drop that information in a metaregister. It's not backward
>> compatible, but at least it will be obvious when that information is
>> available and not.
>
> Well, seriously, UML already looks at the word at "ip-2" for other
> reasons. So it isn't like there is any point in adding more support to
> just give you that information in another form.
>

Not for UML, but it might be useful for other things. Certainly lowers
the priority, though.

-hpa

Al Viro

2011-08-23 19:41:03 UTC

Permalink

On Tue, Aug 23, 2011 at 12:24:22PM -0700, Linus Torvalds wrote:
> On Tue, Aug 23, 2011 at 12:18 PM, H. Peter Anvin <***@zytor.com> wrote:
> >
> > We could drop that information in a metaregister. ?It's not backward
> > compatible, but at least it will be obvious when that information is
> > available and not.
>
> Well, seriously, UML already looks at the word at "ip-2" for other
> reasons. So it isn't like there is any point in adding more support to
> just give you that information in another form.

That is done only for task singlestepped in the guest:
/*
* This closes a way to execute a system call on the host. If
* you set a breakpoint on a system call instruction and singlestep
* from it, the tracing thread used to PTRACE_SINGLESTEP the process
* rather than PTRACE_SYSCALL it, allowing the system call to execute
* on the host. The tracing thread will check this flag and
* PTRACE_SYSCALL if necessary.
*/
if (current->ptrace & PT_DTRACE)
current->thread.singlestep_syscall =
is_syscall(PT_REGS_IP(&current->thread.regs));
with PT_DTRACE set by uml user_enable_single_step()

And it's not cheap - doing that on each syscall will be unpleasant...
Frankly, I'd rather stopped telling the uml userland about vdso in such
setups. And anything that plays with SYSCALL outside of vdso... <shrug>
we already have a "don't run it native on 32bit", adding "don't run
it on 32bit uml on amd64 host" is not too serious. At least for now...

Linus Torvalds

2011-08-23 19:43:28 UTC

Permalink

On Tue, Aug 23, 2011 at 12:41 PM, Al Viro <***@zeniv.linux.org.uk> wro=
te:
>
> And it's not cheap - doing that on each syscall will be unpleasant...
> Frankly, I'd rather stopped telling the uml userland about vdso in su=
ch
> setups. =A0And anything that plays with SYSCALL outside of vdso... <s=
hrug>
> we already have a "don't run it native on 32bit", adding "don't run
> it on 32bit uml on amd64 host" is not too serious. =A0At least for no=
w...

I do agree that the solution might well be to just stop using the
non-int80 vdsos for UML. That may just solve everything in practice.

Linus

Al Viro

2011-08-23 21:17:44 UTC

Permalink

On Tue, Aug 23, 2011 at 12:43:28PM -0700, Linus Torvalds wrote:
> On Tue, Aug 23, 2011 at 12:41 PM, Al Viro <***@zeniv.linux.org.uk> wrote:
> >
> > And it's not cheap - doing that on each syscall will be unpleasant...
> > Frankly, I'd rather stopped telling the uml userland about vdso in such
> > setups. ?And anything that plays with SYSCALL outside of vdso... <shrug>
> > we already have a "don't run it native on 32bit", adding "don't run
> > it on 32bit uml on amd64 host" is not too serious. ?At least for now...
>
> I do agree that the solution might well be to just stop using the
> non-int80 vdsos for UML. That may just solve everything in practice.

SYSENTER works fine, actually... And we can easily check if we have an
affected SYSCALL, simply by forking a child, tracing it into a syscall
and doing POKEUSER to ebp on the second stop (i.e. on the way out).
If the value ends up in ecx after __kernel_vsyscall(), we have SYSCALL-based
variant on amd64 host, if it's lost completely - it's SYSENTER, if it shows
up in ebp - int 0x80.

Andrew Lutomirski

2011-08-23 01:16:50 UTC

Permalink

On Aug 22, 2011 9:01 PM, "Al Viro" <***@zeniv.linux.org.uk> wrote:
>
> On Mon, Aug 22, 2011 at 05:22:07PM -0700, Linus Torvalds wrote:
>
> > You guys seem to positively _want_ to make this a bigger issue than it
> > is. As far as I can tell, nobody has ever even noticed this problem
> > before, and we already have a trivial fix ("don't do it then") for the
> > case Al actually did notice.
>
> I'm not sure. What I've noticed was complete weirdness in uml,
apparently
> dependent on SYSCALL vs. SYSENTER or int 0x80. With following round of
> RTFS and finding what looks like an ugly issue with restarts.
>
> What I don't understand at all is how the hell do we manage to avoid much
> more obvious breakage all the time we ptrace a 32bit task. Look:
> __kernel_vsyscall() is
> push %ebp
> movl %ecx, %ebp
> syscall
> movl $__USER32_DS, %ecx
> movl %ecx, %ss
> movl %ebp, %ecx
> popl %ebp
>
> now, what is going to happen to %ebp if we go through IRET path, for any
> reason? From my reading it appears that right after that IRET we'll have
> ebp containing arg6. I.e. what we'd pushed on stack. Now, popl %ebp
> will bring the same value back. Not a problem. But what about
> movl %ebp, %ecx? Again, I'm talking about the case when we have no
> restart at all - just an strace(1) tracing a process.

I suspect that very few things care whether syscall arguments get
clobbered. The only way it would matter is if gcc reuses the argument in
the ecx slot after an inlined syscall later in the same function. Any code
that does that is already wrong if the syscall restarts with changed ecx or
if something like UML changes the syscall argument.

--Andy

>
> AFAICS, in that case we ought to have %ecx == %ebp after return from
> __kernel_vsyscall(). Which would blow the things up _very_ fast.
>
> So what the hell am I missing here?

H. Peter Anvin

2011-08-23 01:18:15 UTC

Permalink

On 08/22/2011 06:16 PM, Andrew Lutomirski wrote:
>
> I suspect that very few things care whether syscall arguments get
> clobbered. The only way it would matter is if gcc reuses the argument
> in the ecx slot after an inlined syscall later in the same function.
> Any code that does that is already wrong if the syscall restarts with
> changed ecx or if something like UML changes the syscall argument.
>

No, the glibc wrapper for the system call Al was looking at used %ecx to
hold a copy of the PIC pointer (normally %ebx)!

-hpa

H. Peter Anvin

2011-08-22 23:46:24 UTC

Permalink

On 08/22/2011 04:27 PM, Linus Torvalds wrote:
> On Mon, Aug 22, 2011 at 3:04 PM, H. Peter Anvin <***@zytor.com> wrote:
>>
>> However, we could just issue a SIGILL or SIGSEGV at this point; the same
>> way we would if we got an #UD or #GP fault; SIGILL/#UD would be
>> consistent with Intel CPUs here.
>
> Considering that this is not a remotely new issue, and that it has
> been around for years without anybody even noticing, I'd really prefer
> to just fix things going forwards rather than add any code to actively
> break any possible unlucky legacy users.
>
> So I think the "let's fix the vdso case for sysenter" + "let's remove
> the 32-bit syscall vdso" is the right solution. If somebody has
> hardcoded syscall instructions, or generates them dynamically with
> some JIT, that's their problem. We'll continue to support it as well
> as we ever have (read: "almost nobody will ever notice").
>
> One thing we *could* do is to just say "we never restart a x86-32
> 'syscall' instruction at all", and just make such a case return EINTR.
> IOW, do something along the lines of the appended pseudo-patch.
>
> Because returning -EINTR is always "almost correct".
>

I have to say it worries me from a potential security hole point of
view, especially since it clearly isn't very well trod ground to begin
with. An almost-never-used path with access to the full system call
suite is scarier than hell in that sense.

Keep in mind support for SYSCALL32 is already (vendor-)conditional.

(The obvious solution of just putting the proper register frame back in
its place would be okay except for totally breaking anything
trace-on-exit as already hashed to death...)

-hpa

Al Viro

2011-08-22 04:07:59 UTC

Permalink

On Sun, Aug 21, 2011 at 06:41:16PM -0700, Linus Torvalds wrote:
> On Sun, Aug 21, 2011 at 6:16 PM, Al Viro <***@zeniv.linux.org.uk> wrote:
> >
> > Is that ability a part of userland ABI or are we declaring that hopelessly
> > wrong and require to go through the function in vdso32? ?Linus?
>
> If people are using syscall directly, we're pretty much stuck. No
> amount of "that's hopelessly wrong" will ever matter. We don't break
> existing binaries.

There's a funny part, though - such binary won't work on 32bit kernel.
AFAICS, we never set MSR_*STAR on 32bit kernels (and native 32bit vdso
doesn't provide a SYSCALL-based variant).

So if we really consider such SYSCALL outside of vdso32 kosher, shouldn't
we do something with entry_32.S as well? I don't think it's worth doing,
TBH...

Again, I very much hope that binaries with such stray SYSCALL simply do
not exist. In theory it's possible to write one, but...

IIRC, the reason we never had SYSCALL support in 32bit kernel was the utter
lack of point - the *only* CPU where it would matter would be K6-2, IIRC,
and (again, IIRC) it had some differences in SYSCALL semantics compared to
K7 (which supports SYSENTER as well). Bugger if I remember what those
differences might've been... Some flag not cleared?

H. Peter Anvin

2011-08-22 04:11:54 UTC

Permalink

On 08/21/2011 09:07 PM, Al Viro wrote:
> On Sun, Aug 21, 2011 at 06:41:16PM -0700, Linus Torvalds wrote:
>> On Sun, Aug 21, 2011 at 6:16 PM, Al Viro <***@zeniv.linux.org.uk> wrote:
>>>
>>> Is that ability a part of userland ABI or are we declaring that hopelessly
>>> wrong and require to go through the function in vdso32? ?Linus?
>>
>> If people are using syscall directly, we're pretty much stuck. No
>> amount of "that's hopelessly wrong" will ever matter. We don't break
>> existing binaries.
>
> There's a funny part, though - such binary won't work on 32bit kernel.
> AFAICS, we never set MSR_*STAR on 32bit kernels (and native 32bit vdso
> doesn't provide a SYSCALL-based variant).
>
> So if we really consider such SYSCALL outside of vdso32 kosher, shouldn't
> we do something with entry_32.S as well? I don't think it's worth doing,
> TBH...
>
> Again, I very much hope that binaries with such stray SYSCALL simply do
> not exist. In theory it's possible to write one, but...
>
> IIRC, the reason we never had SYSCALL support in 32bit kernel was the utter
> lack of point - the *only* CPU where it would matter would be K6-2, IIRC,
> and (again, IIRC) it had some differences in SYSCALL semantics compared to
> K7 (which supports SYSENTER as well). Bugger if I remember what those
> differences might've been... Some flag not cleared?

The most likely reason for a binary to execute a stray SYSCALL is
because they read it out of the vdso. Totally daft, but we certainly
see a lot of stupid things as evidenced by the JIT thread earlier this
month.

In that sense, a "safe" thing would be to drop use of SYSCALL for 32-bit
processes... I just sent Borislav a query about the cost.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

Al Viro

2011-08-22 04:26:05 UTC

Permalink

On Sun, Aug 21, 2011 at 09:11:54PM -0700, H. Peter Anvin wrote:
> > lack of point - the *only* CPU where it would matter would be K6-2, IIRC,
> > and (again, IIRC) it had some differences in SYSCALL semantics compared to
> > K7 (which supports SYSENTER as well). Bugger if I remember what those
> > differences might've been... Some flag not cleared?
>
> The most likely reason for a binary to execute a stray SYSCALL is
> because they read it out of the vdso. Totally daft, but we certainly
> see a lot of stupid things as evidenced by the JIT thread earlier this
> month.

Um... What, blindly, no matter what surrounds it in there? What will
happen to the same eager JIT when it steps on SYSENTER?

H. Peter Anvin

2011-08-22 05:03:33 UTC

Permalink

On 08/21/2011 09:26 PM, Al Viro wrote:
> On Sun, Aug 21, 2011 at 09:11:54PM -0700, H. Peter Anvin wrote:
>>> lack of point - the *only* CPU where it would matter would be K6-2, IIRC,
>>> and (again, IIRC) it had some differences in SYSCALL semantics compared to
>>> K7 (which supports SYSENTER as well). Bugger if I remember what those
>>> differences might've been... Some flag not cleared?
>>
>> The most likely reason for a binary to execute a stray SYSCALL is
>> because they read it out of the vdso. Totally daft, but we certainly
>> see a lot of stupid things as evidenced by the JIT thread earlier this
>> month.
>
> Um... What, blindly, no matter what surrounds it in there? What will
> happen to the same eager JIT when it steps on SYSENTER?

The JIT will have had to manage SYSENTER already. It's not a change,
whereas SYSCALL would be. We could just try it, and see if anything
breaks, of course.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

Andrew Lutomirski

2011-08-23 05:10:41 UTC

Permalink

On 08/22/2011 01:03 AM, H. Peter Anvin wrote:
> On 08/21/2011 09:26 PM, Al Viro wrote:
>> On Sun, Aug 21, 2011 at 09:11:54PM -0700, H. Peter Anvin wrote:
>>>> lack of point - the *only* CPU where it would matter would be K6-2, IIRC,
>>>> and (again, IIRC) it had some differences in SYSCALL semantics compared to
>>>> K7 (which supports SYSENTER as well). Bugger if I remember what those
>>>> differences might've been... Some flag not cleared?
>>>
>>> The most likely reason for a binary to execute a stray SYSCALL is
>>> because they read it out of the vdso. Totally daft, but we certainly
>>> see a lot of stupid things as evidenced by the JIT thread earlier this
>>> month.
>>
>> Um... What, blindly, no matter what surrounds it in there? What will
>> happen to the same eager JIT when it steps on SYSENTER?
>
> The JIT will have had to manage SYSENTER already. It's not a change,
> whereas SYSCALL would be. We could just try it, and see if anything
> breaks, of course.

Here's a possible solution that works for standalone SYSCALL and vdso
SYSCALL. The idea is to preserve the exact same SYSCALL invocation
sequence. Logically, the SYSCALL instruction does:

push %ebp
mov %ebp,%ecx
mov 4(%esp),%ebp
call __fake_int80

and __fake_int80 is:
int 0x80
mov 4(%esp),%ebp
ret $4

The entire system call sequence is then (effectively):

push %ebp
movl %ecx,%ebp

; "SYSCALL" starts here
push %ebp
mov %ebp,%ecx
mov 4(%esp),%ebp
call __fake_int80
; "SYSCALL ends here

movl %ebp,%ecx
popl %ebp
ret

So we rearrange ebp and ecx and then immediately rearrange them back.
The landing point tweaks them again so that we preserve the old
semantics of SYSCALL. But now the pt_regs values exactly match what
would have happened if we entered via the int 0x80 path, so there
shouldn't be any corner cases with ptrace or restart -- as far as either
one is concerned, we actually entered via int 0x80. If we deliver a
signal, the signal handler returns to the int 0x80 instruction.

Am I missing something? Extremely buggy, incomplete code that
implements this is:

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index a0e866d..6cda8ce 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -291,24 +291,59 @@ ENTRY(ia32_cstar_target)
ENABLE_INTERRUPTS(CLBR_NONE)
SAVE_ARGS 8,0,0
movl %eax,%eax /* zero extension */
- movq %rax,ORIG_RAX-ARGOFFSET(%rsp)
- movq %rcx,RIP-ARGOFFSET(%rsp)
- CFI_REL_OFFSET rip,RIP-ARGOFFSET
- movq %rbp,RCX-ARGOFFSET(%rsp) /* this lies slightly to ptrace */
- movl %ebp,%ecx
+
+ /*
+ * This does (from the user's point of view):
+ * push %ebp
+ * mov %ebp, %ecx
+ * mov 4(%esp), %ebp
+ * call <function that does int 0x80; mov 4(%esp),%ebp; ret 4>
+ *
+ * User address access does not need access_ok check as r8
+ * has been zero-extended, so even with the offsets it cannot
+ * exceed 2**32 + 8.
+ */
+
+ /* XXX: need to check that vdso actually exists. */
+ /* XXX: ia32_badarg may do bad things to the user state. */
+
+ /* move ebp into place on the user stack */
+ 1: movl %ebp,-4(%r8)
+ .section __ex_table,"a"
+ .quad 1b,ia32_badarg
+ .previous
+
+ /* move eip into place on the user stack */
+ 1: movl %ecx,-8(%r8) /* user eip is in ecx */
+ .section __ex_table,"a"
+ .quad 1b,ia32_badarg
+ .previous
+
+ /* move ebp to ecx in CPU registers and argument save area */
+ mov %ebp,%ecx
+ movq %ecx,RCX-ARGOFFSET(%rsp)
+
+ /*
+ * move arg6 to ebp in CPU registers and argument save area
+ * minor optimization: the actual value of ebp is irrelevent,
+ * so stick it straight into r9d -- see the definition of
+ * IA32_ARG_FIXUP.
+ */
+1: movl (%r8),%r9d
+ .section __ex_table,"a"
+ .quad 1b,ia32_badarg
+ .previous
+
+ /* Do the fake call */
+ movl [insert address of int 0x80; ret helper + 2 here],RIP-ARGOFFSET(%rsp)
+ subl $8,%r8 /* we pushed twice */
+
movq $__USER32_CS,CS-ARGOFFSET(%rsp)
movq $__USER32_DS,SS-ARGOFFSET(%rsp)
movq %r11,EFLAGS-ARGOFFSET(%rsp)
/*CFI_REL_OFFSET rflags,EFLAGS-ARGOFFSET*/
movq %r8,RSP-ARGOFFSET(%rsp)
CFI_REL_OFFSET rsp,RSP-ARGOFFSET
- /* no need to do an access_ok check here because r8 has been
- 32bit zero extended */
- /* hardware stack frame is complete now */
-1: movl (%r8),%r9d
- .section __ex_table,"a"
- .quad 1b,ia32_badarg
- .previous
GET_THREAD_INFO(%r10)
orl $TS_COMPAT,TI_status(%r10)
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10)
diff --git a/arch/x86/vdso/vdso32/syscall.S b/arch/x86/vdso/vdso32/syscall.S
index 5415b56..a3e48b0 100644
--- a/arch/x86/vdso/vdso32/syscall.S
+++ b/arch/x86/vdso/vdso32/syscall.S
@@ -19,8 +19,8 @@ __kernel_vsyscall:
.Lpush_ebp:
movl %ecx, %ebp
syscall
- movl $__USER32_DS, %ecx
- movl %ecx, %ss
+ /* The ret in the fake int80 entry lands here */
+ /* ss is already correct AFAICS */
movl %ebp, %ecx
popl %ebp
.Lpop_ebp:
@@ -28,6 +28,11 @@ __kernel_vsyscall:
.LEND_vsyscall:
.size __kernel_vsyscall,.-.LSTART_vsyscall

+__kernel_vsyscall_fake_int80:
+ int 0x80
+ mov 4(%esp),%ebp
+ ret $4
+
.section .eh_frame,"a",@progbits
.LSTARTFRAME:
.long .LENDCIE-.LSTARTCIE

This could be further simplified by checking if any work flags are set and bailing immediately to the right place in the int 0x80 entry.

--Andy