7-STABLE broke drscheme in week between 4 and 11 Jan

Discussion:

(too old to reply)

Andrew Reilly

2008-01-20 02:49:15 UTC

Hi there,

I'm still working on debugging this myself, but thought that a
few more experienced eyes might be able to help me.

I'm tracking 7-STABLE on my amd64 system, but something
happened a couple of weeks ago that broke lang/drscheme.
I've been doing a bit of regressing and testing, and have
found that a mred executable built against a 7-STABLE
checkout of 2008.01.04.00.00.00 works fine, but the exact
same binary crashes with a SEGV on a kernel check-out of
2008.01.11.00.00.00. It's a bit hard to debug, because the
code in question is a twisty maze of macros, #ifs and inlines,
because it's in the garbage collector, and that's quite
platform-sensitive.

Anyway, the last part of the ktrace of the broken version (the
earlier parts are just loading up shared libraries) looks like:
(I sedded the ^pid out, so that I could get a better look at it
with diff (meld, actually: it's nice)).

mredid mred RET mmap 5496832/0x80053e000
mredid mred CALL mmap(0,0x200000,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,0xffffffff,0)
mredid mred RET mmap 57888768/0x803735000
mredid mred CALL munmap(0x803735000,0xcb000)
mredid mred RET munmap 0
mredid mred CALL munmap(0x803900000,0x35000)
mredid mred RET munmap 0
mredid mred CALL thr_self(0x803801120)
mredid mred RET thr_self 0
mredid mred CALL mmap(0x7fffffbff000,0x1000,PROT_NONE,MAP_ANON,0xffffffff,0)
mredid mred RET mmap -4198400/0x7fffffbff000
mredid mred CALL thr_set_name(0x1875d,0x802de3800)
mredid mred RET thr_set_name 0
mredid mred CALL rtprio_thread(0,0x1875d,0x7fffffffe110)
mredid mred RET rtprio_thread 0
mredid mred CALL sysarch(0x81,0x7fffffffe130)
mredid mred RET sysarch 0
mredid mred CALL sigprocmask(SIG_SETMASK,0x7fffffffe150,0x7fffffffe140)
mredid mred RET sigprocmask 0
mredid mred CALL sigaction(SIG 32,0x7fffffffe110,0)
mredid mred RET sigaction 0
mredid mred CALL sigprocmask(SIG_SETMASK,0x7fffffffe140,0)
mredid mred RET sigprocmask 0
mredid mred CALL sigprocmask(SIG_BLOCK,0x800633cc0,0x7fffffffe190)
mredid mred RET sigprocmask 0
mredid mred CALL sigprocmask(SIG_SETMASK,0x800633cd0,0)
mredid mred RET sigprocmask 0
mredid mred CALL getrlimit(RLIMIT_DATA,0x7fffffffe3a0)
mredid mred RET getrlimit 0
mredid mred CALL mmap(0,0x104000,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,0xffffffff,0)
mredid mred RET mmap 59768832/0x803900000
mredid mred PSIG SIGSEGV SIG_DFL
mredid mred NAMI "mred.core"

The equivalent section of the version from 4 Jan (that works
fine) looks like:

mredid mred RET mmap 5496832/0x80053e000
mredid mred CALL mmap(0,0x200000,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,0xffffffff,0)
mredid mred RET mmap 57835520/0x803728000
mredid mred CALL munmap(0x803728000,0xd8000)
mredid mred RET munmap 0
mredid mred CALL munmap(0x803900000,0x28000)
mredid mred RET munmap 0
mredid mred CALL thr_self(0x803801120)
mredid mred RET thr_self 0
mredid mred CALL mmap(0x7fffffbff000,0x1000,PROT_NONE,MAP_ANON,0xffffffff,0)
mredid mred RET mmap -4198400/0x7fffffbff000
mredid mred CALL thr_set_name(0x187ba,0x802dd6800)
mredid mred RET thr_set_name 0
mredid mred CALL rtprio_thread(0,0x187ba,0x7fffffffe0f0)
mredid mred RET rtprio_thread 0
mredid mred CALL sysarch(0x81,0x7fffffffe110)
mredid mred RET sysarch 0
mredid mred CALL sigprocmask(SIG_SETMASK,0x7fffffffe130,0x7fffffffe120)
mredid mred RET sigprocmask 0
mredid mred CALL sigaction(SIG 32,0x7fffffffe0f0,0)
mredid mred RET sigaction 0
mredid mred CALL sigprocmask(SIG_SETMASK,0x7fffffffe120,0)
mredid mred RET sigprocmask 0
mredid mred CALL sigprocmask(SIG_BLOCK,0x800633cc0,0x7fffffffe170)
mredid mred RET sigprocmask 0
mredid mred CALL sigprocmask(SIG_SETMASK,0x800633cd0,0)
mredid mred RET sigprocmask 0
mredid mred CALL getrlimit(RLIMIT_DATA,0x7fffffffe380)
mredid mred RET getrlimit 0
mredid mred CALL mmap(0,0x104000,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,0xffffffff,0)
mredid mred RET mmap 59768832/0x803900000
mredid mred CALL mmap(0,0x300000,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,0xffffffff,0)
mredid mred RET mmap 60833792/0x803a04000
mredid mred CALL munmap(0x803a04000,0xfc000)
mredid mred RET munmap 0
mredid mred CALL munmap(0x803d00000,0x4000)
mredid mred RET munmap 0
mredid mred CALL sigaction(SIGSEGV,0x7fffffffe360,0x7fffffffe340)
mredid mred RET sigaction 0
mredid mred CALL __sysctl(0x7fffffffd8f0,0x2,0x7fffffffd950,0x7fffffffd8e8,0,0)
mredid mred RET __sysctl 0
mredid mred CALL socket(PF_LOCAL,SOCK_STREAM,0)
mredid mred RET socket 3
mredid mred CALL __sysctl(0x7fffffffd910,0x2,0x7fffffffd970,0x7fffffffd908,0,0)
mredid mred RET __sysctl 0
mredid mred CALL __sysctl(0x7fffffffd8b0,0x2,0x7fffffffd8f0,0x7fffffffd8a8,0,0)
mredid mred RET __sysctl 0
mredid mred CALL connect(0x3,0x7fffffffd9f0,0x13)
mredid mred NAMI "/tmp/.X11-unix/X0"
mredid mred RET connect 0
[...and so on...]

So it looks as though it's in a section just before it
establishes it's SIGSEGV handler, and so presumably isn't
expecting to get segv'd just yet. If it hadn't been segv'd,
the next thing to happen was that
mredid mred CALL mmap(0,0x300000,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,0xffffffff,0)

the faulting instruction is:
0x0000000800bdecc6 <GC_init_type_tags+598>: mov %r13,(%rdx,%rax,8)

r13 is 0x803900000
rdx is -1
rax is 0xe40

Any thoughts on why that would be faulting? (Looks like a write
to a very low address (code?) to me, but I'm not up on the VM
map yet.)

The only thing that looks appropriate that changed in that week
was sys/vm/vm_map.c, which had some new code added to help with
shm mappings. I looked at it, but it didn't look as though it
would affect this.

If anyone has suggestions about what I could
investigate/change/poke next, I'd appreciate it.

Cheers,

--
Andrew

John Baldwin

2008-01-23 14:54:27 UTC

Permalink

Post by Andrew Reilly
Hi there,
I'm still working on debugging this myself, but thought that a
few more experienced eyes might be able to help me.
I'm tracking 7-STABLE on my amd64 system, but something
happened a couple of weeks ago that broke lang/drscheme.
I've been doing a bit of regressing and testing, and have
found that a mred executable built against a 7-STABLE
checkout of 2008.01.04.00.00.00 works fine, but the exact
same binary crashes with a SEGV on a kernel check-out of
2008.01.11.00.00.00. It's a bit hard to debug, because the
code in question is a twisty maze of macros, #ifs and inlines,
because it's in the garbage collector, and that's quite
platform-sensitive.
Anyway, the last part of the ktrace of the broken version (the
(I sedded the ^pid out, so that I could get a better look at it
with diff (meld, actually: it's nice)).

There were changes to make binaries get SIGSEGV instead of SIGBUS (or vice
versa) for certain VM-related traps. That might be related in which case
the source code for the app will need to catch a different signal. You can
test this by fiddling with the machdep.prot_fault_translation sysctl.

Post by Andrew Reilly
So it looks as though it's in a section just before it
establishes it's SIGSEGV handler, and so presumably isn't
expecting to get segv'd just yet. If it hadn't been segv'd,
the next thing to happen was that
mredid mred CALL mmap(0,0x300000,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,0xffffffff,0)
0x0000000800bdecc6 <GC_init_type_tags+598>: mov %r13,(%rdx,%rax,8)
r13 is 0x803900000
rdx is -1
rax is 0xe40
Any thoughts on why that would be faulting? (Looks like a write
to a very low address (code?) to me, but I'm not up on the VM
map yet.)

rdx should be a pointer to an array or some such, but it is a bogus
pointer and that is why you are faulting.

Post by Andrew Reilly
The only thing that looks appropriate that changed in that week
was sys/vm/vm_map.c, which had some new code added to help with
shm mappings. I looked at it, but it didn't look as though it
would affect this.

Those changes were only in HEAD, so if you are seeing them in your kernel
you are running HEAD and not RELENG_7.

--
John Baldwin

Andrew Reilly

2008-01-23 20:34:23 UTC

Permalink

Hi John,

Post by John Baldwin

Post by Andrew Reilly
Anyway, the last part of the ktrace of the broken version (the
(I sedded the ^pid out, so that I could get a better look at it
with diff (meld, actually: it's nice)).

There were changes to make binaries get SIGSEGV instead of SIGBUS
(or vice
versa) for certain VM-related traps. That might be related in
which case
the source code for the app will need to catch a different signal.
You can
test this by fiddling with the machdep.prot_fault_translation sysctl.

That wasn't the problem: version 372 already had a patch to use SIGSEGV.

Post by John Baldwin

Post by Andrew Reilly
0x0000000800bdecc6 <GC_init_type_tags+598>: mov %r13,(%rdx,%rax,8)
r13 is 0x803900000
rdx is -1
rax is 0xe40
Any thoughts on why that would be faulting? (Looks like a write
to a very low address (code?) to me, but I'm not up on the VM
map yet.)

rdx should be a pointer to an array or some such, but it is a bogus
pointer and that is why you are faulting.

Yes, that was indeed the problem. The garbage collector was
expecting memory returned by malloc to be zero, which, of course, it
wasn't. It seems as though it was simply luck that the particular
access was reliably zero before this particular change to FreeBSD. I
should get more used to using the debugging support in our malloc, to
grunge up malloc'd memory.

Post by John Baldwin

Those changes were only in HEAD, so if you are seeing them in your
kernel
you are running HEAD and not RELENG_7.

Yes. I did the binary-search thing, and found that the actual change
that broke things was an MFC of src/contrib/gcc/gthr-posix.h, on 5
Jan. There's no obvious (to me) reason why that change would have
affected the malloc system, but there must have been some epsilon of
memory alignment that pushed a non-zero value into the particular
memory that was being returned and tested in that instance. An
entertaining debugging experience...

Patches to drscheme have been accepted up-stream, so all should be
dandy very soon.

Cheers,

Andrew