problem with migrating to shadow master

Post by rumpelkeks
Hello list,
I have just discovered that I can't seem to migrate my qmaster service
any more (it definitely used to work).
I get the "old qmaster did not write lock file. Cannot migrate qmaster."
error. If I manually touch a file called 'lock' in the spool directory,
all works fine. Oh plus it successfully shuts down the qmaster (just
never starts one). So I've probably traced it down to 'no lock file'.
Have found nothing in the logs to explain it either.
Now. Need some pointers for further debugging.
I don't quite understand this 'locking' mechanism - when and where (and
where to) is the 'lock' written? (I can't really find anything in the
startup script that writes the file, only things that check for it). Is
this something that the old qmaster writes when it's shutting down and
the new one only starts once it appears? (There certainly is no 'lock'
file in the spool directory when the qmaster is running.)

AFAIK: the lock file is written by the qmaster, and removed when it's shut down in a proper fashion. But when it crashes, the file will stay there and the heartbeat file won't be updated any longer. Then the shadow master decides to take over and start a now qmaster.

Do you have a file $SGE_ROOT/default/common/shadow_masters?

-- Reuti

Post by rumpelkeks
Tina
PS the sgeadmin user that the qmaster runs as can most certainly write
into the spool directory
--
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
--
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=297664

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=297748

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

rumpelkeks

2010-11-23 09:34:49 UTC

Hi Reuti,

AFAIK: the lock file is written by the qmaster, and removed when it's shut down in a proper fashion. But when it crashes, the file will stay there and the heartbeat file won't be updated any longer. Then the shadow master decides to take over and start a now qmaster.
Do you have a file $SGE_ROOT/default/common/shadow_masters?
-- Reuti

I do have a shadow_masters file, yes.

Failover in case one dies appears to work (i.e. if I kill the qmaster
process, eventually the shadow master starts it). It's the manual
migrate (by calling the startup script with -migrate) that's not working
at the moment.

I do not appear to have any lock file 'in residence' though. I've
discovered (further testing) that it appears to work when migrating from
'master B' to 'master A', but not 'master A' to 'master B'. With the
successsful migration, a file $SGE_ROOT/$SGE_CELL/spool/qmaster/lock
makes a brief appearance. It is, however, not there in normal operation
- looks to me as if the shutting down master creates it only on
shutdown, which puzzled me. (In case of the unsuccessful migration
attempt, there never is a $SGE_ROOT/$SGE_CELL/spool/qmaster/lock file.
Also if I manually touch one, migration works.)

Tina
--
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
--
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=297903

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

reuti

2010-11-23 21:56:08 UTC

Hi Tina,

Post by rumpelkeks
I have just discovered that I can't seem to migrate my qmaster service
any more (it definitely used to work).
I get the "old qmaster did not write lock file. Cannot migrate qmaster."
error. If I manually touch a file called 'lock' in the spool directory,
all works fine. Oh plus it successfully shuts down the qmaster (just
never starts one). So I've probably traced it down to 'no lock file'.
Have found nothing in the logs to explain it either.
Now. Need some pointers for further debugging.
I don't quite understand this 'locking' mechanism - when and where (and
where to) is the 'lock' written? (I can't really find anything in the
startup script that writes the file, only things that check for it). Is
this something that the old qmaster writes when it's shutting down and
the new one only starts once it appears? (There certainly is no 'lock'
file in the spool directory when the qmaster is running.)

AFAIK: the lock file is written by the qmaster, and removed when it's shut down in a proper fashion. But when it crashes, the file will stay there and the heartbeat file won't be updated any longer. Then the shadow master decides to take over and start a now qmaster.
Do you have a file $SGE_ROOT/default/common/shadow_masters?
-- Reuti

it looks like the lock file is written to confirm a successful shutdown then (the opposite of what I was used to), and will prevent that a shadowd will take action of an unchanged heartbeat file then, as -migrate will first shut down the actual master and then start its own.

Do you want to have two qmasters, which can startup when the other is missing two-way, so you have a shadowd running on both of them?

-- Reuti

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=298089

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

rumpelkeks

2010-11-24 10:12:58 UTC

Hi,

AFAIK: the lock file is written by the qmaster, and removed when it's shut down in a proper fashion. But when it crashes, the file will stay there and the heartbeat file won't be updated any longer. Then the shadow master decides to take over and start a now qmaster.
Do you have a file $SGE_ROOT/default/common/shadow_masters?
-- Reuti

That seems to be the theory, only in my case it doesn't seem to work
very well (and I'm trying to find out why - it might be a timing issue,
my $SGE_ROOT/$SGE_CELL/spool etc is on an NFS share).

Yes I do want (and have) two servers running a shadowd (and one running
a qmaster) so they can take over if one fails. Which from my tests is
stil working (I'll do some more testing).

However, not actually being able to cleanly migrate (well, not unless I
manually fake a lock file to appear) is annoying. It's a useful feature,
I think; I stumbled upon this problem when I wanted to migrate of the
current master to be able to take it down for maintenance. I am sure it
used to work, I tested it a lot when I set up the shadow master.

This was before I upgraded to 6.2u4 from 6.2u2 though - did the
mechanism on how a migration is handled change between u2 and u4 to
anyone's knowledge? I'm trying to find out if this is a problem within
SGE (odd timing or something), or a problem with my setup (which I don't
think changed since this was working). I can't fine a lot of information
about about the actual mechanism (i.e. who is supposed to write the lock
file, and when; stuff like that), which limits my debugging capabilities
a bit :)

Tina

Post by reuti
-- Reuti
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=298089

--
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
--
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=298278

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

reuti

2010-11-25 19:14:47 UTC

<snip>
it looks like the lock file is written to confirm a successful shutdown then (the opposite of what I was used to), and will prevent that a shadowd will take action of an unchanged heartbeat file then, as -migrate will first shut down the actual master and then start its own.
Do you want to have two qmasters, which can startup when the other is missing two-way, so you have a shadowd running on both of them?

That seems to be the theory, only in my case it doesn't seem to work
very well (and I'm trying to find out why - it might be a timing issue,
my $SGE_ROOT/$SGE_CELL/spool etc is on an NFS share).

Do you use classic spooling? The common directory is also on the share?

Both machines can also write to these shares?

Post by rumpelkeks
Yes I do want (and have) two servers running a shadowd (and one running
a qmaster) so they can take over if one fails. Which from my tests is
stil working (I'll do some more testing).
However, not actually being able to cleanly migrate (well, not unless I
manually fake a lock file to appear) is annoying. It's a useful feature,
I think; I stumbled upon this problem when I wanted to migrate of the
current master to be able to take it down for maintenance. I am sure it
used to work, I tested it a lot when I set up the shadow master.
This was before I upgraded to 6.2u4 from 6.2u2 though - did the
mechanism on how a migration is handled change between u2 and u4 to
anyone's knowledge? I'm trying to find out if this is a problem within
SGE (odd timing or something), or a problem with my setup (which I don't
think changed since this was working). I can't fine a lot of information
about about the actual mechanism (i.e. who is supposed to write the lock
file, and when; stuff like that), which limits my debugging capabilities

If it's a timing issue, you should at least see the lock file on the machine where it was created, as it should have it already in his cache. Only the NFS share might get the final write later.

Maybe running SGE in debug mode will show more, as the creation should show up there when it's happening if I get the source right.

-- Reuti

Post by rumpelkeks
a bit :)
Tina

-- Reuti
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=298089

--
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
--
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=298278

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=298813

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

rumpelkeks

2010-11-26 09:20:22 UTC

Hello,

That seems to be the theory, only in my case it doesn't seem to work
very well (and I'm trying to find out why - it might be a timing issue,
my $SGE_ROOT/$SGE_CELL/spool etc is on an NFS share).

Do you use classic spooling? The common directory is also on the share?

Yes, yes,

Post by reuti
Both machines can also write to these shares?

and yes:

-bash-3.2$ touch /dls_sw/apps/sge/SGE6.2/TEST/spool/qmaster/lock
-bash-3.2$ ls -l /dls_sw/apps/sge/SGE6.2/TEST/spool/qmaster/lock
-rw-r--r-- 1 sgeadmin sgeadmin 0 Nov 26 09:14
/dls_sw/apps/sge/SGE6.2/TEST/spool/qmaster/lock

(this was as sgeadmin on the current master, into the spool directory)

Accounting & reporting files, and logs etc, are all being written; the
heartbeat file is updated; and, as I said, if I manually create a lock
file prior to calling migrate it is very certainly removed.

<snip>

Post by reuti
If it's a timing issue, you should at least see the lock file on the machine where it was created, as it should have it already in his cache. Only the NFS share might get the final write later.

Good point. It should not be the master going down that removes it but
the new one starting up; so it should turn up eventually (even if too
late for the migration; which it doesn't. It looks rather like one of my
two shadow hosts can created it, the other can't - but I can write to
the share from both machines (as sgeadmin), and the way it got installed
is the same. Only difference is that one of the machines is 64bit the
other 32bit (of the test ones that is; my 'real' qmasters are both 64bit).

Post by reuti
Maybe running SGE in debug mode will show more, as the creation should show up there when it's happening if I get the source right.

I'll see if I can try that on my test cluster cell.

Tina
--
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
--
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=299064

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

rumpelkeks

2010-11-26 11:43:12 UTC

I have found something, maybe. On my (64bit) qmaster, SGE segfaults when
you try to stop it. Everytime you try to stop it, is quite reproduceable.

Nov 26 11:33:36 cs04r-sc-serv-17 kernel: sge_qmaster[32443]: segfault at
000000001d100000 rip 00000000005ba107 rsp 00007fff3ef
Nov 26 11:36:33 cs04r-sc-serv-17 kernel: sge_qmaster[380]: segfault at
0000000019e00000 rip 00000000005ba107 rsp 00007fffc66d6230 error 4
Nov 26 11:37:50 cs04r-sc-serv-17 kernel: sge_qmaster[1328]: segfault at
0000000019400000 rip 00000000005ba107 rsp 00007fff47cfe160 error 4
Nov 26 11:39:04 cs04r-sc-serv-17 kernel: sge_qmaster[1823]: segfault at
000000001f100000 rip 00000000005ba107 rsp 00007fff2ca03970 error 4

...probably when it tries to write the lock file, as it doesn't segfault
if there is a lock file. I'll try to trace it to see where, exactly.
Doesn't do this on my other machine (segfaulting).

Tina

Post by rumpelkeks
Hello,

That seems to be the theory, only in my case it doesn't seem to work
very well (and I'm trying to find out why - it might be a timing issue,
my $SGE_ROOT/$SGE_CELL/spool etc is on an NFS share).

Do you use classic spooling? The common directory is also on the share?

Yes, yes,

Post by reuti
Both machines can also write to these shares?

-bash-3.2$ touch /dls_sw/apps/sge/SGE6.2/TEST/spool/qmaster/lock
-bash-3.2$ ls -l /dls_sw/apps/sge/SGE6.2/TEST/spool/qmaster/lock
-rw-r--r-- 1 sgeadmin sgeadmin 0 Nov 26 09:14
/dls_sw/apps/sge/SGE6.2/TEST/spool/qmaster/lock
(this was as sgeadmin on the current master, into the spool directory)
Accounting& reporting files, and logs etc, are all being written; the
heartbeat file is updated; and, as I said, if I manually create a lock
file prior to calling migrate it is very certainly removed.
<snip>

Post by reuti
Maybe running SGE in debug mode will show more, as the creation should show up there when it's happening if I get the source right.

I'll see if I can try that on my test cluster cell.
Tina

--
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
--
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=299088

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

rumpelkeks

2010-11-26 13:49:56 UTC

Ah, not quite. Seems it segfaults whatever. So, my problem isn't exactly
'migrate not working' but more 'SGE segfaults when stopped'.

I've straced migrate processes... the unsuccessful ones are all like this:

Process 26724 attached - interrupt to quit
futex(0x74302c, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
futex(0x743000, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7314ac, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x731480, 2) = 1
futex(0x731480, FUTEX_WAKE_PRIVATE, 1) = 1
open("messages", O_WRONLY|O_CREAT|O_APPEND, 0666) = 10
write(10, "11/26/2010 13:33:52| main|cs04r"..., 78) = 78
close(10) = 0
futex(0x735d8c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x735d88, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x735db8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x46d129d0, FUTEX_WAIT, 26741, NULL) = 0
futex(0x459109d0, FUTEX_WAIT, 26739, NULL) = 0
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
Process 26724 detached

in the successful ones, the same thing looks like this:

Process 18719 attached - interrupt to quit
futex(0x828e2dc, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
futex(0x828e2c0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x828149c, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x8281480, 2) = 1
futex(0x8281480, FUTEX_WAKE_PRIVATE, 1) = 1
open("messages", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 10
gettimeofday({1290778579, 129103}, {0, 0}) = 0
write(10, "11/26/2010 13:36:19| main|pc030"..., 67) = 67
close(10) = 0
gettimeofday({1290778579, 129635}, {0, 0}) = 0
futex(0x828541c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x8285418, {FUTEX_OP_SET,
0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x8285448, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x8c93bd8, FUTEX_WAIT, 18737, NULL) = 0
futex(0x9714540, FUTEX_WAIT_PRIVATE, 2, NULL) = 0
futex(0x9714540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x712dbd8, FUTEX_WAIT, 18735, NULL) = 0
munmap(0xb69fe000, 10489856) = 0
munmap(0x8293000, 10489856) = 0
futex(0x52bdbd8, FUTEX_WAIT, 18734, NULL) = 0
munmap(0x8c94000, 10489856) = 0
futex(0x48bcbd8, FUTEX_WAIT, 18733, NULL) = 0
munmap(0x672d000, 10489856) = 0
munmap(0x712e000, 10489856) = 0
open("messages", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 9
gettimeofday({1290778589, 3646}, {0, 0}) = 0
write(9, "11/26/2010 13:36:29| main|pc030"..., 61) = 61
close(9) = 0
open("lock", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = 9
close(9) = 0
gettimeofday({1290778589, 6348}, NULL) = 0
gettimeofday({1290778589, 6432}, NULL) = 0
gettimeofday({1290778589, 6518}, NULL) = 0
gettimeofday({1290778589, 6595}, NULL) = 0
gettimeofday({1290778589, 6662}, NULL) = 0
futex(0x96bf85c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x96bf858, {FUTEX_OP_SET,
0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x96bf818, FUTEX_WAKE_PRIVATE, 1) = 1
gettimeofday({1290778589, 6982}, NULL) = 0
gettimeofday({1290778589, 7043}, NULL) = 0
.
.
.
clock_gettime(CLOCK_REALTIME, {1290778589, 9691258}) = 0
futex(0x96a729c, FUTEX_WAIT_PRIVATE, 7, {0, 999937742}) = -1 EAGAIN
(Resource temporarily unavailable)
futex(0x96a7258, FUTEX_WAKE_PRIVATE, 1) = 0
gettimeofday({1290778589, 9972}, NULL) = 0
gettimeofday({1290778589, 10026}, NULL) = 0
futex(0x96a7258, FUTEX_WAKE_PRIVATE, 1) = 1
clock_gettime(CLOCK_REALTIME, {1290778589, 10217977}) = 0
futex(0x96a7298, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x96a729c, FUTEX_WAIT_PRIVATE, 9, {0, 999808023}) = -1 EAGAIN
(Resource temporarily unavailable)
futex(0x96a7258, FUTEX_WAKE_PRIVATE, 1) = 0
gettimeofday({1290778589, 10516}, NULL) = 0
gettimeofday({1290778589, 10590}, NULL) = 0
futex(0x96bf85c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x96bf858, {FUTEX_OP_SET,
0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x96bf818, FUTEX_WAKE_PRIVATE, 1) = 1
munmap(0x48bd000, 10489856) = 0
futex(0x96a76dc, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x96a76d8, {FUTEX_OP_SET,
0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x96a7698, FUTEX_WAKE_PRIVATE, 1) = 1
munmap(0x3ebc000, 10489856) = 0
futex(0x96a7494, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x96a7490, {FUTEX_OP_SET,
0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x96a7450, FUTEX_WAKE_PRIVATE, 1) = 1
munmap(0x34bb000, 10489856) = 0
gettimeofday({1290778589, 11812}, NULL) = 0
shutdown(3, 2 /* send and receive */) = 0
close(3) = 0
futex(0x96a3e2c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x96a3e28, {FUTEX_OP_SET,
0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x96a3de8, FUTEX_WAKE_PRIVATE, 1) = 1
munmap(0x5d2c000, 10489856) = 0
exit_group(0) = ?
Process 18719 detached

(that's attaching strace to the sge master process & then calling
migrate from the intended master)

To me that doesn't look as if it gets to the point where it would write
the lock at all, but dies before that - looks as if this might be the
point where it would try to do the first munmap?

Does that ring a bell with anyone?

Tina

Post by rumpelkeks
I have found something, maybe. On my (64bit) qmaster, SGE segfaults when
you try to stop it. Everytime you try to stop it, is quite reproduceable.
Nov 26 11:33:36 cs04r-sc-serv-17 kernel: sge_qmaster[32443]: segfault at
000000001d100000 rip 00000000005ba107 rsp 00007fff3ef
Nov 26 11:36:33 cs04r-sc-serv-17 kernel: sge_qmaster[380]: segfault at
0000000019e00000 rip 00000000005ba107 rsp 00007fffc66d6230 error 4
Nov 26 11:37:50 cs04r-sc-serv-17 kernel: sge_qmaster[1328]: segfault at
0000000019400000 rip 00000000005ba107 rsp 00007fff47cfe160 error 4
Nov 26 11:39:04 cs04r-sc-serv-17 kernel: sge_qmaster[1823]: segfault at
000000001f100000 rip 00000000005ba107 rsp 00007fff2ca03970 error 4
...probably when it tries to write the lock file, as it doesn't segfault
if there is a lock file. I'll try to trace it to see where, exactly.
Doesn't do this on my other machine (segfaulting).
Tina

Post by rumpelkeks
Hello,

That seems to be the theory, only in my case it doesn't seem to work
very well (and I'm trying to find out why - it might be a timing issue,
my $SGE_ROOT/$SGE_CELL/spool etc is on an NFS share).

Do you use classic spooling? The common directory is also on the share?

Yes, yes,

Post by reuti
Both machines can also write to these shares?

Post by reuti
Maybe running SGE in debug mode will show more, as the creation should show up there when it's happening if I get the source right.

I'll see if I can try that on my test cluster cell.
Tina

--
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
--
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=299104

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

reuti

2010-11-26 14:01:41 UTC

Post by rumpelkeks
Ah, not quite. Seems it segfaults whatever. So, my problem isn't exactly
'migrate not working' but more 'SGE segfaults when stopped'.

In the sgemaster script the `qconf -ks` is used to shutdown the scheduler, but on the man page it's stated that this option is deprecated. Can you replace the command with `qconf -kt scheduler`?

You can even try `qconf -kt scheduler` plus `qconf -kt jvm` (if used) and `qconf -km` by hand to check when it's crashing in detail.

-- Reuti

Post by rumpelkeks
Process 26724 attached - interrupt to quit
futex(0x74302c, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
futex(0x743000, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7314ac, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x731480, 2) = 1
futex(0x731480, FUTEX_WAKE_PRIVATE, 1) = 1
open("messages", O_WRONLY|O_CREAT|O_APPEND, 0666) = 10
write(10, "11/26/2010 13:33:52| main|cs04r"..., 78) = 78
close(10) = 0
futex(0x735d8c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x735d88, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x735db8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x46d129d0, FUTEX_WAIT, 26741, NULL) = 0
futex(0x459109d0, FUTEX_WAIT, 26739, NULL) = 0
Process 26724 detached
Process 18719 attached - interrupt to quit
futex(0x828e2dc, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
futex(0x828e2c0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x828149c, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x8281480, 2) = 1
futex(0x8281480, FUTEX_WAKE_PRIVATE, 1) = 1
open("messages", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 10
gettimeofday({1290778579, 129103}, {0, 0}) = 0
write(10, "11/26/2010 13:36:19| main|pc030"..., 67) = 67
close(10) = 0
gettimeofday({1290778579, 129635}, {0, 0}) = 0
futex(0x828541c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x8285418, {FUTEX_OP_SET,
0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x8285448, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x8c93bd8, FUTEX_WAIT, 18737, NULL) = 0
futex(0x9714540, FUTEX_WAIT_PRIVATE, 2, NULL) = 0
futex(0x9714540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x712dbd8, FUTEX_WAIT, 18735, NULL) = 0
munmap(0xb69fe000, 10489856) = 0
munmap(0x8293000, 10489856) = 0
futex(0x52bdbd8, FUTEX_WAIT, 18734, NULL) = 0
munmap(0x8c94000, 10489856) = 0
futex(0x48bcbd8, FUTEX_WAIT, 18733, NULL) = 0
munmap(0x672d000, 10489856) = 0
munmap(0x712e000, 10489856) = 0
open("messages", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 9
gettimeofday({1290778589, 3646}, {0, 0}) = 0
write(9, "11/26/2010 13:36:29| main|pc030"..., 61) = 61
close(9) = 0
open("lock", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = 9
close(9) = 0
gettimeofday({1290778589, 6348}, NULL) = 0
gettimeofday({1290778589, 6432}, NULL) = 0
gettimeofday({1290778589, 6518}, NULL) = 0
gettimeofday({1290778589, 6595}, NULL) = 0
gettimeofday({1290778589, 6662}, NULL) = 0
futex(0x96bf85c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x96bf858, {FUTEX_OP_SET,
0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x96bf818, FUTEX_WAKE_PRIVATE, 1) = 1
gettimeofday({1290778589, 6982}, NULL) = 0
gettimeofday({1290778589, 7043}, NULL) = 0
.
.
.
clock_gettime(CLOCK_REALTIME, {1290778589, 9691258}) = 0
futex(0x96a729c, FUTEX_WAIT_PRIVATE, 7, {0, 999937742}) = -1 EAGAIN
(Resource temporarily unavailable)
futex(0x96a7258, FUTEX_WAKE_PRIVATE, 1) = 0
gettimeofday({1290778589, 9972}, NULL) = 0
gettimeofday({1290778589, 10026}, NULL) = 0
futex(0x96a7258, FUTEX_WAKE_PRIVATE, 1) = 1
clock_gettime(CLOCK_REALTIME, {1290778589, 10217977}) = 0
futex(0x96a7298, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x96a729c, FUTEX_WAIT_PRIVATE, 9, {0, 999808023}) = -1 EAGAIN
(Resource temporarily unavailable)
futex(0x96a7258, FUTEX_WAKE_PRIVATE, 1) = 0
gettimeofday({1290778589, 10516}, NULL) = 0
gettimeofday({1290778589, 10590}, NULL) = 0
futex(0x96bf85c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x96bf858, {FUTEX_OP_SET,
0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x96bf818, FUTEX_WAKE_PRIVATE, 1) = 1
munmap(0x48bd000, 10489856) = 0
futex(0x96a76dc, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x96a76d8, {FUTEX_OP_SET,
0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x96a7698, FUTEX_WAKE_PRIVATE, 1) = 1
munmap(0x3ebc000, 10489856) = 0
futex(0x96a7494, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x96a7490, {FUTEX_OP_SET,
0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x96a7450, FUTEX_WAKE_PRIVATE, 1) = 1
munmap(0x34bb000, 10489856) = 0
gettimeofday({1290778589, 11812}, NULL) = 0
shutdown(3, 2 /* send and receive */) = 0
close(3) = 0
futex(0x96a3e2c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x96a3e28, {FUTEX_OP_SET,
0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x96a3de8, FUTEX_WAKE_PRIVATE, 1) = 1
munmap(0x5d2c000, 10489856) = 0
exit_group(0) = ?
Process 18719 detached
(that's attaching strace to the sge master process & then calling
migrate from the intended master)
To me that doesn't look as if it gets to the point where it would write
the lock at all, but dies before that - looks as if this might be the
point where it would try to do the first munmap?
Does that ring a bell with anyone?
Tina

Post by rumpelkeks
Hello,

That seems to be the theory, only in my case it doesn't seem to work
very well (and I'm trying to find out why - it might be a timing issue,
my $SGE_ROOT/$SGE_CELL/spool etc is on an NFS share).

Do you use classic spooling? The common directory is also on the share?

Yes, yes,

Post by reuti
Both machines can also write to these shares?

Post by reuti
Maybe running SGE in debug mode will show more, as the creation should show up there when it's happening if I get the source right.

I'll see if I can try that on my test cluster cell.
Tina

--
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
--
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=299104

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=299108

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

rumpelkeks

2010-11-26 14:39:31 UTC

I have tried. 'qconf -kt scheduler' appears to not make it segfault,
however neither does the sge_qmaster process stop - it says 'scheduler
thread terminated', that's all.

qconf -km leads to the segfault again.

Tina

Post by rumpelkeks
Ah, not quite. Seems it segfaults whatever. So, my problem isn't exactly
'migrate not working' but more 'SGE segfaults when stopped'.

In the sgemaster script the `qconf -ks` is used to shutdown the scheduler, but on the man page it's stated that this option is deprecated. Can you replace the command with `qconf -kt scheduler`?
You can even try `qconf -kt scheduler` plus `qconf -kt jvm` (if used) and `qconf -km` by hand to check when it's crashing in detail.
-- Reuti

Post by rumpelkeks
Process 26724 attached - interrupt to quit
futex(0x74302c, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
futex(0x743000, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7314ac, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x731480, 2) = 1
futex(0x731480, FUTEX_WAKE_PRIVATE, 1) = 1
open("messages", O_WRONLY|O_CREAT|O_APPEND, 0666) = 10
write(10, "11/26/2010 13:33:52| main|cs04r"..., 78) = 78
close(10) = 0
futex(0x735d8c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x735d88, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x735db8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x46d129d0, FUTEX_WAIT, 26741, NULL) = 0
futex(0x459109d0, FUTEX_WAIT, 26739, NULL) = 0
Process 26724 detached
Process 18719 attached - interrupt to quit
futex(0x828e2dc, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
futex(0x828e2c0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x828149c, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x8281480, 2) = 1
futex(0x8281480, FUTEX_WAKE_PRIVATE, 1) = 1
open("messages", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 10
gettimeofday({1290778579, 129103}, {0, 0}) = 0
write(10, "11/26/2010 13:36:19| main|pc030"..., 67) = 67
close(10) = 0
gettimeofday({1290778579, 129635}, {0, 0}) = 0
futex(0x828541c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x8285418, {FUTEX_OP_SET,
0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x8285448, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x8c93bd8, FUTEX_WAIT, 18737, NULL) = 0
futex(0x9714540, FUTEX_WAIT_PRIVATE, 2, NULL) = 0
futex(0x9714540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x712dbd8, FUTEX_WAIT, 18735, NULL) = 0
munmap(0xb69fe000, 10489856) = 0
munmap(0x8293000, 10489856) = 0
futex(0x52bdbd8, FUTEX_WAIT, 18734, NULL) = 0
munmap(0x8c94000, 10489856) = 0
futex(0x48bcbd8, FUTEX_WAIT, 18733, NULL) = 0
munmap(0x672d000, 10489856) = 0
munmap(0x712e000, 10489856) = 0
open("messages", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 9
gettimeofday({1290778589, 3646}, {0, 0}) = 0
write(9, "11/26/2010 13:36:29| main|pc030"..., 61) = 61
close(9) = 0
open("lock", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = 9
close(9) = 0
gettimeofday({1290778589, 6348}, NULL) = 0
gettimeofday({1290778589, 6432}, NULL) = 0
gettimeofday({1290778589, 6518}, NULL) = 0
gettimeofday({1290778589, 6595}, NULL) = 0
gettimeofday({1290778589, 6662}, NULL) = 0
futex(0x96bf85c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x96bf858, {FUTEX_OP_SET,
0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x96bf818, FUTEX_WAKE_PRIVATE, 1) = 1
gettimeofday({1290778589, 6982}, NULL) = 0
gettimeofday({1290778589, 7043}, NULL) = 0
.
.
.
clock_gettime(CLOCK_REALTIME, {1290778589, 9691258}) = 0
futex(0x96a729c, FUTEX_WAIT_PRIVATE, 7, {0, 999937742}) = -1 EAGAIN
(Resource temporarily unavailable)
futex(0x96a7258, FUTEX_WAKE_PRIVATE, 1) = 0
gettimeofday({1290778589, 9972}, NULL) = 0
gettimeofday({1290778589, 10026}, NULL) = 0
futex(0x96a7258, FUTEX_WAKE_PRIVATE, 1) = 1
clock_gettime(CLOCK_REALTIME, {1290778589, 10217977}) = 0
futex(0x96a7298, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x96a729c, FUTEX_WAIT_PRIVATE, 9, {0, 999808023}) = -1 EAGAIN
(Resource temporarily unavailable)
futex(0x96a7258, FUTEX_WAKE_PRIVATE, 1) = 0
gettimeofday({1290778589, 10516}, NULL) = 0
gettimeofday({1290778589, 10590}, NULL) = 0
futex(0x96bf85c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x96bf858, {FUTEX_OP_SET,
0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x96bf818, FUTEX_WAKE_PRIVATE, 1) = 1
munmap(0x48bd000, 10489856) = 0
futex(0x96a76dc, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x96a76d8, {FUTEX_OP_SET,
0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x96a7698, FUTEX_WAKE_PRIVATE, 1) = 1
munmap(0x3ebc000, 10489856) = 0
futex(0x96a7494, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x96a7490, {FUTEX_OP_SET,
0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x96a7450, FUTEX_WAKE_PRIVATE, 1) = 1
munmap(0x34bb000, 10489856) = 0
gettimeofday({1290778589, 11812}, NULL) = 0
shutdown(3, 2 /* send and receive */) = 0
close(3) = 0
futex(0x96a3e2c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x96a3e28, {FUTEX_OP_SET,
0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x96a3de8, FUTEX_WAKE_PRIVATE, 1) = 1
munmap(0x5d2c000, 10489856) = 0
exit_group(0) = ?
Process 18719 detached
(that's attaching strace to the sge master process& then calling
migrate from the intended master)
To me that doesn't look as if it gets to the point where it would write
the lock at all, but dies before that - looks as if this might be the
point where it would try to do the first munmap?
Does that ring a bell with anyone?
Tina

Post by rumpelkeks
Hello,