Discussion:
Rmpi under SGE
arnuschky
2010-12-17 10:04:23 UTC
Permalink
Hi,

we're having massive problems using Rmpi with OpenMPI under SGE. OpenMPI
is tested and works fine. We're submittig one master Rscript, which is
in turn spawning the required slaves using Rmpi. Unfortunately, this
fails:

$ cat testRmpi.e3480556
Warning: Permanently added 'compute-1-13.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-10.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-11.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-12.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-14.local' (RSA) to the list of known hosts.
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
--------------------------------------------------------------------------
A daemon (pid 26953) died unexpectedly with status 129 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
mpirun: clean termination accomplished

We're using openmpi-1.3.3 (--with-sge) and SGE V62u4.

Any hint's on what's going wrong here?

Cheers,
Arne
--
Arne Brutschy
Ph.D. Student Email arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles Tel +32 2 650 2273
Avenue Franklin Roosevelt 50 Fax +32 2 650 2715
1050 Bruxelles, Belgium (Fax at IRIDIA secretary)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306389

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
reuti
2010-12-17 10:28:24 UTC
Permalink
Hi,
Post by arnuschky
we're having massive problems using Rmpi with OpenMPI under SGE. OpenMPI
is tested and works fine. We're submittig one master Rscript, which is
in turn spawning the required slaves using Rmpi. Unfortunately, this
$ cat testRmpi.e3480556
Warning: Permanently added 'compute-1-13.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-10.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-11.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-12.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-14.local' (RSA) to the list of known hosts.
Permission denied, please try again.
when Open MPI has a tight integration into SGE, I would assume SGE is configured to use "ssh". What is the output of `qconf -sconf`, there might be double entries?

http://marc.info/?l=npaci-rocks-discussion&m=126411729709528

If you want or must use ssh for sure, you need either passphraseless ssh keys (deprecated), or a hostbased authentication:

http://gridengine.sunsource.net/howto/hostbased-ssh.html

-- Reuti
Post by arnuschky
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
--------------------------------------------------------------------------
A daemon (pid 26953) died unexpectedly with status 129 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
mpirun: clean termination accomplished
We're using openmpi-1.3.3 (--with-sge) and SGE V62u4.
Any hint's on what's going wrong here?
Cheers,
Arne
--
Arne Brutschy
Ph.D. Student Email arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles Tel +32 2 650 2273
Avenue Franklin Roosevelt 50 Fax +32 2 650 2715
1050 Bruxelles, Belgium (Fax at IRIDIA secretary)
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306389
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306398

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
arnuschky
2010-12-17 11:58:55 UTC
Permalink
Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
still fail (even with Reuti's fixes):

$ cat test-mpi-17942.e3480568
error: got no connection within 60 seconds. "Timeout occured while waiting for connection"
error: got no connection within 60 seconds. "Timeout occured while waiting for connection"
--------------------------------------------------------------------------
A daemon (pid 8473) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------

Qmaster spool messages list:

12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel task 3480568.1 task 2.compute-2-9 failed - killing job

Any idea what's going wrong now? 60 seconds is quite a long timeout, I
guess that this is not a network timeout...

Arne
Post by reuti
Hi,
Post by arnuschky
we're having massive problems using Rmpi with OpenMPI under SGE. OpenMPI
is tested and works fine. We're submittig one master Rscript, which is
in turn spawning the required slaves using Rmpi. Unfortunately, this
$ cat testRmpi.e3480556
Warning: Permanently added 'compute-1-13.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-10.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-11.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-12.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-14.local' (RSA) to the list of known hosts.
Permission denied, please try again.
when Open MPI has a tight integration into SGE, I would assume SGE is configured to use "ssh". What is the output of `qconf -sconf`, there might be double entries?
http://marc.info/?l=npaci-rocks-discussion&m=126411729709528
http://gridengine.sunsource.net/howto/hostbased-ssh.html
-- Reuti
Post by arnuschky
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
--------------------------------------------------------------------------
A daemon (pid 26953) died unexpectedly with status 129 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
mpirun: clean termination accomplished
We're using openmpi-1.3.3 (--with-sge) and SGE V62u4.
Any hint's on what's going wrong here?
Cheers,
Arne
--
Arne Brutschy
Ph.D. Student Email arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles Tel +32 2 650 2273
Avenue Franklin Roosevelt 50 Fax +32 2 650 2715
1050 Bruxelles, Belgium (Fax at IRIDIA secretary)
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306389
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306398
--
Arne Brutschy
Ph.D. Student Email arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles Tel +32 2 650 2273
Avenue Franklin Roosevelt 50 Fax +32 2 650 2715
1050 Bruxelles, Belgium (Fax at IRIDIA secretary)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306417

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
reuti
2010-12-17 14:02:39 UTC
Permalink
Post by arnuschky
Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
$ cat test-mpi-17942.e3480568
error: got no connection within 60 seconds. "Timeout occured while waiting for connection"
error: got no connection within 60 seconds. "Timeout occured while waiting for connection"
You are now using the plain -builtin- startup method? Does it happen on all hosts for such a job?

Maybe it's something special on some nodes und would for some hosts happen for less slots too.

-- Reuti
Post by arnuschky
--------------------------------------------------------------------------
A daemon (pid 8473) died unexpectedly with status 1 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel task 3480568.1 task 2.compute-2-9 failed - killing job
Any idea what's going wrong now? 60 seconds is quite a long timeout, I
guess that this is not a network timeout...
Arne
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306450

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
matbradford
2010-12-17 15:01:28 UTC
Permalink
Arne,

We are using OpenMPI 1.4.2 and we had a very similar problem , with exactly
the same message, when using more than a single node.

I managed to get rid of the problem by restarting the execution daemons.
Don't know why it fixed the problem, but it did.

The problem has returned a couple of times over a period of about a month,
but restarting the sgeexecd daemons always seems to fix it.

No other messages in the log files indicate anything unusual.

Cheers,

Mat



-----Original Message-----
From: reuti [mailto:***@staff.uni-marburg.de]
Sent: 17 December 2010 14:03
To: ***@gridengine.sunsource.net
Subject: Re: [GE users] Rmpi under SGE
Post by arnuschky
Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
$ cat test-mpi-17942.e3480568
error: got no connection within 60 seconds. "Timeout occured while
waiting for connection"
Post by arnuschky
error: got no connection within 60 seconds. "Timeout occured while
waiting for connection"

You are now using the plain -builtin- startup method? Does it happen on all
hosts for such a job?

Maybe it's something special on some nodes und would for some hosts happen
for less slots too.

-- Reuti
--------------------------------------------------------------------------
Post by arnuschky
A daemon (pid 8473) died unexpectedly with status 1 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
Post by arnuschky
12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel task
3480568.1 task 2.compute-2-9 failed - killing job
Post by arnuschky
Any idea what's going wrong now? 60 seconds is quite a long timeout, I
guess that this is not a network timeout...
Arne
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=3
06450

To unsubscribe from this discussion, e-mail:
[users-***@gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306469

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
reuti
2010-12-17 15:24:38 UTC
Permalink
Post by matbradford
Arne,
We are using OpenMPI 1.4.2 and we had a very similar problem , with exactly
the same message, when using more than a single node.
I managed to get rid of the problem by restarting the execution daemons.
Don't know why it fixed the problem, but it did.
When you start the sgeexecd by hand from the root account, the environment might be different (from the one at boot time) and it's inherited to processes started by SGE. This behavior can also be adjusted in SGE's configuration.

But when some environment variables are missing (and only set when started by hand), maybe it could also be adjusted in the jobscript. For a tightly integrated job, usually -V is used to export the environment of the master task to all slaves (hence in the jobscript it must be set one time). Here -V is appropriate.

==

This can also be set during job submission time:

$ qsub -V job.sh

But usually I vote against it, as I prefer self-contained scripts (to avoid that a changed variable in the shell will have an effect on the actual job submission - this can be hard to track in case of an error). Exception clause, when you name it explicitly:

$ qsub -v LD_LIBRARY_PATH job.sh
$ qsub -v LD_LIBRARY_PATH=/usr/local/lib job.sh

-- Reuti
Post by matbradford
The problem has returned a couple of times over a period of about a month,
but restarting the sgeexecd daemons always seems to fix it.
No other messages in the log files indicate anything unusual.
Cheers,
Mat
-----Original Message-----
Sent: 17 December 2010 14:03
Subject: Re: [GE users] Rmpi under SGE
Post by arnuschky
Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
$ cat test-mpi-17942.e3480568
error: got no connection within 60 seconds. "Timeout occured while
waiting for connection"
Post by arnuschky
error: got no connection within 60 seconds. "Timeout occured while
waiting for connection"
You are now using the plain -builtin- startup method? Does it happen on all
hosts for such a job?
Maybe it's something special on some nodes und would for some hosts happen
for less slots too.
-- Reuti
--------------------------------------------------------------------------
Post by arnuschky
A daemon (pid 8473) died unexpectedly with status 1 while
attempting
Post by arnuschky
to launch so we are aborting.
There may be more information reported by the environment (see
above).
Post by arnuschky
This may be because the daemon was unable to find all the needed
shared
Post by arnuschky
libraries on the remote node. You may set your LD_LIBRARY_PATH to
have the
Post by arnuschky
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
Post by arnuschky
12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel task
3480568.1 task 2.compute-2-9 failed - killing job
Post by arnuschky
Any idea what's going wrong now? 60 seconds is quite a long timeout, I
guess that this is not a network timeout...
Arne
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=3
06450
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306469
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306475

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
arnuschky
2010-12-17 18:07:41 UTC
Permalink
Post by reuti
$ qsub -V job.sh
I tried this as well, but didn't change anything. Anyways, I think Mat
didn't start the sgeexecd's by hand - he just restarted the daemon (by
init script I assume). Or am I missing something here?

Cheers,
Arne
Post by reuti
Post by matbradford
The problem has returned a couple of times over a period of about a month,
but restarting the sgeexecd daemons always seems to fix it.
No other messages in the log files indicate anything unusual.
Cheers,
Mat
-----Original Message-----
Sent: 17 December 2010 14:03
Subject: Re: [GE users] Rmpi under SGE
Post by arnuschky
Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
$ cat test-mpi-17942.e3480568
error: got no connection within 60 seconds. "Timeout occured while
waiting for connection"
Post by arnuschky
error: got no connection within 60 seconds. "Timeout occured while
waiting for connection"
You are now using the plain -builtin- startup method? Does it happen on all
hosts for such a job?
Maybe it's something special on some nodes und would for some hosts happen
for less slots too.
-- Reuti
--------------------------------------------------------------------------
Post by arnuschky
A daemon (pid 8473) died unexpectedly with status 1 while
attempting
Post by arnuschky
to launch so we are aborting.
There may be more information reported by the environment (see
above).
Post by arnuschky
This may be because the daemon was unable to find all the needed
shared
Post by arnuschky
libraries on the remote node. You may set your LD_LIBRARY_PATH to
have the
Post by arnuschky
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
Post by arnuschky
12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel task
3480568.1 task 2.compute-2-9 failed - killing job
Post by arnuschky
Any idea what's going wrong now? 60 seconds is quite a long timeout, I
guess that this is not a network timeout...
Arne
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=3
06450
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306469
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306475
--
Arne Brutschy
Ph.D. Student Email arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles Tel +32 2 650 2273
Avenue Franklin Roosevelt 50 Fax +32 2 650 2715
1050 Bruxelles, Belgium (Fax at IRIDIA secretary)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306524

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
reuti
2010-12-21 09:29:58 UTC
Permalink
Post by arnuschky
Post by reuti
$ qsub -V job.sh
I tried this as well, but didn't change anything. Anyways, I think Mat
didn't start the sgeexecd's by hand - he just restarted the daemon (by
init script I assume). Or am I missing something here?
Sure, he used the script. But when you log in as root, you have most likely a different environment than the machine when it boots and starts the script automatically. You can check this in /proc/<pid>/environ for the processes, maybe for one where it was started automatically and one where it was started by hand (by using the script).

-- Reuti
Post by arnuschky
Cheers,
Arne
Post by reuti
Post by matbradford
The problem has returned a couple of times over a period of about a month,
but restarting the sgeexecd daemons always seems to fix it.
No other messages in the log files indicate anything unusual.
Cheers,
Mat
-----Original Message-----
Sent: 17 December 2010 14:03
Subject: Re: [GE users] Rmpi under SGE
Post by arnuschky
Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
$ cat test-mpi-17942.e3480568
error: got no connection within 60 seconds. "Timeout occured while
waiting for connection"
Post by arnuschky
error: got no connection within 60 seconds. "Timeout occured while
waiting for connection"
You are now using the plain -builtin- startup method? Does it happen on all
hosts for such a job?
Maybe it's something special on some nodes und would for some hosts happen
for less slots too.
-- Reuti
--------------------------------------------------------------------------
Post by arnuschky
A daemon (pid 8473) died unexpectedly with status 1 while
attempting
Post by arnuschky
to launch so we are aborting.
There may be more information reported by the environment (see
above).
Post by arnuschky
This may be because the daemon was unable to find all the needed
shared
Post by arnuschky
libraries on the remote node. You may set your LD_LIBRARY_PATH to
have the
Post by arnuschky
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
Post by arnuschky
12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel task
3480568.1 task 2.compute-2-9 failed - killing job
Post by arnuschky
Any idea what's going wrong now? 60 seconds is quite a long timeout, I
guess that this is not a network timeout...
Arne
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=3
06450
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306469
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306475
--
Arne Brutschy
Ph.D. Student Email arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles Tel +32 2 650 2273
Avenue Franklin Roosevelt 50 Fax +32 2 650 2715
1050 Bruxelles, Belgium (Fax at IRIDIA secretary)
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306524
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307780

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
matbradford
2010-12-21 09:36:37 UTC
Permalink
Post by matbradford
-----Original Message-----
Sent: 21 December 2010 09:30
Subject: Re: [GE users] Rmpi under SGE
Post by arnuschky
Post by reuti
$ qsub -V job.sh
I tried this as well, but didn't change anything. Anyways, I think Mat
didn't start the sgeexecd's by hand - he just restarted the daemon (by
init script I assume). Or am I missing something here?
Sure, he used the script. But when you log in as root, you have most
likely a different environment than the machine when it boots and starts
the script automatically. You can check this in /proc/<pid>/environ for
the processes, maybe for one where it was started automatically and one
where it was started by hand (by using the script).
We always start our SGE daemons by hand.

The SGE directories have a dependency on the GPFS file system having
started, and to prevent any issues, GPFS gets manually started, followed by
SGE.

We just run a distributed ssh (xdsh) command across all the nodes that have
been rebooted.

Cheers,

Mat
Post by matbradford
-- Reuti
Post by arnuschky
Cheers,
Arne
Post by reuti
Post by matbradford
The problem has returned a couple of times over a period of about a
month,
Post by arnuschky
Post by reuti
Post by matbradford
but restarting the sgeexecd daemons always seems to fix it.
No other messages in the log files indicate anything unusual.
Cheers,
Mat
-----Original Message-----
Sent: 17 December 2010 14:03
Subject: Re: [GE users] Rmpi under SGE
Post by arnuschky
Ah. My previsously was slightly premature, Rmpi jobs with > 20
slaves
Post by arnuschky
Post by reuti
Post by matbradford
Post by arnuschky
$ cat test-mpi-17942.e3480568
error: got no connection within 60 seconds. "Timeout occured
while
Post by arnuschky
Post by reuti
Post by matbradford
waiting for connection"
Post by arnuschky
error: got no connection within 60 seconds. "Timeout occured
while
Post by arnuschky
Post by reuti
Post by matbradford
waiting for connection"
You are now using the plain -builtin- startup method? Does it happen
on all
Post by arnuschky
Post by reuti
Post by matbradford
hosts for such a job?
Maybe it's something special on some nodes und would for some hosts
happen
Post by arnuschky
Post by reuti
Post by matbradford
for less slots too.
-- Reuti
--------------------------------------------------------------------
------
Post by arnuschky
Post by reuti
Post by matbradford
Post by arnuschky
A daemon (pid 8473) died unexpectedly with status 1 while
attempting
Post by arnuschky
to launch so we are aborting.
There may be more information reported by the environment (see
above).
Post by arnuschky
This may be because the daemon was unable to find all the
needed
Post by arnuschky
Post by reuti
Post by matbradford
shared
Post by arnuschky
libraries on the remote node. You may set your LD_LIBRARY_PATH
to
Post by arnuschky
Post by reuti
Post by matbradford
have the
Post by arnuschky
location of the shared libraries on the remote nodes and this
will
Post by arnuschky
Post by reuti
Post by matbradford
Post by arnuschky
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------
------
Post by arnuschky
Post by reuti
Post by matbradford
Post by arnuschky
12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel
task
Post by arnuschky
Post by reuti
Post by matbradford
3480568.1 task 2.compute-2-9 failed - killing job
Post by arnuschky
Any idea what's going wrong now? 60 seconds is quite a long
timeout, I
Post by arnuschky
Post by reuti
Post by matbradford
Post by arnuschky
guess that this is not a network timeout...
Arne
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=3
Post by arnuschky
Post by reuti
Post by matbradford
06450
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=306469
Post by arnuschky
Post by reuti
Post by matbradford
To unsubscribe from this discussion, e-mail: [users-
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=306475
Post by arnuschky
Post by reuti
To unsubscribe from this discussion, e-mail: [users-
--
Arne Brutschy
Ph.D. Student Email arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles Tel +32 2 650 2273
Avenue Franklin Roosevelt 50 Fax +32 2 650 2715
1050 Bruxelles, Belgium (Fax at IRIDIA secretary)
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=306524
Post by arnuschky
To unsubscribe from this discussion, e-mail: [users-
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=307780
To unsubscribe from this discussion, e-mail: [users-
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307781

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
reuti
2010-12-21 09:41:43 UTC
Permalink
Post by matbradford
Post by matbradford
-----Original Message-----
Sent: 21 December 2010 09:30
Subject: Re: [GE users] Rmpi under SGE
Post by arnuschky
Post by reuti
$ qsub -V job.sh
I tried this as well, but didn't change anything. Anyways, I think Mat
didn't start the sgeexecd's by hand - he just restarted the daemon (by
init script I assume). Or am I missing something here?
Sure, he used the script. But when you log in as root, you have most
likely a different environment than the machine when it boots and starts
the script automatically. You can check this in /proc/<pid>/environ for
the processes, maybe for one where it was started automatically and one
where it was started by hand (by using the script).
We always start our SGE daemons by hand.
The SGE directories have a dependency on the GPFS file system having
started, and to prevent any issues, GPFS gets manually started, followed by
SGE.
We just run a distributed ssh (xdsh) command across all the nodes that have
been rebooted.
Cheers,
Mat
Post by matbradford
-- Reuti
Post by arnuschky
Cheers,
Arne
Post by reuti
Post by matbradford
The problem has returned a couple of times over a period of about a
month,
Post by arnuschky
Post by reuti
Post by matbradford
but restarting the sgeexecd daemons always seems to fix it.
No other messages in the log files indicate anything unusual.
Cheers,
Mat
-----Original Message-----
Sent: 17 December 2010 14:03
Subject: Re: [GE users] Rmpi under SGE
Post by arnuschky
Ah. My previsously was slightly premature, Rmpi jobs with > 20
slaves
Post by arnuschky
Post by reuti
Post by matbradford
Post by arnuschky
$ cat test-mpi-17942.e3480568
error: got no connection within 60 seconds. "Timeout occured
while
Post by arnuschky
Post by reuti
Post by matbradford
waiting for connection"
Post by arnuschky
error: got no connection within 60 seconds. "Timeout occured
while
Post by arnuschky
Post by reuti
Post by matbradford
waiting for connection"
You are now using the plain -builtin- startup method? Does it happen
on all
Post by arnuschky
Post by reuti
Post by matbradford
hosts for such a job?
Maybe it's something special on some nodes und would for some hosts
happen
Post by arnuschky
Post by reuti
Post by matbradford
for less slots too.
-- Reuti
--------------------------------------------------------------------
------
Post by arnuschky
Post by reuti
Post by matbradford
Post by arnuschky
A daemon (pid 8473) died unexpectedly with status 1 while
attempting
Post by arnuschky
to launch so we are aborting.
There may be more information reported by the environment (see
above).
Post by arnuschky
This may be because the daemon was unable to find all the
needed
Post by arnuschky
Post by reuti
Post by matbradford
shared
Post by arnuschky
libraries on the remote node. You may set your LD_LIBRARY_PATH
to
Post by arnuschky
Post by reuti
Post by matbradford
have the
Post by arnuschky
location of the shared libraries on the remote nodes and this
will
Post by arnuschky
Post by reuti
Post by matbradford
Post by arnuschky
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------
------
Post by arnuschky
Post by reuti
Post by matbradford
Post by arnuschky
12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel
task
Post by arnuschky
Post by reuti
Post by matbradford
3480568.1 task 2.compute-2-9 failed - killing job
Post by arnuschky
Any idea what's going wrong now? 60 seconds is quite a long
timeout, I
Post by arnuschky
Post by reuti
Post by matbradford
Post by arnuschky
guess that this is not a network timeout...
Arne
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=3
Post by arnuschky
Post by reuti
Post by matbradford
06450
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=306469
Post by arnuschky
Post by reuti
Post by matbradford
To unsubscribe from this discussion, e-mail: [users-
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=306475
Post by arnuschky
Post by reuti
To unsubscribe from this discussion, e-mail: [users-
--
Arne Brutschy
Ph.D. Student Email arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles Tel +32 2 650 2273
Avenue Franklin Roosevelt 50 Fax +32 2 650 2715
1050 Bruxelles, Belgium (Fax at IRIDIA secretary)
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=306524
Post by arnuschky
To unsubscribe from this discussion, e-mail: [users-
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=307780
To unsubscribe from this discussion, e-mail: [users-
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307781
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307784

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
matbradford
2010-12-21 09:48:08 UTC
Permalink
Post by matbradford
-----Original Message-----
Sent: 21 December 2010 09:42
Subject: Re: [GE users] Rmpi under SGE
Post by matbradford
Post by matbradford
-----Original Message-----
Sent: 21 December 2010 09:30
Subject: Re: [GE users] Rmpi under SGE
Post by arnuschky
Post by reuti
$ qsub -V job.sh
I tried this as well, but didn't change anything. Anyways, I think
Mat
Post by matbradford
Post by matbradford
Post by arnuschky
didn't start the sgeexecd's by hand - he just restarted the daemon
(by
Post by matbradford
Post by matbradford
Post by arnuschky
init script I assume). Or am I missing something here?
Sure, he used the script. But when you log in as root, you have most
likely a different environment than the machine when it boots and
starts
Post by matbradford
Post by matbradford
the script automatically. You can check this in /proc/<pid>/environ
for
Post by matbradford
Post by matbradford
the processes, maybe for one where it was started automatically and
one
Post by matbradford
Post by matbradford
where it was started by hand (by using the script).
We always start our SGE daemons by hand.
Sorry, not very precise here. When I say by hand, I still mean we use the
start-up script in init.d, just not automatically at boot time.
Post by matbradford
Post by matbradford
The SGE directories have a dependency on the GPFS file system having
started, and to prevent any issues, GPFS gets manually started,
followed by
Post by matbradford
SGE.
We just run a distributed ssh (xdsh) command across all the nodes that
have
Post by matbradford
been rebooted.
Cheers,
Mat
Post by matbradford
-- Reuti
Post by arnuschky
Cheers,
Arne
Post by reuti
Post by matbradford
The problem has returned a couple of times over a period of about
a
Post by matbradford
Post by matbradford
month,
Post by arnuschky
Post by reuti
Post by matbradford
but restarting the sgeexecd daemons always seems to fix it.
No other messages in the log files indicate anything unusual.
Cheers,
Mat
-----Original Message-----
Sent: 17 December 2010 14:03
Subject: Re: [GE users] Rmpi under SGE
Post by arnuschky
Ah. My previsously was slightly premature, Rmpi jobs with > 20
slaves
Post by arnuschky
Post by reuti
Post by matbradford
Post by arnuschky
$ cat test-mpi-17942.e3480568
error: got no connection within 60 seconds. "Timeout occured
while
Post by arnuschky
Post by reuti
Post by matbradford
waiting for connection"
Post by arnuschky
error: got no connection within 60 seconds. "Timeout occured
while
Post by arnuschky
Post by reuti
Post by matbradford
waiting for connection"
You are now using the plain -builtin- startup method? Does it
happen
Post by matbradford
Post by matbradford
on all
Post by arnuschky
Post by reuti
Post by matbradford
hosts for such a job?
Maybe it's something special on some nodes und would for some
hosts
Post by matbradford
Post by matbradford
happen
Post by arnuschky
Post by reuti
Post by matbradford
for less slots too.
-- Reuti
------------------------------------------------------------------
--
Post by matbradford
Post by matbradford
------
Post by arnuschky
Post by reuti
Post by matbradford
Post by arnuschky
A daemon (pid 8473) died unexpectedly with status 1 while
attempting
Post by arnuschky
to launch so we are aborting.
There may be more information reported by the environment
(see
Post by matbradford
Post by matbradford
Post by arnuschky
Post by reuti
Post by matbradford
above).
Post by arnuschky
This may be because the daemon was unable to find all the
needed
Post by arnuschky
Post by reuti
Post by matbradford
shared
Post by arnuschky
libraries on the remote node. You may set your
LD_LIBRARY_PATH
Post by matbradford
Post by matbradford
to
Post by arnuschky
Post by reuti
Post by matbradford
have the
Post by arnuschky
location of the shared libraries on the remote nodes and this
will
Post by arnuschky
Post by reuti
Post by matbradford
Post by arnuschky
automatically be forwarded to the remote nodes.
------------------------------------------------------------------
--
Post by matbradford
Post by matbradford
------
Post by arnuschky
Post by reuti
Post by matbradford
Post by arnuschky
12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel
task
Post by arnuschky
Post by reuti
Post by matbradford
3480568.1 task 2.compute-2-9 failed - killing job
Post by arnuschky
Any idea what's going wrong now? 60 seconds is quite a long
timeout, I
Post by arnuschky
Post by reuti
Post by matbradford
Post by arnuschky
guess that this is not a network timeout...
Arne
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Post by matbradford
Post by matbradford
Id=3
Post by arnuschky
Post by reuti
Post by matbradford
06450
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Post by matbradford
Post by matbradford
Id=306469
Post by arnuschky
Post by reuti
Post by matbradford
To unsubscribe from this discussion, e-mail: [users-
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Post by matbradford
Post by matbradford
Id=306475
Post by arnuschky
Post by reuti
To unsubscribe from this discussion, e-mail: [users-
--
Arne Brutschy
Ph.D. Student Email arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6 Web
iridia.ulb.ac.be/~abrutschy
Post by matbradford
Post by matbradford
Post by arnuschky
Universite' Libre de Bruxelles Tel +32 2 650 2273
Avenue Franklin Roosevelt 50 Fax +32 2 650 2715
1050 Bruxelles, Belgium (Fax at IRIDIA secretary)
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Post by matbradford
Post by matbradford
Id=306524
Post by arnuschky
To unsubscribe from this discussion, e-mail: [users-
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Post by matbradford
Post by matbradford
Id=307780
To unsubscribe from this discussion, e-mail: [users-
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=307781
Post by matbradford
To unsubscribe from this discussion, e-mail: [users-
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=307784
To unsubscribe from this discussion, e-mail: [users-
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307786

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
reuti
2010-12-21 09:54:36 UTC
Permalink
Post by matbradford
Post by matbradford
-----Original Message-----
Sent: 21 December 2010 09:30
Subject: Re: [GE users] Rmpi under SGE
Post by arnuschky
Post by reuti
$ qsub -V job.sh
I tried this as well, but didn't change anything. Anyways, I think Mat
didn't start the sgeexecd's by hand - he just restarted the daemon (by
init script I assume). Or am I missing something here?
Sure, he used the script. But when you log in as root, you have most
likely a different environment than the machine when it boots and starts
the script automatically. You can check this in /proc/<pid>/environ for
the processes, maybe for one where it was started automatically and one
where it was started by hand (by using the script).
We always start our SGE daemons by hand.
The SGE directories have a dependency on the GPFS file system having
started, and to prevent any issues, GPFS gets manually started, followed by
SGE.
Sorry, wrong button.

Yes, I miss a feature in insserv like "start-last" and "start-second-to-last". Usually I have to reorder sge_execd in rc3.d/rc5.d to be started last. The problem in our case is ypbind, which must have been started already.

-- Reuti
Post by matbradford
We just run a distributed ssh (xdsh) command across all the nodes that have
been rebooted.
Cheers,
Mat
Post by matbradford
-- Reuti
Post by arnuschky
Cheers,
Arne
Post by reuti
Post by matbradford
The problem has returned a couple of times over a period of about a
month,
Post by arnuschky
Post by reuti
Post by matbradford
but restarting the sgeexecd daemons always seems to fix it.
No other messages in the log files indicate anything unusual.
Cheers,
Mat
-----Original Message-----
Sent: 17 December 2010 14:03
Subject: Re: [GE users] Rmpi under SGE
Post by arnuschky
Ah. My previsously was slightly premature, Rmpi jobs with > 20
slaves
Post by arnuschky
Post by reuti
Post by matbradford
Post by arnuschky
$ cat test-mpi-17942.e3480568
error: got no connection within 60 seconds. "Timeout occured
while
Post by arnuschky
Post by reuti
Post by matbradford
waiting for connection"
Post by arnuschky
error: got no connection within 60 seconds. "Timeout occured
while
Post by arnuschky
Post by reuti
Post by matbradford
waiting for connection"
You are now using the plain -builtin- startup method? Does it happen
on all
Post by arnuschky
Post by reuti
Post by matbradford
hosts for such a job?
Maybe it's something special on some nodes und would for some hosts
happen
Post by arnuschky
Post by reuti
Post by matbradford
for less slots too.
-- Reuti
--------------------------------------------------------------------
------
Post by arnuschky
Post by reuti
Post by matbradford
Post by arnuschky
A daemon (pid 8473) died unexpectedly with status 1 while
attempting
Post by arnuschky
to launch so we are aborting.
There may be more information reported by the environment (see
above).
Post by arnuschky
This may be because the daemon was unable to find all the
needed
Post by arnuschky
Post by reuti
Post by matbradford
shared
Post by arnuschky
libraries on the remote node. You may set your LD_LIBRARY_PATH
to
Post by arnuschky
Post by reuti
Post by matbradford
have the
Post by arnuschky
location of the shared libraries on the remote nodes and this
will
Post by arnuschky
Post by reuti
Post by matbradford
Post by arnuschky
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------
------
Post by arnuschky
Post by reuti
Post by matbradford
Post by arnuschky
12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel
task
Post by arnuschky
Post by reuti
Post by matbradford
3480568.1 task 2.compute-2-9 failed - killing job
Post by arnuschky
Any idea what's going wrong now? 60 seconds is quite a long
timeout, I
Post by arnuschky
Post by reuti
Post by matbradford
Post by arnuschky
guess that this is not a network timeout...
Arne
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=3
Post by arnuschky
Post by reuti
Post by matbradford
06450
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=306469
Post by arnuschky
Post by reuti
Post by matbradford
To unsubscribe from this discussion, e-mail: [users-
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=306475
Post by arnuschky
Post by reuti
To unsubscribe from this discussion, e-mail: [users-
--
Arne Brutschy
Ph.D. Student Email arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles Tel +32 2 650 2273
Avenue Franklin Roosevelt 50 Fax +32 2 650 2715
1050 Bruxelles, Belgium (Fax at IRIDIA secretary)
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=306524
Post by arnuschky
To unsubscribe from this discussion, e-mail: [users-
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=307780
To unsubscribe from this discussion, e-mail: [users-
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307781
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307787

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
arnuschky
2010-12-17 17:01:00 UTC
Permalink
Thanks for the suggestion Mat. I've tested and the problem persists
after restarting the sgeexecds, unfortunately.

Cheers,
Arne
Post by matbradford
Arne,
We are using OpenMPI 1.4.2 and we had a very similar problem , with exactly
the same message, when using more than a single node.
I managed to get rid of the problem by restarting the execution daemons.
Don't know why it fixed the problem, but it did.
The problem has returned a couple of times over a period of about a month,
but restarting the sgeexecd daemons always seems to fix it.
No other messages in the log files indicate anything unusual.
Cheers,
Mat
-----Original Message-----
Sent: 17 December 2010 14:03
Subject: Re: [GE users] Rmpi under SGE
Post by arnuschky
Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
$ cat test-mpi-17942.e3480568
error: got no connection within 60 seconds. "Timeout occured while
waiting for connection"
Post by arnuschky
error: got no connection within 60 seconds. "Timeout occured while
waiting for connection"
You are now using the plain -builtin- startup method? Does it happen on all
hosts for such a job?
Maybe it's something special on some nodes und would for some hosts happen
for less slots too.
-- Reuti
--------------------------------------------------------------------------
Post by arnuschky
A daemon (pid 8473) died unexpectedly with status 1 while
attempting
Post by arnuschky
to launch so we are aborting.
There may be more information reported by the environment (see
above).
Post by arnuschky
This may be because the daemon was unable to find all the needed
shared
Post by arnuschky
libraries on the remote node. You may set your LD_LIBRARY_PATH to
have the
Post by arnuschky
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
Post by arnuschky
12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel task
3480568.1 task 2.compute-2-9 failed - killing job
Post by arnuschky
Any idea what's going wrong now? 60 seconds is quite a long timeout, I
guess that this is not a network timeout...
Arne
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=3
06450
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306469
--
Arne Brutschy
Ph.D. Student Email arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles Tel +32 2 650 2273
Avenue Franklin Roosevelt 50 Fax +32 2 650 2715
1050 Bruxelles, Belgium (Fax at IRIDIA secretary)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306502

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
arnuschky
2010-12-17 17:12:35 UTC
Permalink
Post by reuti
Post by arnuschky
Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
$ cat test-mpi-17942.e3480568
error: got no connection within 60 seconds. "Timeout occured while waiting for connection"
error: got no connection within 60 seconds. "Timeout occured while waiting for connection"
You are now using the plain -builtin- startup method? Does it happen on all hosts for such a job?
Yes, here's my current config:

$ qconf -sconf
#global:
execd_spool_dir /opt/gridengine/default/spool
mailer /bin/mail
xterm /usr/bin/X11/xterm
load_sensor none
prolog none
epilog none
shell_start_mode posix_compliant
login_shells sh,ksh,csh,tcsh
min_uid 0
min_gid 0
user_lists none
xuser_lists none
projects none
xprojects none
enforce_project false
enforce_user auto
load_report_time 00:00:40
max_unheard 00:05:00
reschedule_unknown 00:00:00
loglevel log_warning
administrator_mail ***@headnode
set_token_cmd none
pag_cmd none
token_extend_time none
shepherd_cmd none
qmaster_params none
execd_params H_MEMORYLOCKED=infinity
reporting_params accounting=true reporting=false \
flush_time=00:00:15 joblog=false sharelog=00:00:00
finished_jobs 100
gid_range 20000-20100
max_aj_instances 2000
max_aj_tasks 75000
max_u_jobs 35192
max_jobs 25000
auto_user_oticket 0
auto_user_fshare 0
auto_user_default_project none
auto_user_delete_time 86400
delegated_file_staging false
qlogin_command builtin
qlogin_daemon builtin
rlogin_command builtin
rlogin_daemon builtin
rsh_command builtin
rsh_daemon builtin
reprioritize 0
jsv_url none
jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w

$ qconf -sp mpich_fu
pe_name mpich_fu
slots 128
user_lists NONE
xuser_lists NONE
start_proc_args /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args /opt/gridengine/mpi/stopmpi.sh
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
Post by reuti
Maybe it's something special on some nodes und would for some hosts happen for less slots too.
I don't think that the nodes are different. I reinstalled all of them
yesterday. I tested on 2 different generations of nodes separately (2x2
cores and 2x4 cores per node). The problem just seems to be more likely
the more slots (and thus nodes) I use. But in one generation the nodes
are identical, all using a single switch.

Cheers,
Arne

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306505

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
reuti
2010-12-21 12:56:44 UTC
Permalink
Post by reuti
Post by arnuschky
Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
$ cat test-mpi-17942.e3480568
error: got no connection within 60 seconds. "Timeout occured while waiting for connection"
error: got no connection within 60 seconds. "Timeout occured while waiting for connection"
You are now using the plain -builtin- startup method? Does it happen on all hosts for such a job?
<snip>
qlogin_command builtin
qlogin_daemon builtin
rlogin_command builtin
rlogin_daemon builtin
rsh_command builtin
rsh_daemon builtin
Fine.
reprioritize 0
jsv_url none
jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w
$ qconf -sp mpich_fu
pe_name mpich_fu
slots 128
user_lists NONE
xuser_lists NONE
start_proc_args /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args /opt/gridengine/mpi/stopmpi.sh
Aren't you using Open MPI? Then these two entries can be set to NONE.

Open MPI on its own is working fine with more than 20 nodes? How is Open MPI called by R - just a plain `mpiexec`, or any special arguments?

-- Reuti
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
Post by reuti
Maybe it's something special on some nodes und would for some hosts happen for less slots too.
I don't think that the nodes are different. I reinstalled all of them
yesterday. I tested on 2 different generations of nodes separately (2x2
cores and 2x4 cores per node). The problem just seems to be more likely
the more slots (and thus nodes) I use. But in one generation the nodes
are identical, all using a single switch.
Cheers,
Arne
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306505
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307827

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
arnuschky
2010-12-22 19:16:22 UTC
Permalink
Hey,
Post by reuti
Post by arnuschky
start_proc_args /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args /opt/gridengine/mpi/stopmpi.sh
Aren't you using Open MPI? Then these two entries can be set to NONE.
Yes, I do. I will change these entries then.
Post by reuti
Open MPI on its own is working fine with more than 20 nodes? How is Open MPI called
by R - just a plain `mpiexec`, or any special arguments?
Yes, I tested with 50 nodes, worked both with $fill_up and $round_robin.

Regarding the mpiexec: I have honestly no idea. I will try to read the
source code - I am not using R and the Rmpi package myself, I am (only)
the admin of the cluster. Maybe someone on this mailing list can answer
this question?

Have some happy holidays, if you have some!
Cheers and Thanks for all the replies,
Arne

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=308396

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
seandavi
2010-12-23 14:01:27 UTC
Permalink
Post by arnuschky
Hey,
Post by reuti
Post by arnuschky
start_proc_args /opt/gridengine/mpi/startmpi.sh -catch_rsh
$pe_hostfile
Post by reuti
Post by arnuschky
stop_proc_args /opt/gridengine/mpi/stopmpi.sh
Aren't you using Open MPI? Then these two entries can be set to NONE.
Yes, I do. I will change these entries then.
Post by reuti
Open MPI on its own is working fine with more than 20 nodes? How is Open
MPI called
Post by reuti
by R - just a plain `mpiexec`, or any special arguments?
Yes, I tested with 50 nodes, worked both with $fill_up and $round_robin.
Regarding the mpiexec: I have honestly no idea. I will try to read the
source code - I am not using R and the Rmpi package myself, I am (only)
the admin of the cluster. Maybe someone on this mailing list can answer
this question?
You can ask questions like that on the R-sig-hpc mailing list. There are
probably more "users" on that list and the author of Rmpi subscribes to that
list as well.

Sean
Post by arnuschky
Have some happy holidays, if you have some!
Cheers and Thanks for all the replies,
Arne
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=308396
To unsubscribe from this discussion, e-mail: [
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=308737

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
arnuschky
2010-12-17 11:48:32 UTC
Permalink
Hi,

yay, thanks a lot reuti, you're a star! Works fine!

Arne
Post by reuti
Hi,
Post by arnuschky
we're having massive problems using Rmpi with OpenMPI under SGE. OpenMPI
is tested and works fine. We're submittig one master Rscript, which is
in turn spawning the required slaves using Rmpi. Unfortunately, this
$ cat testRmpi.e3480556
Warning: Permanently added 'compute-1-13.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-10.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-11.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-12.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-14.local' (RSA) to the list of known hosts.
Permission denied, please try again.
when Open MPI has a tight integration into SGE, I would assume SGE is configured to use "ssh". What is the output of `qconf -sconf`, there might be double entries?
http://marc.info/?l=npaci-rocks-discussion&m=126411729709528
http://gridengine.sunsource.net/howto/hostbased-ssh.html
-- Reuti
Post by arnuschky
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
--------------------------------------------------------------------------
A daemon (pid 26953) died unexpectedly with status 129 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
mpirun: clean termination accomplished
We're using openmpi-1.3.3 (--with-sge) and SGE V62u4.
Any hint's on what's going wrong here?
Cheers,
Arne
--
Arne Brutschy
Ph.D. Student Email arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles Tel +32 2 650 2273
Avenue Franklin Roosevelt 50 Fax +32 2 650 2715
1050 Bruxelles, Belgium (Fax at IRIDIA secretary)
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306389
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306398
--
Arne Brutschy
Ph.D. Student Email arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles Tel +32 2 650 2273
Avenue Franklin Roosevelt 50 Fax +32 2 650 2715
1050 Bruxelles, Belgium (Fax at IRIDIA secretary)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306413

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
Loading...