Rmpi under SGE

Discussion:

Rmpi under SGE

arnuschky

2010-12-17 10:04:23 UTC

Hi,

we're having massive problems using Rmpi with OpenMPI under SGE. OpenMPI
is tested and works fine. We're submittig one master Rscript, which is
in turn spawning the required slaves using Rmpi. Unfortunately, this
fails:

$ cat testRmpi.e3480556
Warning: Permanently added 'compute-1-13.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-10.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-11.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-12.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-14.local' (RSA) to the list of known hosts.
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
--------------------------------------------------------------------------
A daemon (pid 26953) died unexpectedly with status 129 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
mpirun: clean termination accomplished

We're using openmpi-1.3.3 (--with-sge) and SGE V62u4.

Any hint's on what's going wrong here?

Cheers,
Arne
--
Arne Brutschy
Ph.D. Student Email arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles Tel +32 2 650 2273
Avenue Franklin Roosevelt 50 Fax +32 2 650 2715
1050 Bruxelles, Belgium (Fax at IRIDIA secretary)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306389

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

reuti

2010-12-17 10:28:24 UTC

Permalink

Hi,

Post by arnuschky
we're having massive problems using Rmpi with OpenMPI under SGE. OpenMPI
is tested and works fine. We're submittig one master Rscript, which is
in turn spawning the required slaves using Rmpi. Unfortunately, this
$ cat testRmpi.e3480556
Warning: Permanently added 'compute-1-13.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-10.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-11.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-12.local' (RSA) to the list of known hosts.
Warning: Permanently added 'compute-1-14.local' (RSA) to the list of known hosts.
Permission denied, please try again.

when Open MPI has a tight integration into SGE, I would assume SGE is configured to use "ssh". What is the output of `qconf -sconf`, there might be double entries?

http://marc.info/?l=npaci-rocks-discussion&m=126411729709528

If you want or must use ssh for sure, you need either passphraseless ssh keys (deprecated), or a hostbased authentication:

http://gridengine.sunsource.net/howto/hostbased-ssh.html

-- Reuti

Post by arnuschky
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
--------------------------------------------------------------------------
A daemon (pid 26953) died unexpectedly with status 129 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
mpirun: clean termination accomplished
We're using openmpi-1.3.3 (--with-sge) and SGE V62u4.
Any hint's on what's going wrong here?
Cheers,
Arne
--
Arne Brutschy
Ph.D. Student Email arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles Tel +32 2 650 2273
Avenue Franklin Roosevelt 50 Fax +32 2 650 2715
1050 Bruxelles, Belgium (Fax at IRIDIA secretary)
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306389

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306398

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

arnuschky

2010-12-17 11:58:55 UTC

Permalink

Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
still fail (even with Reuti's fixes):

$ cat test-mpi-17942.e3480568
error: got no connection within 60 seconds. "Timeout occured while waiting for connection"
error: got no connection within 60 seconds. "Timeout occured while waiting for connection"
--------------------------------------------------------------------------
A daemon (pid 8473) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------

Qmaster spool messages list:

12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel task 3480568.1 task 2.compute-2-9 failed - killing job

Any idea what's going wrong now? 60 seconds is quite a long timeout, I
guess that this is not a network timeout...

Arne

Post by reuti
Hi,

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306398

--
Arne Brutschy
Ph.D. Student Email arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles Tel +32 2 650 2273
Avenue Franklin Roosevelt 50 Fax +32 2 650 2715
1050 Bruxelles, Belgium (Fax at IRIDIA secretary)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306417

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

reuti

2010-12-17 14:02:39 UTC

Permalink

Post by arnuschky
Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
$ cat test-mpi-17942.e3480568
error: got no connection within 60 seconds. "Timeout occured while waiting for connection"
error: got no connection within 60 seconds. "Timeout occured while waiting for connection"

You are now using the plain -builtin- startup method? Does it happen on all hosts for such a job?

Maybe it's something special on some nodes und would for some hosts happen for less slots too.

-- Reuti

Post by arnuschky
--------------------------------------------------------------------------
A daemon (pid 8473) died unexpectedly with status 1 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel task 3480568.1 task 2.compute-2-9 failed - killing job
Any idea what's going wrong now? 60 seconds is quite a long timeout, I
guess that this is not a network timeout...
Arne

matbradford

2010-12-17 15:01:28 UTC

Permalink

Arne,

We are using OpenMPI 1.4.2 and we had a very similar problem , with exactly
the same message, when using more than a single node.

I managed to get rid of the problem by restarting the execution daemons.
Don't know why it fixed the problem, but it did.

The problem has returned a couple of times over a period of about a month,
but restarting the sgeexecd daemons always seems to fix it.

No other messages in the log files indicate anything unusual.

Cheers,

Mat

-----Original Message-----
From: reuti [mailto:***@staff.uni-marburg.de]
Sent: 17 December 2010 14:03
To: ***@gridengine.sunsource.net
Subject: Re: [GE users] Rmpi under SGE

Post by arnuschky
Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
$ cat test-mpi-17942.e3480568
error: got no connection within 60 seconds. "Timeout occured while

waiting for connection"

Post by arnuschky
error: got no connection within 60 seconds. "Timeout occured while

waiting for connection"

You are now using the plain -builtin- startup method? Does it happen on all
hosts for such a job?

Maybe it's something special on some nodes und would for some hosts happen
for less slots too.

-- Reuti
--------------------------------------------------------------------------

Post by arnuschky
A daemon (pid 8473) died unexpectedly with status 1 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.

--------------------------------------------------------------------------

Post by arnuschky
12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel task

3480568.1 task 2.compute-2-9 failed - killing job

Post by arnuschky
Any idea what's going wrong now? 60 seconds is quite a long timeout, I
guess that this is not a network timeout...
Arne

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=3
06450

To unsubscribe from this discussion, e-mail:
[users-***@gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306469

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

reuti

2010-12-17 15:24:38 UTC

Permalink

When you start the sgeexecd by hand from the root account, the environment might be different (from the one at boot time) and it's inherited to processes started by SGE. This behavior can also be adjusted in SGE's configuration.

But when some environment variables are missing (and only set when started by hand), maybe it could also be adjusted in the jobscript. For a tightly integrated job, usually -V is used to export the environment of the master task to all slaves (hence in the jobscript it must be set one time). Here -V is appropriate.

==

This can also be set during job submission time:

$ qsub -V job.sh

But usually I vote against it, as I prefer self-contained scripts (to avoid that a changed variable in the shell will have an effect on the actual job submission - this can be hard to track in case of an error). Exception clause, when you name it explicitly:

$ qsub -v LD_LIBRARY_PATH job.sh
$ qsub -v LD_LIBRARY_PATH=/usr/local/lib job.sh

-- Reuti

Post by matbradford
The problem has returned a couple of times over a period of about a month,
but restarting the sgeexecd daemons always seems to fix it.
No other messages in the log files indicate anything unusual.
Cheers,
Mat
-----Original Message-----
Sent: 17 December 2010 14:03
Subject: Re: [GE users] Rmpi under SGE

Post by arnuschky
Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
$ cat test-mpi-17942.e3480568
error: got no connection within 60 seconds. "Timeout occured while

waiting for connection"

Post by arnuschky
error: got no connection within 60 seconds. "Timeout occured while

waiting for connection"
You are now using the plain -builtin- startup method? Does it happen on all
hosts for such a job?
Maybe it's something special on some nodes und would for some hosts happen
for less slots too.
-- Reuti
--------------------------------------------------------------------------

Post by arnuschky
A daemon (pid 8473) died unexpectedly with status 1 while

attempting

Post by arnuschky
to launch so we are aborting.
There may be more information reported by the environment (see

above).

Post by arnuschky
This may be because the daemon was unable to find all the needed

shared

Post by arnuschky
libraries on the remote node. You may set your LD_LIBRARY_PATH to

have the

Post by arnuschky
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.

--------------------------------------------------------------------------

Post by arnuschky
12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel task

3480568.1 task 2.compute-2-9 failed - killing job

Post by arnuschky
Any idea what's going wrong now? 60 seconds is quite a long timeout, I
guess that this is not a network timeout...
Arne

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=3
06450
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306469

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306475

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

arnuschky

2010-12-17 18:07:41 UTC

Permalink

Post by reuti
$ qsub -V job.sh

I tried this as well, but didn't change anything. Anyways, I think Mat
didn't start the sgeexecd's by hand - he just restarted the daemon (by
init script I assume). Or am I missing something here?

Cheers,
Arne