Discussion:
[torqueusers] Question about pbs_submit functions
Bas van der Vlies
2015-03-31 12:28:21 UTC
Permalink
I have some questions about the submit functions for the different torque versions. I am building the python interface above the torque library.

In torque 2.X it was easy, we have one function + man page:
* pbs_submit

Now i have report that pbs_submit is not reliable anymore and that we have to use:
4.X) pbs_submit_hash
5.X) pbs_submit_hash_ext

I can not find ant documentation for it or how to use it. It is in the pbs_ifl.h so this are the API functions. It would be handy if there is some kind of documentation.


---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG Amsterdam
| T +31 (0) 20 800 1300 | ***@surfsara.nl | www.surfsara.nl |
Glen Beane
2015-03-31 13:32:26 UTC
Permalink
Just to follow up with a little more information:

We have found that occasionally pbs_server gets into a state where any job
submitted with pbs_submit() will fail. The failure is typically that the
job script is missing at run time, and the job immediately exits with a "no
such file or directory" error:

"-bash: line 1:
/var/spool/torque/mom_priv/jobs/NNNNNNN.host.domain.SC: No such file
or directory"


far less often I've seen pbs_submit() fail completely (seems to fail
sending the job script to pbs_server) so the job isn't queue or have the
job get queued but have the job script get corrupted.

Note that while our pipelines that use pbs_submit() are failing, jobs
submitted with pbs_submit_hash() via qsub (or a simple test program)
continue to work fine. A restart of pbs_server fixes everything.

Because some of these pipelines are production, we would like to have them
as reliable as possible. It seems that pbs_submit_hash() is now more
reliable than pbs_submit(), so we would like to switch. However, there
isn't any documentation, and the API doesn't seem entirely stable
(pbs_submit_hash vs pbs_submit_hash_ext). We use pbs_python for our
pipelines, so I'm really looking forward to being able to use
pbs_submit_hash from pbs_python.



On Tue, Mar 31, 2015 at 8:28 AM, Bas van der Vlies <
Post by Bas van der Vlies
I have some questions about the submit functions for the different torque
versions. I am building the python interface above the torque library.
* pbs_submit
4.X) pbs_submit_hash
5.X) pbs_submit_hash_ext
I can not find ant documentation for it or how to use it. It is in the
pbs_ifl.h so this are the API functions. It would be handy if there is some
kind of documentation.
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG Amsterdam
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
Ken Nielson
2015-03-31 14:44:09 UTC
Permalink
Hi all,

This is going to take a little sorting out before I get you a good answer.
I will get back with you later.

Ken
Post by Glen Beane
We have found that occasionally pbs_server gets into a state where any job
submitted with pbs_submit() will fail. The failure is typically that the
job script is missing at run time, and the job immediately exits with a "no
/var/spool/torque/mom_priv/jobs/NNNNNNN.host.domain.SC: No such file
or directory"
far less often I've seen pbs_submit() fail completely (seems to fail
sending the job script to pbs_server) so the job isn't queue or have the
job get queued but have the job script get corrupted.
Note that while our pipelines that use pbs_submit() are failing, jobs
submitted with pbs_submit_hash() via qsub (or a simple test program)
continue to work fine. A restart of pbs_server fixes everything.
Because some of these pipelines are production, we would like to have
them as reliable as possible. It seems that pbs_submit_hash() is now more
reliable than pbs_submit(), so we would like to switch. However, there
isn't any documentation, and the API doesn't seem entirely stable
(pbs_submit_hash vs pbs_submit_hash_ext). We use pbs_python for our
pipelines, so I'm really looking forward to being able to use
pbs_submit_hash from pbs_python.
On Tue, Mar 31, 2015 at 8:28 AM, Bas van der Vlies <
Post by Bas van der Vlies
I have some questions about the submit functions for the different torque
versions. I am building the python interface above the torque library.
* pbs_submit
4.X) pbs_submit_hash
5.X) pbs_submit_hash_ext
I can not find ant documentation for it or how to use it. It is in the
pbs_ifl.h so this are the API functions. It would be handy if there is some
kind of documentation.
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG Amsterdam
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
[image: Adaptive Computing] <http://www.adaptivecomputing.com>
[image: Twitter] <http://twitter.com/AdaptiveMoab> [image: LinkedIn]
<http://www.linkedin.com/company/448673?goback=.fcs_GLHD_adaptive+computing_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits>
[image:
YouTube] <http://www.youtube.com/adaptivecomputing> [image: GooglePlus]
<https://plus.google.com/u/0/102155039310685515037/posts> [image: Facebook]
<http://www.facebook.com/pages/Adaptive-Computing/314449798572695?fref=ts>
[image:
RSS] <http://www.adaptivecomputing.com/feed>
Ken Nielson Sr. Software Engineer
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300 Provo, UT 84606
www.adaptivecomputing.com
Bas van der Vlies
2015-04-20 12:56:02 UTC
Permalink
Just curious, any news
Post by Ken Nielson
Hi all,
This is going to take a little sorting out before I get you a good answer. I will get back with you later.
Ken
/var/spool/torque/mom_priv/jobs/
NNNNNNN.host.domain.SC
: No such file
or directory"
far less often I've seen pbs_submit() fail completely (seems to fail sending the job script to pbs_server) so the job isn't queue or have the job get queued but have the job script get corrupted.
Note that while our pipelines that use pbs_submit() are failing, jobs submitted with pbs_submit_hash() via qsub (or a simple test program) continue to work fine. A restart of pbs_server fixes everything.
Because some of these pipelines are production, we would like to have them as reliable as possible. It seems that pbs_submit_hash() is now more reliable than pbs_submit(), so we would like to switch. However, there isn't any documentation, and the API doesn't seem entirely stable (pbs_submit_hash vs pbs_submit_hash_ext). We use pbs_python for our pipelines, so I'm really looking forward to being able to use pbs_submit_hash from pbs_python.
I have some questions about the submit functions for the different torque versions. I am building the python interface above the torque library.
* pbs_submit
4.X) pbs_submit_hash
5.X) pbs_submit_hash_ext
I can not find ant documentation for it or how to use it. It is in the pbs_ifl.h so this are the API functions. It would be handy if there is some kind of documentation.
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG Amsterdam
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
Ken Nielson Sr. Software Engineer
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300 Provo, UT 84606
www.adaptivecomputing.com
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG Amsterdam
| T +31 (0) 20 800 1300 | ***@surfsara.nl | www.surfsara.nl |
Glen Beane
2015-09-25 14:12:04 UTC
Permalink
Following up on this Ken.

any luck sorting out pbs_submit(), pbs_submit_hash(), pbs_submit_hash_ext()

Right now we're stuck using pbs_submit() from pbs_python (at least until
Bas can get info about pbs_submit_hash() so he can implement it in
pbs_python), but occasionally pbs_server gets into a state where jobs
submitted with pbs_submit() always fail until pbs_server is restarted.
While in this state, jobs submitted with pbs_submit_hash() still work, so
we would really like to switch (or have pbs_submit() fixed so this doesn't
happen any more).




On Tue, Mar 31, 2015 at 10:44 AM, Ken Nielson <
Post by Ken Nielson
Hi all,
This is going to take a little sorting out before I get you a good answer.
I will get back with you later.
Ken
Post by Glen Beane
We have found that occasionally pbs_server gets into a state where any
job submitted with pbs_submit() will fail. The failure is typically that
the job script is missing at run time, and the job immediately exits with a
/var/spool/torque/mom_priv/jobs/NNNNNNN.host.domain.SC: No such file
or directory"
far less often I've seen pbs_submit() fail completely (seems to fail
sending the job script to pbs_server) so the job isn't queue or have the
job get queued but have the job script get corrupted.
Note that while our pipelines that use pbs_submit() are failing, jobs
submitted with pbs_submit_hash() via qsub (or a simple test program)
continue to work fine. A restart of pbs_server fixes everything.
Because some of these pipelines are production, we would like to have
them as reliable as possible. It seems that pbs_submit_hash() is now more
reliable than pbs_submit(), so we would like to switch. However, there
isn't any documentation, and the API doesn't seem entirely stable
(pbs_submit_hash vs pbs_submit_hash_ext). We use pbs_python for our
pipelines, so I'm really looking forward to being able to use
pbs_submit_hash from pbs_python.
On Tue, Mar 31, 2015 at 8:28 AM, Bas van der Vlies <
Post by Bas van der Vlies
I have some questions about the submit functions for the different
torque versions. I am building the python interface above the torque
library.
* pbs_submit
4.X) pbs_submit_hash
5.X) pbs_submit_hash_ext
I can not find ant documentation for it or how to use it. It is in the
pbs_ifl.h so this are the API functions. It would be handy if there is some
kind of documentation.
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG Amsterdam
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
[image: Adaptive Computing] <http://www.adaptivecomputing.com>
[image: Twitter] <http://twitter.com/AdaptiveMoab> [image: LinkedIn]
YouTube] <http://www.youtube.com/adaptivecomputing> [image: GooglePlus]
<https://plus.google.com/u/0/102155039310685515037/posts> [image: Facebook]
RSS] <http://www.adaptivecomputing.com/feed>
Ken Nielson Sr. Software Engineer
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300 Provo, UT 84606
www.adaptivecomputing.com
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
Ken Nielson
2015-09-25 15:57:33 UTC
Permalink
Sorry for the delayed response.

pbs_submit is the api to call for submitting jobs. If it is not working as
reliably as before we should try to understand why. Have you created a
ticket for the problem?

The engineer that wrote the pbs_submit_hash left the Adaptive Computing
long ago. We need to go over what was done and see if this is something we
want to support going forward as an API.

So the short answer is don't use pbs_submit_hash and let's debug
pbs_submit.

Regards

Ken
Post by Glen Beane
Following up on this Ken.
any luck sorting out pbs_submit(), pbs_submit_hash(), pbs_submit_hash_ext()
Right now we're stuck using pbs_submit() from pbs_python (at least until
Bas can get info about pbs_submit_hash() so he can implement it in
pbs_python), but occasionally pbs_server gets into a state where jobs
submitted with pbs_submit() always fail until pbs_server is restarted.
While in this state, jobs submitted with pbs_submit_hash() still work, so
we would really like to switch (or have pbs_submit() fixed so this doesn't
happen any more).
On Tue, Mar 31, 2015 at 10:44 AM, Ken Nielson <
Post by Ken Nielson
Hi all,
This is going to take a little sorting out before I get you a good
answer. I will get back with you later.
Ken
Post by Glen Beane
We have found that occasionally pbs_server gets into a state where any
job submitted with pbs_submit() will fail. The failure is typically that
the job script is missing at run time, and the job immediately exits with a
/var/spool/torque/mom_priv/jobs/NNNNNNN.host.domain.SC: No such file
or directory"
far less often I've seen pbs_submit() fail completely (seems to fail
sending the job script to pbs_server) so the job isn't queue or have the
job get queued but have the job script get corrupted.
Note that while our pipelines that use pbs_submit() are failing, jobs
submitted with pbs_submit_hash() via qsub (or a simple test program)
continue to work fine. A restart of pbs_server fixes everything.
Because some of these pipelines are production, we would like to have
them as reliable as possible. It seems that pbs_submit_hash() is now more
reliable than pbs_submit(), so we would like to switch. However, there
isn't any documentation, and the API doesn't seem entirely stable
(pbs_submit_hash vs pbs_submit_hash_ext). We use pbs_python for our
pipelines, so I'm really looking forward to being able to use
pbs_submit_hash from pbs_python.
On Tue, Mar 31, 2015 at 8:28 AM, Bas van der Vlies <
Post by Bas van der Vlies
I have some questions about the submit functions for the different
torque versions. I am building the python interface above the torque
library.
* pbs_submit
4.X) pbs_submit_hash
5.X) pbs_submit_hash_ext
I can not find ant documentation for it or how to use it. It is in the
pbs_ifl.h so this are the API functions. It would be handy if there is some
kind of documentation.
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 |
1098 XG Amsterdam
www.surfsara.nl |
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
[image: Adaptive Computing] <http://www.adaptivecomputing.com>
[image: Twitter] <http://twitter.com/AdaptiveMoab> [image: LinkedIn]
YouTube] <http://www.youtube.com/adaptivecomputing> [image: GooglePlus]
<https://plus.google.com/u/0/102155039310685515037/posts> [image: Facebook]
RSS] <http://www.adaptivecomputing.com/feed>
Ken Nielson Sr. Software Engineer
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300 Provo, UT 84606
www.adaptivecomputing.com
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
[image: Adaptive Computing] <http://www.adaptivecomputing.com>
[image: Twitter] <http://twitter.com/AdaptiveMoab> [image: LinkedIn]
<http://www.linkedin.com/company/448673?goback=.fcs_GLHD_adaptive+computing_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits>
[image:
YouTube] <http://www.youtube.com/adaptivecomputing> [image: GooglePlus]
<https://plus.google.com/u/0/102155039310685515037/posts> [image: Facebook]
<http://www.facebook.com/pages/Adaptive-Computing/314449798572695?fref=ts>
[image:
RSS] <http://www.adaptivecomputing.com/feed>
Ken Nielson Sr. Software Engineer
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300 Provo, UT 84606
www.adaptivecomputing.com
David Beer
2015-09-25 16:04:57 UTC
Permalink
I'm not sure what difference it can make to use pbs_submit() or
pbs_submit_hash() from the server's perspective. The only difference
between the two is how things are stored on the client side before being
sent to pbs_server, but the server receives the exact same information from
either pbs_submit() or pbs_submit_hash().

I personally do not thing that the difference in functionality is caused by
using pbs_submit() instead of pbs_submit_hash(). If you did want to try to
use pbs_submit_hash(), you'd want to populate a job_data_container object
as is done in the code. This can be found in
src/include/u_hash_map_structs.h. Mostly, this is a container object of
name and value pairs with some extra flags. The other difference for this
function is that resources are stored separately from the rest of the
attributes. There is one container for the job attributes and a separate
one for the resources. For an example of how this is populated, you can
look at src/cmds/qsub_functions.c and look at the ji.job_attr for the
attributes and ji.res_attr for the resources.

pbs_submit_hash_ext() is a wrapper function for pbs_submit_hash() and it is
never called in the code.

I apologize that the documentation isn't up to snuff on this. I think that
the best way forward on this issue is hammering down why things are failing
the way they are. We recently had another customer report that they are
having a similar issue and we'd like to help in supporting pbs_python, so
perhaps we need to do some collaborative debugging on this. Do you think
you can send in the script that gets things into this state so I can try to
reproduce it locally?
Post by Glen Beane
Following up on this Ken.
any luck sorting out pbs_submit(), pbs_submit_hash(), pbs_submit_hash_ext()
Right now we're stuck using pbs_submit() from pbs_python (at least until
Bas can get info about pbs_submit_hash() so he can implement it in
pbs_python), but occasionally pbs_server gets into a state where jobs
submitted with pbs_submit() always fail until pbs_server is restarted.
While in this state, jobs submitted with pbs_submit_hash() still work, so
we would really like to switch (or have pbs_submit() fixed so this doesn't
happen any more).
On Tue, Mar 31, 2015 at 10:44 AM, Ken Nielson <
Post by Ken Nielson
Hi all,
This is going to take a little sorting out before I get you a good
answer. I will get back with you later.
Ken
Post by Glen Beane
We have found that occasionally pbs_server gets into a state where any
job submitted with pbs_submit() will fail. The failure is typically that
the job script is missing at run time, and the job immediately exits with a
/var/spool/torque/mom_priv/jobs/NNNNNNN.host.domain.SC: No such file
or directory"
far less often I've seen pbs_submit() fail completely (seems to fail
sending the job script to pbs_server) so the job isn't queue or have the
job get queued but have the job script get corrupted.
Note that while our pipelines that use pbs_submit() are failing, jobs
submitted with pbs_submit_hash() via qsub (or a simple test program)
continue to work fine. A restart of pbs_server fixes everything.
Because some of these pipelines are production, we would like to have
them as reliable as possible. It seems that pbs_submit_hash() is now more
reliable than pbs_submit(), so we would like to switch. However, there
isn't any documentation, and the API doesn't seem entirely stable
(pbs_submit_hash vs pbs_submit_hash_ext). We use pbs_python for our
pipelines, so I'm really looking forward to being able to use
pbs_submit_hash from pbs_python.
On Tue, Mar 31, 2015 at 8:28 AM, Bas van der Vlies <
Post by Bas van der Vlies
I have some questions about the submit functions for the different
torque versions. I am building the python interface above the torque
library.
* pbs_submit
4.X) pbs_submit_hash
5.X) pbs_submit_hash_ext
I can not find ant documentation for it or how to use it. It is in the
pbs_ifl.h so this are the API functions. It would be handy if there is some
kind of documentation.
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 |
1098 XG Amsterdam
www.surfsara.nl |
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
[image: Adaptive Computing] <http://www.adaptivecomputing.com>
[image: Twitter] <http://twitter.com/AdaptiveMoab> [image: LinkedIn]
YouTube] <http://www.youtube.com/adaptivecomputing> [image: GooglePlus]
<https://plus.google.com/u/0/102155039310685515037/posts> [image: Facebook]
RSS] <http://www.adaptivecomputing.com/feed>
Ken Nielson Sr. Software Engineer
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300 Provo, UT 84606
www.adaptivecomputing.com
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
Glen Beane
2015-09-25 17:14:08 UTC
Permalink
I don't think the problem is just with pbs_python. We've duplicated it
with the C API. Once pbs_server gets into this state, submissions using
pbs_submit() directly from C also fail exactly the same way. And we can go
weeks between these events, so it isn't super easy to reproduce. But once
it gets into this state, it stays in this state until a restart.

I think all you need is a program, I think it could even be C, that uses
pbs_submit() and if you run it hundred/thousands of times I think you can
eventually reproduce the problem.


first create a dummy script:

echo "hostname" /tmp/test.sh

Then this C program will submit jobs that fail once pbs_server gets into
this state (but works fine up to that point, and fine after a pbs_server
restart):

#include <pbs_ifl.h>
#include <stdio.h>

int main (int argc, char **argv)
{
int fd = pbs_connect(0);
char *new_jobid;

new_jobid = pbs_submit(fd, 0, "/tmp/test.sh", 0, 0);

printf("%s\n", new_jobid);

return 0;
}




while this will submit jobs that always run, even when jobs submitted with
the program shown above fail (needs to be compiled as C++, at least with
Torque 4 because memmgr_init is not declared with "extern C" and its name
is mangled):


#include <pbs_ifl.h>
#include <u_hash_map_structs.h>
#include <u_memmgr.h>
#include <stdio.h>

int main (int argc, char **argv)
{
int fd = pbs_connect(0);
char *new_jobid;
char script[] = "/tmp/test.sh";
memmgr* mm;
job_data* job_attrs = 0;

memmgr_init(&mm, 0);

/* pass empty ATTR_v, I was getting a segfault if job_attrs was empty
when I called pbs_submit_hash */
hash_add_or_exit(&mm, &job_attrs, ATTR_v, "", ENV_DATA);

pbs_submit_hash(fd, &mm, job_attrs, NULL, script, NULL, NULL,
&new_jobid, NULL);


printf("%s\n", new_jobid);

return 0;
}




When we use pbs_submit() through pbs_python it works fine 99% of the time.
But then suddenly ALL jobs submitted with pbs_submit() start failing (both
with pbs_python or directly from C), however qsub still works (qsub uses
pbs_submit_hash()). The only fix at this point is to restart pbs_server,
and then it will be fine again for a while (could be weeks).

What happens is that the job submission appears to be successful, but when
the job goes to run the job script is missing: The job gets executed and
fails immediate with a command not found error for the job script:

"-bash: line 1:
/var/spool/torque/mom_priv/jobs/NNNNNNN.host.domain.SC: No such file
or directory"


so pbs_mom is trying to exec the .SC file, but for some reason it doesn't
exist. EVERY job submitted with pbs_submit() will fail at run time with
this error, but jobs submitted with pbs_submit_hash() continue to work
fine. This continues until pbs_sever is restarted, and then everything
works fine again.




It looks like in Torque 4 pbs_submit_hash is in libtorque, but in Torque 5
pbs_submit_hash_ext is in libtorque but pbs_submit_hash is not.

Torque 5:

$ strings libtorque.so | grep submit_hash
_Z15pbs_submit_hashiPN9container14item_containerIP8job_dataEES4_PcS5_S5_PS5_S6_
pbs_submit_hash_ext

Torque 4:

$ strings libtorque.so | grep submit_hash
pbs_submit_hash
Post by David Beer
I'm not sure what difference it can make to use pbs_submit() or
pbs_submit_hash() from the server's perspective. The only difference
between the two is how things are stored on the client side before being
sent to pbs_server, but the server receives the exact same information from
either pbs_submit() or pbs_submit_hash().
I personally do not thing that the difference in functionality is caused
by using pbs_submit() instead of pbs_submit_hash(). If you did want to try
to use pbs_submit_hash(), you'd want to populate a job_data_container
object as is done in the code. This can be found in
src/include/u_hash_map_structs.h. Mostly, this is a container object of
name and value pairs with some extra flags. The other difference for this
function is that resources are stored separately from the rest of the
attributes. There is one container for the job attributes and a separate
one for the resources. For an example of how this is populated, you can
look at src/cmds/qsub_functions.c and look at the ji.job_attr for the
attributes and ji.res_attr for the resources.
pbs_submit_hash_ext() is a wrapper function for pbs_submit_hash() and it
is never called in the code.
I apologize that the documentation isn't up to snuff on this. I think that
the best way forward on this issue is hammering down why things are failing
the way they are. We recently had another customer report that they are
having a similar issue and we'd like to help in supporting pbs_python, so
perhaps we need to do some collaborative debugging on this. Do you think
you can send in the script that gets things into this state so I can try to
reproduce it locally?
Post by Glen Beane
Following up on this Ken.
any luck sorting out pbs_submit(), pbs_submit_hash(),
pbs_submit_hash_ext()
Right now we're stuck using pbs_submit() from pbs_python (at least until
Bas can get info about pbs_submit_hash() so he can implement it in
pbs_python), but occasionally pbs_server gets into a state where jobs
submitted with pbs_submit() always fail until pbs_server is restarted.
While in this state, jobs submitted with pbs_submit_hash() still work, so
we would really like to switch (or have pbs_submit() fixed so this doesn't
happen any more).
On Tue, Mar 31, 2015 at 10:44 AM, Ken Nielson <
Post by Ken Nielson
Hi all,
This is going to take a little sorting out before I get you a good
answer. I will get back with you later.
Ken
Post by Glen Beane
We have found that occasionally pbs_server gets into a state where any
job submitted with pbs_submit() will fail. The failure is typically that
the job script is missing at run time, and the job immediately exits with a
/var/spool/torque/mom_priv/jobs/NNNNNNN.host.domain.SC: No such file
or directory"
far less often I've seen pbs_submit() fail completely (seems to fail
sending the job script to pbs_server) so the job isn't queue or have the
job get queued but have the job script get corrupted.
Note that while our pipelines that use pbs_submit() are failing, jobs
submitted with pbs_submit_hash() via qsub (or a simple test program)
continue to work fine. A restart of pbs_server fixes everything.
Because some of these pipelines are production, we would like to have
them as reliable as possible. It seems that pbs_submit_hash() is now more
reliable than pbs_submit(), so we would like to switch. However, there
isn't any documentation, and the API doesn't seem entirely stable
(pbs_submit_hash vs pbs_submit_hash_ext). We use pbs_python for our
pipelines, so I'm really looking forward to being able to use
pbs_submit_hash from pbs_python.
On Tue, Mar 31, 2015 at 8:28 AM, Bas van der Vlies <
Post by Bas van der Vlies
I have some questions about the submit functions for the different
torque versions. I am building the python interface above the torque
library.
* pbs_submit
4.X) pbs_submit_hash
5.X) pbs_submit_hash_ext
I can not find ant documentation for it or how to use it. It is in the
pbs_ifl.h so this are the API functions. It would be handy if there is some
kind of documentation.
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 |
1098 XG Amsterdam
www.surfsara.nl |
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
[image: Adaptive Computing] <http://www.adaptivecomputing.com>
[image: Twitter] <http://twitter.com/AdaptiveMoab> [image: LinkedIn]
YouTube] <http://www.youtube.com/adaptivecomputing> [image: GooglePlus]
<https://plus.google.com/u/0/102155039310685515037/posts> [image: Facebook]
RSS] <http://www.adaptivecomputing.com/feed>
Ken Nielson Sr. Software Engineer
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300 Provo, UT 84606
www.adaptivecomputing.com
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
David Beer
2015-09-25 17:32:15 UTC
Permalink
Glen,

I appreciate the additional info. After looking into this again I noticed
one small difference between what pbs_submit_hash() sends to pbs_server and
what is sent by pbs_submit(). It is small, but it could be it. pbs_submit()
never includes the job id, but pbs_submit_hash() always does. When the job
id is empty, it just grabs the first job in the new jobs list. This
definitely allows for the possibility of an error - just the kind of error
that only happens every so many weeks and is super hard to reproduce.

I haven't reproduced it yet, but here's a patch that makes them exactly the
same in case someone else has the time to test it before I get to it.
Post by Glen Beane
I don't think the problem is just with pbs_python. We've duplicated it
with the C API. Once pbs_server gets into this state, submissions using
pbs_submit() directly from C also fail exactly the same way. And we can go
weeks between these events, so it isn't super easy to reproduce. But once
it gets into this state, it stays in this state until a restart.
I think all you need is a program, I think it could even be C, that uses
pbs_submit() and if you run it hundred/thousands of times I think you can
eventually reproduce the problem.
echo "hostname" /tmp/test.sh
Then this C program will submit jobs that fail once pbs_server gets into
this state (but works fine up to that point, and fine after a pbs_server
#include <pbs_ifl.h>
#include <stdio.h>
int main (int argc, char **argv)
{
int fd = pbs_connect(0);
char *new_jobid;
new_jobid = pbs_submit(fd, 0, "/tmp/test.sh", 0, 0);
printf("%s\n", new_jobid);
return 0;
}
while this will submit jobs that always run, even when jobs submitted with
the program shown above fail (needs to be compiled as C++, at least with
Torque 4 because memmgr_init is not declared with "extern C" and its name
#include <pbs_ifl.h>
#include <u_hash_map_structs.h>
#include <u_memmgr.h>
#include <stdio.h>
int main (int argc, char **argv)
{
int fd = pbs_connect(0);
char *new_jobid;
char script[] = "/tmp/test.sh";
memmgr* mm;
job_data* job_attrs = 0;
memmgr_init(&mm, 0);
/* pass empty ATTR_v, I was getting a segfault if job_attrs was empty
when I called pbs_submit_hash */
hash_add_or_exit(&mm, &job_attrs, ATTR_v, "", ENV_DATA);
pbs_submit_hash(fd, &mm, job_attrs, NULL, script, NULL, NULL,
&new_jobid, NULL);
printf("%s\n", new_jobid);
return 0;
}
When we use pbs_submit() through pbs_python it works fine 99% of the
time. But then suddenly ALL jobs submitted with pbs_submit() start failing
(both with pbs_python or directly from C), however qsub still works (qsub
uses pbs_submit_hash()). The only fix at this point is to restart
pbs_server, and then it will be fine again for a while (could be weeks).
What happens is that the job submission appears to be successful, but
when the job goes to run the job script is missing: The job gets executed
/var/spool/torque/mom_priv/jobs/NNNNNNN.host.domain.SC: No such file
or directory"
so pbs_mom is trying to exec the .SC file, but for some reason it doesn't
exist. EVERY job submitted with pbs_submit() will fail at run time with
this error, but jobs submitted with pbs_submit_hash() continue to work
fine. This continues until pbs_sever is restarted, and then everything
works fine again.
It looks like in Torque 4 pbs_submit_hash is in libtorque, but in Torque 5
pbs_submit_hash_ext is in libtorque but pbs_submit_hash is not.
$ strings libtorque.so | grep submit_hash
_Z15pbs_submit_hashiPN9container14item_containerIP8job_dataEES4_PcS5_S5_PS5_S6_
pbs_submit_hash_ext
$ strings libtorque.so | grep submit_hash
pbs_submit_hash
Post by David Beer
I'm not sure what difference it can make to use pbs_submit() or
pbs_submit_hash() from the server's perspective. The only difference
between the two is how things are stored on the client side before being
sent to pbs_server, but the server receives the exact same information from
either pbs_submit() or pbs_submit_hash().
I personally do not thing that the difference in functionality is caused
by using pbs_submit() instead of pbs_submit_hash(). If you did want to try
to use pbs_submit_hash(), you'd want to populate a job_data_container
object as is done in the code. This can be found in
src/include/u_hash_map_structs.h. Mostly, this is a container object of
name and value pairs with some extra flags. The other difference for this
function is that resources are stored separately from the rest of the
attributes. There is one container for the job attributes and a separate
one for the resources. For an example of how this is populated, you can
look at src/cmds/qsub_functions.c and look at the ji.job_attr for the
attributes and ji.res_attr for the resources.
pbs_submit_hash_ext() is a wrapper function for pbs_submit_hash() and it
is never called in the code.
I apologize that the documentation isn't up to snuff on this. I think
that the best way forward on this issue is hammering down why things are
failing the way they are. We recently had another customer report that they
are having a similar issue and we'd like to help in supporting pbs_python,
so perhaps we need to do some collaborative debugging on this. Do you think
you can send in the script that gets things into this state so I can try to
reproduce it locally?
Post by Glen Beane
Following up on this Ken.
any luck sorting out pbs_submit(), pbs_submit_hash(),
pbs_submit_hash_ext()
Right now we're stuck using pbs_submit() from pbs_python (at least until
Bas can get info about pbs_submit_hash() so he can implement it in
pbs_python), but occasionally pbs_server gets into a state where jobs
submitted with pbs_submit() always fail until pbs_server is restarted.
While in this state, jobs submitted with pbs_submit_hash() still work, so
we would really like to switch (or have pbs_submit() fixed so this doesn't
happen any more).
On Tue, Mar 31, 2015 at 10:44 AM, Ken Nielson <
Post by Ken Nielson
Hi all,
This is going to take a little sorting out before I get you a good
answer. I will get back with you later.
Ken
Post by Glen Beane
We have found that occasionally pbs_server gets into a state where any
job submitted with pbs_submit() will fail. The failure is typically that
the job script is missing at run time, and the job immediately exits with a
/var/spool/torque/mom_priv/jobs/NNNNNNN.host.domain.SC: No such file
or directory"
far less often I've seen pbs_submit() fail completely (seems to fail
sending the job script to pbs_server) so the job isn't queue or have the
job get queued but have the job script get corrupted.
Note that while our pipelines that use pbs_submit() are failing, jobs
submitted with pbs_submit_hash() via qsub (or a simple test program)
continue to work fine. A restart of pbs_server fixes everything.
Because some of these pipelines are production, we would like to have
them as reliable as possible. It seems that pbs_submit_hash() is now more
reliable than pbs_submit(), so we would like to switch. However, there
isn't any documentation, and the API doesn't seem entirely stable
(pbs_submit_hash vs pbs_submit_hash_ext). We use pbs_python for our
pipelines, so I'm really looking forward to being able to use
pbs_submit_hash from pbs_python.
On Tue, Mar 31, 2015 at 8:28 AM, Bas van der Vlies <
Post by Bas van der Vlies
I have some questions about the submit functions for the different
torque versions. I am building the python interface above the torque
library.
* pbs_submit
Now i have report that pbs_submit is not reliable anymore and that we
4.X) pbs_submit_hash
5.X) pbs_submit_hash_ext
I can not find ant documentation for it or how to use it. It is in
the pbs_ifl.h so this are the API functions. It would be handy if there is
some kind of documentation.
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 |
1098 XG Amsterdam
www.surfsara.nl |
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
[image: Adaptive Computing] <http://www.adaptivecomputing.com>
[image: Twitter] <http://twitter.com/AdaptiveMoab> [image: LinkedIn]
YouTube] <http://www.youtube.com/adaptivecomputing> [image: GooglePlus]
<https://plus.google.com/u/0/102155039310685515037/posts> [image: Facebook]
RSS] <http://www.adaptivecomputing.com/feed>
Ken Nielson Sr. Software Engineer
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300 Provo, UT 84606
www.adaptivecomputing.com
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
Glen Beane
2015-09-25 18:13:01 UTC
Permalink
Nice catch. This seems like it could be it. I'm going to see about
getting this installed on our test system so we run a bunch of submissions
through it.


I think pbs_submit() worked fine before the multi-threaded pbs_server, so
I always figured it was some kind of race condition.

Once this happens things must get out of sync because every pbs_submit()
from then on has the same thing happen.
Post by David Beer
Glen,
I appreciate the additional info. After looking into this again I noticed
one small difference between what pbs_submit_hash() sends to pbs_server and
what is sent by pbs_submit(). It is small, but it could be it. pbs_submit()
never includes the job id, but pbs_submit_hash() always does. When the job
id is empty, it just grabs the first job in the new jobs list. This
definitely allows for the possibility of an error - just the kind of error
that only happens every so many weeks and is super hard to reproduce.
I haven't reproduced it yet, but here's a patch that makes them exactly
the same in case someone else has the time to test it before I get to it.
Post by Glen Beane
I don't think the problem is just with pbs_python. We've duplicated it
with the C API. Once pbs_server gets into this state, submissions using
pbs_submit() directly from C also fail exactly the same way. And we can go
weeks between these events, so it isn't super easy to reproduce. But once
it gets into this state, it stays in this state until a restart.
I think all you need is a program, I think it could even be C, that uses
pbs_submit() and if you run it hundred/thousands of times I think you can
eventually reproduce the problem.
echo "hostname" /tmp/test.sh
Then this C program will submit jobs that fail once pbs_server gets into
this state (but works fine up to that point, and fine after a pbs_server
#include <pbs_ifl.h>
#include <stdio.h>
int main (int argc, char **argv)
{
int fd = pbs_connect(0);
char *new_jobid;
new_jobid = pbs_submit(fd, 0, "/tmp/test.sh", 0, 0);
printf("%s\n", new_jobid);
return 0;
}
while this will submit jobs that always run, even when jobs submitted
with the program shown above fail (needs to be compiled as C++, at least
with Torque 4 because memmgr_init is not declared with "extern C" and its
#include <pbs_ifl.h>
#include <u_hash_map_structs.h>
#include <u_memmgr.h>
#include <stdio.h>
int main (int argc, char **argv)
{
int fd = pbs_connect(0);
char *new_jobid;
char script[] = "/tmp/test.sh";
memmgr* mm;
job_data* job_attrs = 0;
memmgr_init(&mm, 0);
/* pass empty ATTR_v, I was getting a segfault if job_attrs was empty
when I called pbs_submit_hash */
hash_add_or_exit(&mm, &job_attrs, ATTR_v, "", ENV_DATA);
pbs_submit_hash(fd, &mm, job_attrs, NULL, script, NULL, NULL,
&new_jobid, NULL);
printf("%s\n", new_jobid);
return 0;
}
When we use pbs_submit() through pbs_python it works fine 99% of the
time. But then suddenly ALL jobs submitted with pbs_submit() start failing
(both with pbs_python or directly from C), however qsub still works (qsub
uses pbs_submit_hash()). The only fix at this point is to restart
pbs_server, and then it will be fine again for a while (could be weeks).
What happens is that the job submission appears to be successful, but
when the job goes to run the job script is missing: The job gets executed
/var/spool/torque/mom_priv/jobs/NNNNNNN.host.domain.SC: No such file
or directory"
so pbs_mom is trying to exec the .SC file, but for some reason it doesn't
exist. EVERY job submitted with pbs_submit() will fail at run time with
this error, but jobs submitted with pbs_submit_hash() continue to work
fine. This continues until pbs_sever is restarted, and then everything
works fine again.
It looks like in Torque 4 pbs_submit_hash is in libtorque, but in Torque
5 pbs_submit_hash_ext is in libtorque but pbs_submit_hash is not.
$ strings libtorque.so | grep submit_hash
_Z15pbs_submit_hashiPN9container14item_containerIP8job_dataEES4_PcS5_S5_PS5_S6_
pbs_submit_hash_ext
$ strings libtorque.so | grep submit_hash
pbs_submit_hash
Post by David Beer
I'm not sure what difference it can make to use pbs_submit() or
pbs_submit_hash() from the server's perspective. The only difference
between the two is how things are stored on the client side before being
sent to pbs_server, but the server receives the exact same information from
either pbs_submit() or pbs_submit_hash().
I personally do not thing that the difference in functionality is caused
by using pbs_submit() instead of pbs_submit_hash(). If you did want to try
to use pbs_submit_hash(), you'd want to populate a job_data_container
object as is done in the code. This can be found in
src/include/u_hash_map_structs.h. Mostly, this is a container object of
name and value pairs with some extra flags. The other difference for this
function is that resources are stored separately from the rest of the
attributes. There is one container for the job attributes and a separate
one for the resources. For an example of how this is populated, you can
look at src/cmds/qsub_functions.c and look at the ji.job_attr for the
attributes and ji.res_attr for the resources.
pbs_submit_hash_ext() is a wrapper function for pbs_submit_hash() and it
is never called in the code.
I apologize that the documentation isn't up to snuff on this. I think
that the best way forward on this issue is hammering down why things are
failing the way they are. We recently had another customer report that they
are having a similar issue and we'd like to help in supporting pbs_python,
so perhaps we need to do some collaborative debugging on this. Do you think
you can send in the script that gets things into this state so I can try to
reproduce it locally?
Post by Glen Beane
Following up on this Ken.
any luck sorting out pbs_submit(), pbs_submit_hash(),
pbs_submit_hash_ext()
Right now we're stuck using pbs_submit() from pbs_python (at least
until Bas can get info about pbs_submit_hash() so he can implement it in
pbs_python), but occasionally pbs_server gets into a state where jobs
submitted with pbs_submit() always fail until pbs_server is restarted.
While in this state, jobs submitted with pbs_submit_hash() still work, so
we would really like to switch (or have pbs_submit() fixed so this doesn't
happen any more).
On Tue, Mar 31, 2015 at 10:44 AM, Ken Nielson <
Post by Ken Nielson
Hi all,
This is going to take a little sorting out before I get you a good
answer. I will get back with you later.
Ken
Post by Glen Beane
We have found that occasionally pbs_server gets into a state where
any job submitted with pbs_submit() will fail. The failure is typically
that the job script is missing at run time, and the job immediately exits
/var/spool/torque/mom_priv/jobs/NNNNNNN.host.domain.SC: No such file
or directory"
far less often I've seen pbs_submit() fail completely (seems to fail
sending the job script to pbs_server) so the job isn't queue or have the
job get queued but have the job script get corrupted.
Note that while our pipelines that use pbs_submit() are failing,
jobs submitted with pbs_submit_hash() via qsub (or a simple test program)
continue to work fine. A restart of pbs_server fixes everything.
Because some of these pipelines are production, we would like to
have them as reliable as possible. It seems that pbs_submit_hash() is now
more reliable than pbs_submit(), so we would like to switch. However,
there isn't any documentation, and the API doesn't seem entirely stable
(pbs_submit_hash vs pbs_submit_hash_ext). We use pbs_python for our
pipelines, so I'm really looking forward to being able to use
pbs_submit_hash from pbs_python.
On Tue, Mar 31, 2015 at 8:28 AM, Bas van der Vlies <
Post by Bas van der Vlies
I have some questions about the submit functions for the different
torque versions. I am building the python interface above the torque
library.
* pbs_submit
Now i have report that pbs_submit is not reliable anymore and that
4.X) pbs_submit_hash
5.X) pbs_submit_hash_ext
I can not find ant documentation for it or how to use it. It is in
the pbs_ifl.h so this are the API functions. It would be handy if there is
some kind of documentation.
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 |
1098 XG Amsterdam
www.surfsara.nl |
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
[image: Adaptive Computing] <http://www.adaptivecomputing.com>
[image: Twitter] <http://twitter.com/AdaptiveMoab> [image: LinkedIn]
Facebook]
RSS] <http://www.adaptivecomputing.com/feed>
Ken Nielson Sr. Software Engineer
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300 Provo, UT 84606
www.adaptivecomputing.com
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
David Beer
2015-09-25 20:44:03 UTC
Permalink
Cool. Let me know if it resolves things for you and I'll get it checked in.
Thanks for helping test this Glen.
Post by Glen Beane
Nice catch. This seems like it could be it. I'm going to see about
getting this installed on our test system so we run a bunch of submissions
through it.
I think pbs_submit() worked fine before the multi-threaded pbs_server, so
I always figured it was some kind of race condition.
Once this happens things must get out of sync because every pbs_submit()
from then on has the same thing happen.
Post by David Beer
Glen,
I appreciate the additional info. After looking into this again I noticed
one small difference between what pbs_submit_hash() sends to pbs_server and
what is sent by pbs_submit(). It is small, but it could be it. pbs_submit()
never includes the job id, but pbs_submit_hash() always does. When the job
id is empty, it just grabs the first job in the new jobs list. This
definitely allows for the possibility of an error - just the kind of error
that only happens every so many weeks and is super hard to reproduce.
I haven't reproduced it yet, but here's a patch that makes them exactly
the same in case someone else has the time to test it before I get to it.
Post by Glen Beane
I don't think the problem is just with pbs_python. We've duplicated it
with the C API. Once pbs_server gets into this state, submissions using
pbs_submit() directly from C also fail exactly the same way. And we can go
weeks between these events, so it isn't super easy to reproduce. But once
it gets into this state, it stays in this state until a restart.
I think all you need is a program, I think it could even be C, that uses
pbs_submit() and if you run it hundred/thousands of times I think you can
eventually reproduce the problem.
echo "hostname" /tmp/test.sh
Then this C program will submit jobs that fail once pbs_server gets into
this state (but works fine up to that point, and fine after a pbs_server
#include <pbs_ifl.h>
#include <stdio.h>
int main (int argc, char **argv)
{
int fd = pbs_connect(0);
char *new_jobid;
new_jobid = pbs_submit(fd, 0, "/tmp/test.sh", 0, 0);
printf("%s\n", new_jobid);
return 0;
}
while this will submit jobs that always run, even when jobs submitted
with the program shown above fail (needs to be compiled as C++, at least
with Torque 4 because memmgr_init is not declared with "extern C" and its
#include <pbs_ifl.h>
#include <u_hash_map_structs.h>
#include <u_memmgr.h>
#include <stdio.h>
int main (int argc, char **argv)
{
int fd = pbs_connect(0);
char *new_jobid;
char script[] = "/tmp/test.sh";
memmgr* mm;
job_data* job_attrs = 0;
memmgr_init(&mm, 0);
/* pass empty ATTR_v, I was getting a segfault if job_attrs was
empty when I called pbs_submit_hash */
hash_add_or_exit(&mm, &job_attrs, ATTR_v, "", ENV_DATA);
pbs_submit_hash(fd, &mm, job_attrs, NULL, script, NULL, NULL,
&new_jobid, NULL);
printf("%s\n", new_jobid);
return 0;
}
When we use pbs_submit() through pbs_python it works fine 99% of the
time. But then suddenly ALL jobs submitted with pbs_submit() start failing
(both with pbs_python or directly from C), however qsub still works (qsub
uses pbs_submit_hash()). The only fix at this point is to restart
pbs_server, and then it will be fine again for a while (could be weeks).
What happens is that the job submission appears to be successful, but
when the job goes to run the job script is missing: The job gets executed
/var/spool/torque/mom_priv/jobs/NNNNNNN.host.domain.SC: No such file
or directory"
so pbs_mom is trying to exec the .SC file, but for some reason it
doesn't exist. EVERY job submitted with pbs_submit() will fail at run
time with this error, but jobs submitted with pbs_submit_hash() continue to
work fine. This continues until pbs_sever is restarted, and then
everything works fine again.
It looks like in Torque 4 pbs_submit_hash is in libtorque, but in Torque
5 pbs_submit_hash_ext is in libtorque but pbs_submit_hash is not.
$ strings libtorque.so | grep submit_hash
_Z15pbs_submit_hashiPN9container14item_containerIP8job_dataEES4_PcS5_S5_PS5_S6_
pbs_submit_hash_ext
$ strings libtorque.so | grep submit_hash
pbs_submit_hash
On Fri, Sep 25, 2015 at 12:04 PM, David Beer <
Post by David Beer
I'm not sure what difference it can make to use pbs_submit() or
pbs_submit_hash() from the server's perspective. The only difference
between the two is how things are stored on the client side before being
sent to pbs_server, but the server receives the exact same information from
either pbs_submit() or pbs_submit_hash().
I personally do not thing that the difference in functionality is
caused by using pbs_submit() instead of pbs_submit_hash(). If you did want
to try to use pbs_submit_hash(), you'd want to populate a
job_data_container object as is done in the code. This can be found in
src/include/u_hash_map_structs.h. Mostly, this is a container object of
name and value pairs with some extra flags. The other difference for this
function is that resources are stored separately from the rest of the
attributes. There is one container for the job attributes and a separate
one for the resources. For an example of how this is populated, you can
look at src/cmds/qsub_functions.c and look at the ji.job_attr for the
attributes and ji.res_attr for the resources.
pbs_submit_hash_ext() is a wrapper function for pbs_submit_hash() and
it is never called in the code.
I apologize that the documentation isn't up to snuff on this. I think
that the best way forward on this issue is hammering down why things are
failing the way they are. We recently had another customer report that they
are having a similar issue and we'd like to help in supporting pbs_python,
so perhaps we need to do some collaborative debugging on this. Do you think
you can send in the script that gets things into this state so I can try to
reproduce it locally?
Post by Glen Beane
Following up on this Ken.
any luck sorting out pbs_submit(), pbs_submit_hash(),
pbs_submit_hash_ext()
Right now we're stuck using pbs_submit() from pbs_python (at least
until Bas can get info about pbs_submit_hash() so he can implement it in
pbs_python), but occasionally pbs_server gets into a state where jobs
submitted with pbs_submit() always fail until pbs_server is restarted.
While in this state, jobs submitted with pbs_submit_hash() still work, so
we would really like to switch (or have pbs_submit() fixed so this doesn't
happen any more).
On Tue, Mar 31, 2015 at 10:44 AM, Ken Nielson <
Post by Ken Nielson
Hi all,
This is going to take a little sorting out before I get you a good
answer. I will get back with you later.
Ken
Post by Glen Beane
We have found that occasionally pbs_server gets into a state where
any job submitted with pbs_submit() will fail. The failure is typically
that the job script is missing at run time, and the job immediately exits
/var/spool/torque/mom_priv/jobs/NNNNNNN.host.domain.SC: No such file
or directory"
far less often I've seen pbs_submit() fail completely (seems to fail
sending the job script to pbs_server) so the job isn't queue or have the
job get queued but have the job script get corrupted.
Note that while our pipelines that use pbs_submit() are failing,
jobs submitted with pbs_submit_hash() via qsub (or a simple test program)
continue to work fine. A restart of pbs_server fixes everything.
Because some of these pipelines are production, we would like to
have them as reliable as possible. It seems that pbs_submit_hash() is now
more reliable than pbs_submit(), so we would like to switch. However,
there isn't any documentation, and the API doesn't seem entirely stable
(pbs_submit_hash vs pbs_submit_hash_ext). We use pbs_python for our
pipelines, so I'm really looking forward to being able to use
pbs_submit_hash from pbs_python.
On Tue, Mar 31, 2015 at 8:28 AM, Bas van der Vlies <
Post by Bas van der Vlies
I have some questions about the submit functions for the different
torque versions. I am building the python interface above the torque
library.
* pbs_submit
Now i have report that pbs_submit is not reliable anymore and that
4.X) pbs_submit_hash
5.X) pbs_submit_hash_ext
I can not find ant documentation for it or how to use it. It is in
the pbs_ifl.h so this are the API functions. It would be handy if there is
some kind of documentation.
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 |
1098 XG Amsterdam
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
[image: Adaptive Computing] <http://www.adaptivecomputing.com>
[image: Twitter] <http://twitter.com/AdaptiveMoab> [image: LinkedIn]
Facebook]
RSS] <http://www.adaptivecomputing.com/feed>
Ken Nielson Sr. Software Engineer
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300 Provo, UT 84606
www.adaptivecomputing.com
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
Bas van der Vlies
2015-09-28 06:52:02 UTC
Permalink
Thanks both for the patch and debugging. Just a question is this patch for all torque versions (4 .X and 5.X)?
Cool. Let me know if it resolves things for you and I'll get it checked in. Thanks for helping test this Glen.
Nice catch. This seems like it could be it. I'm going to see about getting this installed on our test system so we run a bunch of submissions through it.
I think pbs_submit() worked fine before the multi-threaded pbs_server, so I always figured it was some kind of race condition.
Once this happens things must get out of sync because every pbs_submit() from then on has the same thing happen.
Glen,
I appreciate the additional info. After looking into this again I noticed one small difference between what pbs_submit_hash() sends to pbs_server and what is sent by pbs_submit(). It is small, but it could be it. pbs_submit() never includes the job id, but pbs_submit_hash() always does. When the job id is empty, it just grabs the first job in the new jobs list. This definitely allows for the possibility of an error - just the kind of error that only happens every so many weeks and is super hard to reproduce.
I haven't reproduced it yet, but here's a patch that makes them exactly the same in case someone else has the time to test it before I get to it.
I don't think the problem is just with pbs_python. We've duplicated it with the C API. Once pbs_server gets into this state, submissions using pbs_submit() directly from C also fail exactly the same way. And we can go weeks between these events, so it isn't super easy to reproduce. But once it gets into this state, it stays in this state until a restart.
I think all you need is a program, I think it could even be C, that uses pbs_submit() and if you run it hundred/thousands of times I think you can eventually reproduce the problem.
echo "hostname" /tmp/test.sh
#include <pbs_ifl.h>
#include <stdio.h>
int main (int argc, char **argv)
{
int fd = pbs_connect(0);
char *new_jobid;
new_jobid = pbs_submit(fd, 0, "/tmp/test.sh", 0, 0);
printf("%s\n", new_jobid);
return 0;
}
#include <pbs_ifl.h>
#include <u_hash_map_structs.h>
#include <u_memmgr.h>
#include <stdio.h>
int main (int argc, char **argv)
{
int fd = pbs_connect(0);
char *new_jobid;
char script[] = "/tmp/test.sh";
memmgr* mm;
job_data* job_attrs = 0;
memmgr_init(&mm, 0);
/* pass empty ATTR_v, I was getting a segfault if job_attrs was empty when I called pbs_submit_hash */
hash_add_or_exit(&mm, &job_attrs, ATTR_v, "", ENV_DATA);
pbs_submit_hash(fd, &mm, job_attrs, NULL, script, NULL, NULL, &new_jobid, NULL);
printf("%s\n", new_jobid);
return 0;
}
When we use pbs_submit() through pbs_python it works fine 99% of the time. But then suddenly ALL jobs submitted with pbs_submit() start failing (both with pbs_python or directly from C), however qsub still works (qsub uses pbs_submit_hash()). The only fix at this point is to restart pbs_server, and then it will be fine again for a while (could be weeks).
/var/spool/torque/mom_priv/jobs/
NNNNNNN.host.domain.SC
: No such file
or directory"
so pbs_mom is trying to exec the .SC file, but for some reason it doesn't exist. EVERY job submitted with pbs_submit() will fail at run time with this error, but jobs submitted with pbs_submit_hash() continue to work fine. This continues until pbs_sever is restarted, and then everything works fine again.
It looks like in Torque 4 pbs_submit_hash is in libtorque, but in Torque 5 pbs_submit_hash_ext is in libtorque but pbs_submit_hash is not.
$ strings libtorque.so | grep submit_hash
_Z15pbs_submit_hashiPN9container14item_containerIP8job_dataEES4_PcS5_S5_PS5_S6_
pbs_submit_hash_ext
$ strings libtorque.so | grep submit_hash
pbs_submit_hash
I'm not sure what difference it can make to use pbs_submit() or pbs_submit_hash() from the server's perspective. The only difference between the two is how things are stored on the client side before being sent to pbs_server, but the server receives the exact same information from either pbs_submit() or pbs_submit_hash().
I personally do not thing that the difference in functionality is caused by using pbs_submit() instead of pbs_submit_hash(). If you did want to try to use pbs_submit_hash(), you'd want to populate a job_data_container object as is done in the code. This can be found in src/include/u_hash_map_structs.h. Mostly, this is a container object of name and value pairs with some extra flags. The other difference for this function is that resources are stored separately from the rest of the attributes. There is one container for the job attributes and a separate one for the resources. For an example of how this is populated, you can look at src/cmds/qsub_functions.c and look at the ji.job_attr for the attributes and ji.res_attr for the resources.
pbs_submit_hash_ext() is a wrapper function for pbs_submit_hash() and it is never called in the code.
I apologize that the documentation isn't up to snuff on this. I think that the best way forward on this issue is hammering down why things are failing the way they are. We recently had another customer report that they are having a similar issue and we'd like to help in supporting pbs_python, so perhaps we need to do some collaborative debugging on this. Do you think you can send in the script that gets things into this state so I can try to reproduce it locally?
Following up on this Ken.
any luck sorting out pbs_submit(), pbs_submit_hash(), pbs_submit_hash_ext()
Right now we're stuck using pbs_submit() from pbs_python (at least until Bas can get info about pbs_submit_hash() so he can implement it in pbs_python), but occasionally pbs_server gets into a state where jobs submitted with pbs_submit() always fail until pbs_server is restarted. While in this state, jobs submitted with pbs_submit_hash() still work, so we would really like to switch (or have pbs_submit() fixed so this doesn't happen any more).
Hi all,
This is going to take a little sorting out before I get you a good answer. I will get back with you later.
Ken
/var/spool/torque/mom_priv/jobs/
NNNNNNN.host.domain.SC
: No such file
or directory"
far less often I've seen pbs_submit() fail completely (seems to fail sending the job script to pbs_server) so the job isn't queue or have the job get queued but have the job script get corrupted.
Note that while our pipelines that use pbs_submit() are failing, jobs submitted with pbs_submit_hash() via qsub (or a simple test program) continue to work fine. A restart of pbs_server fixes everything.
Because some of these pipelines are production, we would like to have them as reliable as possible. It seems that pbs_submit_hash() is now more reliable than pbs_submit(), so we would like to switch. However, there isn't any documentation, and the API doesn't seem entirely stable (pbs_submit_hash vs pbs_submit_hash_ext). We use pbs_python for our pipelines, so I'm really looking forward to being able to use pbs_submit_hash from pbs_python.
I have some questions about the submit functions for the different torque versions. I am building the python interface above the torque library.
* pbs_submit
4.X) pbs_submit_hash
5.X) pbs_submit_hash_ext
I can not find ant documentation for it or how to use it. It is in the pbs_ifl.h so this are the API functions. It would be handy if there is some kind of documentation.
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG Amsterdam
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
Ken Nielson Sr. Software Engineer
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300 Provo, UT 84606
www.adaptivecomputing.com
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG Amsterdam
| T +31 (0) 20 800 1300 | ***@surfsara.nl | www.surfsara.nl |
David Beer
2015-09-28 15:37:21 UTC
Permalink
This patch applies cleanly to 4.2.10 as well. I would imagine it'd also
apply to other 4.X.

On Mon, Sep 28, 2015 at 12:52 AM, Bas van der Vlies <
Post by Bas van der Vlies
Thanks both for the patch and debugging. Just a question is this patch
for all torque versions (4 .X and 5.X)?
Post by David Beer
Cool. Let me know if it resolves things for you and I'll get it checked
in. Thanks for helping test this Glen.
Post by David Beer
Nice catch. This seems like it could be it. I'm going to see about
getting this installed on our test system so we run a bunch of submissions
through it.
Post by David Beer
I think pbs_submit() worked fine before the multi-threaded pbs_server,
so I always figured it was some kind of race condition.
Post by David Beer
Once this happens things must get out of sync because every pbs_submit()
from then on has the same thing happen.
Post by David Beer
Glen,
I appreciate the additional info. After looking into this again I
noticed one small difference between what pbs_submit_hash() sends to
pbs_server and what is sent by pbs_submit(). It is small, but it could be
it. pbs_submit() never includes the job id, but pbs_submit_hash() always
does. When the job id is empty, it just grabs the first job in the new jobs
list. This definitely allows for the possibility of an error - just the
kind of error that only happens every so many weeks and is super hard to
reproduce.
Post by David Beer
I haven't reproduced it yet, but here's a patch that makes them exactly
the same in case someone else has the time to test it before I get to it.
Post by David Beer
I don't think the problem is just with pbs_python. We've duplicated it
with the C API. Once pbs_server gets into this state, submissions using
pbs_submit() directly from C also fail exactly the same way. And we can go
weeks between these events, so it isn't super easy to reproduce. But once
it gets into this state, it stays in this state until a restart.
Post by David Beer
I think all you need is a program, I think it could even be C, that uses
pbs_submit() and if you run it hundred/thousands of times I think you can
eventually reproduce the problem.
Post by David Beer
echo "hostname" /tmp/test.sh
Then this C program will submit jobs that fail once pbs_server gets into
this state (but works fine up to that point, and fine after a pbs_server
Post by David Beer
#include <pbs_ifl.h>
#include <stdio.h>
int main (int argc, char **argv)
{
int fd = pbs_connect(0);
char *new_jobid;
new_jobid = pbs_submit(fd, 0, "/tmp/test.sh", 0, 0);
printf("%s\n", new_jobid);
return 0;
}
while this will submit jobs that always run, even when jobs submitted
with the program shown above fail (needs to be compiled as C++, at least
with Torque 4 because memmgr_init is not declared with "extern C" and its
Post by David Beer
#include <pbs_ifl.h>
#include <u_hash_map_structs.h>
#include <u_memmgr.h>
#include <stdio.h>
int main (int argc, char **argv)
{
int fd = pbs_connect(0);
char *new_jobid;
char script[] = "/tmp/test.sh";
memmgr* mm;
job_data* job_attrs = 0;
memmgr_init(&mm, 0);
/* pass empty ATTR_v, I was getting a segfault if job_attrs was
empty when I called pbs_submit_hash */
Post by David Beer
hash_add_or_exit(&mm, &job_attrs, ATTR_v, "", ENV_DATA);
pbs_submit_hash(fd, &mm, job_attrs, NULL, script, NULL, NULL,
&new_jobid, NULL);
Post by David Beer
printf("%s\n", new_jobid);
return 0;
}
When we use pbs_submit() through pbs_python it works fine 99% of the
time. But then suddenly ALL jobs submitted with pbs_submit() start failing
(both with pbs_python or directly from C), however qsub still works (qsub
uses pbs_submit_hash()). The only fix at this point is to restart
pbs_server, and then it will be fine again for a while (could be weeks).
Post by David Beer
What happens is that the job submission appears to be successful, but
when the job goes to run the job script is missing: The job gets executed
Post by David Beer
/var/spool/torque/mom_priv/jobs/
NNNNNNN.host.domain.SC
: No such file
or directory"
so pbs_mom is trying to exec the .SC file, but for some reason it
doesn't exist. EVERY job submitted with pbs_submit() will fail at run
time with this error, but jobs submitted with pbs_submit_hash() continue to
work fine. This continues until pbs_sever is restarted, and then
everything works fine again.
Post by David Beer
It looks like in Torque 4 pbs_submit_hash is in libtorque, but in Torque
5 pbs_submit_hash_ext is in libtorque but pbs_submit_hash is not.
Post by David Beer
$ strings libtorque.so | grep submit_hash
_Z15pbs_submit_hashiPN9container14item_containerIP8job_dataEES4_PcS5_S5_PS5_S6_
Post by David Beer
pbs_submit_hash_ext
$ strings libtorque.so | grep submit_hash
pbs_submit_hash
On Fri, Sep 25, 2015 at 12:04 PM, David Beer <
I'm not sure what difference it can make to use pbs_submit() or
pbs_submit_hash() from the server's perspective. The only difference
between the two is how things are stored on the client side before being
sent to pbs_server, but the server receives the exact same information from
either pbs_submit() or pbs_submit_hash().
Post by David Beer
I personally do not thing that the difference in functionality is caused
by using pbs_submit() instead of pbs_submit_hash(). If you did want to try
to use pbs_submit_hash(), you'd want to populate a job_data_container
object as is done in the code. This can be found in
src/include/u_hash_map_structs.h. Mostly, this is a container object of
name and value pairs with some extra flags. The other difference for this
function is that resources are stored separately from the rest of the
attributes. There is one container for the job attributes and a separate
one for the resources. For an example of how this is populated, you can
look at src/cmds/qsub_functions.c and look at the ji.job_attr for the
attributes and ji.res_attr for the resources.
Post by David Beer
pbs_submit_hash_ext() is a wrapper function for pbs_submit_hash() and it
is never called in the code.
Post by David Beer
I apologize that the documentation isn't up to snuff on this. I think
that the best way forward on this issue is hammering down why things are
failing the way they are. We recently had another customer report that they
are having a similar issue and we'd like to help in supporting pbs_python,
so perhaps we need to do some collaborative debugging on this. Do you think
you can send in the script that gets things into this state so I can try to
reproduce it locally?
Post by David Beer
Following up on this Ken.
any luck sorting out pbs_submit(), pbs_submit_hash(),
pbs_submit_hash_ext()
Post by David Beer
Right now we're stuck using pbs_submit() from pbs_python (at least until
Bas can get info about pbs_submit_hash() so he can implement it in
pbs_python), but occasionally pbs_server gets into a state where jobs
submitted with pbs_submit() always fail until pbs_server is restarted.
While in this state, jobs submitted with pbs_submit_hash() still work, so
we would really like to switch (or have pbs_submit() fixed so this doesn't
happen any more).
Post by David Beer
On Tue, Mar 31, 2015 at 10:44 AM, Ken Nielson <
Hi all,
This is going to take a little sorting out before I get you a good
answer. I will get back with you later.
Post by David Beer
Ken
We have found that occasionally pbs_server gets into a state where any
job submitted with pbs_submit() will fail. The failure is typically that
the job script is missing at run time, and the job immediately exits with a
Post by David Beer
/var/spool/torque/mom_priv/jobs/
NNNNNNN.host.domain.SC
: No such file
or directory"
far less often I've seen pbs_submit() fail completely (seems to fail
sending the job script to pbs_server) so the job isn't queue or have the
job get queued but have the job script get corrupted.
Post by David Beer
Note that while our pipelines that use pbs_submit() are failing, jobs
submitted with pbs_submit_hash() via qsub (or a simple test program)
continue to work fine. A restart of pbs_server fixes everything.
Post by David Beer
Because some of these pipelines are production, we would like to have
them as reliable as possible. It seems that pbs_submit_hash() is now more
reliable than pbs_submit(), so we would like to switch. However, there
isn't any documentation, and the API doesn't seem entirely stable
(pbs_submit_hash vs pbs_submit_hash_ext). We use pbs_python for our
pipelines, so I'm really looking forward to being able to use
pbs_submit_hash from pbs_python.
Post by David Beer
On Tue, Mar 31, 2015 at 8:28 AM, Bas van der Vlies <
I have some questions about the submit functions for the different
torque versions. I am building the python interface above the torque
library.
Post by David Beer
* pbs_submit
Now i have report that pbs_submit is not reliable anymore and that we
4.X) pbs_submit_hash
5.X) pbs_submit_hash_ext
I can not find ant documentation for it or how to use it. It is in the
pbs_ifl.h so this are the API functions. It would be handy if there is some
kind of documentation.
Post by David Beer
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098
XG Amsterdam
|
Post by David Beer
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
Ken Nielson Sr. Software Engineer
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300 Provo, UT 84606
www.adaptivecomputing.com
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG Amsterdam
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
Glen Beane
2015-10-06 16:15:27 UTC
Permalink
We've been running this patch on 5.1.1 on my dev cluster. We've rapidly
submitted thousands of jobs using pbs_submit and have been unable to
reproduce the problem. While I don't have any definitive proof this fixes
the issue, it seems like it could have and there is no downside. I'd like
to see this patch incorporated into torque 4 and 5. I think for our
production clusters our sysadmins use the Adaptive yum repo, so it would
be nice if they could get this as an update for torque 4 through that
channel.
Post by David Beer
This patch applies cleanly to 4.2.10 as well. I would imagine it'd also
apply to other 4.X.
On Mon, Sep 28, 2015 at 12:52 AM, Bas van der Vlies <
Post by Bas van der Vlies
Thanks both for the patch and debugging. Just a question is this patch
for all torque versions (4 .X and 5.X)?
Post by David Beer
Cool. Let me know if it resolves things for you and I'll get it checked
in. Thanks for helping test this Glen.
Post by David Beer
Nice catch. This seems like it could be it. I'm going to see about
getting this installed on our test system so we run a bunch of submissions
through it.
Post by David Beer
I think pbs_submit() worked fine before the multi-threaded pbs_server,
so I always figured it was some kind of race condition.
Post by David Beer
Once this happens things must get out of sync because every
pbs_submit() from then on has the same thing happen.
Post by David Beer
On Fri, Sep 25, 2015 at 1:32 PM, David Beer <
Glen,
I appreciate the additional info. After looking into this again I
noticed one small difference between what pbs_submit_hash() sends to
pbs_server and what is sent by pbs_submit(). It is small, but it could be
it. pbs_submit() never includes the job id, but pbs_submit_hash() always
does. When the job id is empty, it just grabs the first job in the new jobs
list. This definitely allows for the possibility of an error - just the
kind of error that only happens every so many weeks and is super hard to
reproduce.
Post by David Beer
I haven't reproduced it yet, but here's a patch that makes them exactly
the same in case someone else has the time to test it before I get to it.
Post by David Beer
I don't think the problem is just with pbs_python. We've duplicated it
with the C API. Once pbs_server gets into this state, submissions using
pbs_submit() directly from C also fail exactly the same way. And we can go
weeks between these events, so it isn't super easy to reproduce. But once
it gets into this state, it stays in this state until a restart.
Post by David Beer
I think all you need is a program, I think it could even be C, that
uses pbs_submit() and if you run it hundred/thousands of times I think you
can eventually reproduce the problem.
Post by David Beer
echo "hostname" /tmp/test.sh
Then this C program will submit jobs that fail once pbs_server gets
into this state (but works fine up to that point, and fine after a
Post by David Beer
#include <pbs_ifl.h>
#include <stdio.h>
int main (int argc, char **argv)
{
int fd = pbs_connect(0);
char *new_jobid;
new_jobid = pbs_submit(fd, 0, "/tmp/test.sh", 0, 0);
printf("%s\n", new_jobid);
return 0;
}
while this will submit jobs that always run, even when jobs submitted
with the program shown above fail (needs to be compiled as C++, at least
with Torque 4 because memmgr_init is not declared with "extern C" and its
Post by David Beer
#include <pbs_ifl.h>
#include <u_hash_map_structs.h>
#include <u_memmgr.h>
#include <stdio.h>
int main (int argc, char **argv)
{
int fd = pbs_connect(0);
char *new_jobid;
char script[] = "/tmp/test.sh";
memmgr* mm;
job_data* job_attrs = 0;
memmgr_init(&mm, 0);
/* pass empty ATTR_v, I was getting a segfault if job_attrs was
empty when I called pbs_submit_hash */
Post by David Beer
hash_add_or_exit(&mm, &job_attrs, ATTR_v, "", ENV_DATA);
pbs_submit_hash(fd, &mm, job_attrs, NULL, script, NULL, NULL,
&new_jobid, NULL);
Post by David Beer
printf("%s\n", new_jobid);
return 0;
}
When we use pbs_submit() through pbs_python it works fine 99% of the
time. But then suddenly ALL jobs submitted with pbs_submit() start failing
(both with pbs_python or directly from C), however qsub still works (qsub
uses pbs_submit_hash()). The only fix at this point is to restart
pbs_server, and then it will be fine again for a while (could be weeks).
Post by David Beer
What happens is that the job submission appears to be successful, but
when the job goes to run the job script is missing: The job gets executed
Post by David Beer
/var/spool/torque/mom_priv/jobs/
NNNNNNN.host.domain.SC
: No such file
or directory"
so pbs_mom is trying to exec the .SC file, but for some reason it
doesn't exist. EVERY job submitted with pbs_submit() will fail at run
time with this error, but jobs submitted with pbs_submit_hash() continue to
work fine. This continues until pbs_sever is restarted, and then
everything works fine again.
Post by David Beer
It looks like in Torque 4 pbs_submit_hash is in libtorque, but in
Torque 5 pbs_submit_hash_ext is in libtorque but pbs_submit_hash is not.
Post by David Beer
$ strings libtorque.so | grep submit_hash
_Z15pbs_submit_hashiPN9container14item_containerIP8job_dataEES4_PcS5_S5_PS5_S6_
Post by David Beer
pbs_submit_hash_ext
$ strings libtorque.so | grep submit_hash
pbs_submit_hash
On Fri, Sep 25, 2015 at 12:04 PM, David Beer <
I'm not sure what difference it can make to use pbs_submit() or
pbs_submit_hash() from the server's perspective. The only difference
between the two is how things are stored on the client side before being
sent to pbs_server, but the server receives the exact same information from
either pbs_submit() or pbs_submit_hash().
Post by David Beer
I personally do not thing that the difference in functionality is
caused by using pbs_submit() instead of pbs_submit_hash(). If you did want
to try to use pbs_submit_hash(), you'd want to populate a
job_data_container object as is done in the code. This can be found in
src/include/u_hash_map_structs.h. Mostly, this is a container object of
name and value pairs with some extra flags. The other difference for this
function is that resources are stored separately from the rest of the
attributes. There is one container for the job attributes and a separate
one for the resources. For an example of how this is populated, you can
look at src/cmds/qsub_functions.c and look at the ji.job_attr for the
attributes and ji.res_attr for the resources.
Post by David Beer
pbs_submit_hash_ext() is a wrapper function for pbs_submit_hash() and
it is never called in the code.
Post by David Beer
I apologize that the documentation isn't up to snuff on this. I think
that the best way forward on this issue is hammering down why things are
failing the way they are. We recently had another customer report that they
are having a similar issue and we'd like to help in supporting pbs_python,
so perhaps we need to do some collaborative debugging on this. Do you think
you can send in the script that gets things into this state so I can try to
reproduce it locally?
Post by David Beer
Following up on this Ken.
any luck sorting out pbs_submit(), pbs_submit_hash(),
pbs_submit_hash_ext()
Post by David Beer
Right now we're stuck using pbs_submit() from pbs_python (at least
until Bas can get info about pbs_submit_hash() so he can implement it in
pbs_python), but occasionally pbs_server gets into a state where jobs
submitted with pbs_submit() always fail until pbs_server is restarted.
While in this state, jobs submitted with pbs_submit_hash() still work, so
we would really like to switch (or have pbs_submit() fixed so this doesn't
happen any more).
Post by David Beer
On Tue, Mar 31, 2015 at 10:44 AM, Ken Nielson <
Hi all,
This is going to take a little sorting out before I get you a good
answer. I will get back with you later.
Post by David Beer
Ken
We have found that occasionally pbs_server gets into a state where any
job submitted with pbs_submit() will fail. The failure is typically that
the job script is missing at run time, and the job immediately exits with a
Post by David Beer
/var/spool/torque/mom_priv/jobs/
NNNNNNN.host.domain.SC
: No such file
or directory"
far less often I've seen pbs_submit() fail completely (seems to fail
sending the job script to pbs_server) so the job isn't queue or have the
job get queued but have the job script get corrupted.
Post by David Beer
Note that while our pipelines that use pbs_submit() are failing, jobs
submitted with pbs_submit_hash() via qsub (or a simple test program)
continue to work fine. A restart of pbs_server fixes everything.
Post by David Beer
Because some of these pipelines are production, we would like to have
them as reliable as possible. It seems that pbs_submit_hash() is now more
reliable than pbs_submit(), so we would like to switch. However, there
isn't any documentation, and the API doesn't seem entirely stable
(pbs_submit_hash vs pbs_submit_hash_ext). We use pbs_python for our
pipelines, so I'm really looking forward to being able to use
pbs_submit_hash from pbs_python.
Post by David Beer
On Tue, Mar 31, 2015 at 8:28 AM, Bas van der Vlies <
I have some questions about the submit functions for the different
torque versions. I am building the python interface above the torque
library.
Post by David Beer
* pbs_submit
Now i have report that pbs_submit is not reliable anymore and that we
4.X) pbs_submit_hash
5.X) pbs_submit_hash_ext
I can not find ant documentation for it or how to use it. It is in the
pbs_ifl.h so this are the API functions. It would be handy if there is some
kind of documentation.
Post by David Beer
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 |
1098 XG Amsterdam
www.surfsara.nl |
Post by David Beer
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
Ken Nielson Sr. Software Engineer
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300 Provo, UT 84606
www.adaptivecomputing.com
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG Amsterdam
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
David Beer
2015-10-06 19:40:23 UTC
Permalink
Glen,

Thanks for help testing. This has now been checked in to 4 and 5.
Post by Glen Beane
We've been running this patch on 5.1.1 on my dev cluster. We've rapidly
submitted thousands of jobs using pbs_submit and have been unable to
reproduce the problem. While I don't have any definitive proof this fixes
the issue, it seems like it could have and there is no downside. I'd like
to see this patch incorporated into torque 4 and 5. I think for our
production clusters our sysadmins use the Adaptive yum repo, so it would
be nice if they could get this as an update for torque 4 through that
channel.
Post by David Beer
This patch applies cleanly to 4.2.10 as well. I would imagine it'd also
apply to other 4.X.
On Mon, Sep 28, 2015 at 12:52 AM, Bas van der Vlies <
Post by Bas van der Vlies
Thanks both for the patch and debugging. Just a question is this patch
for all torque versions (4 .X and 5.X)?
Post by David Beer
Cool. Let me know if it resolves things for you and I'll get it
checked in. Thanks for helping test this Glen.
Post by David Beer
Nice catch. This seems like it could be it. I'm going to see about
getting this installed on our test system so we run a bunch of submissions
through it.
Post by David Beer
I think pbs_submit() worked fine before the multi-threaded
pbs_server, so I always figured it was some kind of race condition.
Post by David Beer
Once this happens things must get out of sync because every
pbs_submit() from then on has the same thing happen.
Post by David Beer
On Fri, Sep 25, 2015 at 1:32 PM, David Beer <
Glen,
I appreciate the additional info. After looking into this again I
noticed one small difference between what pbs_submit_hash() sends to
pbs_server and what is sent by pbs_submit(). It is small, but it could be
it. pbs_submit() never includes the job id, but pbs_submit_hash() always
does. When the job id is empty, it just grabs the first job in the new jobs
list. This definitely allows for the possibility of an error - just the
kind of error that only happens every so many weeks and is super hard to
reproduce.
Post by David Beer
I haven't reproduced it yet, but here's a patch that makes them
exactly the same in case someone else has the time to test it before I get
to it.
Post by David Beer
I don't think the problem is just with pbs_python. We've duplicated
it with the C API. Once pbs_server gets into this state, submissions
using pbs_submit() directly from C also fail exactly the same way. And we
can go weeks between these events, so it isn't super easy to reproduce.
But once it gets into this state, it stays in this state until a restart.
Post by David Beer
I think all you need is a program, I think it could even be C, that
uses pbs_submit() and if you run it hundred/thousands of times I think you
can eventually reproduce the problem.
Post by David Beer
echo "hostname" /tmp/test.sh
Then this C program will submit jobs that fail once pbs_server gets
into this state (but works fine up to that point, and fine after a
Post by David Beer
#include <pbs_ifl.h>
#include <stdio.h>
int main (int argc, char **argv)
{
int fd = pbs_connect(0);
char *new_jobid;
new_jobid = pbs_submit(fd, 0, "/tmp/test.sh", 0, 0);
printf("%s\n", new_jobid);
return 0;
}
while this will submit jobs that always run, even when jobs submitted
with the program shown above fail (needs to be compiled as C++, at least
with Torque 4 because memmgr_init is not declared with "extern C" and its
Post by David Beer
#include <pbs_ifl.h>
#include <u_hash_map_structs.h>
#include <u_memmgr.h>
#include <stdio.h>
int main (int argc, char **argv)
{
int fd = pbs_connect(0);
char *new_jobid;
char script[] = "/tmp/test.sh";
memmgr* mm;
job_data* job_attrs = 0;
memmgr_init(&mm, 0);
/* pass empty ATTR_v, I was getting a segfault if job_attrs was
empty when I called pbs_submit_hash */
Post by David Beer
hash_add_or_exit(&mm, &job_attrs, ATTR_v, "", ENV_DATA);
pbs_submit_hash(fd, &mm, job_attrs, NULL, script, NULL, NULL,
&new_jobid, NULL);
Post by David Beer
printf("%s\n", new_jobid);
return 0;
}
When we use pbs_submit() through pbs_python it works fine 99% of the
time. But then suddenly ALL jobs submitted with pbs_submit() start failing
(both with pbs_python or directly from C), however qsub still works (qsub
uses pbs_submit_hash()). The only fix at this point is to restart
pbs_server, and then it will be fine again for a while (could be weeks).
Post by David Beer
What happens is that the job submission appears to be successful, but
when the job goes to run the job script is missing: The job gets executed
Post by David Beer
/var/spool/torque/mom_priv/jobs/
NNNNNNN.host.domain.SC
: No such file
or directory"
so pbs_mom is trying to exec the .SC file, but for some reason it
doesn't exist. EVERY job submitted with pbs_submit() will fail at run
time with this error, but jobs submitted with pbs_submit_hash() continue to
work fine. This continues until pbs_sever is restarted, and then
everything works fine again.
Post by David Beer
It looks like in Torque 4 pbs_submit_hash is in libtorque, but in
Torque 5 pbs_submit_hash_ext is in libtorque but pbs_submit_hash is not.
Post by David Beer
$ strings libtorque.so | grep submit_hash
_Z15pbs_submit_hashiPN9container14item_containerIP8job_dataEES4_PcS5_S5_PS5_S6_
Post by David Beer
pbs_submit_hash_ext
$ strings libtorque.so | grep submit_hash
pbs_submit_hash
On Fri, Sep 25, 2015 at 12:04 PM, David Beer <
I'm not sure what difference it can make to use pbs_submit() or
pbs_submit_hash() from the server's perspective. The only difference
between the two is how things are stored on the client side before being
sent to pbs_server, but the server receives the exact same information from
either pbs_submit() or pbs_submit_hash().
Post by David Beer
I personally do not thing that the difference in functionality is
caused by using pbs_submit() instead of pbs_submit_hash(). If you did want
to try to use pbs_submit_hash(), you'd want to populate a
job_data_container object as is done in the code. This can be found in
src/include/u_hash_map_structs.h. Mostly, this is a container object of
name and value pairs with some extra flags. The other difference for this
function is that resources are stored separately from the rest of the
attributes. There is one container for the job attributes and a separate
one for the resources. For an example of how this is populated, you can
look at src/cmds/qsub_functions.c and look at the ji.job_attr for the
attributes and ji.res_attr for the resources.
Post by David Beer
pbs_submit_hash_ext() is a wrapper function for pbs_submit_hash() and
it is never called in the code.
Post by David Beer
I apologize that the documentation isn't up to snuff on this. I think
that the best way forward on this issue is hammering down why things are
failing the way they are. We recently had another customer report that they
are having a similar issue and we'd like to help in supporting pbs_python,
so perhaps we need to do some collaborative debugging on this. Do you think
you can send in the script that gets things into this state so I can try to
reproduce it locally?
Post by David Beer
Following up on this Ken.
any luck sorting out pbs_submit(), pbs_submit_hash(),
pbs_submit_hash_ext()
Post by David Beer
Right now we're stuck using pbs_submit() from pbs_python (at least
until Bas can get info about pbs_submit_hash() so he can implement it in
pbs_python), but occasionally pbs_server gets into a state where jobs
submitted with pbs_submit() always fail until pbs_server is restarted.
While in this state, jobs submitted with pbs_submit_hash() still work, so
we would really like to switch (or have pbs_submit() fixed so this doesn't
happen any more).
Post by David Beer
On Tue, Mar 31, 2015 at 10:44 AM, Ken Nielson <
Hi all,
This is going to take a little sorting out before I get you a good
answer. I will get back with you later.
Post by David Beer
Ken
We have found that occasionally pbs_server gets into a state where any
job submitted with pbs_submit() will fail. The failure is typically that
the job script is missing at run time, and the job immediately exits with a
Post by David Beer
/var/spool/torque/mom_priv/jobs/
NNNNNNN.host.domain.SC
: No such file
or directory"
far less often I've seen pbs_submit() fail completely (seems to fail
sending the job script to pbs_server) so the job isn't queue or have the
job get queued but have the job script get corrupted.
Post by David Beer
Note that while our pipelines that use pbs_submit() are failing, jobs
submitted with pbs_submit_hash() via qsub (or a simple test program)
continue to work fine. A restart of pbs_server fixes everything.
Post by David Beer
Because some of these pipelines are production, we would like to have
them as reliable as possible. It seems that pbs_submit_hash() is now more
reliable than pbs_submit(), so we would like to switch. However, there
isn't any documentation, and the API doesn't seem entirely stable
(pbs_submit_hash vs pbs_submit_hash_ext). We use pbs_python for our
pipelines, so I'm really looking forward to being able to use
pbs_submit_hash from pbs_python.
Post by David Beer
On Tue, Mar 31, 2015 at 8:28 AM, Bas van der Vlies <
I have some questions about the submit functions for the different
torque versions. I am building the python interface above the torque
library.
Post by David Beer
* pbs_submit
Now i have report that pbs_submit is not reliable anymore and that we
4.X) pbs_submit_hash
5.X) pbs_submit_hash_ext
I can not find ant documentation for it or how to use it. It is in the
pbs_ifl.h so this are the API functions. It would be handy if there is some
kind of documentation.
Post by David Beer
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 |
1098 XG Amsterdam
www.surfsara.nl |
Post by David Beer
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
Ken Nielson Sr. Software Engineer
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300 Provo, UT 84606
www.adaptivecomputing.com
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG Amsterdam
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
Loading...