This patch applies cleanly to 4.2.10 as well. I would imagine it'd also
apply to other 4.X.
Post by Bas van der VliesThanks both for the patch and debugging. Just a question is this patch
for all torque versions (4 .X and 5.X)?
Post by David BeerCool. Let me know if it resolves things for you and I'll get it
checked in. Thanks for helping test this Glen.
Post by David BeerNice catch. This seems like it could be it. I'm going to see about
getting this installed on our test system so we run a bunch of submissions
through it.
Post by David BeerI think pbs_submit() worked fine before the multi-threaded
pbs_server, so I always figured it was some kind of race condition.
Post by David BeerOnce this happens things must get out of sync because every
pbs_submit() from then on has the same thing happen.
Post by David BeerOn Fri, Sep 25, 2015 at 1:32 PM, David Beer <
Glen,
I appreciate the additional info. After looking into this again I
noticed one small difference between what pbs_submit_hash() sends to
pbs_server and what is sent by pbs_submit(). It is small, but it could be
it. pbs_submit() never includes the job id, but pbs_submit_hash() always
does. When the job id is empty, it just grabs the first job in the new jobs
list. This definitely allows for the possibility of an error - just the
kind of error that only happens every so many weeks and is super hard to
reproduce.
Post by David BeerI haven't reproduced it yet, but here's a patch that makes them
exactly the same in case someone else has the time to test it before I get
to it.
Post by David BeerI don't think the problem is just with pbs_python. We've duplicated
it with the C API. Once pbs_server gets into this state, submissions
using pbs_submit() directly from C also fail exactly the same way. And we
can go weeks between these events, so it isn't super easy to reproduce.
But once it gets into this state, it stays in this state until a restart.
Post by David BeerI think all you need is a program, I think it could even be C, that
uses pbs_submit() and if you run it hundred/thousands of times I think you
can eventually reproduce the problem.
Post by David Beerecho "hostname" /tmp/test.sh
Then this C program will submit jobs that fail once pbs_server gets
into this state (but works fine up to that point, and fine after a
Post by David Beer#include <pbs_ifl.h>
#include <stdio.h>
int main (int argc, char **argv)
{
int fd = pbs_connect(0);
char *new_jobid;
new_jobid = pbs_submit(fd, 0, "/tmp/test.sh", 0, 0);
printf("%s\n", new_jobid);
return 0;
}
while this will submit jobs that always run, even when jobs submitted
with the program shown above fail (needs to be compiled as C++, at least
with Torque 4 because memmgr_init is not declared with "extern C" and its
Post by David Beer#include <pbs_ifl.h>
#include <u_hash_map_structs.h>
#include <u_memmgr.h>
#include <stdio.h>
int main (int argc, char **argv)
{
int fd = pbs_connect(0);
char *new_jobid;
char script[] = "/tmp/test.sh";
memmgr* mm;
job_data* job_attrs = 0;
memmgr_init(&mm, 0);
/* pass empty ATTR_v, I was getting a segfault if job_attrs was
empty when I called pbs_submit_hash */
Post by David Beerhash_add_or_exit(&mm, &job_attrs, ATTR_v, "", ENV_DATA);
pbs_submit_hash(fd, &mm, job_attrs, NULL, script, NULL, NULL,
&new_jobid, NULL);
Post by David Beerprintf("%s\n", new_jobid);
return 0;
}
When we use pbs_submit() through pbs_python it works fine 99% of the
time. But then suddenly ALL jobs submitted with pbs_submit() start failing
(both with pbs_python or directly from C), however qsub still works (qsub
uses pbs_submit_hash()). The only fix at this point is to restart
pbs_server, and then it will be fine again for a while (could be weeks).
Post by David BeerWhat happens is that the job submission appears to be successful, but
when the job goes to run the job script is missing: The job gets executed
Post by David Beer/var/spool/torque/mom_priv/jobs/
NNNNNNN.host.domain.SC
: No such file
or directory"
so pbs_mom is trying to exec the .SC file, but for some reason it
doesn't exist. EVERY job submitted with pbs_submit() will fail at run
time with this error, but jobs submitted with pbs_submit_hash() continue to
work fine. This continues until pbs_sever is restarted, and then
everything works fine again.
Post by David BeerIt looks like in Torque 4 pbs_submit_hash is in libtorque, but in
Torque 5 pbs_submit_hash_ext is in libtorque but pbs_submit_hash is not.
Post by David Beer$ strings libtorque.so | grep submit_hash
_Z15pbs_submit_hashiPN9container14item_containerIP8job_dataEES4_PcS5_S5_PS5_S6_
Post by David Beerpbs_submit_hash_ext
$ strings libtorque.so | grep submit_hash
pbs_submit_hash
On Fri, Sep 25, 2015 at 12:04 PM, David Beer <
I'm not sure what difference it can make to use pbs_submit() or
pbs_submit_hash() from the server's perspective. The only difference
between the two is how things are stored on the client side before being
sent to pbs_server, but the server receives the exact same information from
either pbs_submit() or pbs_submit_hash().
Post by David BeerI personally do not thing that the difference in functionality is
caused by using pbs_submit() instead of pbs_submit_hash(). If you did want
to try to use pbs_submit_hash(), you'd want to populate a
job_data_container object as is done in the code. This can be found in
src/include/u_hash_map_structs.h. Mostly, this is a container object of
name and value pairs with some extra flags. The other difference for this
function is that resources are stored separately from the rest of the
attributes. There is one container for the job attributes and a separate
one for the resources. For an example of how this is populated, you can
look at src/cmds/qsub_functions.c and look at the ji.job_attr for the
attributes and ji.res_attr for the resources.
Post by David Beerpbs_submit_hash_ext() is a wrapper function for pbs_submit_hash() and
it is never called in the code.
Post by David BeerI apologize that the documentation isn't up to snuff on this. I think
that the best way forward on this issue is hammering down why things are
failing the way they are. We recently had another customer report that they
are having a similar issue and we'd like to help in supporting pbs_python,
so perhaps we need to do some collaborative debugging on this. Do you think
you can send in the script that gets things into this state so I can try to
reproduce it locally?
Post by David BeerFollowing up on this Ken.
any luck sorting out pbs_submit(), pbs_submit_hash(),
pbs_submit_hash_ext()
Post by David BeerRight now we're stuck using pbs_submit() from pbs_python (at least
until Bas can get info about pbs_submit_hash() so he can implement it in
pbs_python), but occasionally pbs_server gets into a state where jobs
submitted with pbs_submit() always fail until pbs_server is restarted.
While in this state, jobs submitted with pbs_submit_hash() still work, so
we would really like to switch (or have pbs_submit() fixed so this doesn't
happen any more).
Post by David BeerOn Tue, Mar 31, 2015 at 10:44 AM, Ken Nielson <
Hi all,
This is going to take a little sorting out before I get you a good
answer. I will get back with you later.
Post by David BeerKen
We have found that occasionally pbs_server gets into a state where any
job submitted with pbs_submit() will fail. The failure is typically that
the job script is missing at run time, and the job immediately exits with a
Post by David Beer/var/spool/torque/mom_priv/jobs/
NNNNNNN.host.domain.SC
: No such file
or directory"
far less often I've seen pbs_submit() fail completely (seems to fail
sending the job script to pbs_server) so the job isn't queue or have the
job get queued but have the job script get corrupted.
Post by David BeerNote that while our pipelines that use pbs_submit() are failing, jobs
submitted with pbs_submit_hash() via qsub (or a simple test program)
continue to work fine. A restart of pbs_server fixes everything.
Post by David BeerBecause some of these pipelines are production, we would like to have
them as reliable as possible. It seems that pbs_submit_hash() is now more
reliable than pbs_submit(), so we would like to switch. However, there
isn't any documentation, and the API doesn't seem entirely stable
(pbs_submit_hash vs pbs_submit_hash_ext). We use pbs_python for our
pipelines, so I'm really looking forward to being able to use
pbs_submit_hash from pbs_python.
Post by David BeerOn Tue, Mar 31, 2015 at 8:28 AM, Bas van der Vlies <
I have some questions about the submit functions for the different
torque versions. I am building the python interface above the torque
library.
Post by David Beer* pbs_submit
Now i have report that pbs_submit is not reliable anymore and that we
4.X) pbs_submit_hash
5.X) pbs_submit_hash_ext
I can not find ant documentation for it or how to use it. It is in the
pbs_ifl.h so this are the API functions. It would be handy if there is some
kind of documentation.
Post by David Beer---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 |
1098 XG Amsterdam
www.surfsara.nl |
Post by David Beer_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
Ken Nielson Sr. Software Engineer
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300 Provo, UT 84606
www.adaptivecomputing.com
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG Amsterdam
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers