[drmaa-wg] Questions

Daniel Templeton Dan.Templeton at Sun.COM
Wed Mar 30 03:33:39 CST 2005

In working on a remote implementation of the Java binding, I have run
into a couple of interesting questions.  What happens when during a call
to drmaa_control (DRMAA_JOB_IDS_SESSION_ALL), more the implementation
fails to performs the given action on more than one job for different
reasons.  For example, if I try to hold all jobs, but one job is already
in a hold state, three jobs work ok, and the DRM goes down before acting
on the last job, what is the return code?
When doing a drmaa_control(DRMAA_JOB_IDS_SESSION_ALL), what is the
contract on failure, i.e. in what state will the jobs be left?  In the
case of a job failure, does that mean that all jobs will be left in the
state that they were in before the call?  If so, that's going to cause
serious implementation problems.  If not, that's going to cause serious
usability problems.
What happens when a job ends after a thread has called drmaa_synchronize
(DRMAA_JOB_IDS_SESSION_ALL), but another thread "steals" the job exit
info with a call to drmaa_wait()?  I would assume that the synchronize
thread should just assume that the job finished, even though its job
record is gone.  That is what the SGE implementation does.


*        Daniel Templeton   ERGB01 x60220         *
*       Staff Engineer, Sun N1 Grid Engine        *
* "Roads? Where we're going we don't need roads." *
*                    -Dr. Emmett Brown            *
*                     Back to the Future (1985)   *

More information about the drmaa-wg mailing list