[drmaa-wg] Questions

Daniel Templeton Dan.Templeton at Sun.COM
Wed Mar 30 14:28:50 CST 2005

Rajic, Hrabri wrote:

>>-----Original Message-----
>>From: owner-drmaa-wg at ggf.org [mailto:owner-drmaa-wg at ggf.org] On Behalf
> Of
>>Daniel Templeton
>>Sent: Wednesday, March 30, 2005 3:34 AM
>>To: DRMAA Working Group
>>Subject: [drmaa-wg] Questions
>>In working on a remote implementation of the Java binding, I have run
>>into a couple of interesting questions.  What happens when during a
> call
>>to drmaa_control (DRMAA_JOB_IDS_SESSION_ALL), more the implementation
>>fails to performs the given action on more than one job for different
>>reasons.  For example, if I try to hold all jobs, but one job is
> already
>>in a hold state, three jobs work ok, and the DRM goes down before
> acting
>>on the last job, what is the return code?
> The routine return code would need to indicate a compound error; BTW we
> do not have such error code defined, and the detailed error message
> would need to detail what happened.

In other words, the spec completely fails to address this case. 
Something to keep in mind for 1.1 or 2.0.

>>When doing a drmaa_control(DRMAA_JOB_IDS_SESSION_ALL), what is the
>>contract on failure, i.e. in what state will the jobs be left?  In the
>>case of a job failure, does that mean that all jobs will be left in the
>>state that they were in before the call?  If so, that's going to cause
>>serious implementation problems.  If not, that's going to cause serious
>>usability problems.
> Transactional interface would be quite useful here ...
> If a routine exits/fails during the call there is no good recourse.

Exactly the point I'm making.  Without transactions, it's hard to use. 
With transactions, it's hard to implement.

> Job failure?  Is this a separate question?  
> One analogy would be teaching a university course.  There would be
> students dropping the course, but the rest goes ahead.  In case of
> absences things also go ahead, and when the students reappear the regime
> is known.

That's a typo.  I meant operation failure.

>>What happens when a job ends after a thread has called
> drmaa_synchronize
>>(DRMAA_JOB_IDS_SESSION_ALL), but another thread "steals" the job exit
>>info with a call to drmaa_wait()?  I would assume that the synchronize
>>thread should just assume that the job finished, even though its job
>>record is gone.  That is what the SGE implementation does.
> Ha, races with job reaping info.  The developers would need to be
> careful in multithreaded environments ... some guidelines would be
> necessary, but preferably outside of the normative docs.

The reason I bring it up is that this particular case is non-obvious. 
It's clear that waiting for the same job twice is bad, but it's not so 
clear when waiting for any or all.


*        Daniel Templeton   ERGB01 x60220         *
*       Staff Engineer, Sun N1 Grid Engine        *
* "Roads? Where we're going we don't need roads." *
*                    -Dr. Emmett Brown            *
*                     Back to the Future (1985)   *

More information about the drmaa-wg mailing list