[DRMAA-WG] Conference call -Feb 3rd - 17:00 UTC

Andre Merzky andre at merzky.net
Tue Feb 3 01:17:28 CST 2009

Quoting [Daniel Templeton] (Feb 02 2009):
> Since I won't make the meeting, here's my feedback.
> Peter Tröger wrote:
> To represent the temporarily undetermined state, we expand the 
> TryAgainLaterException to apply to drmaa_job_ps() as well.
> >3. Voting about separate "TERMINATED" vs. "FAILED" state
> >    - Semantics
> >  
> A job that exits via the terminated state has the potential to succeed 
> if resubmitted.  It entered the terminated state due to an action taken 
> by the job owner, an administrator, or the DRM system itself, possibly 
> on behalf of the terminated job.  A job that exits via the failed state 
> is unlikely to succeed if resubmitted.  It entered the failed state due 
> to an error in the job or a misconfiguration of the machine on which it ran.

You can't always know if a resubmit will yield a chance of
success :  a broken file system, or insufficient or bad
memory, may lead to internal fail states, and may well allow
the job to succeed next time.  An endless loop in the
application may always occur, and trigger the scheduler ot
the system to eventually kill the job, without any chance of
a later instance to do any better.

So, maybe it is better to distinguish based on the
information you _do_ have?

  - FAILED:     the job terminated for internal reasons (i.e.
                the application met an internal error condition)

  - TERMINATED: the job termination was triggered by an
                external entity (e.g by the user, scheduler, system, ...)

> There is a problem with my clean could-succeed/won't-succeed division.  
> What if a job failed because the machine it ran on was wonky?  That is 
> clearly a failure, not a termination, but if the job were resubmitted 
> and landed on any other machine, it would succeed.  In that case, do we 
> actually care if there was a difference between failure and termination?
> >    - Resulting new job state transitions
> >  
> There's one more thing we may want to consider.  In SGE, a job can exit 
> one of four ways.  It can succeed.  It can fail, which includes 
> termination.  It can request to be rescheduled.  And it can be set into 
> error state.  The first two are handled fine by drmaa_wait().  The third 
> can be recognized by drmaa_job_ps(), but it's not ideal.  The fourth is 
> completely unknowable from DRMAA.  To the DRMAA client, it will look 
> like the job was requeued to be rescheduled, but is never actually 
> scheduled to run again.  We might want to consider supporting some 
> additional states, such as rescheduled or error, or maybe those states 
> are something that the state/substate model would enable.
> I vote for making the substate as generic as possible.  I think forcing 
> it to be an integer in unnecessarily limiting.  Taking some Java APIs as 
> examples, sometimes the substates are really just text messages that 
> explain what's going on.  I think that's valid and something we should 
> allow.

"If all the tools you have is a hammer, every problem starts
to look like a nail."  So, my apologies to pulling the same
string every time I post to this list *blush*

Anyway, you may want to have a look at the SAGA state model,
again: substates are defined as strings, but SAGA
implementatios are enouraged to define these strings, and to
adhere to a namespace.  So, an SGE implementation would
document the substates of RUNNING as

Well, SGE:ERROR should go into a final state, not into
RUNNING, right?  But you got the picture. (GFD-90 p.65, last

Cheers, Andre.

> >4. Further DRMAA2 discussion
> >  
> See the attached email from a few weeks ago.
> Daniel

> Date: Tue, 20 Jan 2009 08:46:24 -0800
> From: Daniel Templeton <Dan.Templeton at Sun.COM>
> Subject: DRMAA v2
> To: DRMAA Working Group <drmaa-wg at gridforum.org>
> A few proposals for the meeting today:
> PT12:
> < A language binding SHOULD specify numeric values for all DRMAA error 
> constants.
> ---
> > Such a language binding SHOULD specify numeric values for all DRMAA 
> error constants.
> PT13:
> I definitely agree that PartialTimestamp is a boondoggle.  I'm not sure 
> I agree with using ISO8601, though, mostly because it presupposes a 
> date/time *string*.  In a high order language, I want to be able to use 
> the native date/time object.  How about specifying that a language 
> should use a date/time object or primitive is it has one, and an ISO8601 
> string if it doesn't?
> PT20:
> I think we can handle the resource request pretty easily, and I think we 
> need it.  We just need to add a resourceRequest attribute of type 
> Dictionary and treat any such resource request as a hard request.  
> Alternatively, we could have a hardResourceRequest and a 
> softResourceRequest.  The former is simpler, but the later saves us from 
> talking about this again for DRMAAv3. :)
> Thinking about whether a resource request should be an optional 
> attribute makes created in me a doubt about the value of the 
> UnsupportedAttributeException.  Should it be possible to have the 
> implementation just ignore unsupported optional attributes?  It would 
> certainly be easier than repeatedly attempting to submit until all the 
> offending attributes are removed from the template.  Maybe it would help 
> to have the exception detail *all* unsupported attributes at once.  Just 
> thinking out loud here...
> Daniel
Nothing is ever easy.

More information about the drmaa-wg mailing list