[DRMAA-WG] TERMINATED vs. FAILED - reloaded
mamonski at man.poznan.pl
Tue May 5 16:24:37 CDT 2009
2009/5/5 Peter Tröger <peter at troeger.eu>:
> Dear all,
> The March 31th conference call decided upon the following strategy
> regarding job state model extension:
> --- snip
>> 4. TERMINATED vs. FAILED state discussion:
> Option 2 from the original mail is now highly preferred. TERMINATED
> state should express that an external entity (e.g. user or DRM system)
> stopped the job before finishing. For POSIX-aligned systems, this
> could be formulated as reception of a signal by "the job". In
> contrast, FAILED state now expresses that the application stopped on
> its own before finishing. For POSIX-aligned systems, this could be
> formulated as reception of a signal "by the job's application process".
> We ask for comments from PBS and LSF experts (FedStage ?!?). Do these
> systems provide enough error information to distinguish between these
> two states ? For SGE and Condor, Dan and Peter already agreed.
In LSF it seems to be feasible (by checking all events related with the job)
> --- snip
> Piotr from FedStage informed me that the proposed distinction seems not
> to be implementable in PBS. One solution could be to detect the
> 'requested' termination only in the DRMAA library. Dan already expressed
> that this would not reflect the original idea. An intentional job
> termination by another user would then lead to FAILED instead of TERMINATED.
> Since we already rejected Option 1 and 3 in the last phone calls, we
> come out with Option 4 as last solution: There will be no new TERMINATED
> state. The new job sub-state concept will allow to express the job
> failure details, but only in a DRM-specific way.
> We will finally vote about this in the next call.
> Best regards,
> drmaa-wg mailing list
> drmaa-wg at ogf.org
More information about the drmaa-wg