[DRMAA-WG] Report from D-Grid conference
mamonski at man.poznan.pl
Fri Mar 26 05:34:26 CDT 2010
It's very nice feedback. I took this mail as an opportunity to share
2010/3/26 Peter Tröger <peter at troeger.eu>:
> Dear all,
> this week, I had a DRMAAv2 presentation at the conference of the German grid
> initiative (D-Grid). Even though it was the last session on the last day,
> attendance was pretty good. I got some interesting remarks I wanted to
> - Typical D-Grid installations have PBS or SGE, sometimes Torque. No Condor.
> LSF is on the agenda.
> - With the ability to check for core dump file existence in JobInfo, they
> wondered if DRMAA could also offer to actually get this file.
yes, this is also an use case of one of DRMAA for LSF user. By now
this realized via setting core file limit in nativeSpecification. This
attribute is explicitly supported both by the SGE and LSF. I believe
for torque/PBSPro it could be quite easily implemented on top of the
DRMS. So why not to add it as JobTemplate attribute?
> - One user community in D-Grid typically has "pre-jobs" that prepare a node
> for the real work with some software installation. DRMAAv2 with it's
> waitAnyTerminated() looked good enough for them.
> - One request from the audience was automated re-queueing - if a job goes to
> Failed state, it should be re-queued automatically. This is a typical
> massive scale cluster resp. grid problem, were machines outages are normal.
> Condor (of course) has that, I am not sure about the others.
for me this is only DRMS configuration issue, not the DRMAA. However
as i remember in many systems job must be marked as reRunnable in
order for the DRM to do this (rerunning a job may cause the partial
results from the failed run to be overwritten). I will try do a
research on this topic.
> - Another commonly agreed request was intermediate result preview. The
> problem is that some simulations run for hours, and you want to know pretty
> early if it is worthwhile to complete the run. LSF has a feature were you
> can look on job's stdout while it runs, even with non-interactive jobs. I
> don't know about other systems.
we observed the same, this is also vital for SaaS use cases as it
allow to emulate remote execution of application as local one. In LSF
there is as special command/function in API called bpeek (as
stdout/stderr files redirected to temporary files until the job ends).
In SGE the stdout/stderr are simply redirected into stdout/stderr
file names given upon submission - so user can simply read them
(tested!). Torque can be configured to do the same (by default during
the execution the stdout/stderr are redirected to files in worker node
spool directory - not accessible from fronted).
> - One SLA expert in the auditorium was happy about the startTime / endTime /
> duration approach in the AR template. He called that "relaxed reservation".
> - Another guest recommended GLUE2 as input for our monitoring attributes.
> It's like JSDL and DCIM - everything optional, but maybe good for semantics.
some long time ago i was thinking to provide in our service the
monitoring info using the GLUE schema. I found it way to complex, but
maybe i do not put enough effort in understanding it...
> - It was requested that we check the monitoring attributes against Globus
> MDS and Unicore TSI.
> I was also asked about the time frame for DRMAAv2 implementations - really.
> Not only the D-Grid audience seems to be highly interested in using DRMAAv2,
> I got the same kind of feedback also at OGF28. I hope this is enough
> motivation for everybody in the upcoming finalization phase ...
> Slides are attached, feel free to re-use them.
> drmaa-wg mailing list
> drmaa-wg at ogf.org
More information about the drmaa-wg