[DRMAA-WG] OGF 28 Report

Peter Tröger peter at troeger.eu
Mon Mar 22 07:48:28 CDT 2010

Dear all,

after being back from a productive OGF 28 event in Munich, here is the long list of decisions we made. We (Mariusz, Daniel, Peter) had three sessions under continuos participation of the SAGA group. Huge thanks must go to Thilo and Andre, who resolved all open SAGA-related issues with us. We got great feedback from Yves Caniou about user requirements in parallel job execution. 

I also had one hour of intense discussion with Thijs Metsch from the OCCI working group. OCCI defines a RESTful interface for controlling cloud IaaS resources - virtual machines, networks, and storage. They would like to add task control to the OCCI use case landscape, and DRMAA looks like a surprisingly good candidate. We bring the semantics, they bring the protocol. When DRMAAv2 is fixed, Thijs and me intend to work out a DRMAA language binding spec for OCCI. This would bring as the long-demanded Remote-DRMAA variant. 


--- snip

(Everything you see here should also be implemented in the Wiki)

- categoryName --renamed--> jobCategory (people used the old term all the time)
- startReservation --renamed--> requestReservation
- Replaced global occurrences of the term "host" with "machine"
- New queue support
	- Added support for queue name specification in JobTemplate  
	- Only one name supported - LSF and SGE only have support for multiple queue names; precedence rules would be unclear 
	- Three new monitoring attributes (drmQueueNames, maxWallclockTime, maxSlotsAllowed) on queue level
	- New monitoring attributes demand notion of infinity ->  NO_LIMIT constant
- Parallel job support
	- Two classes: "spawns itself" vs "is spawned"
		- First class: OpenMP, pthread, self-managed (shell script submitted)
		- Second class: PVM / MPI jobs, categorization based on GFD.115 
	- General design approach: User defines the parallel application binary in cmdLine argument (in contrast to SGE thinking !)
		- jobCategory attribute decides upon all infrastructure-relevant settings for parallel execution (libraries, paths, launch programs)
		- leaded to according "drmJobCategoryNames" counterpart in MonitoringSession, in order to check DRMS capabilities
		- supported job categories are site-specific, DRMAA web site offers standardized names
		- Examples will follow soon on http://www.drmaa.org/jobcategories/
		- DRMAA implementation most likely creates a shell script based on job category, and submits this one
	- The application decides upon process spawning, but the scheduler still needs the information
		- new job template attributes minSlots / maxSlots
		- if minSlots > 1, you MUST define a jobCategory
		- no need to have final slot count as placeholder macro (comes out of the parallel programming API anyway)
- MonitoringSession::machineLoad
	- Removed coreNumber parameter, since the OS on the host migrates jobs between cores - no real sense in core load index
	- Added a comment that this information should not be used for user-side scheduling decisions; just a gadget to implement qmon on-top-of DRMAA
- New job template attribute "accountingId", as in SAGA, JSDL, and the majority of systems
	- not relevant in ReservationTemplate, since advance reservations do not count for job accounting
- File purging on execution host (demanded by SAGA) was rejected, no overall support in DRM systems
- New job template attributes for resource requirements - minPhysMemory, machineOS, machineArch, candidateMachines
	- candidateMachines semantic: use sub-set or all of this hosts for execution, if not possible, reject job
- Advance Reservation interfaces
	- use case for ReservationTemplate::nativeOptions -> SGE demands queue name in advance reservation
	- state model for reservations rejected for DRMAA
- Introduced "AbsoluteTime" abstraction data type in IDL text
- Job state model
	- New StagedIn / StagedOut / Re-Scheduled state in job state model rejected, give hint in the spec to use sub states for this
	- Going from running to queued is a special case (only for PBS), no new state; in case, emulate intermediate step in the library
- JobInfo will not be merged into Job -> information should be consistent, always get one "performance snapshot"
	- JobInfo becomes a value type, in order to express this more clearly
- getting the list of valid contact strings was rejected (not implementable)
- Bulk index placeholder support (in the API) will not be extended
	- Instead, a new placeholder allows to insert the DRM systems bulk index env. variable name into the template (BULK_TASK_ID_VARNAME)
	- Idea is that applications can assign the variable name to their own environment variable, and perform an "eval" on it later
	- Peter fighted against the alternative idea:  Standardizing which environment variables a DRMAA library must define implicitly
- JobInfo: masterMachine and slaveMachines attributes are merged to an ordered string list (allocatedMachines)
	- Implementation can assign some semantic to the ordering

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2208 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/drmaa-wg/attachments/20100322/75debedb/attachment.bin 

More information about the drmaa-wg mailing list