[DRMAA-WG] OGF 28 Report

Peter Tröger peter at troeger.eu
Tue Mar 23 12:29:04 CDT 2010

> Thanks for the recap Peter.  I'm sure we'll discuss this list on a  
> call,
> but I can't help giving a little preliminary feedback.

Everybody, please do NOT wait for the call. We need the prior  
clarification on the list, in order to focus on the truly nasty issues  
during the call.


> Daniel
> On 03/22/10 05:48, Peter Tröger wrote:
>> Dear all,
>> after being back from a productive OGF 28 event in Munich, here is  
>> the long list of decisions we made. We (Mariusz, Daniel, Peter) had  
>> three sessions under continuos participation of the SAGA group.  
>> Huge thanks must go to Thilo and Andre, who resolved all open SAGA- 
>> related issues with us. We got great feedback from Yves Caniou  
>> about user requirements in parallel job execution.
>> I also had one hour of intense discussion with Thijs Metsch from  
>> the OCCI working group. OCCI defines a RESTful interface for  
>> controlling cloud IaaS resources - virtual machines, networks, and  
>> storage. They would like to add task control to the OCCI use case  
>> landscape, and DRMAA looks like a surprisingly good candidate. We  
>> bring the semantics, they bring the protocol. When DRMAAv2 is  
>> fixed, Thijs and me intend to work out a DRMAA language binding  
>> spec for OCCI. This would bring as the long-demanded Remote-DRMAA  
>> variant.
>> Best,
>> Peter.
>> --- snip
>> (Everything you see here should also be implemented in the Wiki)
>> - categoryName --renamed-->  jobCategory (people used the old term  
>> all the time)
>> - startReservation --renamed-->  requestReservation
>> - Replaced global occurrences of the term "host" with "machine"
>> - New queue support
>> 	- Added support for queue name specification in JobTemplate
>> 	- Only one name supported - LSF and SGE only have support for  
>> multiple queue names; precedence rules would be unclear
>> 	- Three new monitoring attributes (drmQueueNames,  
>> maxWallclockTime, maxSlotsAllowed) on queue level
>> 	- New monitoring attributes demand notion of infinity ->    
>> NO_LIMIT constant
>> - Parallel job support
>> 	- Two classes: "spawns itself" vs "is spawned"
>> 		- First class: OpenMP, pthread, self-managed (shell script  
>> submitted)
>> 		- Second class: PVM / MPI jobs, categorization based on GFD.115
>> 	- General design approach: User defines the parallel application  
>> binary in cmdLine argument (in contrast to SGE thinking !)
>> 		- jobCategory attribute decides upon all infrastructure-relevant  
>> settings for parallel execution (libraries, paths, launch programs)
>> 		- leaded to according "drmJobCategoryNames" counterpart in  
>> MonitoringSession, in order to check DRMS capabilities
>> 		- supported job categories are site-specific, DRMAA web site  
>> offers standardized names
>> 		- Examples will follow soon on http://www.drmaa.org/jobcategories/
>> 		- DRMAA implementation most likely creates a shell script based  
>> on job category, and submits this one
>> 	- The application decides upon process spawning, but the scheduler  
>> still needs the information
>> 		- new job template attributes minSlots / maxSlots
>> 		- if minSlots>  1, you MUST define a jobCategory
>> 		- no need to have final slot count as placeholder macro (comes  
>> out of the parallel programming API anyway)
>> - MonitoringSession::machineLoad
>> 	- Removed coreNumber parameter, since the OS on the host migrates  
>> jobs between cores - no real sense in core load index
>> 	- Added a comment that this information should not be used for  
>> user-side scheduling decisions; just a gadget to implement qmon on- 
>> top-of DRMAA
>> - New job template attribute "accountingId", as in SAGA, JSDL, and  
>> the majority of systems
>> 	- not relevant in ReservationTemplate, since advance reservations  
>> do not count for job accounting
>> - File purging on execution host (demanded by SAGA) was rejected,  
>> no overall support in DRM systems
>> - New job template attributes for resource requirements -  
>> minPhysMemory, machineOS, machineArch, candidateMachines
>> 	- candidateMachines semantic: use sub-set or all of this hosts for  
>> execution, if not possible, reject job
>> - Advance Reservation interfaces
>> 	- use case for ReservationTemplate::nativeOptions ->  SGE demands  
>> queue name in advance reservation
>> 	- state model for reservations rejected for DRMAA
>> - Introduced "AbsoluteTime" abstraction data type in IDL text
>> - Job state model
>> 	- New StagedIn / StagedOut / Re-Scheduled state in job state model  
>> rejected, give hint in the spec to use sub states for this
>> 	- Going from running to queued is a special case (only for PBS),  
>> no new state; in case, emulate intermediate step in the library
>> - JobInfo will not be merged into Job ->  information should be  
>> consistent, always get one "performance snapshot"
>> 	- JobInfo becomes a value type, in order to express this more  
>> clearly
>> - getting the list of valid contact strings was rejected (not  
>> implementable)
>> - Bulk index placeholder support (in the API) will not be extended
>> 	- Instead, a new placeholder allows to insert the DRM systems bulk  
>> index env. variable name into the template (BULK_TASK_ID_VARNAME)
>> 	- Idea is that applications can assign the variable name to their  
>> own environment variable, and perform an "eval" on it later
>> 	- Peter fighted against the alternative idea:  Standardizing which  
>> environment variables a DRMAA library must define implicitly
>> - JobInfo: masterMachine and slaveMachines attributes are merged to  
>> an ordered string list (allocatedMachines)
>> 	- Implementation can assign some semantic to the ordering
>> --
>>   drmaa-wg mailing list
>>   drmaa-wg at ogf.org
>>   http://www.ogf.org/mailman/listinfo/drmaa-wg
> --
>  drmaa-wg mailing list
>  drmaa-wg at ogf.org
>  http://www.ogf.org/mailman/listinfo/drmaa-wg

More information about the drmaa-wg mailing list