Open Grid Forum

Open Forum | Open Standards

Site Tools


Data Format Description Language (DFDL)

This is a tutorial page for the DFDL standard of the Open Grid Forum. The base document for this standard is GFD.240, "Data Format Description Language (DFDL) v1.0 Specification", which obsoletes and replaces earlier documents GFD.207 and GFD.174

The DFDL group also hosts internal working group pages on the OGF Github and helps to maintain a public DFDL schemas web community repository further described at this link, where anyone interested can download and use the schemas subject to the accompanying license. Anybody can apply to the DFDL WG to create a repository for a new format (or simply inform it of the existence of such a repository for dissemination), or to contribute to an existing repository.

DFDL Overview:

Data Format Description Language (DFDL) is a language for describing text and binary data formats. A DFDL description allows any text or binary data to be read from its native format and to be presented as an instance of an information set. DFDL also allows data to be taken from an instance of an information set and written out to its native format. DFDL achieves this by leveraging W3C XML Schema Definition Language (XSDL) 1.0. It is therefore very easy to use DFDL to convert text and binary data to a corresponding XML document.

An XML schema is written for the logical model of the data. The schema is augmented with special DFDL annotations. These annotations are used to describe the native (non-XML) format of the data. This is an established approach that is already being used today in commercial systems. DFDL evolves this approach into an open standard capable of describing almost any format of text or binary data.

A good place to start to learn about DFDL is the Getting Started With DFDL page on the IBM Community web pages. There is also a set of recently updated tutorials available on the XFront tutorial web site.

Background

Data interchange is critically important for most computing. Cloud computing and all forms of distributed computing require distributed software and hardware resources to work together. Inevitably, these resources read and write data in a variety of formats. General tools for data interchange are essential to solving such problems. Scalable and High Performance Computing (HPC) applications require high-performance data handling, so data interchange standards must enable efficient representation of data. Data Format Description Language (DFDL) enables powerful data interchange and very high-performance data handling. The DFDL Working Group envisages three dominant kinds of data in the future, as follows:

  1. Textual data defined by a format specific schema such as XML or JSON.
  2. Binary data in standard formats.
  3. Data with DFDL descriptors.

Textual XML data is the most successful data interchange standard to date. All such data are by definition new, by which we mean created in the XML era. Because of the large overhead that XML tagging imposes, there is often a need to compress and decompress XML data. However, there is a high-cost for compression and decompression that is unacceptable to some applications. Standardized binary data are also relatively new, and is suitable for larger data because of the reduced costs of encoding and more compact size. Examples of standard binary formats are data described by modern versions of ASN.1, or the use of XDR. These techniques lack the fully self-describing nature of XML-data. Scientific formats, such as NetCDF and HDF are used by some communities to provide self-describing binary data. In the future, there may be standardized binary-encoded XML data as there is a W3C working group that has been formed on this subject.

It is an important observation that both XML format and standardized binary formats are prescriptive in that they specify or prescribe a representation of the data. To use them your applications must be written to conform to their encodings and mechanisms of expression.

DFDL suggests an entirely different scheme. The approach is descriptive in that one chooses an appropriate data representation for an application based on its needs and one then describes the format using DFDL so that multiple programs can directly interchange the described data. DFDL descriptions can be provided by the creator of the format, or developed as needed by third parties intending to use the format. That is, DFDL is not a format for data; it is a way of describing any data format. DFDL is intended for data commonly found in scientific and numeric computations, as well as record-oriented representations found in commercial data processing.

DFDL can be used to describe legacy data files, to simplify transfer of data across domains without requiring global standard formats, or to allow third-party tools to easily access multiple formats. DFDL can also be a powerful tool for supporting backward compatibility as formats evolve.

DFDL is designed to provide flexibility and also permit implementations that achieve very high levels of performance. DFDL descriptions are separable and native applications do not need to use DFDL libraries to parse their data formats. DFDL parsers can also be highly efficient. The DFDL language is designed to permit implementations that use lazy evaluation of formats and to support seekable, random access to data. The following goals can be achieved by DFDL implementations:

  • Density. Fewest bytes to represent information content (without resorting to compression). Fastest possible I/O.
  • Optimized I/O. Applications can write data aligned to byte, word, or even page boundaries and to use memory-mapped I/O to insure access to data content with the smallest number of machine cycles for common use cases without sacrificing general access.

DFDL can describe the same types of abstract data that other binary or textual data formats can describe and, furthermore, it can describe almost any possible representation scheme for those data. For example, DFDL can provide multiple representations of the same logical data and that data are optimized for specific uses. It is the spirit of DFDL to support canonical data descriptions that correspond closely to the original in-memory representation of the data, and also to provide sufficient information to write as well as to read the given format.

DFDL 1.0

DFDL 1.0 is the initial release of DFDL. The specification may be found on the [OGF documents page|/gf/docs] or by direct link [here|/documents/GFD.174.pdf] (PDF). Release 1.0 includes the following language features:

  • Subset of XML Schema 1.0
  • Rich text content including bi-di support
  • Rich binary content including bit support
  • Text and binary delimiters
  • Ordered and unordered sequences
  • Scoping rules to allow modular construction and re-use
  • Validation
  • Defaults for missing values
  • Nil capability for out-of-band values
  • Expression language including variables to model dynamic data
  • Stratagems to resolve choices, optionality and other points of uncertainty
  • One dimensional arrays
  • Hidden elements and calculated values
  • Very general parsing and serializing capability

In future releases, it is intended to enhance DFDL with further features:

  • Direct access by offset
  • Multi-dimensional arrays
  • Multi-layered models
  • Custom language extensions

Example

Consider the following XML data:

The logical model for this data can be described by the following fragment of an XML schema document that simply provides a description of the name and type of each element:

Now, suppose we have the same data but represented in a non-XML format. A binary representation of the data could be visualized like this (shown as hexadecimal):

To describe this using DFDL, we take our original XML schema document that described the data model and we annotate the element declarations as follows:

These simple DFDL annotations express that the data are represented in a binary format where integers are two's complement and floats are IEEE, and that the byte order will be big endian.

DFDL Implementations

Implementations of DFDL processors that can parse and serialize data using DFDL schemas exist.

  • Several IBM products including IBM App Connect Enterprise and IBM z/Transaction Processing Facility now include a DFDL 1.0 streaming parser and modeler. A developer edition of IBM App Connect Enterprise is available to download without charge.
  • Apache Daffodil is an open source DFDL processor under active development with several official releases that implement streaming parsing and serializing. New contributors to its code base are always welcome.
  • The European Space Agency has created DFDL4S a DFDL implementation targeted to satellite communications formats.

A public repository for DFDL schemas that describe commercial and scientific data formats has been established on GitHub.

For flexibility and ease of implementation, the DFDL language is divided into core features and optional features. A DFDL processor can choose to be a minimal conforming processor (all core features), an extended conforming processor (all core and some optional features) or a fully conforming processor (all core and optional features). Additionally, a DFDL processor can choose to provide a parser, a serializer or both a parser and serializer.

About OGF and DFDL Working Group

The Open Grid Forum (OGF) is an open community committed to the rapid evolution and adoption of applied distributed computing through activities which explore trends, share best practices and consolidate these best practices into standards. In OGF, people from around the world work together to drive openness and interoperability in the technology and protocols that enable discovery in science and the creation of value in business through applied distributed computing. The DFDL Working Group, an OGF group co-chaired by Steve Hanson of IBM and Mike Beckerle, is part of the Data Area within the OGF Standards function.