<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.6000.16825" name=GENERATOR></HEAD>
<BODY>
<DIV dir=ltr align=left><FONT face=Arial color=#0000ff
size=2></FONT> </DIV>
<DIV dir=ltr align=left><SPAN class=625023015-05052009><FONT face=Arial
color=#0000ff size=2>The problem with choice 2 is that when you have a string
with an encoding, then there's the issue of what do you encounter when you index
into the string at say position 3. Do you get the 3rd byte of the encoding? or
is the encoding somehow decoded into individual character codepoints ... but for
many encodings that's not crisply defined.</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=625023015-05052009><FONT face=Arial
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=625023015-05052009><FONT face=Arial
color=#0000ff size=2>If we go with choice 2 we should flat out say that the
string is an array of bytes representing a string by way of the
encoding.</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=625023015-05052009><FONT face=Arial
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=625023015-05052009><FONT face=Arial
color=#0000ff size=2>There's a variation we didn't explore which is that
implementations can supply the strings in whatever form they want. But
they make the encoding available. This allows an implementation to provide
say, UTF-16 always, if it chooses. </FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=625023015-05052009><FONT face=Arial
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=625023015-05052009><FONT face=Arial
color=#0000ff size=2>I'm in favor of the simplest possible thing here. So, for
example, if you guys have a UTF-16 constraint, then I'd be happy just picking
that as the encoding that is always used by the infoset.</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=625023015-05052009><FONT face=Arial
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=625023015-05052009><FONT face=Arial
color=#0000ff size=2>...mike</FONT></SPAN></DIV>
<DIV> </DIV>
<P align=left><A name=""></A><?xml:namespace prefix = st1 ns =
"urn:schemas-microsoft-com:office:smarttags" /><st1:PersonName w:st="on"><SPAN
style="mso-bookmark: ''"><B><SPAN
style="FONT-SIZE: 10pt; COLOR: navy; FONT-FAMILY: Arial"></SPAN></B></SPAN></st1:PersonName><SPAN
style="mso-bookmark: ''"><B><SPAN
style="FONT-SIZE: 10pt; COLOR: navy; FONT-FAMILY: Arial">Mike Beckerle |
OGF DFDL WG Co-Chair | CTO | Oco, Inc.</SPAN></B></SPAN><BR><SPAN
style="FONT-SIZE: 10pt; COLOR: gray; FONT-FAMILY: Arial">Tel:
781-810-2125 | <st1:address w:st="on"><st1:Street w:st="on">100 Fifth
Ave., 4th Floor</st1:Street>, <st1:City w:st="on">Waltham</st1:City> <st1:State
w:st="on">MA</st1:State> <st1:PostalCode
w:st="on">02451</st1:PostalCode></st1:address> |</SPAN> <A
href="mailto:mbeckerle.dfdl@gmail.com"><SPAN
style="FONT-SIZE: 10pt; COLOR: gray"><FONT
face=Arial>mbeckerle.dfdl@gmail.com</FONT></SPAN></A><SPAN
style="FONT-SIZE: 10pt; COLOR: gray; FONT-FAMILY: Arial"> </SPAN></P>
<DIV> </DIV><BR>
<DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> Alan Powell
[mailto:alan_powell@uk.ibm.com] <BR><B>Sent:</B> Tuesday, May 05, 2009 11:14
AM<BR><B>To:</B> DFDL<BR><B>Cc:</B> dfdl-wg@ogf.org; dfdl-wg-bounces@ogf.org;
Steve Hanson<BR><B>Subject:</B> Re: [DFDL-WG] Infoset
codepage<BR></FONT><BR></DIV>
<DIV></DIV><BR><FONT face=sans-serif size=2>Isn't choice 2 the most flexible?
The caller can convert to what they need.</FONT> <BR><FONT face=sans-serif
size=2><BR>Alan Powell<BR><BR>MP 211, IBM UK Labs, Hursley, Winchester,
SO21 2JN, England<BR>Notes Id: Alan Powell/UK/IBM email:
alan_powell@uk.ibm.com <BR>Tel: +44 (0)1962 815073
Fax: +44 (0)1962
816898<BR></FONT><BR><BR><BR>
<TABLE width="100%">
<TBODY>
<TR vAlign=top>
<TD><FONT face=sans-serif color=#5f5f5f size=1>From:</FONT>
<TD><FONT face=sans-serif size=1>DFDL
<mbeckerle.dfdl@gmail.com></FONT>
<TR vAlign=top>
<TD><FONT face=sans-serif color=#5f5f5f size=1>To:</FONT>
<TD><FONT face=sans-serif size=1>Steve Hanson/UK/IBM@IBMGB</FONT>
<TR>
<TD vAlign=top><FONT face=sans-serif color=#5f5f5f size=1>Cc:</FONT>
<TD><FONT face=sans-serif size=1>Alan Powell/UK/IBM@IBMGB,
"dfdl-wg@ogf.org" <dfdl-wg@ogf.org>, "dfdl-wg-bounces@ogf.org"
<dfdl-wg-bounces@ogf.org></FONT>
<TR vAlign=top>
<TD><FONT face=sans-serif color=#5f5f5f size=1>Date:</FONT>
<TD><FONT face=sans-serif size=1>05/05/2009 15:35</FONT>
<TR vAlign=top>
<TD><FONT face=sans-serif color=#5f5f5f size=1>Subject:</FONT>
<TD><FONT face=sans-serif size=1>Re: [DFDL-WG] Infoset
codepage</FONT></TR></TBODY></TABLE><BR>
<HR noShade>
<BR><BR><BR><FONT size=3><BR>How about we specify unicode codepoints but
implementations can have limitations on the numeric range of codepoints.
</FONT> <BR><BR><FONT size=3>Reason: keeps us out of the codepoints vs.
encodings morass. </FONT><BR><BR><FONT size=3>...mikeb</FONT> <BR><BR><FONT
size=3><BR>On May 5, 2009, at 10:20 AM, Steve Hanson <</FONT><A
href="mailto:smh@uk.ibm.com"><FONT color=blue
size=3><U>smh@uk.ibm.com</U></FONT></A><FONT size=3>>
wrote:<BR></FONT><BR><FONT face=sans-serif size=2><BR>There is a 4th option -
remain silent and leave it up to the implementation.</FONT><FONT size=3>
<BR></FONT><FONT face=sans-serif size=2><BR>Reason: Within IBM we have
different products that will embed DFDL parser/unparser. WMB requires strings in
UTF-16, that is not always the case for others.</FONT><FONT size=3>
<BR></FONT><FONT face=sans-serif size=2><BR>Regards<BR><BR>Steve
Hanson<BR>Programming Model Architect<BR>WebSphere Message Brokers<BR>Hursley,
UK<BR>Internet: </FONT><A href="mailto:smh@uk.ibm.com"></A><A
href="mailto:smh@uk.ibm.com"><FONT face=sans-serif color=blue
size=2><U>smh@uk.ibm.com</U></FONT></A><FONT face=sans-serif size=2><BR>Phone
(+44)/(0) 1962-815848</FONT><FONT size=3> <BR><BR></FONT>
<TABLE width="100%">
<TBODY>
<TR vAlign=top>
<TD width="48%"><FONT face=sans-serif size=1><B>"Mike Beckerle"
<</B></FONT><A href="mailto:mbeckerle.dfdl@gmail.com"><FONT
face=sans-serif color=blue
size=1><B><U>mbeckerle.dfdl@gmail.com</U></B></FONT></A><FONT
face=sans-serif size=1><B>></B> <BR>Sent by: </FONT><A
href="mailto:dfdl-wg-bounces@ogf.org"></A><A
href="mailto:dfdl-wg-bounces@ogf.org"><FONT face=sans-serif color=blue
size=1><U>dfdl-wg-bounces@ogf.org</U></FONT></A><FONT size=3> </FONT>
<P><FONT face=sans-serif size=1>05/05/2009 14:09</FONT><FONT size=3>
</FONT>
<P><BR>
<TABLE border=1>
<TBODY>
<TR vAlign=top>
<TD bgColor=white>
<DIV align=center><FONT face=sans-serif size=1>Please respond
to</FONT><FONT face=sans-serif color=blue
size=1><U><BR></U></FONT><A
href="mailto:mbeckerle.dfdl@gmail.com"></A><A
href="mailto:mbeckerle.dfdl@gmail.com"><FONT face=sans-serif
color=blue
size=1><U>mbeckerle.dfdl@gmail.com</U></FONT></A></DIV></TR></TBODY></TABLE><BR></P>
<TD width="51%">
<TABLE width="100%">
<TBODY>
<TR vAlign=top>
<TD width="13%">
<DIV align=right><FONT face=sans-serif size=1>To</FONT></DIV>
<TD width="86%"><FONT face=sans-serif size=1>Alan
Powell/UK/IBM@IBMGB, <</FONT><A
href="mailto:dfdl-wg@ogf.org"><FONT face=sans-serif color=blue
size=1><U>dfdl-wg@ogf.org</U></FONT></A><FONT face=sans-serif
size=1>></FONT><FONT size=3> </FONT>
<TR vAlign=top>
<TD>
<DIV align=right><FONT face=sans-serif size=1>cc</FONT></DIV>
<TD>
<TR vAlign=top>
<TD>
<DIV align=right><FONT face=sans-serif size=1>Subject</FONT></DIV>
<TD><FONT face=sans-serif size=1>[DFDL-WG] Infoset
codepage</FONT></TR></TBODY></TABLE><BR><BR>
<TABLE>
<TBODY>
<TR vAlign=top>
<TD>
<TD></TR></TBODY></TABLE><BR></TR></TBODY></TABLE><BR><FONT
size=3><BR><BR></FONT><FONT face=sans-serif size=2><BR><BR>4. Infoset codepage
and encoding <BR><BR>The spec does not say what codepage and encoding is used
for string fields. </FONT>
<P><FONT face=Arial color=blue size=2>I wanted to comment on this.</FONT><FONT
size=3> </FONT>
<P><FONT face=Arial color=blue size=2>There are three choices here: </FONT><FONT
face=sans-serif size=2><BR>1. </FONT><FONT face=Arial
color=blue size=2>unicode codepoints - we may need to preserve the mapping table
(from representation encoding to unicode) as part of the infoset.</FONT><FONT
size=3> </FONT><FONT face=sans-serif size=2><BR>2.
</FONT><FONT face=Arial color=blue size=2>"As Encoded" codepoints -
we must add the encoding to the infoset.</FONT><FONT size=3> </FONT><FONT
face=sans-serif size=2><BR>3. </FONT><FONT face=Arial
color=blue size=2>Both</FONT><FONT size=3> </FONT><FONT face=Arial color=blue
size=2><BR>In favor of unicode codepoints - simplicity. Minor issue is that some
mappings will lose information making perfect round-tripping of string contents
impossible.</FONT><FONT size=3> </FONT><FONT face=Arial color=blue
size=2><BR>E.g., EBCDIC has two different line-endings both of which normally
are translated to ASCII/Unicode linefeed. Hence, translating back is
ambiguous.</FONT><FONT size=3> <BR> </FONT><FONT face=Arial color=blue
size=2><BR>In favor of "as encoded" - simplicity. We just add an encoding
attribute to the string infoset object which returns the information that the
dfdl:encoding representation property contained. Note that the encoding
information really is already available via the schema component associated with
the string, so there is some redundancy here. Also, there's the issue when
dealing with this of whether one wants codepoints, or raw access to the bytes.
E.g., if the encoding is UTF-8 or shifted JIS, then the characters take up 1 or
more bytes. Do you want the bytes, or the interpreted code points or
both?</FONT><FONT size=3> <BR> </FONT><FONT face=Arial color=blue
size=2><BR>In favor of "both" - complexity, but eliminates all the
ambiguity.</FONT><FONT size=3> <BR> </FONT><FONT face=Arial color=blue
size=2><BR>My suggestion: keep it simple for v1.0 - Choose number 1 - because we
can always expand the capabilities later by providing access to the unencoded
representation one way or another. </FONT><FONT size=3><BR> </FONT><FONT
face=Arial color=blue size=2><BR>If you badly need infoset-level contents which
expose the actual representation character codes, you can always model this as
an array of bytes instead of a character string. </FONT><FONT
size=3><BR> </FONT><FONT face=Arial color=blue
size=2><BR>...mike</FONT><FONT size=3> <BR> </FONT>
<P><FONT face=Arial color=#000080 size=2><B>Mike Beckerle | OGF DFDL WG Co-Chair
| CTO | Oco, Inc.</B></FONT><FONT face=Arial color=#808080 size=2><BR>Tel:
781-810-2125 | 100 Fifth Ave., 4th Floor, Waltham MA 02451
|</FONT><FONT face=Arial color=blue size=2> </FONT><A
href="mailto:mbeckerle.dfdl@gmail.com"><FONT face=Arial color=#808080
size=2><U>mbeckerle.dfdl@gmail.com</U></FONT></A><FONT face=Arial color=#808080
size=2> </FONT><TT><FONT size=2>--<BR>dfdl-wg mailing list<BR></FONT></TT><A
href="mailto:dfdl-wg@ogf.org"></A><A href="mailto:dfdl-wg@ogf.org"><TT><FONT
color=blue size=2><U>dfdl-wg@ogf.org</U></FONT></TT></A><TT><FONT
size=2><BR></FONT></TT><A
href="http://www.ogf.org/mailman/listinfo/dfdl-wg"></A><A
href="http://www.ogf.org/mailman/listinfo/dfdl-wg"><TT><FONT color=blue
size=2><U>http://www.ogf.org/mailman/listinfo/dfdl-wg</U></FONT></TT></A><FONT
size=3> </FONT><FONT face=sans-serif size=2><BR></FONT><FONT
size=3><BR></FONT><FONT face=sans-serif size=2><BR></FONT>
<P>
<HR>
<FONT face=sans-serif size=2><I><BR></I></FONT>
<P><FONT face=sans-serif size=2><I>Unless stated otherwise above:<BR>IBM United
Kingdom Limited - Registered in England and Wales with number 741598.
<BR>Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU</I></FONT><FONT size=3> </FONT>
<P><FONT face=sans-serif size=2><BR></FONT><FONT size=3><BR><BR></FONT><FONT
face=sans-serif size=2><BR></FONT>
<P><BR><FONT face=sans-serif size=2><BR></FONT><BR><FONT face=sans-serif
size=2><BR></FONT>
<HR>
<FONT face=sans-serif size=2><BR><I><BR></I></FONT>
<P><FONT face=sans-serif size=2><I>Unless stated otherwise above:<BR>IBM United
Kingdom Limited - Registered in England and Wales with number 741598.
<BR>Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU</I></FONT>
<P><FONT face=sans-serif size=2><BR><BR></FONT><BR><BR><FONT face=sans-serif
size=2><BR></FONT></P></BODY></HTML>