[llvm-dev] RFC: Adding support for the z/OS platform to LLVM and clang

Wed Jun 17 05:06:03 PDT 2020

Tom Honermann <Thomas.Honermann at synopsys.com> wrote on 16.06.2020 
19:09:18:

> From: Tom Honermann <Thomas.Honermann at synopsys.com>
> To: Kai Peter Nacke <kai.nacke at de.ibm.com>
> Cc: Corentin <corentin.jabot at gmail.com>, "llvm-dev at lists.llvm.org" 
> <llvm-dev at lists.llvm.org>
> Date: 16.06.2020 19:09
> Subject: [EXTERNAL] RE: [llvm-dev] RFC: Adding support for the z/OS 
> platform to LLVM and clang
> 
> > -----Original Message-----
> > From: Kai Peter Nacke <kai.nacke at de.ibm.com>
> > Sent: Tuesday, June 16, 2020 11:17 AM
> > To: Tom Honermann <thonerma at synopsys.com>
> > Cc: Corentin <corentin.jabot at gmail.com>; llvm-dev at lists.llvm.org
> > Subject: RE: [llvm-dev] RFC: Adding support for the z/OS platform 
> to LLVM and
> > clang
> > 
> > Tom Honermann <Thomas.Honermann at synopsys.com> wrote on 16.06.2020
> > 16:53:33:
> > 
> > > > > > 2) Add patches to Clang to allow EBCDIC and ASCII (ISO-8859-1)
> > > > > > encoded
> > > >
> > > > > input source files. This would be done at the file open time to
> > allow
> > > > the
> > > > > rest of Clang to operate as if the source was UTF-8 and so 
require
> > no
> > > > > changes downstream. Feedback on this plan is welcome from the
> > > > > Clang community.
> > > > > Would it be correct to assume that this EBCDIC -> UTF-8 mapping
> > would
> > > > > be as prescribed by UTF-EBCDIC / IBM CDRA, notably for the 
control
> > > > > characters that do not map exactly?
> > > > > Notably, if the execution encoding is EBCDIC, is '0x06' 
equivalent
> > to
> > > > > '0086', etc?
> > > > >
> > > > > The question "Is Unicode sufficient to represent all characters
> > > > > present in the input source without using the Private Use Area?"
> > > > > is one
> > > > that
> > > > > is relevant to both Clang and the C/C++ standard. ( I do hope 
that
> > it
> > > > > is the case!)
> > > >
> > > > The current goal is to make only minimal changes to the frontend 
to
> > enable
> > > > reading of EBCDIC encoded files. For this, we use the auto-
> > > conversion service of
> > > > z/OS UNIX System Services (
> > > >
> > 
https://urldefense.com/v3/__https://www.ibm.com/support/knowledgecenter/
> > > >
> > SSLTBW_2.4.0/com.ibm.zos.v2r4.bpxb200/xpascii.htm__;!!A4F2R9G_pg!NKR
> > > > nU eS37wLNWpYN6Yvhm9SzZwujyMlnpbFJyHV5Z8-M6-
> > aucp0zxwXGxSZ7EKlr$
> > > > ), together with file tagging and setting the CCSID for the 
program
> > and for
> > > > opened files.. The auto-conversion service supports round-trip
> > conversion
> > > > between EBCDIC and Enhanced ASCII. With it, boot strapping with
> > > > EBCDIC source files is possible.
> > > > Of course, more complete UTF-8 support is a valid implementation
> > > alternative.
> > >
> > > Other good references:
> > > - The 'ctag' utility
> > >
> > > 
https://urldefense.com/v3/__https://www.ibm.com/support/knowledgecente
> > >
> > r/SSLTBW_2.3.0/__;!!A4F2R9G_pg!KV1im4SvVFKKMIvutwguN6maqCZttB7_zG_i
> > 0QW
> > > ZFauUVe6IKXYm6CeMjYXbWNyQ6SO-TOs$
> > > com.ibm.zos.v2r3.bpxa500/chtag.htm
> > > - File tagging overview
> > >
> > > 
https://urldefense.com/v3/__https://www.ibm.com/support/knowledgecente
> > >
> > r/en/SSLTBW_2.3.0/__;!!A4F2R9G_pg!KV1im4SvVFKKMIvutwguN6maqCZttB7_z
> > G_i
> > > 0QWZFauUVe6IKXYm6CeMjYXbWNyQ2CwjL08$
> > > com.ibm.zos.v2r3.cbcpx01/cbc1p273.htm
> > >
> > > Kai, would use of auto conversion require that users set the
> > > _BPXK_AUTOCVT, _BPXK_CCSIDS, and/or _BPXK_PCCSID environment
> > > variables?  Or do you envision having the clang driver set them 
before
> > > invocation of the compiler?  If the latter, that would imply that
> > > users (and tests) are responsible for setting them for direct 'clang
> > > -cc1' invocations.
> > 
> > Hi Tom,
> > the current approach is to enable auto conversion only if 
> _BPX_AUTOCVT is set
> > to ON. If the variable is not set, then all input files are 
> treated as EBCDIC. The
> > rational behind is that we do not want to outsmart the user.
> > So there is no problem with direct `clang -cc1` invocations. It's 
> a good hint that
> > we need to describe this setup somewhere.
> 
> That seems reasonable.  How would you handle _BPX_AUTOCVT being set to 
ALL?
> 
> (
> For anyone following along, the difference between ON and ALL is 
described at 
> https://www.ibm.com/support/knowledgecenter/SSLTBW_2.3.0/
> com.ibm.zos.v2r3.cbcpx01/setenv.htm#setenv:
> > When _BPXK_AUTOCVT is ON, automatic conversion can only take place
> between IBM-1047 and ISO8859-1 code sets. Other CCSID pairs are not 
> supported for automatic text conversion. To request automatic 
> conversion for any CCSID pairs that Unicode service supports, set 
> _BPXK_AUTOCVT to ALL.
> )
> 
> Tom.
> 

That's a bit more complicated. For reading files, I can imagine the 
following approach:
- the application is still using the ASCII execution mode (to link against 
the ASCII version of the library)
- on each file handle, the program CCSID is set to UTF-8 (1208)
  auto-conversion on the file is turned on if
  - _BPX_AUTOCVT set to ALL
  - file is untagged (assuming EBCDIC 1047) or file tag is not 1208
Writing text files would need a default encoding. Using UTF-8 (1208) would 
makes sense.

This is really a "rough" first thought. I gave it a quick try, and it 
failed. Most likely I overlooked something.

Best regards,
Kai Nacke
IT Architect

IBM Deutschland GmbH
Vorsitzender des Aufsichtsrats: Sebastian Krause
Geschäftsführung: Gregor Pillen (Vorsitzender), Agnes Heftberger, Norbert 
Janzen, Markus Koerner, Christian Noll, Nicole Reimer
Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, 
HRB 14562 / WEEE-Reg.-Nr. DE 99369940