Handling Non-Windows Text Files

Contents of Article


Introduction

Common settings for non-Windows files

Determining file attributes through experimentation

EBCDIC files

Custom EBCDIC translation tables, prior to Version 7.1

Custom Translation Tables, from version 7.1 onward

Using the EOL settings AUTO and AUTONL

Handling files with lone CR characters


Introduction


We are normally unconcerned with the file format of common text files or source programs, since most Windows functions and tools naturally work with this format, which is simple variable-length text.  That format is often compatible with other, non-Windows systems, but not always.


Occasionally you will attempt to open a file from another system, only to get a disorganized or unreadable screen display - so you know something is wrong with the data's content or format, or with SPFLite's understanding of the file.  The problem could be one of several factors:


Record Format

There are three basic formats used to organize how multiple text lines are stored in a file.  This setting is specified for the Profile variable by using the RECFM command. They are:


U - Undefined

Undefined is the norm for Windows text files; the individual text lines may be any length and are separated by unique delimiter characters.  See Line Delimiters below.


F - Fixed

Fixed indicates that all text lines are the same length, known as the Logical Record Length.  See Logical Record Length below.   Note that Fixed records may be stored with or without Line Delimiters as well.


V - Variable

Variable indicates the text lines may be of any length and are not separated by unique delimiters, but are written with a prefix field for each line which indicates the length of that particular line.   The prefix field is known as an RDW for Record Descriptor Word.


VBI - Variable Big-Endian

Variable Big-Endian indicates the text lines may be of any length and are not separated by unique delimiters, but are written with a prefix field for each line which indicates the length of that particular line.   The prefix field is known as an RDW for Record Descriptor Word. VBI specifies an RDW of 4 bytes in big-endian format containing the length of the data record not including the RDW itself.


VLI - Variable Little-Endian

Variable indicates the text lines may be of any length and are not separated by unique delimiters, but are written with a prefix field for each line which indicates the length of that particular line.   The prefix field is known as an RDW for Record Descriptor Word. VLI specifies an RDW of 4 bytes in little-endian format containing the length of the data record not including the RDW itself.


Line Delimiters

Windows text files typically separate text lines from each other by using a pair of special characters, the CR (Carriage Return) and LF (Line Feed) characters, having the hex value X'0D0A'.   This pair is normally referred to as CRLF for short.  


Unix, Linux and newer Macintosh systems use just the LF character.  Other systems (in older Macintosh systems and some few others) may use just the CR character, but that usage is rare.


Some systems even have their own 1 or 2 byte unique delimiter strings.  You might also wish to temporarily define your own line delimiters.  For example, it could be convenient to temporarily treat a comma as the end of the line if you had a type of file called a CSV, or Comma-Separated-Value.


SPFLite allows you to process all these types (including files without delimiters) by setting the Profile option EOL to the requirements of your file.


There are also files which are not even consistent within themselves, seemingly using CRLF, LF, FF, and CR combinations in weird combinations.  An example of this are SYSOUT files from mainframe emulators such as Hercules.  SPFLite addresses such files by providing the EOL types of AUTO and AUTONL, discussed later in this article.


Character Set

Even though the Line delimiter (EOL) may be correct, the data in a text file may be using some other character set encoding technique.  If the data seems to be total gibberish, or perhaps has a few extraneous characters at the beginning that don't seem to belong, this could be the cause.   SPFLite supports the following character sets via the Profile SOURCE variable.  


Note regarding Unicode


The Unicode support in SPFLite is quite limited.  SPFLite is not a true Unicode text editor, as all editing functions take place internally in native ANSI (Windows MS-1252) mode.  The SPFLite support is provided to allow reading in and writing Unicode files which only contain characters that map to normal ANSI characters.  This may sound very restrictive, but many Unicode files are stored in UTF format because web servers demand that format, not because there is a true requirement for the additional (non-ANSI) characters supported in Unicode.  For this type of application, the SPFLite support is perfectly adequate.


If you truly need the ability to edit the full Unicode character set, then SPFLite will not handle your requirements.


The following are supported:


ANSI

This is the default and is the normal Windows character set.  ANSI is also known as MS-1252, which is a superset of ISO-8859-1 and the first 256 bytes of Unicode.


UTF8

This format of encoding (one of the Unicode types) is the most 'economic' as the base ANSI characters are stored as 8 bit characters, and only special characters (above ANSI 127 are encoded as larger values.


UTF16

UTF16LE

Another Unicode format, this format of encoding uses 16 bits for each character. There are two types of UTF16, LE (Little Endian) and BE (Big Endian) which refer to the byte order of the encoded 16 bit value. The normal value for Windows based systems is UTF16/UTF16LE, since the Intel processors that run Windows are Little-Endian devices.


UTF16BE

And another Unicode format, this one uses 16 bits for each character. The encoded 16 bit value is stored in Big-Endian format.  UTF16BE is used on IBM mainframe systems like z/OS when they process Unicode data as 16-bit values.


EBCDIC

This is the standard 8-bit encoding used by IBM mainframe systems. Some further information on EBCDIC is at the end of this article


Logical Record Length

When the Record Format is set to F (for fixed), the length of the fixed record must be specified using the LRECL command.  The LRECL is set to 0 (zero) for conventional Windows variable-length records that are terminated by a CRLF pair.


If you need to enforce a minimum logical record length that is greater than zero, see Managing Line Lengths and MINLEN - Set Minimum Record Length for more information.



Common settings for non-Windows files


What should you specify when something is obviously not right?  It is best to check with the provider of the file, or determine the type of system on which it was created.


If it was a Unix/Linux system or newer Macintosh, then try the following:

RECFM U,  EOL LF,  SOURCE ANSI,  LRECL 0


If it was an IBM mainframe, then try either:

RECFM V,  EOL NONE,  SOURCE EBCDIC,  LRECL 0

 or

RECFM F,  EOL NONE,  SOURCE EBCDIC,  LRECL nn

where nn is a record length based on what the file usage appears to be.


If it is a web document such as HTML, then try:

RECFM U,  EOL CRLF,  SOURCE UTF8,  LRECL 0

 

Determining file attributes through experimentation


If you lack exact information about the file's format, it may become a matter of trial and error to resolve.  Sometimes it will require loading as a simple normal text file and carefully examining the data.   It is sometimes helpful here to set SPFLIte to use a good ANSI font such as Raster (included in the optional Font Package on the web site) since it will correctly show line delimiter control characters like CR and LF in a visible format, making it easier to work out what you're looking at.  


For very difficult file issues, you may need to exit SPFLite and examine the data with a Hex Editor.  Several free versions are available on the Internet.  One such hex editor may be found at  http://www.catch22.net/


When playing with these parameters trying to find the correct settings, it is sometimes helpful to rename the problem file to a unique file type (e.g PROBLEM.TRYIT ) so that you don't interfere with the options for a currently valid file profile.  


Be sure to issue a PROFILE UNLOCK on this test profile.  Then you can simply EDIT the file, examine it, alter one or more of the Profile variables you are experimenting with, CANCEL out and EDIT it again to try the new settings.  You can repeat this in a trial-and-error approach, or see if the originator of the file has documentation on the file's format.


Note:  Once you 'get it right'  remember that SPFLite will, using the same Profile, now successfully write files in this format.  For example, say you  have a profile for a file type of EBC (for EBCDIC).  You can now 'convert' a  normal Windows text file to this format by simply editing the Windows text file, entering PROFILE USING EBC, and then using CREATE/REPLACE to write the new file.  Then CANCEL out of the edit session.  Make sure the original file's Profile is LOCKED before doing this to prevent making the USING permanent.


EBCDIC files


SPFLite internally handles all data as Ansi characters.  This is equivalent to the Windows 1252 character set, which is a superset of the first 256 characters of Unicode.  


Note:  IBM has its own ideas about code pages.  What Microsoft calls Windows 1252 isn't as simple to IBM.  The reason is that 1252 has changed over the years, most recently to add the Euro character.  Even though the exact characters present in this code page have changed, Microsoft still calls it 1252.  However, IBM considers "1252" to mean a Windows code page 1252 of the past, prior to the advent of the Euro and some other changes they made.  IBM considers the Windows 1252 code page of today to be called 5348.  They take this position because they have to support things like DB/2 databases, where database administrators have to know precisely what data is, and is not, present in CHARACTER database fields, and the old vs. new 1252 are just not the same thing.  For the record, SPFLite's idea of 1252 is the same as IBM's idea of 5348.  That is, we support the current Windows 1252 code page definition of today.


You can also edit EBCDIC files, by setting up a PROFILE for a given file type (that is, a file name extension) that has SOURCE EBCDIC associated with it.  Only Ansi characters are displayed on the edit screen.


In order for SPFLite to handle EBCDIC data, it must translate it from EBCDIC to Ansi while editing, and from Ansi back to EBCDIC for storing externally.  To do that, a translation table is required.  SPFLite has already defined such a table, and normally there is nothing you need to do, but the following just explains the process.


Presently, only one translation table is supplied with SPFLite, which converts between Windows 1252 (Ansi) and IBM EBCDIC code page 1140.  Code page 1140 is a modern code set, comprised of an earlier code page 037 plus the Euro character.  Page 1140 is commonly used in North America for nearly all IBM z/OS mainframe installations.  The particular tables used by SPFLite are two-way lossless tables that are based on published IBM code-table documentation.  Any otherwise unallocated characters have a unique one-to-one translation, so that no data will be lost or mistranslated while editing the Ansi version of your EBCDIC data, even for “binary” data.   (Even the "unallocated characters" have lossless translations based on IBM specifications; we did not "make up" any rules for this just for SPFLite.)  If for any reason this table is not suitable for your use (if you need EBCDIC national characters outside the North American and/or European characters in Code Page 1140) it is possible to provide your own table.  See Custom EBCDIC translation tables below.


This default table performs a translation which is identical to the translation performed in the Hercules mainframe emulator when using a configuration file parameter line of CODEPAGE 1252/1140.


SPFLite does not dictate how lines are terminated in EBCDIC files.  You decide how you want this to be handled.  All the existing CR/LF combinations can be used with EBCDIC, and their EBCDIC equivalents are used to terminate lines.  SPFLite also supports the EBCDIC New Line character NL = X'15' as an EOL value.  Additionally, you may use the non-standard file formats noted below.  Users of the Hercules mainframe emulator who need to edit EBCDIC files may find cases where fixed-length files and EOL NONE is required.


There is nothing to prevent you from specifying EOL NL in an ANSI file.  However, the ANSI equivalent of EBCDIC X'15' is X'85', which is not a standard ANSI text delimiter, and so NL may be of limited usefulness outside of EBCDIC files.


Custom EBCDIC translation tables, prior to Version 7.1


Note: Translation tables are undergoing transition.  Starting in version 7.1, translation tables have undergone many changes.


If you have existing procedures to create translation tables, you can continue to use them, but any .TXT-based  table must be converted to the new format.  SPFLite is distributed with a command-line program called TxtToSource.exe.  This program will not allow you to create a table actually called EBCDIC.SOURCE, or to overwrite any file that already exists, as a safeguard.  Suppose you had made your own table, and now you want to use it in the current version.  Open a command prompt in the SPFLite directory, and issue this command:


       TxtToSource EBCDIC.TXT MYEBCDIC.SOURCE


When you want to use your custom table, you'd issue the command SOURCE MYEBCDIC in the edit profile of the file type you want to edit.


If you decide to create your own custom EBCDIC translation table, you should be sure that the EBCDIC to ANSI and ANSI to EBCDIC tables are lossless.  That is, translating data back and forth between the two tables should result in no effective change taking place.  The IBM terminology for this is called "Round Trip Translation Mode".  If this is not the case, you may run into problems such as data loss or data corruption.  It is possible to create translation tables in "Enforce Subset Translation Mode" using SUB characters where there isn't a one-to-one match, but you may run into problems.  For most users, Round Trip mode is the correct way to handle this.


To create the table do the following:




Because code pages and translation tables are very critical and tedious to get exactly right, it is common to use a program to generate and verify such tables, rather than trying to do it manually.  The default ANSI to EBCDIC translation table supplied with SPFLite was generated by such a program, which verified the table as being two-way lossless.  


If you were to write such a program yourself, you would have to set up two 256-entry arrays, one for ANSI to EBCDIC and one for EBCDIC to ANSI.  You populate one of the tables, ensuring that all 256 entries are set, with none unassigned and none assigned more than once.  (If the arrays were integers, you could initialize each entry in the table to -1 and then as you set values, you ensure the value is -1 before you do, and then, if necessary, review the table to be sure no -1 entries are left.)  Assume that the ANSI to EBCBIC table is set up first.  If you didn't have definitions for all 256 entries, you would have to decide on some mapping scheme for the "left over" values, so that you would end up with a two-way lossless translation.  Then, you would populate the EBCDIC to ANSI table from the first, using a loop like the following pseudo-code:


       for a = 0 to 255

               e = AnsiToEbcdic [a]

               EbcdicToAnsi [e] = a

       end for


Finally, output the tables using the format described above.



Custom Translation Tables, from version 7.1 onward


Note: Translation tables are undergoing transition.  Starting in version 7.1, translation tables have undergone the following changes:


As noted above, the command-line program TxtToSource.exe is provided to convert from the old to the new format.


Why was the format changed?  There were a number of problems with the old format:



The new format addresses all of these problems, and has a number of benefits.  One of them is that when a translation table is in Round-Trip/Lossless mode, only one 'side' of the translation table is needed, since the two 'halves' are like mirror images of each other anyway.  When SPFLite knows you have such a table, it validates that the values in the table are consistent with that.  That means, for each of 256 possible character values in a table, each one must appear once and only once for the table to be valid.


Independent of SPFLite, software is being written to directly generate SPFLite translation tables from IBM-provided code page data, known as UCM files.  This software is in development and will be available in the near future.  Rather than describe here how that program generates translation tables, the current advice is to just create old-style tables for now and convert them.  Once this new software is ready, a full discussion of all these issues will be documented.


Meantime, to get an idea about the new .SOURCE format, here is an example of what the TxtToSource.exe conversion program does to the old EBCDIC.TXT file, showing the key features of the new format:


TT  TITLE='SPFLITE TRANSLATION TABLE'  MODE=RT

TT  GENDATE='2013-11-29 14:36:46'


**  SOURCE file was created from conversion of 'EBCDIC.TXT'


**  AE comment:  ASCII 1252 => EBCDIC 1140

**  EA comment:  EBCDIC 1140 => ASCII 1252


**  _0  _1  _2  _3  _4  _5  _6  _7  _8  _9  _A  _B  _C  _D  _E  _F  EA


0*  00  01  02      œ  09   †  7F  97  8D  8E  0B  0C  0D        0*

1*                      …  08   ‡           ’                     1*

2*   ¤  81   ‚   ƒ   „  0A          ˆ   ‰   Š   ‹   Œ  05  06  07  2*

3*  90   ‘      “   ”   •   –      ˜    ™   š   ›  14  15   ž  1A  3*

4*      A0   â   ä   à   á   ã   å   ç   ñ   ¢   .   <   (   +   |  4*

5*   &   é   ê   ë   è   í   î   ï   ì   ß   !   $   *   )   ;   ¬  5*

6*   -   /   Â   Ä   À   Á   Ã   Å   Ç   Ñ   ¦   ,   %   _   >   ?  6*

7*   ø   É   Ê   Ë   È   Í   Î   Ï   Ì   `   :   #   @   '   =   "  7*

8*   Ø   a   b   c   d   e   f   g   h   i   «   »   ð   ý   þ   ±  8*

9*   °   j   k   l   m   n   o   p   q   r   ª   º   æ   ¸   Æ   €  9*

A*   µ   ~   s   t   u   v   w   x   y   z   ¡   ¿   Ð   Ý   Þ   ®  A*

B*   ^   £   ¥   ·   ©   §   ¶   ¼   ½   ¾   [   ]   ¯   ¨   ´   ×  B*

C*   {   A   B   C   D   E   F   G   H   I   ­    ô   ö   ò   ó   õ  C*

D*   }   J   K   L   M   N   O   P   Q   R   ¹   û   ü   ù   ú   ÿ  D*

E*   \   ÷   S   T   U   V   W   X   Y   Z   ²   Ô   Ö   Ò   Ó   Õ  E*

F*   0   1   2   3   4   5   6   7   8   9   ³   Û   Ü   Ù   Ú   Ÿ  F*


EA  _0  _1  _2  _3  _4  _5  _6  _7  _8  _9  _A  _B  _C  _D  _E  _F  EA


0_  00  01  02  03  9C  09  86  7F  97  8D  8E  0B  0C  0D  0E  0F  0_

1_  10  11  12  13  9D  85  08  87  18  19  92  8F  1C  1D  1E  1F  1_

2_  A4  81  82  83  84  0A  17  1B  88  89  8A  8B  8C  05  06  07  2_

3_  90  91  16  93  94  95  96  04  98  99  9A  9B  14  15  9E  1A  3_

4_  20  A0  E2  E4  E0  E1  E3  E5  E7  F1  A2  2E  3C  28  2B  7C  4_

5_  26  E9  EA  EB  E8  ED  EE  EF  EC  DF  21  24  2A  29  3B  AC  5_

6_  2D  2F  C2  C4  C0  C1  C3  C5  C7  D1  A6  2C  25  5F  3E  3F  6_

7_  F8  C9  CA  CB  C8  CD  CE  CF  CC  60  3A  23  40  27  3D  22  7_

8_  D8  61  62  63  64  65  66  67  68  69  AB  BB  F0  FD  FE  B1  8_

9_  B0  6A  6B  6C  6D  6E  6F  70  71  72  AA  BA  E6  B8  C6  80  9_

A_  B5  7E  73  74  75  76  77  78  79  7A  A1  BF  D0  DD  DE  AE  A_

B_  5E  A3  A5  B7  A9  A7  B6  BC  BD  BE  5B  5D  AF  A8  B4  D7  B_

C_  7B  41  42  43  44  45  46  47  48  49  AD  F4  F6  F2  F3  F5  C_

D_  7D  4A  4B  4C  4D  4E  4F  50  51  52  B9  FB  FC  F9  FA  FF  D_

E_  5C  F7  53  54  55  56  57  58  59  5A  B2  D4  D6  D2  D3  D5  E_

F_  30  31  32  33  34  35  36  37  38  39  B3  DB  DC  D9  DA  9F  F_


//  _0  _1  _2  _3  _4  _5  _6  _7  _8  _9  _A  _B  _C  _D  _E  _F  //



Using the EOL settings AUTO and AUTONL


The End of Line profile options AUTO and AUTONL allow for automatic detection of line terminations, possibly containing inconsistent and spurious line terminators, so that files edited across different system, mainframe SYSOUT files, and other inconsistently-terminated text files can be opened, viewed and edited in a reasonable way.  EOL AUTO/AUTONL may be applied to non-mainframe files as well, to handle situations where a file's line termination is inconsistent for some reason.  A possible cause of this is a file shared between Windows and Unix on a network and edited with different editors that apply different line endings.  Line terminations under EOL AUTO/AUTONL are handled as follows:











PAGE Profile Support


An extension to the AUTO / AUTONL support is the PAGE Profile option.  When selected (ON) and an AUTO / AUTONL file is processed, the screen display will, when fewer lines exist on a page than the screen height, leave the bottom of the screen page blank rather than display the beginning lines of the next page.   This presents a more normal 'print page' format for viewing.  


If only UP PAGE and DOWN PAGE commands are used to scroll, this 'page mode' will be retained.  Scrolling via the mouse-wheel, or via the arrow keys, will suspend PAGE mode till the next time an UP PAGE or DOWN PAGE command is used.


Handling files with lone CR characters


A “lone CR” character – that is, a CR not followed by LF, FF or another CR – is sometimes produced by older software that attempted to overprint the data in order to simulate underscores or bold print.  It may also exist in non-Windows text files; older versions of Macintosh and some lesser-known systems used CR as a line termination.  Because of this, a lone CR character might be used for two different, conflicting purposes.  To handle this, you may choose between EOL AUTO and EOL AUTONL.  These two options work as follows:


For all but the “lone CR” situation, EOL AUTO and EOL AUTONL work identically.


When EOL is set to AUTONL and SPFLite detects a lone CR in a file, it is considered to be a “new line” and is treated as if the lone CR were actually a normal CR/LF line termination.


When EOL is set to AUTO and SPFLite detects a lone CR in a file, it is considered to be an overprint request.  At this point, SPFLite will buffer the lines involved in the overprint request until a ‘normal' line terminator is found, and then it attempts to simulate an overprint.  This means that, on a column-by-column basis, it examines the characters that are attempting to ‘occupy' the same column at the same time.  For each two characters involved in this way:






Created with the Personal Edition of HelpNDoc: Easily create iPhone documentation