Contents of Article


Introduction

Common settings for non-Windows files

Determining file attributes through experimentation

EBCDIC files

Custom Translation Tables

Using the EOL settings AUTO and AUTONL

Handling files with lone CR characters


Introduction


We are normally unconcerned with the file format of common text files or source programs, since most Windows functions and tools naturally work with this format, which is simple variable-length text. That format is often compatible with other, non-Windows systems, but not always.


Occasionally you will attempt to open a file from another system, only to get a disorganized or unreadable screen display - so you know something is wrong with the data's content or format, or with SPFLite's understanding of the file. The problem could be one of several factors:


Record Format

There are three basic formats used to organize how multiple text lines are stored in a file. This setting is specified for the Profile variable by using the RECFM command. They are:


U - Undefined

Undefined is the norm for Windows text files; the individual text lines may be any length and are separated by unique delimiter characters. See Line Delimiters below.


F - Fixed

Fixed indicates that all text lines are the same length, known as the Logical Record Length. See Logical Record Length below. Note that Fixed records may be stored with or without Line Delimiters as well.


V - Variable

Variable indicates the text lines may be of any length and are not separated by unique delimiters, but are written with a prefix field for each line which indicates the length of that particular line. The prefix field is known as an RDW for Record Descriptor Word.


VBI - Variable Big-Endian

Variable Big-Endian indicates the text lines may be of any length and are not separated by unique delimiters, but are written with a prefix field for each line which indicates the length of that particular line. The prefix field is known as an RDW for Record Descriptor Word. VBI specifies an RDW of 4 bytes in big-endian format containing the length of the data record not including the RDW itself.


VLI - Variable Little-Endian

Variable indicates the text lines may be of any length and are not separated by unique delimiters, but are written with a prefix field for each line which indicates the length of that particular line. The prefix field is known as an RDW for Record Descriptor Word. VLI specifies an RDW of 4 bytes in little-endian format containing the length of the data record not including the RDW itself.


Line Delimiters

Windows text files typically separate text lines from each other by using a pair of special characters, the CR (Carriage Return) and LF (Line Feed) characters, having the hex value X'0D0A'. This pair is normally referred to as CRLF for short.


Unix, Linux and newer Macintosh systems use just the LF character. Other systems (in older Macintosh systems and some few others) may use just the CR character, but that usage is rare.


Some systems even have their own 1 or 2 byte unique delimiter strings. You might also wish to temporarily define your own line delimiters. For example, it could be convenient to temporarily treat a comma as the end of the line if you had a type of file called a CSV, or Comma-Separated-Value.


SPFLite allows you to process all these types (including files without delimiters) by setting the Profile option EOL to the requirements of your file.


There are also files which are not even consistent within themselves, seemingly using CRLF, LF, FF, and CR combinations in weird combinations. An example of this are SYSOUT files from mainframe emulators such as Hercules. SPFLite addresses such files by providing the EOL types of AUTO and AUTONL, discussed later in this article.


Character Set

Even though the Line delimiter (EOL) may be correct, the data in a text file may be using some other character set encoding technique. If the data seems to be total gibberish, or perhaps has a few extraneous characters at the beginning that don't seem to belong, this could be the cause. SPFLite supports the following character sets via the Profile SOURCE variable.


Note regarding Unicode


The Unicode support in SPFLite is quite limited. SPFLite is not a true Unicode text editor, as all editing functions take place internally in native ANSI (Windows MS-1252) mode. The SPFLite support is provided to allow reading in and writing Unicode files which only contain characters that map to normal ANSI characters. This may sound very restrictive, but many Unicode files are stored in UTF format because web servers demand that format, not because there is a true requirement for the additional (non-ANSI) characters supported in Unicode. For this type of application, the SPFLite support is perfectly adequate.


If you truly need the ability to edit the full Unicode character set, then SPFLite will not handle your requirements.


The following are supported:


ANSI

This is the default and is the normal Windows character set. ANSI is also known as MS-1252, which is a superset of ISO-8859-1 and the first 256 bytes of Unicode.


UTF8

This format of encoding (one of the Unicode types) is the most 'economic' as the base ANSI characters are stored as 8 bit characters, and only special characters (above ANSI 127 are encoded as larger values.


UTF16

UTF16LE

Another Unicode format, this format of encoding uses 16 bits for each character. There are two types of UTF16, LE (Little Endian) and BE (Big Endian) which refer to the byte order of the encoded 16 bit value. The normal value for Windows based systems is UTF16/UTF16LE, since the Intel processors that run Windows are Little-Endian devices.


UTF16BE

And another Unicode format, this one uses 16 bits for each character. The encoded 16 bit value is stored in Big-Endian format. UTF16BE is used on IBM mainframe systems like z/OS when they process Unicode data as 16-bit values.


EBCDIC

This is the standard 8-bit encoding used by IBM mainframe systems. Some further information on EBCDIC is at the end of this article


Logical Record Length

When the Record Format is set to F (for fixed), the length of the fixed record must be specified using the LRECL command. The LRECL is set to 0 (zero) for conventional Windows variable-length records that are terminated by a CRLF pair.


If you need to enforce a minimum logical record length that is greater than zero, see Managing Line Lengths and MINLEN - Set Minimum Record Length for more information.



Common settings for non-Windows files


What should you specify when something is obviously not right?  It is best to check with the provider of the file, or determine the type of system on which it was created.


If it was a Unix/Linux system or newer Macintosh, then try the following:

RECFM U,  EOL LF,  SOURCE ANSI,  LRECL 0


If it was an IBM mainframe, then try either:

RECFM V,  EOL NONE,  SOURCE EBCDIC,  LRECL 0

 or

RECFM F,  EOL NONE,  SOURCE EBCDIC,  LRECL nn

where nn is a record length based on what the file usage appears to be.


If it is a web document such as HTML, then try:

RECFM U,  EOL CRLF,  SOURCE UTF8,  LRECL 0

 

Determining file attributes through experimentation


If you lack exact information about the file's format, it may become a matter of trial and error to resolve. Sometimes it will require loading as a simple normal text file and carefully examining the data. It is sometimes helpful here to set SPFLIte to use a good ANSI font such as Raster (included in the optional Font Package on the web site) since it will correctly show line delimiter control characters like CR and LF in a visible format, making it easier to work out what you're looking at.


For very difficult file issues, you may need to exit SPFLite and examine the data with a Hex Editor. Several free versions are available on the Internet. One such hex editor may be found at  http://www.catch22.net/


When playing with these parameters trying to find the correct settings, it is sometimes helpful to rename the problem file to a unique file type (e.g PROBLEM.TRYIT ) so that you don't interfere with the options for a currently valid file profile.


Be sure to issue a PROFILE UNLOCK on this test profile. Then you can simply EDIT the file, examine it, alter one or more of the Profile variables you are experimenting with, CANCEL out and EDIT it again to try the new settings. You can repeat this in a trial-and-error approach, or see if the originator of the file has documentation on the file's format.


Note:  Once you 'get it right'  remember that SPFLite will, using the same Profile, now successfully write files in this format. For example, say you  have a profile for a file type of EBC (for EBCDIC). You can now 'convert' a  normal Windows text file to this format by simply editing the Windows text file, entering PROFILE USING EBC, and then using CREATE/REPLACE to write the new file. Then CANCEL out of the edit session. Make sure the original file's Profile is LOCKED before doing this to prevent making the USING permanent.


EBCDIC files


SPFLite internally handles all data as Ansi characters. This is equivalent to the Windows 1252 character set, which is a superset of the first 256 characters of Unicode.


Note:  IBM has its own ideas about code pages. What Microsoft calls Windows 1252 isn't as simple to IBM. The reason is that 1252 has changed over the years, most recently to add the Euro character. Even though the exact characters present in this code page have changed, Microsoft still calls it 1252. However, IBM considers "1252" to mean a Windows code page 1252 of the past, prior to the advent of the Euro and some other changes they made. IBM considers the Windows 1252 code page of today to be called 5348. They take this position because they have to support things like DB/2 databases, where database administrators have to know precisely what data is, and is not, present in CHARACTER database fields, and the old vs. new 1252 are just not the same thing. For the record, SPFLite's idea of 1252 is the same as IBM's idea of 5348. That is, we support the current Windows 1252 code page definition of today.


You can also edit EBCDIC files, by setting up a PROFILE for a given file type (that is, a file name extension) that has SOURCE EBCDIC associated with it. Only Ansi characters are displayed on the edit screen.


In order for SPFLite to handle EBCDIC data, it must translate it from EBCDIC to Ansi while editing, and from Ansi back to EBCDIC for storing externally. To do that, a translation table is required. SPFLite has already defined such a table, and normally there is nothing you need to do, but the following just explains the process.


Presently, only one translation table is supplied with SPFLite, which converts between Windows 1252 (Ansi) and IBM EBCDIC code page 1140. Code page 1140 is a modern code set, comprised of an earlier code page 037 plus the Euro character. Page 1140 is commonly used in North America for nearly all IBM z/OS mainframe installations. The particular tables used by SPFLite are two-way lossless tables that are based on published IBM code-table documentation. Any otherwise unallocated characters have a unique one-to-one translation, so that no data will be lost or mistranslated while editing the Ansi version of your EBCDIC data, even for “binary” data. (Even the "unallocated characters" have lossless translations based on IBM specifications; we did not "make up" any rules for this just for SPFLite.)  If for any reason this table is not suitable for your use (if you need EBCDIC national characters outside the North American and/or European characters in Code Page 1140) it is possible to provide your own table. See Custom EBCDIC translation tables below.


This default table performs a translation which is identical to the translation performed in the Hercules mainframe emulator when using a configuration file parameter line of CODEPAGE 1252/1140.


SPFLite does not dictate how lines are terminated in EBCDIC files. You decide how you want this to be handled. All the existing CR/LF combinations can be used with EBCDIC, and their EBCDIC equivalents are used to terminate lines. SPFLite also supports the EBCDIC New Line character NL = X'15' as an EOL value. Additionally, you may use the non-standard file formats noted below. Users of the Hercules mainframe emulator who need to edit EBCDIC files may find cases where fixed-length files and EOL NONE is required.


There is nothing to prevent you from specifying EOL NL in an ANSI file. However, the ANSI equivalent of EBCDIC X'15' is X'85', which is not a standard ANSI text delimiter, and so NL may be of limited usefulness outside of EBCDIC files.


Custom Translation Tables


Note: Translation tables.

    • File extension is .SOURCE
    • The default file name for EBCDIC is EBCDIC.SOURCE
    • More than one translation table can exist at the same time besides EBCDIC.SOURCE
    • Translation tables need not translate between ANSI and EBCDIC. They may be used to translate between different variants of ASCII code pages, such as between ASCII 437 or ASCII 850 and ANSI 1252.
    • The new format of the translation tables is much different


The format of the table over predecessor versions has a number of benefits. One of them is that when a translation table is in Round-Trip/Lossless mode, only one 'side' of the translation table is needed, since the two 'halves' are like mirror images of each other anyway. When SPFLite knows you have such a table, it validates that the values in the table are consistent with that. That means, for each of 256 possible character values in a table, each one must appear once and only once for the table to be valid.


Independent of SPFLite, software is being written to directly generate SPFLite translation tables from IBM-provided code page data, known as UCM files. This software is in development and will be made available when ready. Once this new software is ready, a full discussion of all these issues will be documented.


Meantime, to get an idea about the new .SOURCE format, here is an example of what the TxtToSource.exe conversion program does to the old EBCDIC.TXT file, showing the key features of the new format:


TT  TITLE='SPFLITE TRANSLATION TABLE'  MODE=RT

TT  GENDATE='2013-11-29 14:36:46'


**  SOURCE file was created from conversion of 'EBCDIC.TXT'


**  AE comment:  ASCII 1252 => EBCDIC 1140

**  EA comment:  EBCDIC 1140 => ASCII 1252


**  _0  _1  _2  _3  _4  _5  _6  _7  _8  _9  _A  _B  _C  _D  _E  _F  EA


0*  00  01  02      œ  09   †  7F  97  8D  8E  0B  0C  0D        0*

1*                      …  08   ‡           ’                     1*

2*   ¤  81   ‚   ƒ   „  0A          ˆ   ‰   Š   ‹   Œ  05  06  07  2*

3*  90   ‘      “   ”   •   –      ˜    ™   š   ›  14  15   ž  1A  3*

4*      A0   â   ä   à   á   ã   å   ç   ñ   ¢   . <   (   +   |  4*

5*   &   é   ê   ë   è   í   î   ï   ì   ß   !   $   *   )   ;   ¬  5*

6*   -   /   Â   Ä   À   Á   Ã   Å   Ç   Ñ   ¦   ,   %   _   >   ?  6*

7*   ø   É   Ê   Ë   È   Í   Î   Ï   Ì   `   :   #   @   '   =   "  7*

8*   Ø   a   b   c   d   e   f   g   h   i   «   »   ð   ý   þ   ±  8*

9*   °   j   k   l   m   n   o   p   q   r   ª   º   æ   ¸   Æ   €  9*

A*   µ   ~   s   t   u   v   w   x   y   z   ¡   ¿   Ð   Ý   Þ   ®  A*

B*   ^   £   ¥   ·   ©   §   ¶   ¼   ½   ¾   [   ]   ¯   ¨   ´   ×  B*

C*   {   A   B   C   D   E   F   G   H   I   ­    ô   ö   ò   ó   õ  C*

D*   }   J   K   L   M   N   O   P   Q   R   ¹   û   ü   ù   ú   ÿ  D*

E*   \   ÷   S   T   U   V   W   X   Y   Z   ²   Ô   Ö   Ò   Ó   Õ  E*

F*   0   1   2   3   4   5   6   7   8   9   ³   Û   Ü   Ù   Ú   Ÿ  F*


EA  _0  _1  _2  _3  _4  _5  _6  _7  _8  _9  _A  _B  _C  _D  _E  _F  EA


0_  00  01  02  03  9C  09  86  7F  97  8D  8E  0B  0C  0D  0E  0F  0_

1_  10  11  12  13  9D  85  08  87  18  19  92  8F  1C  1D  1E  1F  1_

2_  A4  81  82  83  84  0A  17  1B  88  89  8A  8B  8C  05  06  07  2_

3_  90  91  16  93  94  95  96  04  98  99  9A  9B  14  15  9E  1A  3_

4_  20  A0  E2  E4  E0  E1  E3  E5  E7  F1  A2  2E  3C  28  2B  7C  4_

5_  26  E9  EA  EB  E8  ED  EE  EF  EC  DF  21  24  2A  29  3B  AC  5_

6_  2D  2F  C2  C4  C0  C1  C3  C5  C7  D1  A6  2C  25  5F  3E  3F  6_

7_  F8  C9  CA  CB  C8  CD  CE  CF  CC  60  3A  23  40  27  3D  22  7_

8_  D8  61  62  63  64  65  66  67  68  69  AB  BB  F0  FD  FE  B1  8_

9_  B0  6A  6B  6C  6D  6E  6F  70  71  72  AA  BA  E6  B8  C6  80  9_

A_  B5  7E  73  74  75  76  77  78  79  7A  A1  BF  D0  DD  DE  AE  A_

B_  5E  A3  A5  B7  A9  A7  B6  BC  BD  BE  5B  5D  AF  A8  B4  D7  B_

C_  7B  41  42  43  44  45  46  47  48  49  AD  F4  F6  F2  F3  F5  C_

D_  7D  4A  4B  4C  4D  4E  4F  50  51  52  B9  FB  FC  F9  FA  FF  D_

E_  5C  F7  53  54  55  56  57  58  59  5A  B2  D4  D6  D2  D3  D5  E_

F_  30  31  32  33  34  35  36  37  38  39  B3  DB  DC  D9  DA  9F  F_


//  _0  _1  _2  _3  _4  _5  _6  _7  _8  _9  _A  _B  _C  _D  _E  _F  //



Using the EOL settings AUTO and AUTONL


The End of Line profile options AUTO and AUTONL allow for automatic detection of line terminations, possibly containing inconsistent and spurious line terminators, so that files edited across different system, mainframe SYSOUT files, and other inconsistently-terminated text files can be opened, viewed and edited in a reasonable way. EOL AUTO/AUTONL may be applied to non-mainframe files as well, to handle situations where a file's line termination is inconsistent for some reason. A possible cause of this is a file shared between Windows and Unix on a network and edited with different editors that apply different line endings. Line terminations under EOL AUTO/AUTONL are handled as follows:


    • FF (form feed) characters delimit lines, and cause a =PAGE> marker to be placed in the sequence area. Notes:


      • Scrolling commands PAGE UP and PAGE DOWN will locate these marked lines.


      • Since PAGE UP and PAGE DOWN will move the file to these =PAGE> marker lines, which may have a variable number of lines involved, to regain the ‘full screen motion' that PAGE UP and PAGE DOWN does in other files, you can use a scroll amount of HALF or DATA, or you could enter a numeric value for a specific number of lines. For most users who would have used PAGE UP and PAGE DOWN, scrolling UP/DOWN by the DATA scroll amount should work well for them.


      • When you have a file that shows this =PAGE> marker on a line, and you PRINT this file, you will have a Form Feed sent to the printer for every line containing the =PAGE> marker.


    • A lone LF (line feed) is treated as a line delimiter equivalent to CR,LF


    • “Spurious” CR characters that seemingly don't belong there, such as CR,CR,LF are ignored. For example, in the sequence CR,CR,LF, the first CR is spurious; the remaining CR,LF pair is a normal line termination.


    • A CR,FF or CR,LF,FF sequence is considered as the end of one line, followed by a page separator line.


    • A hex value of X'1A' at the end of the file is ignored.



PAGE Profile Support


An extension to the AUTO / AUTONL support is the PAGE Profile option. When selected (ON) and an AUTO / AUTONL file is processed, the screen display will, when fewer lines exist on a page than the screen height, leave the bottom of the screen page blank rather than display the beginning lines of the next page. This presents a more normal 'print page' format for viewing.


If only UP PAGE and DOWN PAGE commands are used to scroll, this 'page mode' will be retained. Scrolling via the mouse-wheel, or via the arrow keys, will suspend PAGE mode till the next time an UP PAGE or DOWN PAGE command is used.


Handling files with lone CR characters


A “lone CR” character – that is, a CR not followed by LF, FF or another CR – is sometimes produced by older software that attempted to overprint the data in order to simulate underscores or bold print. It may also exist in non-Windows text files; older versions of Macintosh and some lesser-known systems used CR as a line termination. Because of this, a lone CR character might be used for two different, conflicting purposes. To handle this, you may choose between EOL AUTO and EOL AUTONL. These two options work as follows:


For all but the “lone CR” situation, EOL AUTO and EOL AUTONL work identically.


When EOL is set to AUTONL and SPFLite detects a lone CR in a file, it is considered to be a “new line” and is treated as if the lone CR were actually a normal CR/LF line termination.


When EOL is set to AUTO and SPFLite detects a lone CR in a file, it is considered to be an overprint request. At this point, SPFLite will buffer the lines involved in the overprint request until a ‘normal' line terminator is found, and then it attempts to simulate an overprint. This means that, on a column-by-column basis, it examines the characters that are attempting to ‘occupy' the same column at the same time. For each two characters involved in this way:


    • When a blank and a non-blank character are in the same column, the non-blank character ‘wins out'.


    • When two non-blank characters are in the same column, and they are identical, it is recognized as a “bold font” type of overprint, and is not a problem; the non-blank character is retained.


    • When two non-blank characters are in the same column, but one of the non-blank characters is an underscore, the non-underscore character ‘wins out'. That way, the underscore character will never ‘obliterate' the meaningful data.


    • When two non-blank characters are in the same column, and they are not identical, and neither is an underscore, an “overprint clash” has occurred. When this happens, the first such character is retained and any others are discarded. Where SPFLite detects this, it will issue a warning message. If you frequently see this warning, it suggests the file having lone CR characters was not written to overprint, but either has foreign line delimiters or is perhaps ‘damaged' in some way. The way to address this is to change the setting from EOL AUTO to EOL AUTONL.


Created with the Personal Edition of HelpNDoc: Free iPhone documentation generator