NonWindows Text Files

Contents of Article

Determining file attributes through experimentation

Using the End-of-Line settings AUTO and AUTONL

Introduction

We are normally unconcerned with the file format of common text files or source programs, since most Windows functions and tools naturally work with this format, which is simple variable-length text. That format is often compatible with other, non-Windows systems, but not always.

Occasionally you will attempt to open a file from another system, only to get a disorganized or unreadable screen display - so you know something is wrong with the data's content or format, or with SPFLite's understanding of the file. The problem could be one of several factors:

Record Format

There are three basic formats used to organize how multiple text lines are stored in a file. This setting is specified for the Profile variable by using the DCB RECFM command. They are:

U - Undefined	Undefined is the norm for Windows text files; the individual text lines may be any length and are separated by unique delimiter characters. See Line Delimiters below.
F - Fixed	Fixed indicates that all text lines are the same length, known as the Logical Record Length. See Logical Record Length below. Note that Fixed records may be stored with or without Line Delimiters as well.
V - Variable	Variable indicates the text lines may be of any length and are not separated by unique delimiters, but are written with a prefix field for each line which indicates the length of that particular line. The prefix field is known as an RDW for Record Descriptor Word.
VBI - Variable Big-Endian	Variable Big-Endian indicates the text lines may be of any length and are not separated by unique delimiters, but are written with a prefix field for each line which indicates the length of that particular line. The prefix field is known as an RDW for Record Descriptor Word. VBI specifies an RDW of 4 bytes in big-endian format containing the length of the data record not including the RDW itself.
VLI - Variable Little-Endian	Variable indicates the text lines may be of any length and are not separated by unique delimiters, but are written with a prefix field for each line which indicates the length of that particular line. The prefix field is known as an RDW for Record Descriptor Word. VLI specifies an RDW of 4 bytes in little-endian format containing the length of the data record not including the RDW itself.

Line Delimiters

Windows text files typically separate text lines from each other by using a pair of special characters, the CR (Carriage Return) and LF (Line Feed) characters, having the hex value X'0D0A'. This pair is normally referred to as CRLF for short.

Unix, Linux and newer Macintosh systems use just the LF character. Other systems (in older Macintosh systems and some few others) may use just the CR character, but that usage is rare.

Some systems even have their own 1 or 2 byte unique delimiter strings. You might also wish to temporarily define your own line delimiters. For example, it could be convenient to temporarily treat a comma as the end of the line if you had a type of file called a CSV, or Comma-Separated-Value.

SPFLite allows you to process all these types (including files without delimiters) by setting the Profile option DCB EOL to the requirements of your file.

There are also files which are not even consistent within themselves, seemingly using CRLF, LF, FF, and CR combinations in weird combinations. An example of this are SYSOUT files from mainframe emulators such as Hercules. SPFLite addresses such files by providing the DCB EOL type of AUTONL, discussed later in this article.

Character Set

Even though the Line delimiter in DCB EOL may be correct, the data in a text file may be using some other character set encoding technique. If the data seems to be total gibberish, or perhaps has a few extraneous characters at the beginning that don't seem to belong, this could be the cause. SPFLite supports the following character sets via the Profile SOURCE variable.

Note regarding Unicode

The Unicode support in SPFLite is quite limited. SPFLite is not a true Unicode text editor, as all editing functions take place internally in native ANSI (Windows MS-1252) mode. The SPFLite support is provided to allow reading in and writing Unicode files which only contain characters that map to normal ANSI characters. This may sound very restrictive, but many Unicode files are stored in UTF format because web servers demand that format, not because there is a true requirement for the additional (non-ANSI) characters supported in Unicode. For this type of application, the SPFLite support is perfectly adequate.

If you truly need the ability to edit the full Unicode character set, then SPFLite will not handle your requirements.

The following are supported:

ANSI

This is the default and is the normal Windows character set. ANSI is also known as MS-1252, which is a superset of ISO-8859-1 and the first 256 bytes of Unicode.

UTF8

This format of encoding (one of the Unicode types) is the most 'economic' as the base ANSI characters are stored as 8 bit characters, and only special characters (above ANSI 127 are encoded as larger values.

UTF16

UTF16LE

Another Unicode format, this format of encoding uses 16 bits for each character. There are two types of UTF16, LE (Little Endian) and BE (Big Endian) which refer to the byte order of the encoded 16 bit value. The normal value for Windows based systems is UTF16/UTF16LE, since the Intel processors that run Windows are Little-Endian devices.

UTF16BE

And another Unicode format, this one uses 16 bits for each character. The encoded 16 bit value is stored in Big-Endian format. UTF16BE is used on IBM mainframe systems like z/OS when they process Unicode data as 16-bit values.

EBCDIC

This is the standard 8-bit encoding used by IBM mainframe systems. Some further information on EBCDIC is at the end of this article

UTF BOM Markers

To identify the various types of UTF files, the true data is optionally prefixed with a 2 or 3 byte BOM (Byte Order Marker) This identifies the particular type of UTF coding used.

However some, like UTF8, do not need a BOM, and in fact, sometimes the presence of a BOM can cause problems. SPFLite provides a Profile option named BOM (why not?) which can be set to ON or OFF. This controls whether you wish SPFLite to create the BOM when writing the file. See BOM.

Logical Record Length

When the Record Format is set to F (for fixed), the length of the fixed record must be specified using the DCB LRECL command. The record-length is set to 0 (zero) for conventional Windows variable-length records that are terminated by a CRLF pair.

If you need to enforce a minimum logical record length that is greater than zero, see Managing Line Lengths and MINLEN - Set Minimum Record Length for more information.

Common settings for non-Windows files

What should you specify when something is obviously not right? It is best to check with the provider of the file, or determine the type of system on which it was created.

If it was a Unix/Linux system or newer Macintosh, then try the following:

DCB RECFM U LRECL 0 EOL LF and SOURCE ANSI

If it was an IBM mainframe, then try either:

DCB RECFM V LRECL 0 EOL NONE and SOURCE EBCDIC

DCB RECFM F LRECL nn EOL NONE and SOURCE EBCDIC

where nn is a record length based on what the file usage appears to be.

If it is a web document such as HTML, then try:

DCB RECFM U LRECL 0 EOL CRLF and SOURCE UTF8

Determining file attributes through experimentation

If you lack exact information about the file's format, it may become a matter of trial and error to resolve. Sometimes it will require loading as a simple normal text file and carefully examining the data. It is sometimes helpful here to set SPFLIte to use a good ANSI font such as Raster (included in the optional Font Package on the web site) since it will correctly show line delimiter control characters like CR and LF in a visible format, making it easier to work out what you're looking at.

For very difficult file issues, you may need to exit SPFLite and examine the data with a Hex Editor. Several free versions are available on the Internet. One such hex editor may be found at http://www.catch22.net/

When playing with these parameters trying to find the correct settings, it is sometimes helpful to rename the problem file to a unique file type (e.g PROBLEM.TRYIT ) so that you don't interfere with the options for a currently valid file profile.

Be sure to issue a PROFILE UNLOCK on this test profile. Then you can simply EDIT the file, examine it, alter one or more of the Profile variables you are experimenting with, CANCEL out and EDIT it again to try the new settings. You can repeat this in a trial-and-error approach, or see if the originator of the file has documentation on the file's format.

EBCDIC files

SPFLite internally handles all data as Ansi characters. This is equivalent to the Windows 1252 character set, which is a superset of the first 256 characters of Unicode.

Note: IBM has its own ideas about code pages. What Microsoft calls Windows 1252 isn't as simple to IBM. The reason is that 1252 has changed over the years, most recently to add the Euro character. Even though the exact characters present in this code page have changed, Microsoft still calls it 1252. However, IBM considers "1252" to mean a Windows code page 1252 of the past, prior to the advent of the Euro and some other changes they made. IBM considers the Windows 1252 code page of today to be called 5348. They take this position because they have to support things like DB/2 databases, where database administrators have to know precisely what data is, and is not, present in CHARACTER database fields, and the old vs. new 1252 are just not the same thing. For the record, SPFLite's idea of 1252 is the same as IBM's idea of 5348. That is, we support the current Windows 1252 code page definition of today.

You can also edit EBCDIC files, by setting up a PROFILE for a given file type (that is, a file name extension) that has SOURCE EBCDIC associated with it. Only Ansi characters are displayed on the edit screen.

In order for SPFLite to handle EBCDIC data, it must translate it from EBCDIC to Ansi while editing, and from Ansi back to EBCDIC for storing externally. To do that, a translation table is required. SPFLite has already defined such a table, and normally there is nothing you need to do, but the following just explains the process.

Presently, only one translation table is supplied with SPFLite, which converts between Windows 1252 (Ansi) and IBM EBCDIC code page 1140. Code page 1140 is a modern code set, comprised of an earlier code page 037 plus the Euro character. Page 1140 is commonly used in North America for nearly all IBM z/OS mainframe installations. The particular tables used by SPFLite are two-way lossless tables that are based on published IBM code-table documentation. Any otherwise unallocated characters have a unique one-to-one translation, so that no data will be lost or mistranslated while editing the Ansi version of your EBCDIC data, even for “binary” data. (Even the "unallocated characters" have lossless translations based on IBM specifications; we did not "make up" any rules for this just for SPFLite.) If for any reason this table is not suitable for your use (if you need EBCDIC national characters outside the North American and/or European characters in Code Page 1140) it is possible to provide your own table. See Custom EBCDIC translation tables below.

This default table performs a translation which is identical to the translation performed in the Hercules mainframe emulator when using a configuration file parameter line of CODEPAGE 1252/1140.

SPFLite does not dictate how lines are terminated in EBCDIC files. You decide how you want this to be handled. All the existing CR/LF combinations can be used with EBCDIC, and their EBCDIC equivalents are used to terminate lines. SPFLite also supports the EBCDIC New Line character NL = X'15' as an End-of-Line value. Additionally, you may use the non-standard file formats noted below. Users of the Hercules mainframe emulator who need to edit EBCDIC files may find cases where fixed-length files and End-of-Line NONE is required.

There is nothing to prevent you from specifying End-of-Line NL in an ANSI file. However, the ANSI equivalent of EBCDIC X'15' is X'85', which is not a standard ANSI text delimiter, and so NL may be of limited usefulness outside of EBCDIC files.

Custom Translation Tables

Note: Translation tables.

File extension is .SOURCE
The default file name for EBCDIC is EBCDIC.SOURCE
More than one translation table can exist at the same time besides EBCDIC.SOURCE
Translation tables need not translate between ANSI and EBCDIC. They may be used to translate between different variants of ASCII code pages, such as between ASCII 437 or ASCII 850 and ANSI 1252.
The new format of the translation tables is much different

The format of the table over predecessor versions has a number of benefits. One of them is that when a translation table is in Round-Trip/Lossless mode, only one 'side' of the translation table is needed, since the two 'halves' are like mirror images of each other anyway. When SPFLite knows you have such a table, it validates that the values in the table are consistent with that. That means, for each of 256 possible character values in a table, each one must appear once and only once for the table to be valid.

Independent of SPFLite, software is being written to directly generate SPFLite translation tables from IBM-provided code page data, known as UCM files. This software is in development and will be made available when ready. Once this new software is ready, a full discussion of all these issues will be documented.

Meantime, to get an idea about the new .SOURCE format, here is an example of what the TxtToSource.exe conversion program does to the old EBCDIC.TXT file, showing the key features of the new format:

TT TITLE='SPFLITE TRANSLATION TABLE' MODE=RT

TT GENDATE='2013-11-29 14:36:46'

** SOURCE file was created from conversion of 'EBCDIC.TXT'

** AE comment: ASCII 1252 => EBCDIC 1140

** EA comment: EBCDIC 1140 => ASCII 1252

** _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F EA

0* 00 01 02 œ 09 † 7F 97 8D 8E 0B 0C 0D 0*

1* … 08 ‡ ’ 1*

2* ¤ 81 ‚ ƒ „ 0A ˆ ‰ Š ‹ Œ 05 06 07 2*

3* 90 ‘ “ ” • – ˜ ™ š › 14 15 ž 1A 3*

4* A0 â ä à á ã å ç ñ ¢ . < ( + | 4*

5* & é ê ë è í î ï ì ß ! $ * ) ; ¬ 5*

6* - / Â Ä À Á Ã Å Ç Ñ ¦ , % _ > ? 6*

7* ø É Ê Ë È Í Î Ï Ì ` : # @ ' = " 7*

8* Ø a b c d e f g h i « » ð ý þ ± 8*

9* ° j k l m n o p q r ª º æ ¸ Æ € 9*

A* µ ~ s t u v w x y z ¡ ¿ Ð Ý Þ ® A*

C* { A B C D E F G H I ô ö ò ó õ C*

D* } J K L M N O P Q R ¹ û ü ù ú ÿ D*

E* \ ÷ S T U V W X Y Z ² Ô Ö Ò Ó Õ E*

F* 0 1 2 3 4 5 6 7 8 9 ³ Û Ü Ù Ú Ÿ F*

EA _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F EA

0_ 00 01 02 03 9C 09 86 7F 97 8D 8E 0B 0C 0D 0E 0F 0_

1_ 10 11 12 13 9D 85 08 87 18 19 92 8F 1C 1D 1E 1F 1_

2_ A4 81 82 83 84 0A 17 1B 88 89 8A 8B 8C 05 06 07 2_

3_ 90 91 16 93 94 95 96 04 98 99 9A 9B 14 15 9E 1A 3_

4_ 20 A0 E2 E4 E0 E1 E3 E5 E7 F1 A2 2E 3C 28 2B 7C 4_

5_ 26 E9 EA EB E8 ED EE EF EC DF 21 24 2A 29 3B AC 5_

6_ 2D 2F C2 C4 C0 C1 C3 C5 C7 D1 A6 2C 25 5F 3E 3F 6_

7_ F8 C9 CA CB C8 CD CE CF CC 60 3A 23 40 27 3D 22 7_

8_ D8 61 62 63 64 65 66 67 68 69 AB BB F0 FD FE B1 8_

9_ B0 6A 6B 6C 6D 6E 6F 70 71 72 AA BA E6 B8 C6 80 9_

A_ B5 7E 73 74 75 76 77 78 79 7A A1 BF D0 DD DE AE A_

B_ 5E A3 A5 B7 A9 A7 B6 BC BD BE 5B 5D AF A8 B4 D7 B_

C_ 7B 41 42 43 44 45 46 47 48 49 AD F4 F6 F2 F3 F5 C_

D_ 7D 4A 4B 4C 4D 4E 4F 50 51 52 B9 FB FC F9 FA FF D_

E_ 5C F7 53 54 55 56 57 58 59 5A B2 D4 D6 D2 D3 D5 E_

F_ 30 31 32 33 34 35 36 37 38 39 B3 DB DC D9 DA 9F F_

// _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F //

Using the End-of-Line settings AUTO and AUTONL

The End of Line profile options AUTO and AUTONL allow for automatic detection of line terminations, possibly containing inconsistent and spurious line terminators, so that files edited across different system, mainframe SYSOUT files, and other inconsistently-terminated text files can be opened, viewed and edited in a reasonable way. End-of-Line AUTO/AUTONL may be applied to non-mainframe files as well, to handle situations where a file's line termination is inconsistent for some reason. A possible cause of this is a file shared between Windows and Unix on a network and edited with different editors that apply different line endings. Line terminations under End-of-Line AUTO/AUTONL are handled as follows:

FF (form feed) characters delimit lines, and cause a =PAGE> marker to be placed in the sequence area. Notes:

Scrolling commands PAGE UP and PAGE DOWN will locate these marked lines.

Since PAGE UP and PAGE DOWN will move the file to these =PAGE> marker lines, which may have a variable number of lines involved, to regain the ‘full screen motion' that PAGE UP and PAGE DOWN does in other files, you can use a scroll amount of HALF or DATA, or you could enter a numeric value for a specific number of lines. For most users who would have used PAGE UP and PAGE DOWN, scrolling UP/DOWN by the DATA scroll amount should work well for them.

When you have a file that shows this =PAGE> marker on a line, and you PRINT this file, you will have a Form Feed sent to the printer for every line containing the =PAGE> marker.

A lone LF (line feed) or lone CR (carriage return) is treated as a line delimiter equivalent to CR,LF

“Spurious” CR characters that seemingly don't belong there, such as CR,CR,LF are ignored. For example, in the sequence CR,CR,LF, the first CR is spurious; the remaining CR,LF pair is a normal line termination.

A CR,FF or CR,LF,FF sequence is considered as the end of one line, followed by a page separator line.

A hex value of X'1A' at the end of the file is ignored.

PAGE Profile Support

An extension to the AUTO / AUTONL support is the PAGE Profile option. When selected (ON) and an AUTO / AUTONL file is processed, the screen display will, when fewer lines exist on a page than the screen height, leave the bottom of the screen page blank rather than display the beginning lines of the next page. This presents a more normal 'print page' format for viewing.

If only UP PAGE and DOWN PAGE commands are used to scroll, this 'page mode' will be retained. Scrolling via the mouse-wheel, or via the arrow keys, will suspend PAGE mode till the next time an UP PAGE or DOWN PAGE command is used.

Created with the Personal Edition of HelpNDoc: Free HTML Help documentation generator