Working with Word and Delimiter Characters

Contents of Article


Setting the WORD characters

Word Syntax


Many SPFLite functions are required to make decisions based on whether a character is considered to be a Word character or a non-Word (delimiter) character. This support is provided by the WORD string as part of a file's Profile.   This enables the characters that make up a Word to be different for different file types.   Example, different programming languages support different character as valid variable names (such as .dot,  _ underscore, etc.)  IBM mainframe languages often use $ # and @ as letter characters in names.

In addition, handling international characters which are part of the extended ANSI character set can be a problem.  A decision is needed as to whether these should be considered Word characters (as would be normal for western European languages), or should they be non-Word (delimiter) characters, which would typically done for English-based data.  This issue is important, because in files containing English data, the presence of international characters is often a symptom of a corrupted file, whereas in a file containing data in French, German or Spanish such characters are normal and expected.

Setting the WORD characters

To set the WORD characters, perform the following:

Word Syntax

The WORD string consists of a series of either 1 or 2 byte character strings separated by spaces, or by a character range string separated by spaces.    A character range string is simply two characters separated by a - (dash).   Two byte character strings are assumed to be Hex character requests and must consist of only valid hex characters  0-9 and A-F, case is unimportant for Hex values.  Note that a range can be a character range, or a hex range, but you cannot mix hex and character notation in the same range; so 0-39 and 30-9 are illegal ranges.

The following would be valid operand strings for WORD:

A        the uppercase letter A

FF        the hex value FF ( the ÿ character)

a-z        the whole range of characters from lowercase 'a' through lowercase 'z'

0-9        the numbers '0' through '9'

30-39        the numbers '0' through '9' (expressed in hex)

Range strings must be expressed in ascending sequence.  e.g. a range of Z-A is invalid.

The operands of WORD do not have to themselves be in sequence; it is perfectly valid to have a string of:

A-Z FF FE FD a-z _ - 0-9

It is not an error to repeat a character in a WORD specification, the following is valid:

A-Z a-z 0-9 a-c D X z 32-35 a-z

If you have chosen to include the International characters under Options => General as described above, they will be handled as follows.   The uppercase International characters will be added if A-Z is included in the WORD string; and the lowercase International characters will be added if a-z is included in the WORD string.  

By automatically adding international characters this way, you don't have to manually list them all yourself (which is a good thing, because there are a lot of them, and they are not in contiguous character ranges either).  This "adding" of characters is done internally; you won't visibly see the WORD string modified with the added characters, but SPFLite's processing logic for WORD will take them into account.  

Note that the default WORD string used by SPFLite is:   A-Z a-z 0-9

Use of specific non-English language characters  

If you wanted to define only the letters for a specific non-English language (like Spanish) but not for other non-English languages, leave the International Characters checkbox disabled, and then manually enter each desired non-English letter you want to include in the list of defined "word letters" on the WORD command.  

Why would you want to set up WORD characters that way?  Wouldn't it be simpler just to include all international letters and be done with it?  Simpler, yes, but not necessarily better.  Just as non-English letters in an English-based file could indicate file corruption, non-Spanish letters like Ð or ß could indicate file corruption in a Spanish-based file.  For most user, allowing all international letters is far too lenient of a test.

The following describes an approach for doing this.

SPFLite will repopulate the Normal Characters string in the General Options setup dialog when it is deleted.  If you enable the International Characters checkbox, clear this field, close the dialog, then reopen it, the Normal Characters field will have every possible displayable character.  You could then highlight that string and copy it with Ctrl-C to place the string in the clipboard.  Then, paste that string somewhere in an SPFLite edit screen and delete the international characters you didn't want, and reformat that string as a WORD operand.  Finally, go back and disable the International Characters checkbox and define your WORD setting as needed.

Use of WORD hex ranges in EBCDIC files  

Users of EBCDIC files should be mindful that SPFLite does all of its internal editing in ANSI.  EBCDIC files are translated to and from ANSI as needed to make the process appear transparent.  What this means for the WORD command is that if you specify a hex range for characters, you must presently use ANSI encoding, not EBCDIC encoding.  So, a hex range of digits must be defined as 30-39, even though the data values appear in HEX mode edit displays as F0-F9.  As this process could be confusing and non-intuitive, it may be best to avoid hex word ranges for EBCDIC files.  Otherwise, choose these values carefully.

This is a limitation in the implementation of the WORD command, which may or may not be addressed in a future release.

Created with the Personal Edition of HelpNDoc: Free Qt Help documentation generator