Never Handle non-Text Binary Data as a String

October 30, 2018

Never Handle non-Text Binary Data as a String

The bytes of a binary file, such as a JPG, PDF, etc. should never be treated as a string. Loading a binary file into a string, and then saving back to a binary file will surely result in a different file that is corrupted. This rule should be followed for all programming languages. Don’t treat binary non-text bytes as text characters.

Consider this Visual FoxPro code:

LOCAL x
x = FILETOSTR( "in.pdf" )
STRTOFILE( x, "out.pdf" )

It is likely that out.pdf is corrupt. Here’s why:

When reading a text file, the bytes must be interpreted according to some character encoding.

For example, consider this character: É
In the windows-1252 character encoding, it is represented by a single byte: 0xC9
In the utf-8 character encoding, it is represented by a two bytes: 0xC3 0x89
In the utf-16 character encoding, it is represented by a two bytes: 0x00 0xC9

In this case, FoxPro is probably assuming ANSI (i.e. Windows-1252 for USA/Western European computers, 1 byte per char). Internally, FoxPro most likely holds strings in the utf-16 byte representation. Therefore, each incoming ANSI byte is converted to 2-byte per char utf-16.

Now have a look at the Windows-1252 charset:

Notice the “NOT USED” bytes, such as 8D, 9D, 8E, 9E, etc.

These byte values will never appear in valid Windows-1252 text. However, they will likely appear in a binary non-text file. If the binary file is large enough, you can be sure these bytes will be present. They’ll likely get converted to a “?” char. That’s why you see “?” or some other standard char when non-text is loaded.

When you write the text back to the file, all of the “NOT USED” bytes are written as “?” chars. This is the corruption. By trying to handle binary data as text, incoming bytes are implicitly converted to the byte representation used to hold strings (likely utf-8 or utf-16). Writing the file (in the case of STRTOFILE) involves an implicit conversion to the 1-byte per char ANSI representation.

The round-trip of ANSI –> Internal Representation –> ANSI corrupts the data.

Also, it doesn’t matter what charsets are involved. It could be utf-8, utf-16, etc. Reading a text file implicitly involves interpreting bytes according to a charset, and if those bytes don’t actually represent text in the given charset, impossible byte values or byte sequences will be present that cause some sort of error char to be substituted (or the error sequences are simply dropped), and the round-trip always results in corruption.

The rule that should never be broken is: never treat binary data as text. Don’t use string data types to hold non-text binary data.

admin

Never Handle non-Text Binary Data as a String

Blogroll

Tags