Never Try to Handle Binary Data as a String

This issue comes up frequently, and hopefully this C# and VB.NET example will help people to understand what not to do..

Here’s the C# example (the VB.NET example is further below..)


// Never try to store non-text binary data as a string.
// This applies to all programming languages where the string data type is
// an object, such as C#, VB.NET, Java, VB6, FoxPro, etc.
//
// If the semantics of the programming language are such that a "string"
// is just a sequence of bytes terminated by a 0 byte, such as in C/C++,
// there are still problems because the 1st 0 byte in the non-text data (such as JPG, or PDF)
// would terminate the "string".

// Both C# and VB.NET are languages where strings are objects.  If you wish to
// set the contents of a C# or VB.NET string from a byte array, you MUST tell .NET
// the character encoding of the byte array -- otherwise it does not know how
// to interpret the bytes.  (For example, the bytes might be utf-8, iso-8859-1, Shift_JIS, etc.)
// The bytes don't actually represent characters (such as image data, or a zip archive), then
// it makes no sense to be trying to convert the bytes into "chars" because there will be innumerable
// sequences of bytes that don't represent any possible char in the charset encoding.

// For example, this is OK:

byte[] utf8Bytes = System.IO.File.ReadAllBytes("utf8_sampler.htm");
byte[] jpgBytes = System.IO.File.ReadAllBytes("starfish.jpg");

textBox1.Text = "num utf8 bytes = " + utf8Bytes.Length.ToString() + "\r\n";
textBox1.Text += "num JPG bytes = " + jpgBytes.Length.ToString() + "\r\n";

// Interpret the bytes according to the utf-8 encoding and return the string object:
// The number of chars in the string may be different than the number of bytes if there
// were chars with multi-byte utf-8 representations.
string s1 = Encoding.UTF8.GetString(utf8Bytes);
textBox1.Text += "num chars = " + s1.Length + "\r\n";

// This is garbage, because the JPG bytes don't represent chars in the utf-8 encoding.
string s2 = Encoding.UTF8.GetString(jpgBytes);
textBox1.Text += "num chars = " + s2.Length + "\r\n";

// Go back to utf-8 bytes:
byte[] utf8Bytes2 = Encoding.UTF8.GetBytes(s1);
byte[] jpgBytes2 = Encoding.UTF8.GetBytes(s2);

textBox1.Text += "num utf8 bytes 2 = " + utf8Bytes2.Length.ToString() + "\r\n";
textBox1.Text += "num JPG bytes 2 = " + jpgBytes2.Length.ToString() + "\r\n";

// Here's the output of this program:
//num utf8 bytes = 62417
//num JPG bytes = 6229
//num chars = 55731
//num chars = 5962
//num utf8 bytes 2 = 62417
//num JPG bytes 2 = 10710

VB.NET Example:

' Never try to store non-text binary data as a string.
' This applies to all programming languages where the string data type is
' an object, such as C#, VB.NET, Java, VB6, FoxPro, etc.
'
' If the semantics of the programming language are such that a "string"
' is just a sequence of bytes terminated by a 0 byte, such as in C/C++,
' there are still problems because the 1st 0 byte in the non-text data (such as JPG, or PDF)
' would terminate the "string".

' But C# and VB.NET are languages where strings are objects.  If you wish to 
' set the contents of a C# or VB.NET string from a byte array, you MUST tell .NET
' the character encoding of the byte array -- otherwise it does not know how
' to interpret the bytes.  (For example, the bytes might be utf-8, iso-8859-1, Shift_JIS, etc.)
' The bytes don't actually represent characters (such as image data, or a zip archive), then
' it makes no sense to be trying to convert the bytes into "chars" because there will be innumerable
' sequences of bytes that don't represent any possible char in the charset encoding.

' For example, this is OK:

Dim utf8Bytes As Byte() = System.IO.File.ReadAllBytes("utf8_sampler.htm")
Dim jpgBytes As Byte() = System.IO.File.ReadAllBytes("starfish.jpg")

textBox1.Text = "num utf8 bytes = " + utf8Bytes.Length.ToString() + vbCr & vbLf
textBox1.Text += "num JPG bytes = " + jpgBytes.Length.ToString() + vbCr & vbLf

' Interpret the bytes according to the utf-8 encoding and return the string object:
' The number of chars in the string may be different than the number of bytes if there
' were chars with multi-byte utf-8 representations.
Dim s1 As String = Encoding.UTF8.GetString(utf8Bytes)
textBox1.Text += "num chars = " + s1.Length + vbCr & vbLf

' This is garbage, because the JPG bytes don't represent chars in the utf-8 encoding.
Dim s2 As String = Encoding.UTF8.GetString(jpgBytes)
textBox1.Text += "num chars = " + s2.Length + vbCr & vbLf

' Go back to utf-8 bytes:
Dim utf8Bytes2 As Byte() = Encoding.UTF8.GetBytes(s1)
Dim jpgBytes2 As Byte() = Encoding.UTF8.GetBytes(s2)

textBox1.Text += "num utf8 bytes 2 = " + utf8Bytes2.Length.ToString() + vbCr & vbLf
textBox1.Text += "num JPG bytes 2 = " + jpgBytes2.Length.ToString() + vbCr & vbLf

' Here's the output of this program:
'num utf8 bytes = 62417
'num JPG bytes = 6229
'num chars = 55731
'num chars = 5962
'num utf8 bytes 2 = 62417
'num JPG bytes 2 = 10710