Never Try to Handle Binary Data as a String
This issue comes up frequently, and hopefully this C# and VB.NET example will help people to understand what not to do..
Here’s the C# example (the VB.NET example is further below..)
// Never try to store non-text binary data as a string. // This applies to all programming languages where the string data type is // an object, such as C#, VB.NET, Java, VB6, FoxPro, etc. // // If the semantics of the programming language are such that a "string" // is just a sequence of bytes terminated by a 0 byte, such as in C/C++, // there are still problems because the 1st 0 byte in the non-text data (such as JPG, or PDF) // would terminate the "string". // Both C# and VB.NET are languages where strings are objects. If you wish to // set the contents of a C# or VB.NET string from a byte array, you MUST tell .NET // the character encoding of the byte array -- otherwise it does not know how // to interpret the bytes. (For example, the bytes might be utf-8, iso-8859-1, Shift_JIS, etc.) // The bytes don't actually represent characters (such as image data, or a zip archive), then // it makes no sense to be trying to convert the bytes into "chars" because there will be innumerable // sequences of bytes that don't represent any possible char in the charset encoding. // For example, this is OK: byte[] utf8Bytes = System.IO.File.ReadAllBytes("utf8_sampler.htm"); byte[] jpgBytes = System.IO.File.ReadAllBytes("starfish.jpg"); textBox1.Text = "num utf8 bytes = " + utf8Bytes.Length.ToString() + "\r\n"; textBox1.Text += "num JPG bytes = " + jpgBytes.Length.ToString() + "\r\n"; // Interpret the bytes according to the utf-8 encoding and return the string object: // The number of chars in the string may be different than the number of bytes if there // were chars with multi-byte utf-8 representations. string s1 = Encoding.UTF8.GetString(utf8Bytes); textBox1.Text += "num chars = " + s1.Length + "\r\n"; // This is garbage, because the JPG bytes don't represent chars in the utf-8 encoding. string s2 = Encoding.UTF8.GetString(jpgBytes); textBox1.Text += "num chars = " + s2.Length + "\r\n"; // Go back to utf-8 bytes: byte[] utf8Bytes2 = Encoding.UTF8.GetBytes(s1); byte[] jpgBytes2 = Encoding.UTF8.GetBytes(s2); textBox1.Text += "num utf8 bytes 2 = " + utf8Bytes2.Length.ToString() + "\r\n"; textBox1.Text += "num JPG bytes 2 = " + jpgBytes2.Length.ToString() + "\r\n"; // Here's the output of this program: //num utf8 bytes = 62417 //num JPG bytes = 6229 //num chars = 55731 //num chars = 5962 //num utf8 bytes 2 = 62417 //num JPG bytes 2 = 10710
VB.NET Example:
' Never try to store non-text binary data as a string. ' This applies to all programming languages where the string data type is ' an object, such as C#, VB.NET, Java, VB6, FoxPro, etc. ' ' If the semantics of the programming language are such that a "string" ' is just a sequence of bytes terminated by a 0 byte, such as in C/C++, ' there are still problems because the 1st 0 byte in the non-text data (such as JPG, or PDF) ' would terminate the "string". ' But C# and VB.NET are languages where strings are objects. If you wish to ' set the contents of a C# or VB.NET string from a byte array, you MUST tell .NET ' the character encoding of the byte array -- otherwise it does not know how ' to interpret the bytes. (For example, the bytes might be utf-8, iso-8859-1, Shift_JIS, etc.) ' The bytes don't actually represent characters (such as image data, or a zip archive), then ' it makes no sense to be trying to convert the bytes into "chars" because there will be innumerable ' sequences of bytes that don't represent any possible char in the charset encoding. ' For example, this is OK: Dim utf8Bytes As Byte() = System.IO.File.ReadAllBytes("utf8_sampler.htm") Dim jpgBytes As Byte() = System.IO.File.ReadAllBytes("starfish.jpg") textBox1.Text = "num utf8 bytes = " + utf8Bytes.Length.ToString() + vbCr & vbLf textBox1.Text += "num JPG bytes = " + jpgBytes.Length.ToString() + vbCr & vbLf ' Interpret the bytes according to the utf-8 encoding and return the string object: ' The number of chars in the string may be different than the number of bytes if there ' were chars with multi-byte utf-8 representations. Dim s1 As String = Encoding.UTF8.GetString(utf8Bytes) textBox1.Text += "num chars = " + s1.Length + vbCr & vbLf ' This is garbage, because the JPG bytes don't represent chars in the utf-8 encoding. Dim s2 As String = Encoding.UTF8.GetString(jpgBytes) textBox1.Text += "num chars = " + s2.Length + vbCr & vbLf ' Go back to utf-8 bytes: Dim utf8Bytes2 As Byte() = Encoding.UTF8.GetBytes(s1) Dim jpgBytes2 As Byte() = Encoding.UTF8.GetBytes(s2) textBox1.Text += "num utf8 bytes 2 = " + utf8Bytes2.Length.ToString() + vbCr & vbLf textBox1.Text += "num JPG bytes 2 = " + jpgBytes2.Length.ToString() + vbCr & vbLf ' Here's the output of this program: 'num utf8 bytes = 62417 'num JPG bytes = 6229 'num chars = 55731 'num chars = 5962 'num utf8 bytes 2 = 62417 'num JPG bytes 2 = 10710
admin
0
Tags :