Understanding EncryptStringENC and DecryptStringENC in Python and C/C++

Chilkat provides API’s that are identical across a variety of different programming languages. One difficulty in doing this is in handling strings. Different programming languages pass strings in different ways. In some programming languages, such as Python or C/C++, a “string” is simply a sequence of bytes terminated by a null. (I’m referring to “multibyte” strings, not Unicode (utf-16) strings. The term “multibyte” means any charset such that each letter or symbol is represented by one or more bytes without using nulls.) A Python or C/C++ application must indicate how the bytes are going to be interpreted. There are two choices: ANSI or utf-8. Each Chilkat class has a “Utf8” property that controls whether the bytes are interpreted as ANSI or utf-8. Note: The Utf8 property only exists in programming languages where strings are passed as a sequence of bytes. For example, in .NET strings are objects and are always passed as objects (and returned as objects). If the ActiveX is used, then strings are always passed as utf-16. However, in the case of Python or C/C++, strings are simply sequences of bytes and some additional mechanism must be used to indicate how the bytes are to be interpreted.

To encrypt a string, we must precisely specify the exact byte representation of the string we want to be encrypted. This is achieved via the Charset property. For example, maybe it is the ANSI byte representation that is to be encrypted. Or maybe it is the utf-16 byte representation. Or maybe utf-8, or anything else. The mechanism to specify the byte representation of the string to be encrypted must be entirely separate from the mechanism used to unambiguously pass the string to the Chilkat method. These are two separate things. Therefore, string encryption/decryption happens in these steps:

Encrypting a String (EncryptStringENC)

1) Unambiguously pass the string to the EncryptStringENC method.
2) (Internal to the Chilkat method) Convert the string to the byte representation specified by the Charset property.
3) Encrypt
4) Encode the binary encrypted bytes according to the EncodingMode property (which can be base64, hex, etc.) and return this string.

Decrypting a String (DecryptStringENC)

1) Pass the encoded string to DecryptStringENC method. Note that all possible encodings (base64, hex, etc.) use only us-ascii chars. In all multibyte charsets, it is only the non-us-ascii chars that are different. us-ascii chars are always represented by a single byte that is less than 0x80. Therefore, the Utf8 property can be either true or false because us-ascii chars have the same byte representation in both utf-8 and ANSI.
2) (Internal to the Chilkat method) Decode the base64/hex/etc. to get the binary encrypted bytes.
3) Decrypt to get the string in the byte representation as was indicated by the Charset property when encrypting. (The Charset property must be set to this same value when decrypting.)
4) Unambiguously return the string. For a languages such as Python or C/C++, this means examining the Utf8 property setting, and performing whatever conversion is necessary (if any) to convert from the charset indicated by the Charset property, to return the string in the ANSI or utf-8 encoding. (For languages such as C#, Chilkat will convert as appropriate to return as string object to the .NET language.)

BASE64 Decode with Charset GB2312

Question:
I have a Base64 decode error, as follows:

CkString str;
str.setString("16q");
str.base64Decode("gb2312");
const char *strResult = str.getString();

convert result is { cb f2 }
But the correct result should be { d7 aa}

What’s wrong?

The platform is WinCE 6.0, use Chilkat_PPC_M5.lib

Answer:

The following code shows how to do it correctly:

    CkString str;
    str.setString("16q");

    // The following line of code tells the CkString object to 
    // decode the base64 to raw bytes, then interpret those bytes as
    // GB2312 encoded characters and store them within the string.
    // Internally, the string is stored as utf-8.
    str.base64Decode("gb2312");

    // This is an implicit conversion to ANSI, because
    // getString returns either ANSI or utf-8,
    // depending on the setting of get_Utf8/put_Utf8
    const char *strAnsi = str.getString();

    // Instead, fetch the string as GB2312 bytes:
    const char *strGb2312 = str.getEnc("gb2312");

    const unsigned char *c = (const unsigned char *) strGb2312;
    while (*c != '\0') {  printf("%02x ",*c); c++; }
    printf("\n");

    // The output is "d7 aa "


    // Another way to decode using CkByteData...
    CkByteData data;
    data.appendEncoded("16q","base64");
    c = data.getData();
    unsigned long i;
    unsigned long sz = data.getSize();
    for (i=0; i<sz; i++) { printf("%02x ",*c); c++; }
    printf("\n");

    // The output is "d7 aa "

Encrypting Chinese Characters

Question:
Why is it the return is blank when encrypting chinese characters?
Here’s a snippet of my code:

  crypt.KeyLength := 256;
  crypt.SecretKey := Password;
  crypt.CryptAlgorithm := 'aes';
  crypt.EncodingMode := 'base64';
  OutPutStr := crypt.EncryptStringENC(StringToEncrypt);

Answer:

Strings in some programming languages such as Visual Basic, C#, VB.NET, Delphi, Foxpro, etc. should be thought of as objects.  The object contains a string (i.e. a sequence of characters that renders to a sequence of glyphs).  The representation of the string within the object is private — the application shouldn’t care.  For these languages it happens to be Unicode (the 2-byte per char encoding), so the string object is capable of containing characters in any spoken language.  (Of course, just because the string may contain characters in any spoken language doesn’t mean glyphs of any language are renderable, and this is a big problem in older programming languages such as VB6, Delphi, etc. where the visual controls are not capable of mixing glyphs of any language — i.e. they are not Unicode capable controls even though the string data type (i.e. object) holds characters represented internally in Unicode.

OK, back to the main point…

The representation of the string (i.e. the encoding used to represent each character as a sequence of 1 or more bytes) within the string object is private — the application shouldn’t care.   With encryption however, it matters greatly.  Encryption algorithms operate on bytes.  (The same goes for hash algorithms)   Therefore, when you encrypt Chinese characters, did you intend to encrypt 2-byte per char Unicode?  Did you intend to encrypt the utf-8 representation of the characters?  What about the “big5” or “gb2312” character encoding representations?  All would provide different results (of course).

The Crypt.Charset property controls the charset (character encoding) used for encrypting strings.  The string passed to EncryptString* is first converted (internally) to a byte array using the specified character encoding, and then this byte array is encrypted.  The default value for Crypt.Charset is “ANSI”.  In most cases, this is what you expect — you’re expecting a typical European accented character to be represented as a single byte in the default charset of the computer.  This doesn’t work with Chinese (or other Asian languages), or any language that doesn’t match the locale of the computer.  The internal conversion from Unicode to ANSI is dropping the characters where there is no 1-byte/char representation.

The solution:  Set Crypt.Charset equal to the encoding desired.  For Chinese it would be one of the following:  “utf-8”, “Unicode”, “big5”, “gb2312”.