Understanding EncryptStringENC and DecryptStringENC in Python and C/C++

Chilkat provides API’s that are identical across a variety of different programming languages. One difficulty in doing this is in handling strings. Different programming languages pass strings in different ways. In some programming languages, such as Python or C/C++, a “string” is simply a sequence of bytes terminated by a null. (I’m referring to “multibyte” strings, not Unicode (utf-16) strings. The term “multibyte” means any charset such that each letter or symbol is represented by one or more bytes without using nulls.) A Python or C/C++ application must indicate how the bytes are going to be interpreted. There are two choices: ANSI or utf-8. Each Chilkat class has a “Utf8” property that controls whether the bytes are interpreted as ANSI or utf-8. Note: The Utf8 property only exists in programming languages where strings are passed as a sequence of bytes. For example, in .NET strings are objects and are always passed as objects (and returned as objects). If the ActiveX is used, then strings are always passed as utf-16. However, in the case of Python or C/C++, strings are simply sequences of bytes and some additional mechanism must be used to indicate how the bytes are to be interpreted.

To encrypt a string, we must precisely specify the exact byte representation of the string we want to be encrypted. This is achieved via the Charset property. For example, maybe it is the ANSI byte representation that is to be encrypted. Or maybe it is the utf-16 byte representation. Or maybe utf-8, or anything else. The mechanism to specify the byte representation of the string to be encrypted must be entirely separate from the mechanism used to unambiguously pass the string to the Chilkat method. These are two separate things. Therefore, string encryption/decryption happens in these steps:

Encrypting a String (EncryptStringENC)

1) Unambiguously pass the string to the EncryptStringENC method.
2) (Internal to the Chilkat method) Convert the string to the byte representation specified by the Charset property.
3) Encrypt
4) Encode the binary encrypted bytes according to the EncodingMode property (which can be base64, hex, etc.) and return this string.

Decrypting a String (DecryptStringENC)

1) Pass the encoded string to DecryptStringENC method. Note that all possible encodings (base64, hex, etc.) use only us-ascii chars. In all multibyte charsets, it is only the non-us-ascii chars that are different. us-ascii chars are always represented by a single byte that is less than 0x80. Therefore, the Utf8 property can be either true or false because us-ascii chars have the same byte representation in both utf-8 and ANSI.
2) (Internal to the Chilkat method) Decode the base64/hex/etc. to get the binary encrypted bytes.
3) Decrypt to get the string in the byte representation as was indicated by the Charset property when encrypting. (The Charset property must be set to this same value when decrypting.)
4) Unambiguously return the string. For a languages such as Python or C/C++, this means examining the Utf8 property setting, and performing whatever conversion is necessary (if any) to convert from the charset indicated by the Charset property, to return the string in the ANSI or utf-8 encoding. (For languages such as C#, Chilkat will convert as appropriate to return as string object to the .NET language.)

Understanding a typical 8bit character problem (such as w/ European language accented chars)

If a single accented European character is incorrectly displayed as two seemingly random characters, then the issue is that at some point utf-8 bytes were incorrectly interpreted as ANSI bytes.

For example, consider the character “é”.

In the utf-8 encoding, this character is represented in two bytes: 0xC3 0xA9
In the typical ANSI encoding (such as Windows-1252 or iso-8859-1) it is a single byte: 0xE9

For example, if the word “appliquée” is represented in utf-8 bytes, but interpreted as if the bytes contained ANSI chars, you would see this: “appliquée”.

The reason for “é” is that each of the 0xC3 and 0xA9 chars are being interpreted as a separate ANSI char. If the iso-8859-1 code chart at http://en.wikipedia.org/wiki/ISO/IEC_8859-1 is examined, you’ll find that:

0xC3 = Ã
and
0xA9 = ©

The solution is to determine how/why the utf-8 chars were mistakenly being interpreted as ANSI.

One common issue with the FTP2 component is that it’s not possible to always automatically know the character encoding for directory listings returned by the FTP server. The Ftp2.DirListingCharset property provides a way to tell the FTP2 component how to interpret the bytes returned in a directory listing. The default is ANSI. However, if the directory listing actually returns utf-8 bytes, then this misinterpretation will occur. The solution is to set the DirListingCharset property = “utf-8”.

Utf8 C++ property allows for utf-8 or ANSI “const char *”

All Chilkat C++ classes have a Utf8 property. For example:

class CkEmail : public CkObject
{
    public:

	CkEmail();
	virtual ~CkEmail();

...
	bool get_Utf8(void) const;
	void put_Utf8(bool b);

...
	const char *addFileAttachment(const char *fileName);
...
};

The Utf8 property controls how the bytes pointed by “const char *” arguments are interpreted. By default, “const char *” strings are interpreted as ANSI bytes. If the Utf8 property is set to true by calling put_Utf8(true), then “const char *” inputs are interpreted as utf-8. This allows any application to pass either ANSI or utf-8 strings to any Chilkat method.

The Utf8 property also controls whether utf-8 or ANSI strings are returned by methods that return a “const char *”.