Avoid Non-usascii Literal Strings in Source Code

It’s generally not a good idea to use literal strings containing non-usascii chars in your source code, regardless of programming language.

For example, in C++ a literal string would be like this:

const char *s = "44ης Οδός, αρ.2";

Or perhaps in another programming language, such as DataFlex, it looks like this:

Move "44ης Οδός, αρ.4" to streetAddr

If you choose to do so,  then you need to be aware of some things:

(1)  How does your editor save the source file?  Does it save in utf-8?  ANSI? utf-16?   Make sure you understand that a charset (such as “utf-8”) defines the byte(s) that represent any given character, and also defines the set or subset of all characters possible.  ANSI charsets are typically a 1-byte per char encoding typically capable of only representing the chars in the language for the default locale (i.e such as Western European languages).

Consider this character: É

In the ANSI (iso-8859-1) character encoding, it is represented by a single byte: 0xC9

In the utf-8 character encoding, it is represented by a two bytes: 0xC3 0x89

In the utf-16 character encoding, it is represented by a two bytes: 0x00 0xC9

(2) What charset is the compiler or language interpreter expecting?  Even if you save your source file containing non-usascii chars using utf-8, if the compiler is interpreting the bytes as something else, such as ANSI, then the chars in the string will not be correct.  For example, if you save a source file containing “É” as utf-8, but the compiler interprets the bytes as iso-8859-1, then the utf-8 byte sequence 0xC3 0x89 will be interpreted as 2 individual ANSI chars: 0xC3 and 0x89, which is not what you want.

(3) Is your programming language such  that “strings” are actually nothing more than a pointer to a sequence of bytes terminated by a null?  Or are “strings” actually an object.  For example, in C++, PHP, and other languages, a literal string in source code is stored as bytes, whereas in C# and other languages, the bytes of the literal string are first interpreted according to the charset and then stored in an opaque way within the object.

For a language where the string is an object, such as C#, then you’re OK if the compiler interpreted the source code bytes using the correct charset encoding (such as utf-8).  Once the compiler/interpreter has correctly created the string object, then it can be passed as an argument to functions and there’s no problem, because the bytes are not exposed. (You’ll notice that in these languages, there are functions to get the string as a byte array in a particular charset.  For example, in C# you have Encoding.UTF8.GetBytes

However, for a language where the string is just a pointer to a null-terminated sequence of bytes (or effectively the same, such as with PHP), you still have the problem where you must be careful when passing the pointer to functions.  You need to know if the function is expecting utf-8 bytes, or ANSI  bytes, or whatever.  The byte representation (charset) must match what the function is expecting to receive.   You might consider adding the charset name to the variable name to make it clear in the source code what you actually have.  For example:   const char *s_utf8;  or in PHP $str =”44ης Οδός, αρ.2″;

The Solution: Binary Encode the Literal Strings in your Source Code

Instead of putting the actual literal string in your source code, binary encode it using either quoted-printable or base64, and then use the encoded literal string and decode at runtime.  See the following examples:

Android™ Decode Literal String
Classic ASP Decode Literal String
AutoIt Decode Literal String
C Decode Literal String
Chilkat2-Python Decode Literal String
C++ Decode Literal String
C# Decode Literal String
DataFlex Decode Literal String
Delphi DLL Decode Literal String
Visual FoxPro Decode Literal String
Go Decode Literal String
Java Decode Literal String
Node.js Decode Literal String
Objective-C Decode Literal String
Perl Decode Literal String
PHP Extension Decode Literal String
PowerBuilder Decode Literal String
PowerShell Decode Literal String
CkPython Decode Literal String
Ruby Decode Literal String
SQL Server Decode Literal String
Swift Decode Literal String
Tcl Decode Literal String
Unicode C Decode Literal String
Unicode C++ Decode Literal String
Visual Basic 6.0 Decode Literal String
VB.NET Decode Literal String
VBScript Decode Literal String
Xojo Plugin Decode Literal String