Get Started: Unicode in Delphi XE3


Delphi has moved to the era of Unicode. From Delphi 2009, the string type is in fact a UnicodeString instead of AnsiString. How is this going to affect the project migration? I have a ongoing Delphi project that was initially compiled under Delphi 2007 and recently I am moving the code to Delphi XE3 and compile it under 64-bit. In my case, there are two issues to deal with: the Unicode and the 64-bit. The 64-bit turns out to be less trouble than the Unicode. The advantage of using Unicode is to deal with international characters (such as Chinese) easily via using more bytes (2 or 4) than 1 byte defined in AnsiString. Hence, there are conversion issues between two types of string. During my experience, I have 400 warnings when first compiled the source using Unicode, but surprisingly, I have 2000 warnings if I keep the old-style string (replacing all String with AnsiString).

From Delphi 3, we have WideString, and its basic element is WideChar which is two bytes (the AnsiString consists of an array of AnsiChar, which is one byte only). Before Delphi 2009, the string is an alias of AnsiString and the char is the alias of AnsiChar. This changes to UnicodeString and WideChar respectively on and after Delphi 2009.

If the project contains usage of Char as buffers, such as arr: array[0..255] of Char then you have check whether it still works under Delphi XE3 because each Char takes 2 bytes. The WideString is intended for holding international characters but unfortunately, they are not reference-counted. Therefore, its performance is less than AnsiString. Delphi uses copy-on-write technique on string assignments, i.e. simply assigning strings between variables will be just the pointer assignments at first, and strings will be copied if the content changes. i.e.

program stringtest;
var
  a, b: string;
begin
  a := 'Hello';
  b := a; // not copy yet
  Writeln(StringRefCount(a), StringRefCount(b));   // prints 22
  a[1] := 't';  // copy now
  Writeln(StringRefCount(a), StringRefCount(b));   // prints 11
end.

We can use function StringRefCount to see how many string variables are associated to it. When string reference count is zero, the Delphi will automatically free the memory.

If the code involves copying lots of strings between each other, but the content-changing is not so frequent, the copy-on-write technique will be faster than using WideString.

The UnicodeString is also reference-counted like AnsiString. Therefore, the migration to Unicode will not incurr much performance hit even it takes twice memory space.

Embrace the Unicode! Don’t be afraid of the performance hit. The performance hits occur if you have lots of string conversions between AnsiString, UnicodeString, WideString and other string types such as ShortString (maximum 255 characters, maintained for backward compatibility, e.g. string[255]). Since Delphi XE3 has inbuilt Unicode version of VCL (Visual Component Library) and functions, you can simply keep string definition in most cases, and the compiler will match a Unicode version of function for you.

Most string functions, such as Copy, Length, Pos make sense in Unicode versions. Take a look at the following code.

unicode2 Get Started: Unicode in Delphi XE3 beginner data structure data types delphi implementation object pascal programming languages

We define two string variable, s0 is the Unicode and s1 is the old-style AnsiString. Both strings are assigned with four Chinese characters and a ‘!’, without using Unicode/WideString, it treats 9 characters, which is not what we wanted in the first place. By using Unicode, we can deal with each two-byte character independently, as a single WideChar. The following output illustrates the differences.

unicode1 Get Started: Unicode in Delphi XE3 beginner data structure data types delphi implementation object pascal programming languages

The Length function returns 5 (= 4 * 2 + 1)  and 9 for UnicodeString and AnsiString respectively. The SizeOf function returns the pointer (which is 8 in this case, because it is running as 64-bit) to the string, both are reference-counted. However, it will return the size of the ShortString since the ShortString is defined like an static array of characters.

The StringElementSize function returns the SizeOf value of each single unit of a string. The Copy function clearly shows the different result, with the Unicode version, it counts the characters by WideChar but in Ansi version, it treats as a 1-byte character. This is ok if the string only contains two-byte characters (you can multiply by two for the string indexing) but in most cases, there are mixed letters and characters, in which case, there will be a two-byte character halved into unexpected results.

The Copy function at Delphi 2007 can omit the third parameter which is the count, by omitting, we mean that to get the result until the end of string. However, in Delphi XE3, this returns unexpected results without the third parameter given.

s := '123456';
Writeln(Copy(s, 2)) // returns 23456 at Delphi 2007
Writeln(Copy(s, 2)) // returns empty string at Delphi XE3

The Pos function is Unicode, therefore, there will be implicit cast from AnsiString to UnicodeString in order to use this function. In above examples, it returns 3 (the 3-th Chinese character) instead of 5 in the AnsiString.

In Delphi XE3, the pointer PChar to Char is alias for PWideChar. PAnsiChar is there for pointing to AnsiChar (single-byte character) 

–EOF (The Ultimate Computing & Technology Blog) —

GD Star Rating
loading...
1058 words
Last Post: Data Types in Delphi XE3, Win32 and Win64
Next Post: Inline in Delphi

The Permanent URL is: Get Started: Unicode in Delphi XE3

Leave a Reply