Table of Contents

Unicode

Forget all about bytes and think of Unicode strings as sets of symbols. Now, there are at least 4 ways to encode the the Greek symbol Omega (Ω or U+03A9) as binary:

Encoding name Binary representation
ISO-8859-7 \xD9 “Native” Greek encoding
UTF-8 \xCE\xA9
UTF-16 \xFF\xFE\xA9\x03
UTF-32 \xFF\xFE\x00\x00\xA9\x03\x00\x00

The \u escape sequence is used to denote Unicode codes. This is somewhat like the traditional C-style \xNN to insert binary values.

When you convert Unicode symbols to bytes you are encoding. To encode the symbol Ω (u'\u03A9') to UTF-8 you use u'\u03A9'.encode('utf-8') which results in '\xce\xa9'. To decode this bytestring back into Unicode you use unicode('\xce\xa9', 'utf-8') which once again results in u'\u03a9'.

Specific Notes About GTK/GLib/GObject

These libraries are used extensively in RabbitVCS. The standard way for GLib to pass around textual information is with a Glib::ustring.

These are essentially UTF-8 encoded strings. This means that any time you get a result from a GLib/GTK/GObject method, you will need to decode it, like:

path = realpath(
 
           # This! This is the decoding part:
           unicode(
 
               # The result of a GLib call:
               gnomevfs.get_local_path_from_uri(item.get_uri()),
 
               # UTF-8 because that's what the GLib docs say
               "utf-8"))

Python does not know that this stream of bytes that came from a C library is actually a UTF-8 string. This is how you tell it.

BEWARE! You need to think about this any time you manipulate data in the RabbitVCS code. Do not forget it just because you're rolling some utility function deep in the bowels of the library itself. The variable that you are working on, wherever it is, may actually be pulled from a list that came from a dict that was part of a tuple that resulted from a GLib function call several miles away.

Resources