How to Validate UTF-8 Encoding? (The Simple UTF-8 Validation Algorithm)


bart-simpson-utf8 How to Validate UTF-8 Encoding? (The Simple UTF-8 Validation Algorithm) algorithms bit hacks c / c++ string

bart-simpson-utf8

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

For 1-byte character, the first bit is a 0, followed by its unicode code.
For n-bytes character, the first n-bits are all one’s, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10.
This is how the UTF-8 encoding would work:

   Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Given an array of integers representing the data, return whether it is a valid utf-8 encoding.

Note:
The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

Example 1:
data = [197, 130, 1], which represents the octet sequence: 11000101 10000010 00000001.
Return true.

It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.

Example 2:
data = [235, 140, 4], which represented the octet sequence: 11101011 10001100 00000100.
Return false.

The first 3 bits are all one’s and the 4th bit is 0 means it is a 3-bytes character.
The next byte is a continuation byte which starts with 10 and that’s correct.
But the second continuation byte does not start with 10, so it is invalid.

All you have to do is follow the rules. For a given integer, obtain its binary representation in the string form and work with the rules given in the problem.

An integer can either represent the start of a UTF-8 character, or a part of an existing UTF-8 character. There are two separate rules for these two scenarios in the problem.

If an integer is a part of an existing UTF-8 character, simply check the 2 most significant bits of in the binary representation string. They should be 10. If the integer represents the start of a UTF-8 character, then the first few bits would be 1 followed by a 0. The number of initial bits (most significant) bits determines the length of the UTF-8 character.

Note: The array can contain multiple valid UTF-8 characters.

String manipulation will work fine here. But, it is too slow. Can we instead use bit manipulation to do the validations instead of string manipulations?

We can use bit masking to check how many initial bits are set for a given number. We only need to work with the 8 least significant bits as mentioned in the problem.

1
2
3
4
mask = 1 << 7
while mask & num:
    n_bytes += 1
    mask = mask >> 1
mask = 1 << 7
while mask & num:
    n_bytes += 1
    mask = mask >> 1

Can you use bit-masking to perform the second validation as well i.e. checking if the most significant bit is 1 and the second most significant bit a 0?

To check if the most significant bit is a 1 and the second most significant bit is a 0, we can make use of the following two masks.

1
2
3
4
5
mask1 = 1 << 7
mask2 = 1 << 6
 
if not (num & mask1 and not (num & mask2)):
    return False
mask1 = 1 << 7
mask2 = 1 << 6

if not (num & mask1 and not (num & mask2)):
    return False

Simple UTF-8 Validation Algorithm in C++

The key to validate a UTF-8 encoded-stream is to check the bits. To check if a bit is set, we use the logical AND operator. For example, to check if the first bit (most significant) of a 8-bit byte is set, we can use (n & (1 << 7)). For example, we can use the following is10x function to check if a given byte starts with 10 (in binary).

1
2
3
4
5
inline bool is10x(int a) {
    int bit1 = (a >> 7) & 1;
    int bit2 = (a >> 6) & 1;
    return (bit1 == 1) && (bit2 == 0);
}
inline bool is10x(int a) {
    int bit1 = (a >> 7) & 1;
    int bit2 = (a >> 6) & 1;
    return (bit1 == 1) && (bit2 == 0);
}

With this, it is straightforward: by following the four rules, we check each possibility. If there is no enough data left, it is not a valid UTF-8 for instance, when we meet 11110xxx, there should be at least 3 bytes that start with 10xxx.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
class Solution {
public:
    bool validUtf8(vector<int>& data) {
        for (int i = 0; i < data.size(); ++ i) {
            // 0xxxxxxx
            int bit1 = (data[i] >> 7) & 1;
            if (bit1 == 0) continue;
            // 110xxxxx 10xxxxxx
            int bit2 = (data[i] >> 6) & 1;
            if (bit2 == 0) return false;
            // 11
            int bit3 = (data[i] >> 5) & 1;
            if (bit3 == 0) {
                // 110xxxxx 10xxxxxx
                if ((++ i) < data.size()) {
                    if (is10x(data[i])) {
                        continue;
                    }
                    return false;
                } else {
                    return false;
                }
            }                        
            int bit4 = (data[i] >> 4) & 1;
            if (bit4 == 0) {
                // 1110xxxx 10xxxxxx 10xxxxxx
                if (i + 2 < data.size()) {
                    if (is10x(data[i + 1]) && is10x(data[i + 2])) {
                        i += 2;
                        continue;
                    }
                    return false;
                } else {
                    return false;
                }                
            }            
            int bit5 = (data[i] >> 3) & 1;
            if (bit5 == 1) return false;
            if (i + 3 < data.size()) {
                if (is10x(data[i + 1]) && is10x(data[i + 2]) && is10x(data[i + 3])) {
                    i += 3;
                    continue;
                }
                return false;
            } else {
                return false;
            }                            
        }
        return true;
    }
private:
    inline bool is10x(int a) {
        int bit1 = (a >> 7) & 1;
        int bit2 = (a >> 6) & 1;
        return (bit1 == 1) && (bit2 == 0);
    }
};
class Solution {
public:
    bool validUtf8(vector<int>& data) {
        for (int i = 0; i < data.size(); ++ i) {
            // 0xxxxxxx
            int bit1 = (data[i] >> 7) & 1;
            if (bit1 == 0) continue;
            // 110xxxxx 10xxxxxx
            int bit2 = (data[i] >> 6) & 1;
            if (bit2 == 0) return false;
            // 11
            int bit3 = (data[i] >> 5) & 1;
            if (bit3 == 0) {
                // 110xxxxx 10xxxxxx
                if ((++ i) < data.size()) {
                    if (is10x(data[i])) {
                        continue;
                    }
                    return false;
                } else {
                    return false;
                }
            }                        
            int bit4 = (data[i] >> 4) & 1;
            if (bit4 == 0) {
                // 1110xxxx 10xxxxxx 10xxxxxx
                if (i + 2 < data.size()) {
                    if (is10x(data[i + 1]) && is10x(data[i + 2])) {
                        i += 2;
                        continue;
                    }
                    return false;
                } else {
                    return false;
                }                
            }            
            int bit5 = (data[i] >> 3) & 1;
            if (bit5 == 1) return false;
            if (i + 3 < data.size()) {
                if (is10x(data[i + 1]) && is10x(data[i + 2]) && is10x(data[i + 3])) {
                    i += 3;
                    continue;
                }
                return false;
            } else {
                return false;
            }                            
        }
        return true;
    }
private:
    inline bool is10x(int a) {
        int bit1 = (a >> 7) & 1;
        int bit2 = (a >> 6) & 1;
        return (bit1 == 1) && (bit2 == 0);
    }
};

The above UTF-8 validation algorithm written in C++ runs at O(N) time and O(1) constant space.

–EOF (The Ultimate Computing & Technology Blog) —

GD Star Rating
loading...
946 words
Last Post: Algorithms to Count the Number of Palindromic Substrings
Next Post: The Custom Sort String Algorithm with Count and Write

The Permanent URL is: How to Validate UTF-8 Encoding? (The Simple UTF-8 Validation Algorithm)

One Response

  1. Anon

Leave a Reply