Write Code vs. Write Poetry: UTF-8 Validation (Array)

Question: A string is encoded in UTF-8 as a sequence of bytes, where each character is composed of one or more bytes in the encoding. To figure out how many bytes are in a character, look at the leading number of ones in the binary representation of the character’s first byte:
0xxxxxxx - 1-byte character (i.e., 0xxxxxxx)
10xxxxxx - continuation byte (not valid as a leading byte)
110xxxxx - starts a 2-byte character (i.e., 110xxxxx 10xxxxxx)
1110xxxx - starts a 3-byte character (i.e., 1110xxxx 10xxxxxx 10xxxxxx)
11110xxx - starts a 4-byte character
…
11111111 - starts a 8-byte character
For example:
[11001101, 10000111, 00110000, 11101111, 10010011, 10000110] is a valid three-character string.
First character is 11001101, 10000111.
Second character is 00110000.
Third character is 11101111, 10010011, 10000110.
[01001110, 11000011, 11010101] is an invalid string.
3rd byte should be continuation byte.
[10111000, 00010010] is an invalid string.
1st byte is should be a non-continuation byte.
Write a function that determines whether a given byte array is a valid UTF-8-encoded string.

Idea: Brute force. For each input string array, scan from left to right. If it is a start character, check the next x strings whether it is 10xxxxxx format.

Time: O(n) Space: O(1)

Code:

 public class UTF8 {  
      public UTF8()  
      {  
           String[] input1={"11001101", "10000111", "00110000", "11101111", "10010011", "10000110"};  
           String[] input2={"01001110", "11000011", "11010101"};  
           String[] input3={"10111000", "00010010"};  
           String[] input4={"11001101", "10000111", "00110000", "11101111", "10010011", "10000110","10010011"};  
           System.out.println(checkUTF8(input1));  
           System.out.println(checkUTF8(input2));  
           System.out.println(checkUTF8(input3));  
           System.out.println(checkUTF8(input4));  
      }  
      public boolean checkUTF8(String[] input)  
      {  
           if(input==null||input.length==0)  
                return true;  
           int index=0;  
           while(index<input.length)  
           {  
                int type=getType(input[index]);  
                if(type==0)  
                     return false;  
                int end=index+type-1;  
                if(end>=input.length)  
                     return false;  
                while(end>index)  
                {  
                     if(getType(input[end])!=0)  
                          return false;  
                     end--;  
                }  
                index=index+type;  
           }  
           return true;  
      }  
      public int getType(String s)  
      {  
           int i=0;  
           for(;i<s.length();i++)  
           {  
                if(s.charAt(i)!='1')  
                     break;  
           }  
           if(i==0)  
                return 1;  
           else if(i==1)  
                return 0;  
           else  
                return i;  
      }  
 }

Write Code vs. Write Poetry

Tuesday, January 27, 2015

UTF-8 Validation (Array)

No comments:

Post a Comment