0xxxxxxx - 1-byte character (i.e., 0xxxxxxx)
10xxxxxx - continuation byte (not valid as a leading byte)
110xxxxx - starts a 2-byte character (i.e., 110xxxxx 10xxxxxx)
1110xxxx - starts a 3-byte character (i.e., 1110xxxx 10xxxxxx 10xxxxxx)
11110xxx - starts a 4-byte character
…
11111111 - starts a 8-byte character
For example:
[11001101, 10000111, 00110000, 11101111, 10010011, 10000110] is a valid three-character string.
First character is 11001101, 10000111.
Second character is 00110000.
Third character is 11101111, 10010011, 10000110.
[01001110, 11000011, 11010101] is an invalid string.
3rd byte should be continuation byte.
[10111000, 00010010] is an invalid string.
1st byte is should be a non-continuation byte.
Write a function that determines whether a given byte array is a valid UTF-8-encoded string.
Idea: Brute force. For each input string array, scan from left to right. If it is a start character, check the next x strings whether it is 10xxxxxx format.
Time: O(n) Space: O(1)
Code:
public class UTF8 {
public UTF8()
{
String[] input1={"11001101", "10000111", "00110000", "11101111", "10010011", "10000110"};
String[] input2={"01001110", "11000011", "11010101"};
String[] input3={"10111000", "00010010"};
String[] input4={"11001101", "10000111", "00110000", "11101111", "10010011", "10000110","10010011"};
System.out.println(checkUTF8(input1));
System.out.println(checkUTF8(input2));
System.out.println(checkUTF8(input3));
System.out.println(checkUTF8(input4));
}
public boolean checkUTF8(String[] input)
{
if(input==null||input.length==0)
return true;
int index=0;
while(index<input.length)
{
int type=getType(input[index]);
if(type==0)
return false;
int end=index+type-1;
if(end>=input.length)
return false;
while(end>index)
{
if(getType(input[end])!=0)
return false;
end--;
}
index=index+type;
}
return true;
}
public int getType(String s)
{
int i=0;
for(;i<s.length();i++)
{
if(s.charAt(i)!='1')
break;
}
if(i==0)
return 1;
else if(i==1)
return 0;
else
return i;
}
}
No comments:
Post a Comment