Guava Strings.java

Google Guava

Strings#repeat(String string, int count)

依据自己的理解 repeat(String string, int count) 方法只需要使用如下的逻辑就可以简单实现:

1
2
3
4
5
final StringBuilder buf = new StringBuilder(outputLength);
for (int i = 0; i < repeat; i++) {
buf.append(str);
}
return buf.toString();

而在 Guava 的 Strings#repeat(String string, int count) 中却是如下实现的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
public static String repeat(String string, int count) {
checkNotNull(string); // eager for GWT.

if (count <= 1) {
checkArgument(count >= 0, "invalid count: %s", count);
return (count == 0) ? "" : string;
}

// IF YOU MODIFY THE CODE HERE, you must update StringsRepeatBenchmark
final int len = string.length();
final long longSize = (long) len * (long) count;
final int size = (int) longSize;
if (size != longSize) {
throw new ArrayIndexOutOfBoundsException(
"Required array size too large: " + longSize);
}

final char[] array = new char[size];
string.getChars(0, len, array, 0);
int n;
for (n = len; n < size - n; n <<= 1) {
System.arraycopy(array, 0, array, n, n);
}
System.arraycopy(array, 0, array, n, size - n);
return new String(array);
}

这段代码本身并不难理解,可以看到代码注释中赫然写着,如果你修改了这里的代码,必须同步更新 Benchmark!看来这段代码是经过极致优化了的。

简单描述下这段代码干了啥事:真正的代码从霸气注释开始。开头的 3 行代码,int 升级 long 然后降级 int,是为了确保字符串 repeat 之后没有超过 String 的长度限制,而先强制提升然后截断的方法,能够高效的判断溢出,这种手法在 C 语言中也是常见的。

然后这里没有用 StringBuilder,而是出于性能考虑用了 char[],直接申请目标大小的数组。循环复制字符串的时候,复制源的长度指数增长,以最快的速度结束循环。System#arraycopy 是个 native 方法,也就是用 C 来实现的,性能上似乎更值得信赖一点。

Strings#commonPrefix(CharSequence a, CharSequence b)

该方法的作用显而易见,查找两个字符串的共同前缀,看似简单的逻辑其实暗藏奥秘,看看他的具体实现你就会知道了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/**
* Returns the longest string {@code prefix} such that
* {@code a.toString().startsWith(prefix) && b.toString().startsWith(prefix)},
* taking care not to split surrogate pairs. If {@code a} and {@code b} have
* no common prefix, returns the empty string.
*
* @since 11.0
*/

public static String commonPrefix(CharSequence a, CharSequence b) {
checkNotNull(a);
checkNotNull(b);

int maxPrefixLength = Math.min(a.length(), b.length());
int p = 0;
while (p < maxPrefixLength && a.charAt(p) == b.charAt(p)) {
p++;
}
if (validSurrogatePairAt(a, p - 1) || validSurrogatePairAt(b, p - 1)) {
p--;
}
return a.subSequence(0, p).toString();
}

对了,你一定发现了,while 后面还跟着一个莫名其妙的 if,这是什么东西!函数名里面居然出现了我不认识的单词,查一下 surrogate pair 发现是代理对的意思。Google 一番之后发现原来是 Java 平台增补字符 惹的货。简单说这里其实是判断最后两个字符是不是合法的 “Java 平台增补字符” 。看起来这些增补字符占了 2 个字节,然后要用判断高位低位之类的。仔细看了函数的头注释,里面也提到 taking care not to split surrogate pairs,然后就明白了。

明白归明白,咱们还是要亲自实践一下的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
/**
* 去掉增补字符判断。
*
* @see {@link Strings#commonPrefix(CharSequence, CharSequence)}
*/

public static String commonPrefixWithoutIf(CharSequence a, CharSequence b) {
checkNotNull(a);
checkNotNull(b);

int maxPrefixLength = Math.min(a.length(), b.length());
int p = 0;
while (p < maxPrefixLength && a.charAt(p) == b.charAt(p)) {
p++;
}
return a.subSequence(0, p).toString();
}

public static void main( String[] args )
{


// 增补字符一
String surrogatePair1 = String.valueOf( Character.toChars( 0x2F81B ) );
System.out.println( "surrogatePair1:" + surrogatePair1 + " surrogatePair1 length: " + surrogatePair1.length() );

// 增补字符二
String surrogatePair2 = String.valueOf( Character.toChars( 0x2F81A ) );
System.out.println( "surrogatePair2:" + surrogatePair2 + " surrogatePair2 length: " + surrogatePair2.length() );

// 注释掉commonPrefix方法中的这段逻辑:if( validSurrogatePairAt( a, p - 1 ) || validSurrogatePairAt( b, p - 1 ) ) { p--; }
// 从而更好的理解增补字符
String containSurrogatePair1 = "汉字和数字123" + surrogatePair1 + "456";
String containSurrogatePair2 = "汉字和数字123" + surrogatePair2 + "456";

System.out.println( "containSurrogatePair1: " + containSurrogatePair1 );
System.out.println( "containSurrogatePair2: " + containSurrogatePair2 );
System.out.println( "containSurrogatePair2.equals( containSurrogatePair1 ): " + containSurrogatePair2.equals( containSurrogatePair1 ) );
System.out.println( "commonPrefix:" + commonPrefix( containSurrogatePair1, containSurrogatePair2 ) );
System.out.println( "commonPrefixWithoutIf:" + commonPrefixWithoutIf( containSurrogatePair1, containSurrogatePair2 ) );

}

//输出如下:
//surrogatePair1:况 surrogatePair1 length: 2
//surrogatePair2:冬 surrogatePair2 length: 2
//containSurrogatePair1: 汉字和数字123况456
//containSurrogatePair2: 汉字和数字123冬456
//containSurrogatePair2.equals( containSurrogatePair1 ): false
//commonPrefix:汉字和数字123
//commonPrefixWithoutIf:汉字和数字123?

如此一来是不是更好理解了。哈哈哈哈!!!

参考文献