问一个关于正则表达式的问题，关于SCJP考试

sinkeler · 发表于 2008-1-25 13:46

Well, according to your tips, I'm gonna review the source code of JDK, catch you later

silver_xie · 发表于 2008-1-25 13:57

我还遇到一个问题
如果是patern="\\d*"
m.groupCount()=0

如果
Pattern is (\d)*
0
0 ><1
1
1 ><1
2
2 >34<1
3
4 ><1
4
5 ><1
5
6 ><1

m.groupCount()=1

sinkeler · 发表于 2008-1-25 14:13

omg, the situation is gonna lose control.

I have encountered the same problem like yours, too

[ 本帖最后由 sinkeler 于 2008-1-25 14:16 编辑 ]

silver_xie · 发表于 2008-1-25 14:16

aha......go~~~~~~~~~~

int start(int group)返回在匹配期间由给定组捕获的子序列的初始索引。
捕获组是从左到右的索引，从 1 开始。组零表示整个模式，因此表达式 m.start(0) 等效于 m.start()。

参数：
group - 此匹配器模式中捕获组的索引
返回：
组捕获的第一个字符的索引；如果匹配成功但组本身没有匹配项，则返回 -1
抛出：
IllegalStateException - 如果没有尝试任何匹配或者以前的匹配操作失败
IndexOutOfBoundsException - 如果在给定索引的模式中不存在捕获组

红字部分啥意思？？？从1开始？？？
组和捕获
捕获组可以通过从左到右计算其开括号来编号。例如，在表达式 ((A)(B(C))) 中，存在四个这样的组：

1    ((A)(B(C)))
2    \A
3    (B(C))
4    (C)

组零始终代表整个表达式。 group(0) is always the entire matched ，start(0)=start() then group(0)=group().....

之所以这样命名捕获组是因为在匹配中，保存了与这些组匹配的输入序列的每个子序列。捕获的子序列稍后可以通过 Back 引用在表达式中使用，也可以在匹配操作完成后从匹配器检索。

与组关联的捕获输入始终是与组最近匹配的子序列。如果由于量化的缘故再次计算了组，则在第二次计算失败时将保留其以前捕获的值（如果有的话）例如，将字符串 "aba" 与表达式 (a(b)?)+ 相匹配，会将第二组设置为 "b"。在每个匹配的开头，所有捕获的输入都会被丢弃。

以 (?) 开头的组是纯的非捕获组，它不捕获文本，也不针对组合计进行计数。

[ 本帖最后由 silver_xie 于 2008-1-25 14:24 编辑 ]

sinkeler · 发表于 2008-1-25 14:30

捕获组应该就是Pattern构造的对象，及正则表达式

只不过由于表达复杂，我们就将它们拆开来理解，组合起来就是捕获组

silver_xie · 发表于 2008-1-25 14:35

我理解的就是能匹配的结果集。。。。。。。。

哎呀呀，不明白的地方真让人心里堵得慌啊。。。。。

[ 本帖最后由 silver_xie 于 2008-1-25 14:36 编辑 ]

mzh120120 · 发表于 2008-1-25 15:26

学习中

sinkeler · 发表于 2008-1-25 15:29

I got some information about this question.

The code is modified as below after some tips.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Regex2 {

public static void main(String[] args) {
Pattern p = Pattern.compile( "\\w" );
Matcher m = p.matcher( "ab34ef" );

boolean b = false;

int i = 0;
while ( b = m.find() ){

System.out.println("///////////////////////////////");

System.out.println("Time is " + ++ i);
System.out.println( "m.start is " + m.start() );
System.out.println( "m.end is " + m.end() );
System.out.println( "m.group is " + m.group() );
System.out.println("///////////////////////////////");
}
}
}

So the output is :
///////////////////////////////
Time is 1
m.start is 0
m.end is 1
m.group is a
///////////////////////////////
///////////////////////////////
Time is 2
m.start is 1
m.end is 2
m.group is b
///////////////////////////////
///////////////////////////////
Time is 3
m.start is 2
m.end is 3
m.group is 3
///////////////////////////////
///////////////////////////////
Time is 4
m.start is 3
m.end is 4
m.group is 4
///////////////////////////////
///////////////////////////////
Time is 5
m.start is 4
m.end is 5
m.group is e
///////////////////////////////
///////////////////////////////
Time is 6
m.start is 5
m.end is 6
m.group is f
///////////////////////////////

and the output would be displayed as below if the pattern were alternated to "\\w*"

///////////////////////////////
Time is 1
m.start is 0
m.end is 6
m.group is ab34ef
///////////////////////////////
///////////////////////////////
Time is 2
m.start is 6
m.end is 6
m.group is
///////////////////////////////

Well, according to this output,  maybe the concept of index were misunderstood.

   a b  3  4  e  f
0  1  2  3  4  5  6

As the drawing above(does it look like a drawing, dosent it?), string a is determined by index 0 and 1. So we can conclude that 2 indexes can locate a string.

[ 本帖最后由 sinkeler 于 2008-1-25 16:43 编辑 ]

sinkeler · 发表于 2008-1-25 15:31

But back to our problem, the pattern is "\\d*" actually.

then the output is

///////////////////////////////
Time is 1
m.start is 0
m.end is 0
m.group is
///////////////////////////////
///////////////////////////////
Time is 2
m.start is 1
m.end is 1
m.group is
///////////////////////////////
///////////////////////////////
Time is 3
m.start is 2
m.end is 4
m.group is 34
///////////////////////////////
///////////////////////////////
Time is 4
m.start is 4
m.end is 4
m.group is
///////////////////////////////
///////////////////////////////
Time is 5
m.start is 5
m.end is 5
m.group is
///////////////////////////////
///////////////////////////////
Time is 6
m.start is 6
m.end is 6
m.group is
///////////////////////////////

sinkeler · 发表于 2008-1-25 15:35

So there would be a standpoint assumed: "" is at the range scanned by the regular expression engine, expecially when the * existed.

Here is the reference obtained from JDK-1.5-document

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
group
public String group()返回由以前匹配操作所匹配的输入子序列。
对于具有输入序列 s 的匹配器 m，表达式 m.group() 和 s.substring(m.start(), m.end()) 是等效的。

注意，某些模式（例如，a*）匹配空字符串。当模式成功匹配输入中的空字符串时，此方法将返回空字符串。
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

In my mind, that means there would be somethine special occurred when the symbol "*" existed.

So the empty string would be scanned at this time. The greedy quantifier "*" is responsible for the source data, as much as possible, so, taking "//w*" for instance, all the string would be poped and the index is moved to the end of the data, which is 6 pointing to the string "f". So the index of start and end is 6 and 6 after the scanning with * for the first time.

[ 本帖最后由 sinkeler 于 2008-1-25 16:58 编辑 ]

[SCJP] 问一个关于正则表达式的问题，关于SCJP考试

浏览过的版块