Thursday, June 16, 2011

JDK7: Named Capturing Group in RegEx


In a complicated regular expression, which might have many capturing and
non-capturing groups, the left to right group number counting might get a little confusing, and the expression itself (the group and its back reference) becomes hard to understand and trace. A natural solution is to, instead of counting it manually one by one and left to right, give each group a name, and then back reference it (in the same regex) or access the capturing match result (from MatchResult) by the assigned NAME, as what Python, PHP, .Net and Perl (5.10.0) do in their regex engines. This convenient feature has been missed in Java RegEx for years, now it finally got itself in JDK7 b50.

The newly added RegEx constructs to support the named capturing group are:
(1) (?<NAME>X) to define a named group NAME"                      
(2) \\k<Name> to backref a named group "NAME"                   
(3) <$<NAME> to reference to captured group in matcher's replacement str 
(4) group(String NAME) to return the captured input subsequence by the given "named group"

With these new constructs, now you can write something like 
    String pStr = "0x(?<bytes>\\\\p{XDigit}{1,4})\\\\s++u\\\\+(?<char>\\\\p{XDigit}{4})(?:\\\\s++)?";
    Matcher m = Pattern.compile(pStr).matcher(INPUTTEXT);
    if (m.matches()) {
        int bs = Integer.valueOf(m.group("bytes"), 16);
        int c =  Integer.valueOf(m.group("char"), 16);
        System.out.printf("[%x] -> [%04x]%n", bs, c);
    }
    or
    System.out.println("0x1234 u+5678".replaceFirst(pStr, "u+$<char> 0x$<bytes>"));

OK, examples above just show how to use these new constructs, not necessary mean they are "better":-) more "easy" to understand for me though. 

The method group(String name) is NOT added into MatchResult interface for the compatibility concern ( personally I don't think it's a big deal, my guess is the majority of RegEx users can just live with the Matcher class, the compatibility weighs more here. Let me know otherwise).

No comments :

Post a Comment