Monday, 18 December 2017

Named capture groups with NSRegularExpression in iOS11 / High Sierra

After years of no change, Apple slipped a small improvement to NSRegularExpression into iOS 11 / High Sierra. The macOS 10.13 and iOS 11 Release Notes Cocoa Foundation Framework mentioned the updates to NSRegularExpression, but little was given in terms of detail. So lets explore and see if we can find out more. We’ve always had the ability to use index capture groups. However as the complexity of a regular expression grows using numbered indexes can grow unmanageable, as well as making the indices fagile to changes in the pattern can throw the numbering system out of kilter.

Apple’s class reference for NSRegularExpression links to the ICU user guide, which lists the syntax for named groups as (?<name>pattern). So let experiment to see it in action.

Since NSRegularExpression is still very much an API that works with objective-c NSString and its UTF-16 code point model. I’ll use a simple extension on String to make life easier.


Consider a regex pattern for matching formatted U.S. domestic telephone numbers, such as (123) 456-7890.
\(\d{3}\)\s\d{3}-\d{4}
This is chosen for simplicity for demonsating the point rather than the most flexible pattern for general purpose matching. The pattern matches the following: -
  1. an opening parenthesis
  2. three digits
  3. a closing parenthesis
  4. a single whitespace character
  5. three digits
  6. a hyphen
  7. four digits
Let’s say that we’re in particular interested in the area code, which is number 2 on the above list. While the pattern will match the whole telephone number, we can create a capture group around sections of interest, in this case the area code by adding parenthesis around it.
\((\d{3})\)\s\d{3}-\d{4}
This is what we’ve always been able to do, and we’d access this capture group at index 1, using the method range(at:) on NSTextCheckingResult. Index zero is reserved for matching the whole pattern.
let areaCodeRange = match.range(at: 1)
With named capture groups however rather than thinking about it as the capture group at index 1. We can name the capture group like so:
\((?<areacode>\d{3})\)\s\d{3}-\d{4}
This allows us to extract the area code using the new method range(withName:) on NSTextCheckingResult
let areaCodeRange = match.range(withName: "areacode")
  

  
Named back references
Named capture groups are not just for extraction, they can be used in back references. Using the syntax \k<name>. For example, if we wanted to match a balenced set of HTML tags, we can use a named capture group for the tag name and then use that name as a back reference to match the closing tag.

<(?<tag>\w+)[^>]*>(?<content>.*?)</\k<tag>>

No comments:

Post a Comment