Source Code

[This article assumes that you have a basic understanding of regular expressions].

Regular expressions (referred to as regex in rest of the article) are highly specialized programming language tools in Python. The module 're', added in Python 1.5, provides regular expression powers to Python. It is recommended to use compile frequently used regular expressions using re.compile(). re.compile() converts the regex pattern into bytecode to be executed by the matching engine written in C. This article talks about some of the less known or less used features of regex in Python.

Using compilations flags

Compilation flags allows users to alter the way regex works, e.g. re.IGNORECASE does case-insensitive matches. Note that I'm using raw Python strings (r'') for specifying the regular expression pattern to avoid messy backslashes.

>>> import re

>>> re.search(r'(python)', "Monty Python's flying circus", flags=re.I).groups()

('Python',)

>>> re.search(r'(python)', "Monty Python's flying circus", flags=re.IGNORECASE).groups()

('Python',)

Search all

findall() and finditer() find all substrings where the regex matches the string and returns them as a list or an iterator respectively. You can use start() and end() functions to find the start and end index of the substring. The following example prints start and end index of all instances of 'eat' in the given string 'rhyme'.

>>> rhyme = 'I like to eat, eat, eat, apples and bananas.'

>>> eat_expr = re.compile(r'eat')

>>> for match in eat_expr.finditer(rhyme):
....: print "Match found from %d-%d" % (match.start(), match.end())

Match found from 10-13
Match found from 15-18
Match found from 20-23

Using symbolic group names

(?P<name>...) assigns user defined name to the capturing group and it is much cleaner to use than Perl's $1, $2, etc magic variables. The following example finds the number of individuals per race in the fellowship of the ring.

>>> fellowship = "#human=2 #dwarf=1 #hobbit=4 #elf=1 #wizard=1"

>>> count_expr = re.compile(r'(?P<race>#[a-z]+?)=(?P<count>\d+)')

>>> for match in count_expr.finditer(fellowship):

...: print "race=%s count=%d" % (match.group('race'), int(match.group('count')))

race=#human count=2

race=#dwarf count=1

race=#hobbit count=4

race=#elf count=1

race=#wizard count=1

Lookahead assertions

(?=...), a.k.a lookahead assertion, matches if ... matches the expression but does not consume anything.

Similarly (?!...), a.k.a negative lookahead assertion, matches if ... doesn't match the expression but does not consume anything. The following examples demonstrate the usage of these expressions.

>>> hobbits = ['Frodo Baggins', 'Meriadoc Brandybuck', 'Bilbo Baggins', 'Samwise Gamgee', 'Peregrin Took']

# Find all Hobbits with last name Baggings.

>>> baggins_expr = re.compile(r'(\w+) (?=Baggins)')

# Find all Hobbits whose last name is not Baggins.

>>> [hobbit for hobbit in hobbits if baggins_expr.match(hobbit)]

['Frodo Baggins', 'Bilbo Baggins']

>>> non_baggins_expr = re.compile(r'(\w+) (?!Baggins)')

>>> [hobbit for hobbit in hobbits if non_baggins_expr.match(hobbit)]

['Meriadoc Brandybuck', 'Samwise Gamgee', 'Peregrin Took']

Matching an already matched group

(?P=name) matches the string matched earlier by the group 'name'.

>>> hierarchical_str = '<outer> <inner> enclosed text <inner><outer>'

>>> re.search(r'(?P<outer><outer>)\s*?(?P<inner><inner>)(?P<enclosed>[^<>]*?)(?P=inner)\s*?(?P=outer)', hierarchical_str).groups()

('<outer>', '<inner>', ' enclosed text ')

To be greedy or not

By default '*' and '+' are greedy, i.e. they will match as many characters as possible. This is not desirable in many cases, see example below, where a non-greedy version is required. '*?' and '+?' are the non-greedy versions.

# Greedy version consumes too much!

>>> re.search(r'<(?P<tag>.+)>', html).group('tag')

'head>First html</head><body>Hello World!</body'

# We want the non-greedy version

>>> re.search(r'<(?P<tag>.+?)>', html).group('tag')

'head'

For more information of regular expressions in Python visit http://docs.python.org/howto/regex.html#regex-howto.

Source Code

Saturday, August 6, 2011

More regular expressions ...