Notes and tips on Python’s regular expression library.

Regex Module

import re

However, there’s an alternative module “regex” It is supposed to be superior version.



    match always starts from beginning of line, "^" is implied."^[a-z]","something")
    similar to perl's full regex search. Starts from anywhere in the string for the match."\w+","___  anyword") # OK"^[^a-z]+$","__a") # must not be lower-case alphabet for entire string

Once match/search is found, use group()

m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> m.groupdict()
{'first_name': 'Malcolm', 'last_name': 'Reynolds'}

# group(0) returns all string match, not just those that are in paren ( )
# group(1) returns 1st match group...

Example from python doc

>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>>       # The entire match
'Isaac Newton'
>>>       # The first parenthesized subgroup.
>>>       # The second parenthesized subgroup.
>>>, 2)    # Multiple arguments give us a tuple.
('Isaac', 'Newton')


mystr = re.match(".*?(run).*?",line).groups()[0] 
# mystr = always "run", i.e. extracts from string containing "run" .

Use compile(pat) if the regex needs to be repeated several times for the same pattern.

# from python doc
import re
re.compile("a").match("ba", 1)           # succeeds
re.compile("^a").search("ba", 1)         # fails; 'a' not at start
re.compile("^a").search("\na", 1)        # fa ils; 'a' not at start
re.compile("^a", re.M).search("\na", 1)  # succeeds
re.compile("^a", re.M).search("ba", 1)   # fails; no preceding \n

Use iteration to find match.

for match in re.finditer(patter, string):
# once for each regex match...

re.split: advanced split

import re
re.split ("REGEX of Delimiters","TEXT ....")

re.split("\W+", "TEXT...")  # split using any non-words

Dealing with “-” (dash)

  • This seems to split “e-mail” into “e” “mail”.
  • Can’t seem to split “-”. is there a way? Yes, see below

    re.split('\W+',"e-mail")      ==> ['e', 'mail']

Exclude “-” as separator, by using ^ and adding \-

re.split('[^\w\-]+',"e-mail")    ==>['e-mail']