Step 1 ~ Continuation
Starting with the last example's get_contents(url), name it tutorial-02.py

# Beginning of file from urllib2 import urlopen # Step 1 def print_contents(url): ''' function print_contents: url is the url to download and print ''' f = urlopen(url) contents = f.read() f.close() print( contents ) # print file return print_contents('http://www.google.com/trends/hottrends/atom/hourly') # End of file

Step 2 ~ Regular expressions
Basics

  • '^' = beginning of the line
  • '$' = end of the line
  • '.' = any character
  • '[x]' = any character x
  • '[xy]' = any character x or y
  • '[x-z]' = any character x through z (x,y,z)
  • '[^x] = any character except x (not x)
  • '\d' = any character same as [0-9]
  • '\s' = any whitespace same as [ \t\r\n]
  • '*' = 0 or more ('.*' = any character 0 or more times)
  • '+' = 1 or more ('.+' = any character 1 or more times)
  • '?' = non-greedy expression
  • '()' = parameters within expression
  • Examples using '1abc2def3'
    • '^.*$' -> 1abc2def3
    • '^\d' -> 1
    • '\d$' -> 3
    • '^\d.*\d' -> 1abc2def3
    • '^\d.*?\d' -> 1abc2
    • '\d.*?\d$' -> 2def3
    • '[a-z]+' -> abc
    • '^\d[a-z]+\d[a-z]+\d$' -> 1abc2def3
  • Repeat with parameters
    • '^(.*)$' -> group(1) = 1abc2def3
    • '^(\d)' -> group(1) = 1
    • '\d$' -> group(1) = 3
    • '^\d(.*)\d' -> group(1) = abc2def
    • '^\d(.*?)\d' -> group(1) = abc
    • '(\d.*?\d)$' -> group(1) = 2def3
    • '([a-z]+)' -> group(1) = abc
    • '^\d([a-z]+)\d([a-z]+)\d$' -> group(1) = abc, group(2) = def

Step 3 ~ Identify Patterns
Output of search term...

... <content type="html"><![CDATA[<ol> <li><span class="Volcanic"><a href="http://www.google.com/trends/hottrends?q=bowflex+revolution&date=2008-2-15&sa=X">bowflex revolution</a></span></li> ...

Step 4 ~ Create Expression
We need to create an search (regular) expression that finds 'what' we're looking for.

  • Line starts with ^<li>
  • Line ends with </a></span></li>$
  • What we're looking for is >(bowflex revolution)

So our expression could like this: ^<li>.*>([^<]+)</a></span></li>$

Step 5 ~ Add Expression and Test
This will find the first occurrence of our expression in the contents


# Beginning of file from urllib2 import urlopen import re # Regular Expressions reHot = re.compile('^<li>.*>([^<]+)</a></span></li>$', re.MULTILINE) # MULTILINE makes each line have a ^ and $ match # Step 1 def print_contents(url): ''' function print_contents: url is the url to download and print ''' f = urlopen(url) contents = f.read() f.close() # print( contents ) match = reHot.search( contents ) if(match) : print( match.group(1) ) return print_contents('http://www.google.com/trends/hottrends/atom/hourly') # End of file

Save and run...
python tutorial-02.py bowflex revolution

Step 6 ~ Change search to finditer
This allows us to search for all matches (not just the first match) and loop through them.

# Beginning of file from urllib2 import urlopen import re # Regular Expressions reHot = re.compile('^<li>.*>([^<]+)</a></span></li>$', re.MULTILINE) # MULTILINE makes each line have a ^ and $ match # Step 1 def print_contents(url): ''' function print_contents: url is the url to download and print ''' f = urlopen(url) contents = f.read() f.close() # print( contents ) matches = reHot.finditer( contents ) for match in matches: print( match.group(1) ) return print_contents('http://www.google.com/trends/hottrends/atom/hourly') # End of file

Save and run...
python tutorial-02.py bowflex revolution corcoran laboratories kristin chenoweth adrienne curry kings cheerleaders bonnie mcfarlane withleather.com ... [ total 100 lines ]

Step 7 ~ Create a list of matches
This will become what this function returns (ie passes to our template so it can display the results). Because we just cannot print to the screen in Django!

# Beginning of file from urllib2 import urlopen import re # Regular Expressions reHot = re.compile('^<li>.*>([^<]+)</a></span></li>$', re.MULTILINE) # MULTILINE makes each line have a ^ and $ match # Step 1 def print_contents(url): ''' function print_contents: url is the url to download and print ''' f = urlopen(url) contents = f.read() f.close() # print( contents ) matches = reHot.finditer( contents ) results = [] for match in matches: results.append( match.group(1) ) for term in results[:25]: # Limit to 25 print( term ) return print_contents('http://www.google.com/trends/hottrends/atom/hourly') # End of file

Save and run...
python tutorial-02.py bowflex revolution corcoran laboratories kristin chenoweth adrienne curry kings cheerleaders bonnie mcfarlane withleather.com ... [ total 25 lines ]

Step 8 ~ Fix words
Let's loop through words and capitalize the first letter of each

# Beginning of file from urllib2 import urlopen import re # Regular Expressions reHot = re.compile('^<li>.*>([^<]+)</a></span></li>$', re.MULTILINE) # MULTILINE makes each line have a ^ and $ match reWord = re.compile('(\w+)') # Step 1 def print_contents(url): ''' function print_contents: url is the url to download and print ''' f = urlopen(url) contents = f.read() f.close() # print( contents ) results = [] matches = reHot.finditer( contents ) # loop through matches for match in matches: # find words word_matches = reWord.finditer( match.group(1) ) # loop through words term_list = [] # create a list of words as term_list for word_match in word_matches: term_list.append( word_match.group(1).capitalize() ) # back together print( ' '.join(term_list) ) return print_contents('http://www.google.com/trends/hottrends/atom/hourly') # End of file

Save and run...
python tutorial-02.py ...


     __  __    __    _  _  ____  _____  _  _    _____  ____   ___ 
    (  \/  )  /__\  ( \( )(_  _)(  _  )( \( )  (  _  )(  _ \ / __) 
     )    (  /(__)\  )  (  _)(_  )(_)(  )  (    )(_)(  )   /( (_-.
    (_/\/\_)(__)(__)(_)\_)(____)(_____)(_)\_)()(_____)(_)\_) \___/