Step 1 ~ Continuation
Starting with the last example's get_contents(url), name it tutorial-02.py
# Beginning of file
from urllib2 import urlopen
# Step 1
def print_contents(url):
'''
function print_contents: url is the url to download and print
'''
f = urlopen(url)
contents = f.read()
f.close()
print( contents ) # print file
return
print_contents('http://www.google.com/trends/hottrends/atom/hourly')
# End of file
Step 2 ~ Regular expressions
Basics
- '^' = beginning of the line
- '$' = end of the line
- '.' = any character
- '[x]' = any character x
- '[xy]' = any character x or y
- '[x-z]' = any character x through z (x,y,z)
- '[^x] = any character except x (not x)
- '\d' = any character same as [0-9]
- '\s' = any whitespace same as [ \t\r\n]
- '*' = 0 or more ('.*' = any character 0 or more times)
- '+' = 1 or more ('.+' = any character 1 or more times)
- '?' = non-greedy expression
- '()' = parameters within expression
- Examples using '1abc2def3'
- '^.*$' -> 1abc2def3
- '^\d' -> 1
- '\d$' -> 3
- '^\d.*\d' -> 1abc2def3
- '^\d.*?\d' -> 1abc2
- '\d.*?\d$' -> 2def3
- '[a-z]+' -> abc
- '^\d[a-z]+\d[a-z]+\d$' -> 1abc2def3
- Repeat with parameters
- '^(.*)$' -> group(1) = 1abc2def3
- '^(\d)' -> group(1) = 1
- '\d$' -> group(1) = 3
- '^\d(.*)\d' -> group(1) = abc2def
- '^\d(.*?)\d' -> group(1) = abc
- '(\d.*?\d)$' -> group(1) = 2def3
- '([a-z]+)' -> group(1) = abc
- '^\d([a-z]+)\d([a-z]+)\d$' -> group(1) = abc, group(2) = def
Step 3 ~ Identify Patterns
Output of search term...
...
<content type="html"><![CDATA[<ol>
<li><span class="Volcanic"><a href="http://www.google.com/trends/hottrends?q=bowflex+revolution&date=2008-2-15&sa=X">bowflex revolution</a></span></li>
...
Step 4 ~ Create Expression
We need to create an search (regular) expression that finds 'what' we're looking for.
- Line starts with ^<li>
- Line ends with </a></span></li>$
- What we're looking for is >(bowflex revolution)
So our expression could like this:
^<li>.*>([^<]+)</a></span></li>$
Step 5 ~ Add Expression and Test
This will find the first occurrence of our expression in the contents
# Beginning of file
from urllib2 import urlopen
import re # Regular Expressions
reHot = re.compile('^<li>.*>([^<]+)</a></span></li>$', re.MULTILINE) # MULTILINE makes each line have a ^ and $ match
# Step 1
def print_contents(url):
'''
function print_contents: url is the url to download and print
'''
f = urlopen(url)
contents = f.read()
f.close()
# print( contents )
match = reHot.search( contents )
if(match) :
print( match.group(1) )
return
print_contents('http://www.google.com/trends/hottrends/atom/hourly')
# End of file
Save and run...
python tutorial-02.py
bowflex revolution
Step 6 ~ Change search to finditer
This allows us to search for all matches (not just the first match) and loop through them.
# Beginning of file
from urllib2 import urlopen
import re # Regular Expressions
reHot = re.compile('^<li>.*>([^<]+)</a></span></li>$', re.MULTILINE) # MULTILINE makes each line have a ^ and $ match
# Step 1
def print_contents(url):
'''
function print_contents: url is the url to download and print
'''
f = urlopen(url)
contents = f.read()
f.close()
# print( contents )
matches = reHot.finditer( contents )
for match in matches:
print( match.group(1) )
return
print_contents('http://www.google.com/trends/hottrends/atom/hourly')
# End of file
Save and run...
python tutorial-02.py
bowflex revolution
corcoran laboratories
kristin chenoweth
adrienne curry
kings cheerleaders
bonnie mcfarlane
withleather.com
...
[ total 100 lines ]
Step 7 ~ Create a list of matches
This will become what this function returns (ie passes to our template so it can display the results). Because we just cannot print to the screen in Django!
# Beginning of file
from urllib2 import urlopen
import re # Regular Expressions
reHot = re.compile('^<li>.*>([^<]+)</a></span></li>$', re.MULTILINE) # MULTILINE makes each line have a ^ and $ match
# Step 1
def print_contents(url):
'''
function print_contents: url is the url to download and print
'''
f = urlopen(url)
contents = f.read()
f.close()
# print( contents )
matches = reHot.finditer( contents )
results = []
for match in matches:
results.append( match.group(1) )
for term in results[:25]: # Limit to 25
print( term )
return
print_contents('http://www.google.com/trends/hottrends/atom/hourly')
# End of file
Save and run...
python tutorial-02.py
bowflex revolution
corcoran laboratories
kristin chenoweth
adrienne curry
kings cheerleaders
bonnie mcfarlane
withleather.com
...
[ total 25 lines ]
Step 8 ~ Fix words
Let's loop through words and capitalize the first letter of each
# Beginning of file
from urllib2 import urlopen
import re # Regular Expressions
reHot = re.compile('^<li>.*>([^<]+)</a></span></li>$', re.MULTILINE) # MULTILINE makes each line have a ^ and $ match
reWord = re.compile('(\w+)')
# Step 1
def print_contents(url):
'''
function print_contents: url is the url to download and print
'''
f = urlopen(url)
contents = f.read()
f.close()
# print( contents )
results = []
matches = reHot.finditer( contents )
# loop through matches
for match in matches:
# find words
word_matches = reWord.finditer( match.group(1) )
# loop through words
term_list = [] # create a list of words as term_list
for word_match in word_matches:
term_list.append( word_match.group(1).capitalize() )
# back together
print( ' '.join(term_list) )
return
print_contents('http://www.google.com/trends/hottrends/atom/hourly')
# End of file
Save and run...
python tutorial-02.py
...