2010
10.13

By using Beautiful Soup, we can change the code as seen at the previous post to the code below…  and it even works much better… just by changing the regex function, it return a better result :

#!/usr/bin/python
# otoy -- http://otoyrood.wordpress.com
# 0x102010

from urllib import urlopen
from BeautifulSoup import BeautifulSoup

text = urlopen('http://otoyrood.wordpress.com').read()
soup = BeautifulSoup(text)

pages = set()
for header in soup('a'):
pages.add(header['href'])

print '\n'.join(sorted(pages))

Related Post

No Comment.

Add Your Comment