Creating a wikipedia watchlist rss feed with Python and Twill
20 Nov 2005 18:27 - (2) comments
Wikipedia doesn't have a rss feed so I created one with Python and Twill. You can run it with wikipedia_rss.py username password.
After running the script the rss feed can be found at whatever you define as temp_rss.
It's my first Python script. I had trouble getting to work in my crontab. I kept getting ImportError: No module named. It turns out crontab was running a different Python version than my /usr/bin/env python. Running which python from the command line showed me my working version could be found at /opt/local/bin/python.
#!/opt/local/bin/python
from twill.commands import go, follow, showforms, fv, submit, find, code, show, save_html
import twill, xml.dom.minidom, sys, string, datetimetry:
username = sys.argv[1]
password = sys.argv[2]
except IndexError:
print "Please supply username password"
sys.exit(1)remote_html = "http://en.wikipedia.org/wiki/Special:Watchlist"
temp_html = "/Users/petrik/Scripts/tmp/wikipedia.html"
temp_rss = "/Library/WebServer/Documents/crap/wikipedia.rss"
rss_title = "Wikipedia watchlist"
rss_link = "http://en.wikipedia.org"#main methods
def loginWikiPedia(username, password):
print "Logging in with username " + username
go("http://en.wikipedia.org/w/index.php?title=Special:Userlogin")
fv("1", "wpName", username)
fv("1", "wpPassword", password)
submit("wpLoginattempt")def getSavedHtmlPage(filename):
return xml.dom.minidom.parseString(open(filename, 'r').read())def saveHtmlPage(html_page, filename):
go(html_page)
save_html(filename)#rss utils
def createRss(html):
r = "<rss version=\"0.92\">\n"
r += "<channel>\n"
r += getTitle(html)
r += handleTag(rss_link, "link")
r += handleTag(rss_title , "title")
r += handleTag(getTitle(html) , "description")
r += handleTag("en", "language")
r += handleTag(datetime.datetime.now().strftime('%a, %d %b %Y %X +0000'), "pubDate")
r += handleList(html.getElementsByTagName("li"))
r += "</channel>\n</rss>"
return string.replace(r,'&','&')def handleTag(content, tag):
return "<" + tag + ">" + str(content) + "</" + tag + ">\n"def handleLink(node):
return rss_link + str(getAttribute(node.childNodes[1], "href"))def getTitle(html):
return getText(html.getElementsByTagName("title")[0].childNodes)def handleList(listItems):
i = 0
r = ""
for li in listItems:
if i < 10 & li.childNodes.length > 1:
r += handleTag(handleItem(li), "item")
i += 1
return rdef handleItem(item):
contentIndex = filterContent(5, item, "m")
r = handleTag(getText(item.childNodes[contentIndex].childNodes), "title")
r += handleTag(handleLink(item), "link")
r += handleTag(handleDescription(item, contentIndex), "description")
return rdef handleDescription(item, contentIndex):
i = 0
description = ""
for child in item.childNodes:
if i > contentIndex:
description += " " + str(getText(child.childNodes))
i += 1
return descriptiondef filterContent(index, item, pattern):
if getText(item.childNodes[index].childNodes) == pattern:
index += 2
return index#xml utils
def getText(nodelist):
rc = ""
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc += node.data
return rcdef getAttribute(node, attribute):
attrs = node.attributes
for attrName in attrs.keys():
if attrName == attribute:
attrNode = attrs.get(attrName)
return attrNode.nodeValueloginWikiPedia(username, password)
saveHtmlPage(remote_html, temp_html)
rss = createRss(getSavedHtmlPage(temp_html))
print rssf = open(temp_rss, 'w')
f.write(rss)
f.close
print "saved rss to " + temp_rss
Comments
This biggest problem is that this script does make RSS. Two smaller issues: 1.)You should make plaintext available so that whitespace is not lost. 2.)This script does not work if the user has chosen to the "Enhance recent changes" preference.
Sample output, where are the tags?:
My watchlist - Wikipedia, the free encyclopediahttp://en.wikipedia.org
Wikipedia watchlist
My watchlist - Wikipedia, the free encyclopedia
en
Mon, 21 Nov 2005 10:23:53 +0000
Wikimedia
http://en.wikipedia.org/w/index.php?title=Wikimedia&curid=198862&diff=28909879&oldid=28758482
Zanimum Talk ()
Talk:Capitalism
http://en.wikipedia.org/w/index.php?title=Talk:Capitalism&curid=5417&diff=28908365&oldid=28859534
Christofurio Talk ()
TWiki
http://en.wikipedia.org/w/index.php?title=TWiki&curid=275801&diff=28907609&oldid=28905411
147.188.192.41 Talk ( - added line on structured info)
Raymond Aron
http://en.wikipedia.org/w/index.php?title=Raymond_Aron&curid=316129&diff=28903980&oldid=26350148
86.55.5.214 Talk ()
On 21 Nov 16:35 by Forest Gregg
Sorry, about the missing tags. Something went wrong with the damn HTML-entities. I fixed the script.
On 21 Nov 18:15 by p8
Comments have been closed.