Creating a spam list with a web crawler in python
- by user313623
Hey guys, I'm not trying to do anything malicious here, I just need to do some homework. I'm a fairly new programmer, I'm using python 3.0, and I having difficulty using recursion for problem-solving. I've been stuck on this question for quite a while. Here's the assignment:
Write a recursive method spam(url, n) that takes a url of a web page as input and a non-negative integer n, collects all the email address contained in the web page and adds them to a global dictionary variable spam_dict, and then recursively calls itself on every http hyperlink contained in the web page. You will use a dictionary so only one copy of every email address is save; your dictionary will store (key,value) pairs (email, email). The recursive call should use the parameter n-1 instead of n. If n = 0, you should collect the email addresses but no recursive calls should be made. The parameter n is used to limit the recursion to at most depth n. You will need to use the solutions of the two above problems; you method spam() will call the methods links2() and emails() and possibly other functions as well. Notes: 1. running spam() directly will produce no output on the screen; to find your spam_dict, you will need to read the value of spam_dict, and you will also need to reset it to the empty dictionary before every run of spam. 2. Recall how global variables are used.
Usage:
spam_dict = {}
spam('http://reed.cs.depaul.edu/lperkovic/csc242/test1.html',0)
spam_dict.keys()
dict_keys([])
spam_dict = {}
spam('http://reed.cs.depaul.edu/lperkovic/csc242/test1.html',1)
spam_dict.keys()
dict_keys(['[email protected]', '[email protected]'])
So far, I've written a function that traverses web pages and puts all the links in a nice little list, and what I wanted to do was call that functions. And why would I use recursion on a dictionary? And how? I don't understand how n ties into all of this.
def links2(url):
content = str(urlopen(url).read())
myparser = MyHTMLParser()
myparser.feed(content)
lst = myparser.get()
mergelst = []
for link in lst:
mergelst.append(urljoin(lst[0],link))
print(mergelst)
Any input (except why spam is bad) would be greatly appreciated. Also, I realize that the above function could probably look better, if you have a way to do it, I'm all ears. However, all I need is the point is for the program to produce the proper output.