Creating a spam list with a web crawler in python

Posted by user313623 on Stack Overflow See other posts from Stack Overflow or by user313623
Published on 2010-05-01T00:54:53Z Indexed on 2010/05/01 0:57 UTC
Read the original article Hit count: 333

Filed under:

Hey guys, I'm not trying to do anything malicious here, I just need to do some homework. I'm a fairly new programmer, I'm using python 3.0, and I having difficulty using recursion for problem-solving. I've been stuck on this question for quite a while. Here's the assignment:

  1. Write a recursive method spam(url, n) that takes a url of a web page as input and a non-negative integer n, collects all the email address contained in the web page and adds them to a global dictionary variable spam_dict, and then recursively calls itself on every http hyperlink contained in the web page. You will use a dictionary so only one copy of every email address is save; your dictionary will store (key,value) pairs (email, email). The recursive call should use the parameter n-1 instead of n. If n = 0, you should collect the email addresses but no recursive calls should be made. The parameter n is used to limit the recursion to at most depth n. You will need to use the solutions of the two above problems; you method spam() will call the methods links2() and emails() and possibly other functions as well. Notes: 1. running spam() directly will produce no output on the screen; to find your spam_dict, you will need to read the value of spam_dict, and you will also need to reset it to the empty dictionary before every run of spam. 2. Recall how global variables are used.

    Usage:

    spam_dict = {} spam('http://reed.cs.depaul.edu/lperkovic/csc242/test1.html',0) spam_dict.keys() dict_keys([]) spam_dict = {} spam('http://reed.cs.depaul.edu/lperkovic/csc242/test1.html',1) spam_dict.keys() dict_keys(['[email protected]', '[email protected]'])

So far, I've written a function that traverses web pages and puts all the links in a nice little list, and what I wanted to do was call that functions. And why would I use recursion on a dictionary? And how? I don't understand how n ties into all of this.

def links2(url):
    content = str(urlopen(url).read())
    myparser = MyHTMLParser()
    myparser.feed(content)
    lst = myparser.get()
    mergelst = []
    for link in lst:
        mergelst.append(urljoin(lst[0],link))
    print(mergelst)

Any input (except why spam is bad) would be greatly appreciated. Also, I realize that the above function could probably look better, if you have a way to do it, I'm all ears. However, all I need is the point is for the program to produce the proper output.

© Stack Overflow or respective owner

Related posts about python