Parsing html for domain links
Posted
by Hallik
on Stack Overflow
See other posts from Stack Overflow
or by Hallik
Published on 2010-05-07T01:56:46Z
Indexed on
2010/05/07
2:08 UTC
Read the original article
Hit count: 278
python
I have a script that parses an html page for all the links within it. I am getting all of them fine, but I have a list of domains I want to compare it against. So a sample list contains
list=['www.domain.com', 'sub.domain.com']
But I may have a list of links that look like
http://domain.com
http://sub.domain.com/some/other/page
I can strip off the http:// just fine, but in the two example links I just posted, they both should match. The first I would like to match against the www.domain.com, and the second, I would like to match against the subdomain in the list.
Right now I am using url2lib for parsing the html. What are my options in completely this task?
© Stack Overflow or respective owner