-
as seen on Stack Overflow
- Search for 'Stack Overflow'
does anybody know where i can get a free web crawler that actually works with minimal coding by me. ive googled it and can only find really old ones that dont work or openwebspider which doesnt seem to work.
ideally id like to store just the web addresses and which links that page contains
any suggestions…
>>> More
-
as seen on Stack Overflow
- Search for 'Stack Overflow'
I am building a web application crawler that's meant not only to find all the links or pages in a web application, but also perform all the allowed actions in the app (such as pushing buttons, filling forms, notice changes in the DOM even if they did not trigger a request etc.)
Basically, this is…
>>> More
-
as seen on Stack Overflow
- Search for 'Stack Overflow'
i built an appengine web app cricket.hover.in. The web app consists of about 15k url's
linked in it, But even after a long time of my launch, no pages are indexed on google.
Any base link place on my root site hover.in are being indexed with in minutes.
but i placed the same link home page of root…
>>> More
-
as seen on Programmers
- Search for 'Programmers'
I would like to extract data from internet like www.mozenda.com does but I want to write my own program to do that. Specific data I'm looking for is various event data.
Based on my research, I think custom web crawler is my answer but I Would like to confirm the answer and see if there are any suggestion…
>>> More
-
as seen on Stack Overflow
- Search for 'Stack Overflow'
I want to crawl useful resource (like background picture .. ) from certain websites. It is not a hard job, especially with the help of some wonderful projects like scrapy.
The problem here is I not only just want crawl this site ONE TIME. I also want to keep my crawl long running and crawl the updated…
>>> More
-
as seen on Stack Overflow
- Search for 'Stack Overflow'
I just downloaded Scrapy (web crawler) on Windows 32 and have just created a new project folder using the "scrapy-ctl.py startproject dmoz" command in dos. I then proceeded to created the first spider using the command:
scrapy-ctl.py genspider myspider myspdier-domain.com
but it did not work and…
>>> More
-
as seen on Server Fault
- Search for 'Server Fault'
It's been suggested that we use mysql for our site's search as it'd be running on the same server that hosts our web server (nginx) and our db (mysql).
Since not all of our pages are created from the database, it's been suggested that we have a crawler that can crawl the site, and toss the page url…
>>> More
-
as seen on Stack Overflow
- Search for 'Stack Overflow'
I am new to python and just downloaded it today. I am using it to work on a web spider, so to test it out and make sure everything was working, I downloaded a sample code. Unfortunately, it does not work and gives me the error:
"AttributeError: 'MyShell' object has no attribute 'loaded' "
I am…
>>> More
-
as seen on Stack Overflow
- Search for 'Stack Overflow'
Is there a way to make a web robot like websiteoutlook.com does? I need something that searches the internet for URLs only...I don't need links, descriptions, etc.
What is the best way to do this without getting too technical? I guess it could even be a cronjob that runs a PHP script grabbing URLs…
>>> More
-
as seen on Pro Webmasters
- Search for 'Pro Webmasters'
How do I realize a UA string block by regular expression in the config files of my Apache webserver?
For example: if I would like to block out all bots from Apache on my debian server, that have the regular expression /\b\w+[Bb]ot\b/ or /Spider/ in their user-agent.
Those bots should not be able…
>>> More