CAPTCHA blocking for my scraping script?
- by Surabhil Sergy
I am working on a scraping project which involves getting web data and parsing them for further use. I have been working using PHP and CURL to make scraping scripts which crawls web data and I make use of either PHP Dom or Simple HTML DOM Parser library for these kinds of projects.
On a recent project I encountered some challenges; initially I found the target website blocked my server IP such that the server could not make any successful requests to the site. Understanding these issues as common I bought a set of private proxies and tried to make request calls using them.
Though this could get successful response, I noticed the script is getting some kind of blocks after 2-3 continuous requests. On printing and checking the response I could see a pop-up asking for CAPTCHA validation. I could not see any captcha characters to be entered and it also shows an error “input error: invalid referrer”. On examining the source I could see some Google recaptcha scripts within. I’m stuck at this point and I m not able to execute my script.
My script is used for gathering data and it needs to go through a large number of pages periodically over the site. But in the current scenario I am not able to proceed with my script. I could see there are some options to overcome these captcha issues and scraping these kinds of sites too are common.
I have been checking my script performance and responses over last two months. I could see during first month I was able to execute very large number of requests from a single IP and I was able to get results. Later I get an IP block and used private proxies which could get me some results. Later I am facing now with the captcha trouble.
I would appreciate any help or suggestions in this regard.
(Often in this kind of questions I used to get a first comment as, ‘Have you asked for prior permission from the target?’ .I haven’t ,but I know there are many sites doing so to get the details out of sites and target sites may not often give access to them. I respect the legality and scraping etiquettes but I would like to know at what point I stuck and how could I overcome that! )
I could provide any supporting information if needed.