Search Results

Search found 446 results on 18 pages for 'crawl'.

Page 6/18 | < Previous Page | 2 3 4 5 6 7 8 9 10 11 12 13  | Next Page >

  • SwingWorker in Java (beginner question)

    - by Malachi
    I am relatively new to multi-threading and want to execute a background task using a Swingworker thread - the method that is called does not actually return anything but I would like to be notified when it has completed. The code I have so far doesn't appear to be working: private void crawl(ActionEvent evt) { try { SwingWorker<Void, Void> crawler = new SwingWorker<Void, Void>() { @Override protected Void doInBackground() throws Exception { Discoverer discover = new Discoverer(); discover.crawl(); return null; } @Override protected void done() { JOptionPane.showMessageDialog(jfThis, "Finished Crawling", "Success", JOptionPane.INFORMATION_MESSAGE); } }; crawler.execute(); } catch (Exception ex) { JOptionPane.showMessageDialog(this, ex.getMessage(), "Exception", JOptionPane.ERROR_MESSAGE); } } Any feedback/advice would be greatly appreciated as multi-threading is a big area of programming that I am weak in.

    Read the article

  • Can EC2 instances be set up to come from different IP ranges?

    - by Joshua Frank
    I need to run a web crawler and I want to do it from EC2 because I want the HTTP requests to come from different IP ranges so I don't get blocked. So I thought distributing this on EC2 instances might help, but I can't find any information about what the outbound IP range will be. I don't want to go to the trouble of figuring out the extra complexity of EC2 and distributed data, only to find that all the instances use the same address block and I get blocked by the server anyway. NOTE: This isn't for a DoS attack or anything. I'm trying to harvest data for a legitimate business purpose, I'm respecting robots.txt, and I'm only making one request per second, but the host is still shutting me down. Edit: Commenter Paul Dixon suggests that the act of blocking even my modest crawl indicates that the host doesn't want me to crawl them and therefore that I shouldn't do it (even assuming I can work around the blocking). Do people agree with this?

    Read the article

  • Best way to store data for Greasemonkey based crawler?

    - by Björn
    I want to crawl a site with Greasemonkey and wonder if there is a better way to temporarily store values than with GM_setValue. What I want to do is crawl my contacts in a social network and extract the Twitter URLs from their profile pages. My current plan is to open each profile in it's own tab, so that it looks more like a normal browsing person (ie css, scrits and images will be loaded by the browser). Then store the Twitter URL with GM_setValue. Once all profile pages have been crawled, create a page using the stored values. I am not so happy with the storage option, though. Maybe there is a better way? I have considered inserting the user profiles into the current page so that I could all process them with the same script instance, but I am not sure if XMLHttpRequest looks indistignuishable from normal user initiated requests.

    Read the article

  • Writing a PHP web crawler using cron

    - by Horse
    Hi all I have written myself a web crawler using simplehtmldom, and have got the crawl process working quite nicely. It crawls the start page, adds all links into a database table, sets a session pointer, and meta refreshes the page to carry onto the next page. That keeps going until it runs out of links That works fine however obviously the crawl time for larger websites is pretty tedious. I wanted to be able to speed things up a bit though, and possibly make it a cron job. Any ideas on making it as quick and efficient as possible other than setting the memory limit / execution time higher?

    Read the article

  • guide on crawling the entire web ?

    - by bohohasdhfasdf
    i just had this thought, and was wondering if it's possible to crawl the entire web (just like the big boys!) on a single dedicated server (like Core2Duo, 8gig ram, 750gb disk 100mbps) . I've come across a paper where this was done....but i cannot recall this paper's title. it was like about crawling the entire web on a single dedicated server using some statistical model. Anyways, imagine starting with just around 10,000 seed URLs, and doing exhaustive crawl.... is it possible ? I am in need of crawling the web but limited to a dedicated server. how can i do this, is there an open source solution out there already ? for example see this real time search engine. http://crawlrapidshare.com the results are exteremely good and freshly updated....how are they doing this ?

    Read the article

  • Open Source PHP search engine

    - by Ravi Gupta
    I am looking for an open source search engine plugin written in php for my website(eCommerce). Before anybody answer that I have a doubt regarding the search engine. Usually search engine crawl web pages, create indexes and then use them while looking for contents. But will the same model work for eCommerce websites? Yeah, it can crawl products pages, index them but don't you think it would be better if it crawls the database directly and index the products stored in the database? And when a user search for any product, it will simply give us the rows of the table which matches the user query? May be what I am asking is a stupid question but I am new to web development, so kindly help me to understand the concept. I have looked at a search engine called Sphider but didn't get what all I have to do to make it work with an eCommerce website.

    Read the article

  • Why deny access to website for msnbot/bingbot?

    - by Quandary
    I've seen quite a lot of tutorials that recommend you to ban user agents containing the strings libwww-perl and msnbot. I understand why one would ban libwww-perl, it's mainly if not only used for hacking and spamming. But why are there so many sites recommending to ban msnbot/bingbot? Since it's a search engine, even if only with a marginal market share, I would except one would want this bot to crawl one's sites. What is it that msnbot does that makes people ban it?

    Read the article

  • What bots are really worth letting onto a site?

    - by blunders
    Having written a number of bots, and seen the massive amounts of random bots that happen to crawl a site, I am wondering if the goal of the site allowing bots is for the potential for the bot to send real traffic back to the site if there is any reason to allow bots that are not known to be sending real traffic back, and how to spot these "good" bots; based on how they ID themselves, IPs they come from, behaviors, etc.

    Read the article

  • wordpress feeds not indexing in webmaster tools

    - by jogesh_p
    I don't have much experience about webmaster tools, i just know the basic of the webmaster, and i am not from SEO background, but i just want to know that: Why my blog's RSS Feeds not indexing from webmaster tools? i want to know about Crawl Stat is this stat is good or bad? To submit the RSS into the webmaster is good for indexing the pages or not?? i also submitted the sitemap. the link of the website is Webtech Eleven

    Read the article

  • Should I add a "nofollow" attribute to download links, or disallow the URLs in robots.txt?

    - by Laurent
    I have a download link very similar to Opera's one - it's just a script that sends the file. It doesn't have an extension and there's no obvious way to tell that it's actually a download link. So since I don't want robots to crawl this link, do I need to add it to robots.txt or maybe add a "nofollow" attribute to it? I see that on Opera's website they didn't do either of this, so perhaps it's not necessary?

    Read the article

  • adding tagged / dynamic pages in sitemap

    - by sam
    ive got a blog thats been running for about a year ive made about 200 posts, and there should be about 220 pages to index (additional pages for about / contact ect). When i go to crawl the site i get 1900 pages because of all the pages that are related to tags ive used in my blogs these 70% of these pages only contain one blog post. When submitting my site map to google should i exclude all pages with /tagged/ in the url so ill only be submitting unqiue pages, or should i submit the full site map ?

    Read the article

  • Can the update manager download only a single package at a time?

    - by SaultDon
    I need the update manager to only download a single package at a time and not try to download multiple packages at once. My slow internet cannot handle multiple connections; slows the download to a crawl and some packages will reset themselves halfway through when they time-out. EDIT When using apt-get update multiple repositories get checked: When using apt-get upgrade multiple packages are downloaded:

    Read the article

  • How do I control how often search engines visit my site?

    - by Nick
    I've been using the following line in the <head> of my sites for years: <meta name="revisit-after" content="3 days" /> I recently discovered that it's not one of the meta tags that Google understands, which I take to mean that there's no point in including it, and that it's been doing no good at all for years. How often do search engines crawl a website by default, and what reliable ways are there to increase or decrease that frequency?

    Read the article

  • Can a site recover by itself after dropping google page rank for 404 errors?

    - by Jeff
    Recently redid a website and changed the directory / URL structure. I did some .htaccess redirects for the main landing pages - however when reviewing web master tools received 404 errors for the rest of the changed URLs and noticed that Google dropped my site from the #1 position to around the 5th page. I corrected all the 404s by providing redirects in the .htaccess, resubmitted the site map and tested the google crawl bot. Will my page regain its rank by itself - or am I going to have to put some time into like I originally did?

    Read the article

  • The Importance of SEO Articles

    If you have ever tried creating your own website or blog, then you have probably heard about the importance of search engine optimization (SEO). This is the process of optimizing your site so that search engines such as Google and Yahoo can find and crawl through your site easily. This allows your site to rank higher on the search results when people search for the keywords that you are targeting with your site.

    Read the article

  • Google Caffeine - How Will it Affect Your Web Site?

    Google, one of the most used search engines is coming up with a new organization process that will make Google search engine faster for searchers as well as crawl the web faster therefore rankings to be updated faster. Many people are concerned about the changes and how it will affect their Search Engine Optimization.

    Read the article

  • Keyphrases - How to Use Them

    Keywords and phrases are words which trigger a response from the search engine spiders (mathematical robots that crawl the web looking for new content to index). They are effective if they are tuned into what people type into the search engines at this moment in time, and you can find this out through the Google AdWords Keyword Tool.

    Read the article

  • SEO - Coaching Newbies

    All search engines use algorithms and each search engines have different ones. An algorithm is the formula that the search engine uses to evaluate your web pages. The robots will crawl all pages on your site but not all pages will be indexed.

    Read the article

  • 7 Ways to Get Ranked in Google Within 24 Hours

    In order for a website to receive more targeted visits, it needs to get indexed by major search engines like Google. But if you don't know the right strategy, it can take weeks before search engine spiders crawl into your pages. Listed below are proven techniques on how you can get indexed in less than 24 hours.

    Read the article

  • When Canonicalization is an Issue

    Although extremely hard to pronounce, canonicalization is a hot topic right now. If there are a lot of URLs that lead to pretty much the same page, you're going to make the search engines work extra hard and spend a lot more time crawling all the different URLs. Often times, this means that they'll miss the important pages of your website because your crawl time is limited or too slow.

    Read the article

  • How RSS Feeds Help in SEO Optimization

    RSS, which stands for Really Simple Syndication is a web feed that is designed to publish updated content such as blog post, podcast and video. Submitting your RSS feeds to the blog directory allows the search engine to crawl your blog more often so that it can pick up new content.

    Read the article

  • 7 Ways to Get Your Website Noticed

    Putting your website on the internet is only the first step to getting your website noticed. Sooner or later, the web spiders will crawl around it and a while after that you'll start appearing in the search results, but probably only for very specific searches such as your company name. And, let's face it, if someone knows your company name they probably already know a bit about you.

    Read the article

< Previous Page | 2 3 4 5 6 7 8 9 10 11 12 13  | Next Page >