Search Results

Search found 287 results on 12 pages for 'crawling pasta hellion'.

Page 9/12 | < Previous Page | 5 6 7 8 9 10 11 12 | Next Page >

Do or can robots cause considerable performance issues?

- by Anicho

So the question in the title is exactly what I am trying to find out. My case is: At work we are in a discussion with team members who seem to think bots will cause us problems relating to performance when running on our services website. Out setup: Lets say I have site www.mysite.co.uk this is a shop window to our online services which sit on www.mysiteonline.co.uk. When people search in google for mysite they see mysiteonline.co.uk as well as mysite.co.uk. Cases against stopping bots crawling: We don't store gb's of data publicly available on the web Most friendly bots, if they were to cause issues would have done so already In our instance the bots can't crawl the site because it requires username & password Stopping bots with robot .txt causes an issue with seo (ref.1) If it was a malicious bot, it would ignore robot.txt or meta tags anyway Ref 1. If we were to block mysiteonline.co.uk from having robots crawl this will affect seo rankings and make it inconvenient for users who actively search for mysite to find mysiteonline. Which we can prove is the case for a good portion of our users.

Read the article
Guide to the web development ecosystem

- by acjohnson55

I'm a long-time software developer, and I've been thrown in the deep, deep end of developing from the ground up what will hopefully be a highly scalable and interactive web application. I've been out of the web game for about 8 years, and even when I was last in it, I wasn't exactly on the cutting edge. I think I've made judicious design decisions and I'm quite happy with the progress I've been making so far, but new, hot web technologies keep crawling out of the woodwork and into my headspace, forcing me to continually revalidate my implementation decisions. Complicating things even further is the preponderance of out-of-date information and the difficulty of knowing what is out of date in the first place. What I'm wondering is, are there any comprehensive books or guides dedicated to compiling and comparing the technologies out there, end-to-end in the web application stack? I'm happy to learn new techs on demand, but I don't like learning about them after I've already spent time going in another direction. I'm looking for the sort of executive info a CTO might read to make sure the best architectural decisions are being made. And just to be clear, this is a question about resources, not about specific technology suggestions.

Read the article
ArchBeat Link-o-Rama for 2012-04-03

- by Bob Rhubart

Crawling a Content Folio | Kyle Hatlestad blogs.oracle.com Kyle Hatlestad shares detials on a component developed by Ed Bryant that simplifies the task of "consuming and publishing that folio on a Site Studio page or in your portal using RIDC." Northeast Ohio Oracle Users Group 2 Day Seminar - May 14-15 - Cleveland, OH www.neooug.org More than 20 sessions over 4 tracks, featuring 18 speakers, including Oracle ACE Director Cary Millsap, Oracle ACE Director Rich Niemiec, and Oracle ACE Stewart Brand. Register before April 15 and save. OTN Member discounts for April www.oracle.com Save up to 40% on titles from Oracle Press, Pearson, O'Reilly, Apress, and more. The Java EE 6 Example - Galleria - Part 1 | Markus Eisele blog.eisele.net Oracle ACE Director Markus Eisele heaps praise on Vineet Reynolds' Java EE 6 Galleria demo application, which demonstrates the use of JSF 2.0 and JPA 2.0 in a Java EE project using Domain Driven Design. Reminder: JavaOne Call For Papers Closing April 9th, 11:59pm | Arun Gupta blogs.oracle.com One week left to submit your JavaOne papers. Narrowing the gap between UI design and ADF development | Jack Ritzen www.nl.capgemini.com "Joining my first demo project I was confronted with two traditional contradictory worlds," says Jack Ritzen. "In the left corner; me, as a beginning GUI designer. And in the right, a heavyweight ADF developer. Let the game begin!" Thought for the Day "Operating systems are like underwear — nobody really wants to look at them." — Bill Joy

Read the article
ArchBeat Link-o-Rama Top 20 for April 1-9, 2012

- by Bob Rhubart

The top 20 most popular items shared via my social networks for the week of April 1 - 8, 2012. Webcast: Oracle Maximum Availability Architecture Best Practices w/Tom Kyte - April 12 Oracle Cloud Conference: dates and locations worldwide Bad Practice Use Case for LOV Performance Implementation in ADF BC | Oracle ACE Director Andresjus Baranovskis How to create a Global Rule that stores a document’s folder path in a custom metadata field | Nicolas Montoya MySQL Cluster 7.2 GA Released How to deal with transport level security policy with OSB | Jian Liang Webcast Series: Data Warehousing Best Practices http://bit.ly/I0yUx1 Interactive Webcast and Live Chat: Oracle Enterprise Manager Ops Center 12c Launch - April 12 Is This How the Execs React to Your Recommendations? | Rick Ramsey Unsolicited login with OAM 11g | Chris Johnson Event: OTN Developer Day: MySQL - New York - May 2 OTN Member discounts for April: Save up to 40% on titles from Oracle Press, Pearson, O'Reilly, Apress, and more Get Proactive with Fusion Middleware | Daniel Mortimer How to use the Human WorkFlow Web Services | Oracle ACE Edwin Biemond Northeast Ohio Oracle Users Group 2 Day Seminar - May 14-15 - Cleveland, OH IOUG Real World Performance Tour, w/Tom Kyte, Andrew Holdsworth, Graham Wood WebLogic Server Performance and Tuning: Part I - Tuning JVM | Gokhan Gungor Crawling a Content Folio | Kyle Hatlestad The Java EE 6 Example - Galleria - Part 1 | Oracle ACE Director Markus Eisele Reminder: JavaOne Call For Papers Closing April 9th, 11:59pm | Arun Gupta Thought for the Day "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." — Leslie Lamport

Read the article
.Net search engine architecture and technology choice

- by shrivb

I am in the process of designing a search engine for an asp.net site. The site currently uses Microsoft Indexing Server to index and search content which range from simple text files to MS documents to PDFs. MIS is also used to crawl File servers. MIS in tandem with Index Server Companion crawls for content from external sites. I intend to replace MIS with the indexer/crawler I am trying to build. Since my platform is completely on the Microsoft stack, I cant afford to have a Java application server. Thus, Solr, and effectively, SolrNet is ruled out. With this being the context, I have couple of questions. 1.Technology choice I had done my initial investigation and looked at Lucene.Net. There seemed to be 2 issues in using Lucene.Net. First being, it cant crawl external content. There doesn't seem to be a direct port of Nutch in .Net. Second, since it is just an indexer, it cant parse various document types. The parsing is left to the developer. So, what would be best technology choice on the .Net platform to achieve indexing & crawling? Are there any .Net open source libraries available for document parsing? 2.Architectural pattern Is there any general architectural pattern or best practice that needs to be followed in designing such a search engine? Thanks in advance.

Read the article
Turn-based JRPG battle system architecture resources

- by BenoitRen

The past months I've been busy programming a 2D JRPG (Japanese-style RPG) in C++ using the SDL library. The exploration mode is more or less done. Now I'm tackling the battle mode. I have been unable to find any resources about how a classic turn-based JRPG battle system is structured. All I find are discussions about damage formula. I've tried googling, searching gamedev.net's message board, and crawling through C++-related questions here on Stack Exchange. I've also tried reading source code of existing open source RPGs, but without a guide of some sort it's like trying to find a needle in a haystack. I'm not looking for a set of rules like D&D or anything similar. I'm talking purely about code and object structure design. A battle system asks the player for input using menus. Next the battle turn is executed as the heroes and the enemies execute their actions. Can anyone point me in the right direction? Thanks in advance.

Read the article
JavaScript loaded external content SEO

- by user005569871

I wonder what is the best way to have Javascript loaded content indexed by search engines. I know that search engines don't execute Javascript, but I am thinking more of an progressive enchantment. I am creating a responsive website, and on the home page I will have some sections about most visited products and recommended product that I plan to load depending on the device detected. These products will be in sliders with thumbnail images and names of the products. If mobile is detected slider content will not load, ant the link to the external page will be shown. I know that external content will be indexed via link to those resources. Where will the users be directed from search in this case? To the external page or home page? Will it be bad for SEO if I show only product names on front page so they can be indexed and hide them with CSS? What is the best way to index that content and possibly direct users from search to home page? Also, i've seen the Ajax crawling but iI would like not to use that if there is any better way.

Read the article
Ranking drop after using reverse proxy for blog subdirectory and robots.txt for old blog subdomain

- by user40387

We have a 3Dcart store and a WordPress blog hosted on a separate server. Originally, we had a CNAME set up to point the blog to http://blog.example.com/. However, in our attempt to boost link-based and traffic-based authority on the main site, we've opted to do a reverse proxy to http://www.example.com/blog/. It’s been about two months since we finished the reverse proxy migration. It appears that everything is technically working as intended, including some robots and sitemap changes; the new URLs are even generating some traffic, as indicated on Google Analytics. While Google has been indexing the new URL locations, they’re ranking very poorly, even for non-competitive, long-tail keywords. Meanwhile, the old subdomain URLs are still ranking mostly as well as they used to (even though they aren’t showing meta titles and descriptions due to being blocked by robots.txt). Our working theory is that Google has an old index of the subdomain URLs, and is considering the new URLs to be duplicate content, since it’s being told not to crawl the subdomain and therefore can’t see the rel canonicals we have in place. To resolve this, we’ve updated the subdomain’s robot.txt to no longer block crawling and indexing. Theoretically, seeing the canonical tag on the subdomain pages will resolve any perceived duplicate content issues. In the meantime, we were wondering if anyone would have any other ideas. We are very concerned that we’ll be losing valuable traffic, as we’re entering our on season at the moment.

Read the article
De-index URL paremeters

- by Doug Firr

Upon reading over this question is lengthy so allow me to provide a one sentence summary: I need to get Google to de-index URLs that have certain parameters appended I have a website example.com with language translations. There used to be many translations but I deleted them all so that only English (Default) and French options remain. When one selects a language option a parameter is aded to the URL. For example, the home page: https://example.com (default) https://example.com/main?l=fr_FR (French) I added a robots.txt to stop Google from crawling any of the language translations: # robots.txt generated at http://www.mcanerin.com User-agent: * Disallow: Disallow: /cgi-bin/ Disallow: /*?l= So any pages containing "?l=" should not be crawled. I checked in GWT using the robots testing tool. It works. But under html improvements the previously crawled language translation URLs remain indexed. The internet says to add a 404 to the header of the removed URLs so the Googles knows to de-index it. I checked to see what my CMS would throw up if I visited one of the URLs that should no longer exist. This URL was listed in GWT under duplicate title tags (One of the reasons I want to scrub up my URLS) https://example.com/reports/view/884?l=vi_VN&l=hy_AM This URL should not exist - I removed the language translations. The page loads when it should not! I played around. I typed example.com?whatever123 It seems that parameters always load as long as everything before the question mark is a real URL. So if Google has indexed all these URLS with parameters how do I remove them? I cannot check if a 404 is being generated because the page always loads because it's a parameter that needs to be de-indexed.

Read the article
Sudden drop in Total Indexed pages and increase in 'Not Selected' number.

- by Pravin

My blog is around 1 year old and have PR2. The average daily pageviews upto last 1 week were 1800. The total number of posts are 180. Though I have only 180 total posts, the total number of Indexed URL was increasing and it was as high as 510. But in the month of Sept2012, the total number of Indexed pages dropped from 510 to 214. The drop was sudden and it is now increasing very slowly. Also, the other main concern is huge increase in 'Not Selected' number. It is currently 814. I have never posted any post again and never copied any idea from any other blog. But I do use internal linking to some older post those are related to the new posts. The questions are:; Why there is sudden drop in the 'Total Indexed' pages. Why there was increase in total indexed pages to 500 even though the total posts were only 180. As the drop in 'Total Indexed' was in the month of sept2012, I was getting same organic traffic and it was steadily increasing till last week and then there was a 50 drop in the total pageviews. Why. Now, again the traffic is becoming to normal but still there is a problem. Is increase in the 'Not selected' number is a cause of drop in 'Total Indexed'? How to prevent or reduce the number of 'Not Selected' even though I do not have any duplicate post withing blog. Is the 'internal linking' to older post creating 'Not selected' problem? Should I edit my 'Robot.txt' to avoid crawling of labes that may be creating duplicate posts or something like that, if so, what is correct robot.txt. I have uploaded the screenshot of the graph of Webmaster Tools. Please take a look and give suggestions. Please help. Thank you in advance.

Read the article
How can I tell GoogleBot that a subdirectory is now a subdomain?

- by cwd

I had about a million pages of a catalog indexed under a subdirectory, and now that's moved to a subdomain. GoogleBot is crawling each one of them and getting a 301 redirect to the new location. Even though I have set up the redirect rule in the apache sites-enabled configuration file, (i.e. it's early on when apache does the redirect - PHP is not even getting loaded), even though I have done that, the server isn't handling the load well. GoogleBot is making around 5 requests per second, and on top of my normal traffic that is hiking up the CPU for a few hours at a time. I checked in Webmaster Tools and the corresponding documentation for a way to let Google know that the content had been moved from a subdirectory to a subdomain, but with little luck. Basically the most helpful thing I saw said to just send 301 headers for the new location. How can I tell GoogleBot that a subdirectory is now a subdomain? If that is not an option, how can I more efficiently send 301 redirects out for a particular subdomain? I was thinking perhaps the Nginx server but I'm not sure that I can run both Apache and Nginx side by side on port 80 for different subdomains.

Read the article
Disallow robots.txt from being accessed in a browser but still accessible by spiders?

- by Michael Irigoyen

We make use of the robots.txt file to prevent Google (and other search spiders) from crawling certain pages/directories in our domain. Some of these directories/files are secret, meaning they aren't linked (except perhaps on other pages encompassed by the robots.txt file). Some of these directories/files aren't secret, we just don't want them indexed. If somebody browses directly to www.mydomain.com/robots.txt, they can see the contents of the robots.txt file. From a security standpoint, this is not something we want publicly available to anybody. Any directories that contain secure information are set behind authentication, but we still don't want them to be discoverable unless the user specifically knows about them. Is there a way to provide a robots.txt file but to have it's presence masked by John Doe accessing it from his browser? Perhaps by using PHP to generate the document based on certain criteria? Perhaps something I'm not thinking of? We'd prefer a way to centrally do it (meaning a <meta> tag solution is less than ideal).

Read the article
mod_rewrite and SEO friendliness

- by John Doe

My website has an atypical structure and I'm not sure if this could create problems in the long run, specially for SEO positioning purposes. I have a unique, large PHP script, and I use the Apache module mod_rewrite in the .htaccess file to create friendly URLs, for example: RewriteRule ^$ /index.php?section=Main RewriteRule ^createArticle$ /index.php?section=Main&view=CreateArticle RewriteRule ^configuration$ /index.php?section=Configuration RewriteRule ^article/([0-9]{1,10})$ /index.php?section=Article&view=Default&id=$1 RewriteRule ^deleteArticle/([0-9]{1,10})$ /index.php?section=Article&view=Delete&id=$1 RewriteRule ^reportArticle/([0-9]{1,10})$ /index.php?section=Article&view=Report&id=$1 RewriteRule ^logIn$ /index.php?section=Authentication ... So, www.example.com/index.php?section=Article&view=Default&id=105 would become www.example.com/article/105. The only real physical file is index.php, in which the parameters of the URL queried is processed and the corresponding result is outputted. My question is, do the crawling robots (e.g. Googlebot) recognize these links? Do they index the resulting HTML outputted by index.php with the specified parameters as if it was a actual HTML file? Also, would this become a problem when creating a Sitemap?

Read the article
Can I use nofollow for offsite links without it affecting my page rank?

- by Jack

What I have is a page with almost all offsite links. Each clicked link is forwarded on to the destination. What I would like the search engines to do is to index the text between the anchor tag and not follow the link itself. <a href="somelink">Index This Text Only</a> I've read several articles and they all seem to contradict themselves as to when to use nofollow. What's been happening over the past 2 months that the site has been live is that both Google and Bing are crawling the site as well as all the links on the site that it has been forwarded to. The search engines are now generating a lot of 404s for images and files that never existed on my site but rather seems to correlate to the site it was forwarded to. The search engines don't seem to honor the 302 header when forwarded. I would like to get a definitive answer on the nofollow tag as it relates to my situation. Can I use nofollow to stop the 404s and if so, will it affect my page ranking negatively?

Read the article
Dualboot (Win 8 / Ubuntu 13) is stuck at 'switching to clocksource'

- by Daniel Puscht

for days I have been crawling the web for solutions to my problem, but couln't find any. Here it is: I got a new Laptop (ASUS Vivobook S200E) with Win 8 OEM preinstalled. I wanted to create a dual-boot system with Ubuntu 13 next to it. I read about UEFI and that I have to turn of Secure Boot and use the existing EFI partition as bootloader for Ubuntu. So I did. I also ran boot-repair reinstalling the GRUB. The result is when I start the computer I get into the boot menu. So far, so good. When I pick Win everthing is fine. But when I choose Ubuntu (recovery) the system starts, but gets stuck at the line '[1.806366] Switching to clocksource tsc'. I already tried other versions of Ubuntu (12.04.2, 12.10). I played with boot-repair (using the recommended fix, setting everything manually). But nothing works. It's always the same issue. I read that it could be a problem concerning graphic drivers, but this I can hardly believe. If this is any help, boot-repair gave me this link to post in fora. http://paste.ubuntu.com/5810391/ Thanks for any help in advance

Read the article
ASP.NET MVC, Web API, Razor, e Open Source (Código Aberto)

- by Leniel Macaferi

A Microsoft tornou o código fonte da ASP.NET MVC disponível sob uma licença open source (de código aberto) desde a primeira versão V1. Nós também integramos uma série de grandes tecnologias de código aberto no produto, e agora entregamos jQuery, jQuery UI, jQuery Mobile, jQuery Validation, Modernizr.js, NuGet, Knockout.js e JSON.NET como parte integrante dos lançamentos da ASP.NET MVC. Estou muito animado para anunciar hoje que também iremos liberar o código fonte da ASP.NET Web API e ASP.NET Web Pages (também conhecido como Razor) sob uma licença open source (Apache 2.0), e que iremos aumentar a transparência do desenvolvimento de todos os três projetos hospedando seus repositórios de código no CodePlex (usando o novo suporte ao Git anunciado na semana passada - em Inglês). Isso permitirá um modelo de desenvolvimento mais aberto, onde toda a comunidade será capaz de participar e fornecer feedback nos checkins (envios de código), corrigir bugs, desenvolver novos recursos, e construir e testar os produtos diariamente usando a versão do código-fonte e testes mais atualizada possível. Nós também pela primeira vez permitiremos que os desenvolvedores de fora da Microsoft enviem correções e contribuições de código que a equipe de desenvolvimento da Microsoft irá rever para potencial inclusão nos produtos. Nós anunciamos uma abordagem de desenvolvimento semelhantemente aberta com o Windows Azure SDK em Dezembro passado, e achamos que essa abordagem é um ótimo caminho para estreitar as relações, pois permite um excelente ciclo de feedback com os desenvolvedores - e, finalmente, permite a entrega de produtos ainda melhores, como resultado. Muito importante - ASP.NET MVC, Web API e o Razor continuarão a ser totalmente produtos suportados pela Microsoft que são lançados tanto independentemente, bem como parte do Visual Studio (exatamente da mesma maneira como é feito hoje em dia). Eles também continuarão a ser desenvolvidos pelos mesmos desenvolvedores da Microsoft que os constroem hoje (na verdade, temos agora muito mais desenvolvedores da Microsoft trabalhando na equipe da ASP.NET). Nosso objetivo com o anúncio de hoje é aumentar ainda mais o ciclo de feedback/retorno sobre os produtos, para nos permitir oferecer produtos ainda melhores. Estamos realmente entusiasmados com as melhorias que isso trará. Saiba mais Agora você pode navegar, sincronizar e construir a árvore de código fonte da ASP.NET MVC, Web API, e Razor através do website http://aspnetwebstack.codeplex.com. O repositório Git atual no site refere-se à árvore de desenvolvimento do marco RC (release candidate/candidata a lançamento) na qual equipe vem trabalhando nas últimas semanas, e esta mesma árvore contém ambos o código fonte e os testes, e pode ser construída e testada por qualquer pessoa. Devido aos binários produzidos serem bin-deployable (DLLs instaladas diretamente na pasta bin sem demais dependências), isto permite a você compilar seus próprios builds e experimentar as atualizações do produto, tão logo elas sejam adicionadas no repositório. Agora você também pode contribuir diretamente para o desenvolvimento dos produtos através da revisão e envio de feedback sobre os checkins de código, enviando bugs e ajudando-nos a verificar as correções tão logo elas sejam enviadas para o repositório, sugerindo e dando feedback sobre os novos recursos enquanto eles são implementados, bem como enviando suas próprias correções ou contribuições de código. Note que todas as submissões de código serão rigorosamente analisadas ??e testadas pelo Time da ASP.NET MVC, e apenas aquelas que atenderem a um padrão elevado de qualidade e adequação ao roadmap (roteiro) definido para as próximas versões serão incorporadas ao código fonte do produto. Sumário Todos nós da equipe estamos realmente entusiasmados com o anúncio de hoje - isto é algo no qual nós estivemos trabalhando por muitos anos. O estreitamento no relacionamento entre a comunidade e os desenvolvedores nos permitirá construir produtos ainda melhores levando a ASP.NET para o próximo nível em termos de inovação e foco no cliente. Obrigado! Scott P.S. Além do blog, eu uso o Twitter para disponibilizar posts rápidos e para compartilhar links. Meu apelido no Twitter é: @scottgu Texto traduzido do post original por Leniel Macaferi.

Read the article
Where to store front-end data for "object calculator"

- by Justin Grahn

I recently have completed a language library that acts as a giant filter for food items, and flows a bit like this :Products -> Recipes -> MenuItems -> Meals and finally, upon submission, creates an Order. I have also completed a database structure that stores all the pertinent information to each class, and seems to fit my needs. The issue I'm having is linking the two. I imagined all of the information being local to each instance of the product, where there exists one backend user who edits and manipulates data, and multiple front end users who select their Meal(s) to create an Order. Ideally, all of the front end users would have all of this information stored locally within the library, and would update the library on startup from a database. How should I go about storing the data so that I can load it into the library every time the user opens the application? Do I package a database onboard and just load and populate every time? The only method I can currently conceive of doing this, even if I only have 500 possible Product objects, would require me to foreach the list for every Product that I need to match to a Recipe and so on and so forth every time I relaunch the program, which seems like a lot of wasteful loading. Here is a general flow of my architecture: Products: public class Product : IPortionable { public Product(string n, uint pNumber = 0) { name = n; productNumber = pNumber; } public string name { get; set; } public uint productNumber { get; set; } } Recipes: public Recipe(string n, decimal yieldAmt, Volume.Unit unit) { name = n; yield = new Volume(yieldAmt, unit); yield.ConvertUnit(); } /// <summary> /// Creates a new ingredient object /// </summary> /// <param name="n">Name</param> /// <param name="yieldAmt">Recipe Yield</param> /// <param name="unit">Unit of Yield</param> public Recipe(string n, decimal yieldAmt, Weight.Unit unit) { name = n; yield = new Weight(yieldAmt, unit); } public Recipe(Recipe r) { name = r.name; yield = r.yield; ingredients = r.ingredients; } public string name { get; set; } public IMeasure yield; public Dictionary<IPortionable, IMeasure> ingredients = new Dictionary<IPortionable,IMeasure>(); MenuItems: public abstract class MenuItem : IScalable { public static string title = null; public string name { get; set; } public decimal maxPortionSize { get; set; } public decimal minPortionSize { get; set; } public Dictionary<IPortionable, IMeasure> ingredients = new Dictionary<IPortionable, IMeasure>(); and Meal: public class Meal { public Meal(int guests) { guestCount = guests; } public int guestCount { get; private set; } //TODO: Make a new MainCourse class that holds pasta and Entree public Dictionary<string, int> counts = new Dictionary<string, int>(){ {MainCourse.title, 0}, {Side.title , 0}, {Appetizer.title, 0} }; public List<MenuItem> items = new List<MenuItem>(); The Database just stores and links each of these basic names and amounts together usings ID's (RecipeID, ProductID and MenuItemID)

Read the article
Cross-domain jQuery using YQL gives robots.txt error

- by Jens Roland

On the page http://qxlapps.dk/test.htm I am trying to perform an Ajax load from another domain, qxlapp.dk. I am using James Padolsey's xdomainajax.js plugin from: http://james.padolsey.com/javascript/cross-domain-requests-with-jquery/ When I open my test page, I get no output, but FireBug shows the JSON result, including the error message: "forbidden":"robots.txt for the domain disallows crawling for url: http://qxlapp.dk/projects/dagens_kup/show.php". The robots.txt on the qxlapp.dk domain contains the following: User-agent: Yahoo Pipes 2.0 Allow: / User-agent: * Allow: / So I don't see what the problem is? Shouldn't it pull the page just fine with those settings?

Read the article
SQL Server 2005, Sudden increase of connections - SharePoint 2007

- by CrazyNick

We observed that sudden increase of SQL connections during a specific hour, it is a backend of a SharePoint 2007 Farm. From SharePoint 2007 Perspective: 1. Incremental crawling is scheduled at that time and few of the Timer jobs (normal timer jobs) are scheduled to run every mins / per 10mins. 2. Number of user requests are less. From SQL Server 2005 Perspective: 1. Transaction log backup is scheduled at that time 2. No other scheduled jobs are running at that time. so, how to narrow down the issue, what would be causing the sudden SQL connection increase?

Read the article
Extracting pure content / text from HTML Pages by excluding navigation and chrome content

- by Ankur Gupta

Hi, I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version of the article (minus the html tags, webkit provides api for the same). Then I run the diff algorithm comparing various article's text from same website this results in similar text being eliminated. This gives me content minus the common navigation content etc. Despite the above approach I am still getting quite some junk in my final text. This results in incorrect News Abstract being extracted. The error rate is 5 in 10 article i.e. 50%. Error as in Can you Suggest an alternative strategy for extraction of pure content, Would/Can learning Natural Language rocessing help in extracting correct abstract from these articles ? How would you approach the above problem ?. Are these any research papers on the same ?. Regards Ankur Gupta

Read the article
What is the right license for tutorial source code ?

- by devdude

Putting sourcecode from tutorials or books online requires the author to add some kind of disclaimer or license (otherwise people would use it make lots of $$$ or break a power plant IT control system and sue you as author). But what is the right license or disclaimer statement ? Can I use BSD license with ... IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA,... We are talking about tutorials ! Released to teach and share knowledge. Or do I need to follow the (potentially different) licenses of the libraries I use ? Might be insignificant now but I feel we will face a license hunt in "public" sourcecode (aka OSS) in future, similar to companies/lawyers currently crawling the web for pictures with wrong copyright statement or infringing their IP (and suing someone using a picture in a personal blog,etc..).

Read the article
What are the best measures to protect content from being crawled?

- by Moak

I've been crawling a lot of websites for content recently and am surprised how no site so far was able to put up much resistance. Ideally the site I'm working on should not be able to be harvested so easily. So I was wondering what are the best methods to stop bots from harvesting your web content. Obvious solutions: Robots.txt (yea right) IP blacklists What can be done to catch bot activity? What can be done to make data extraction difficult? What can be done to give them crap data? Just looking for ideas, no right/wrong answer

Read the article
iTunes Visualization -- What type of code is it written in and what does that code look like?

- by Christopher Altman

Being a web developer, I know how event driven user interfaces are written, but do not have insight into other families of code (embedded software like automotive software, automation software on assembly lines, drivers, or the crawling lower-thirds on CNN, etc.) I was looking at the iTunes visualizer (example) and am curious: What code is used to write the visualizer? Objective C? Does it use Core Animation? What type of abstraction does that library offer? What does the code look like? Is it a list of mathematical equations for producing the crazy graphics? Is it a list of key frames with tweening? Is there an array of images, fractals, worm holes, flowers, sparkles, and some magic mixes them together. Or something totally different? I am not looking for a tutorial, just an understanding of how something very different than web development works. Oh yah, I know iTunes is closed source, so all of this is conjecture.

Read the article
Millions of anonymous ASP.Net profiles!?

- by Mantorok

Hi all, some advice needed! Our website receives approximately 50,000 hits a day, and we use anonymous ASP.Net membership profiles/users, this is resulting in millions (4.5m currently) of "active" profiles and the database is 'crawling', we have a nightly task that cleans up all the inactive ones. There is no way that we have 4.5m unique visitors (our county population is only 1/2 million), could this be caused by crawlers and spiders? Also, if we have to live with this huge number of profiles is there anyway of optimising the DB? Thanks Kev

Read the article
Django : proper way to use model, duplicates!

- by llazzaro

Hello, I have a question about the proper, best way to manage the model. I am relative newbie to django, so I think I need to read more docs, tutorials,etc (suggestions for this would be cool!). Anyway, this is my question : I have a python web crawler, that is "connected" with django model. Crawling is done once a day, so its really common to find "duplicates". To avoid duplicates I do this : cars = Car.Objects.filter(name=crawledItem['name']) if len(cars) 0: #object already exists, update it car = cars[0] else: car = Car() #some non-relevant code here car.save() I want to know, if this is the proper/correct way to do it or its any "automatic" way to do it. Its possible to put the logic inside the Car() constructor also, should I do that? Thanks a lot!

Read the article

Search Results

Search found 287 results on 12 pages for 'crawling pasta hellion'.

Page 9/12 | < Previous Page | 5 6 7 8 9 10 11 12 | Next Page >

- by Anicho

- by acjohnson55

- by Bob Rhubart

- by Bob Rhubart

- by shrivb

- by BenoitRen

- by user005569871

- by user40387

- by Doug Firr

- by Pravin

- by cwd

- by Michael Irigoyen

- by John Doe

- by Jack

- by Daniel Puscht

- by Leniel Macaferi

- by Justin Grahn

- by Jens Roland

- by CrazyNick

- by Ankur Gupta

- by devdude

- by Moak

- by Christopher Altman

- by Mantorok

- by llazzaro

< Previous Page | 5 6 7 8 9 10 11 12 | Next Page >