crawling - Page 7 - Developer IT

SQL Server 2005, Sudden increase of connections - SharePoint 2007

- by CrazyNick

We observed that sudden increase of SQL connections during a specific hour, it is a backend of a SharePoint 2007 Farm. From SharePoint 2007 Perspective: 1. Incremental crawling is scheduled at that time and few of the Timer jobs (normal timer jobs) are scheduled to run every mins / per 10mins. 2. Number of user requests are less. From SQL Server 2005 Perspective: 1. Transaction log backup is scheduled at that time 2. No other scheduled jobs are running at that time. so, how to narrow down the issue, what would be causing the sudden SQL connection increase?

Read the article

SQL Server 2005, Sudden increase of connections - SharePoint 2007

- by CrazyNick

We observed that sudden increase of SQL connections during a specific hour, it is a backend of a SharePoint 2007 Farm. From SharePoint 2007 Perspective: 1. Incremental crawling is scheduled at that time and few of the Timer jobs (normal timer jobs) are scheduled to run every mins / per 10mins. 2. Number of user requests are less. From SQL Server 2005 Perspective: 1. Transaction log backup is scheduled at that time 2. No other scheduled jobs are running at that time. so, how to narrow down the issue, what would be causing the sudden SQL connection increase?

Read the article

Sharepoint Search crawl not working

- by Satish

Search Crawling is error out on my MOSS 2007 installation. I get the following error for all the web apps I have following error in Crawl logs. http://mysites.devserver URL could not be resolved. The host may be unavailable, or the proxy settings are not configured correctly on the index server. The Application Event log also has the following corresponding error The start address http://mysites.devserver cannot be crawled. Context: Application 'SSPMain', Catalog 'Portal_Content' Details: The URL of the item could not be resolved. The repository might be unavailable, or the crawler proxy settings are not configured. To configure the crawler proxy settings, use the Proxy and Timeout page in search administration. (0x80041221) I'm using Windows 2008 server. I tried accessing the site using the above mentioned url and its available. I did the registry setting for loop back issue found here http://support.microsoft.com/kb/896861 still not luck. Any Ideas?

Read the article

What sources do spammers use to get email addresses?

- by Andrew Grimm

From what sources do email spammers get their addresses? Wikipedia mentions the following: Harvesting email addresses from publicly available sources. This includes web pages (web crawling), usenet posts, mailing list archives, DNS and WHOIS records Guessing email addresses (directory harvest attack) Asking people for their emails for one purpose, such as jokes of the day, and selling the email addresses elsewhere Getting access to people's address books (which Quechup utilized) Scanning an infected computer for email addresses. Are there any other techniques used? Are any of the techniques above now obsolete?

Read the article

How do I rate limit google's crawl of my class C IP block?

- by Zak

I have several sites in a class C network that all get crawled by google on a pretty regular basis. Normally this is fine. However, when google starts crawling all the sites at the same time, the small set of servers that back this IP block can take a pretty big hit on load. With google webmaster tools, you can rate limit the googlebot on a given domain, but I haven't found a way to limit the bot across an IP network yet. Anyone have experience with this? How did you fix it?

Read the article

My HP-Vista based laptop has become very slow recently

- by goldenmean

My HP laptop which has Vista Home premium. When I try to start Firefox, internet explorer, it becomes very slow. No other app. When i checked the Performance in Task Manager. It shows the Physical memory , Free as 0 bytes, almost always. This has been recently. Earlier it didn't used to be zero. Laptop has 2GB of RAM. I have nothing running in my tray except - Sound control, Laptop power plan indicator,Network status indicator. There are no other processes whose memory usage adds up to so high to make Free memory as 0. Then what could be hogging the memory and make the laptop very slow. Any pointers would help as it is crawling at the moment.

Read the article

How do multiple displays work on a AMD 785G / ATI HD 4200 motherboard?

- by aireq

I just ordered a ASUS M4A785TD-V EVO which has the AMD 785G chipset and HD4200 integrated graphic. The board has VGA, DVI, and HDMI outputs. I'm wondering how many outputs I can run at once, and from what connectors? My guess is that I can only use the VGA, and either the DVI or the HDMI in a dual setup. But not the HDMI and the DVI at the same time. Is this correct? If I have devices plugged into both the HDMI and the DVI ports is there a way to choose between which port I want to use? I have a dual 19" monitor setup, as well as a LCD TV. I'd like to run the VGA and the DVI into my two monitors, and then the HDMI to my TV. Then when I want to watch something on the TV I'd like to be able to switch over from the DVI to the HDMI. Is this possible with out crawling under my desk and unplugging/plugging things in?

Read the article

Do virtual machines perform better on the host HDD or USB drive?

- by Jeremy Ricketts

The question I'm asking is kind of general, and I'll give more specifics about my specific setup. Here's the main question though: Do virtual machines generally perform better on the host HDD or is it better to operate them from an external disk? My specific setup: A Macbook Pro with a nearly full internal SATA drive that spins at 7200. On this system I'm running large programs like Photoshop and some other RAM-intense applications. I've dedicated 2 of my 8 gigs of RAM to my VMware Fusion virtual machine, which runs Windows 7 and Visual Studio, sits on the same drive. When that thing boots up, my system really starts crawling. I have an external USB (specifics of that drive are here) which I'm thinking about moving the VM to. Obviously a USB drive is slower than my internal HDD, but maybe having two operating systems using the same disk is WORSE than putting one of them on a separate (albiet slower) disk. This a bad idea?

Read the article

Merging and re-formatting paragraphs in Microsoft Word 2007

- by thkala

After a copy/paste mishap in Microsoft Word 2007, I ended up with text looking like this: This line breaks up here continues here, and so on here, when it should all be in a single line without all the random whitespace. I confirmed that there are paragraph separators and extra whitespace between each line - probably due to hard-coded newlines in the original source. Is there a (preferrably easy) way to merge paragraphs in Microsoft Word? Is there a way to re-format a paragraph so that extraneous whitespace is removed? I can change the flush style, but the whitespace remains. I (obviously?) do not have any experience with Word, being more of a TeX person, but I have been searching Google and crawling the menus for a few hours and I have yet to find a solution...

Read the article

How much HDD space would I need to cache the web while respecting robot.txts?

- by Koning Baard XIV

I want to experiment with creating a web crawler. I'll start with indexing a few medium sized website like Stack Overflow or Smashing Magazine. If it works, I'd like to start crawling the entire web. I'll respect robot.txts. I save all html, pdf, word, excel, powerpoint, keynote, etc... documents (not exes, dmgs etc, just documents) in a MySQL DB. Next to that, I'll have a second table containing all restults and descriptions, and a table with words and on what page to find those words (aka an index). How much HDD space do you think I need to save all the pages? Is it as low as 1 TB or is it about 10 TB, 20? Maybe 30? 1000? Thanks

Read the article

How to display recently installed programs and when they were installed?

- by salvationishere

I have a Windows XP and I just installed about 12 new programs. Big mistake! Before I installed these programs, my internet connection was running great. But now after installing and restarting my laptop, the internet is crawling. How can I see what was changed? Hint: prior to installing these 12 programs, I installed IE version 8. So probably if I removed that it would fix it, but the problem is I need IE in order for my SQL/C# web application to work properly.

Read the article

OWA no longer accessing 1 backend exchange server

- by Morchuboo

We have IIS hosting OWA that is the web frontend to 3 backend exchange servers. Yesterday we got a lot of event 9791 warnings: "Cleanup of the DeliveredTo table for database 'Second Storage Group\Mailbox Store EUROPE 2' was pre-empted because the database engine's version store was growing too large. 0 entries were purged. At this point the server was crawling. Our Mail admin is currently away and not contactable so we rebooted the server. Everything seems ok when reading mail from outlook and evolution-mapi clients but OWA and active-sync connections cannot access. When logging into OWA, users whos mailboxes are not on this backend server are fine but users on this server can log into the OWA frontend but once submitting their credentials the page returns a 503 service unavailable error. We have since rebstarted the affected exchange server and the IIS server as well as iisreset /noforce but problem persists. Can anyone suggest what we should look at...

Read the article

Private Git repo using Smart HTTP with LDAP authentification

- by ALOToverflow

I've been crawling the interwebz and getting my hands dirty for the last few days, but I can't seem to make it all work together. I managed to get a HTTP repo working with Ubuntu 10.04 over Smart HTTP (pull and push over HTTP) for a single repo. This means that I do the initial setup over SSH to the server (git init --bare) and after that the clients can pull and push to it (git clone http://servername/allgitrepos/repo.git). Unfortunately it's impossible to add a new repo without SSHing to the server and adding it manually) i.e. git push http://servername/allgitrepos/repo2.git (allgitrepos is available for everyone to read-write and execute) would fail talking about git update-server-info (which seems to be a general error message). So far the repository is anonymous, so I would like to authenticate using LDAP and also use the LDAP creds to make the git commit. So, how can I push new repos to the server and how can I use the LDAP creds to make the git commit. Thanks

Read the article

Building intranet search

- by gmkv

At work, we have lots of information squirreled away in many different sites -- wikis, product docs, ticketing system, etc -- many of which require authentication. I'm very interested in having a single way to search all our various silos, and in my spare time have looked at Nutch, Grub, Django + Haystack, etc. None of these is a complete solution a la Google Mini or Google Search Appliance. Has anybody built a basic intranet search engine out of a mixture of these tools? Would you have recommendations about how to go about it? I like Django, and Haystack seems to be a mildly popular search solution for it, but I'd need to wire up a crawler that can support crawling authenticated sites to it.

Read the article

De-index URL parameters by value

- by Doug Firr

Upon reading over this question is lengthy so allow me to provide a one sentence summary: I need to get Google to de-index URLs that have parameters with certain values appended I have a website example.com with language translations. There used to be many translations but I deleted them all so that only English (Default) and French options remain. When one selects a language option a parameter is aded to the URL. For example, the home page: https://example.com (default) https://example.com/main?l=fr_FR (French) I added a robots.txt to stop Google from crawling any of the language translations: # robots.txt generated at http://www.mcanerin.com User-agent: * Disallow: Disallow: /cgi-bin/ Disallow: /*?l= So any pages containing "?l=" should not be crawled. I checked in GWT using the robots testing tool. It works. But under html improvements the previously crawled language translation URLs remain indexed. The internet says to add a 404 to the header of the removed URLs so the Googles knows to de-index it. I checked to see what my CMS would throw up if I visited one of the URLs that should no longer exist. This URL was listed in GWT under duplicate title tags (One of the reasons I want to scrub up my URLS) https://example.com/reports/view/884?l=vi_VN&l=hy_AM This URL should not exist - I removed the language translations. The page loads when it should not! I played around. I typed example.com?whatever123 It seems that parameters always load as long as everything before the question mark is a real URL. So if Google has indexed all these URLS with parameters how do I remove them? I cannot check if a 404 is being generated because the page always loads because it's a parameter that needs to be de-indexed.

Read the article

Huge google impression drop after cleaning html

- by olgatorresfoundation

Good morning, I am the webmaster of a non-profit organization that donates grants to colorectal cancer research projects and funds various colorectal cancer information campaigns. We have three domains: www.fundacioolgatorres dot org (Catalan) www.fundacionolgatorres dot org (Spanish) www.olgatorresfoundation dot org (English) So what happened? I redesigned olgatorresfoundation on the 20th and the fundacionolgatorres on the 30th of May. In both cases, exactly two days later, the number of impressions on both dropped to a halt. Granted, we did not have the traffic of Microsoft, but a 90% decrease a disaster of incredible proportions for us. My only real changes were cleaning up the old ineffective HTML to a cleaner form (mostly moving away from redundant table construction to a table-less view). Here is a before and after snapshot of what the change looks like: Before: http://www.fundacioolgatorres.org/aparell_digestiu/introduccio/ (unchanged page in Catalan) After: http://www.olgatorresfoundation.org/digestive_system/introduction/ (changed page in English) Anybody has a clue to what just happened? Why should a normal, sane html improvement be punished and so dramatically? No URLs have been changed, neither have page names or descriptions. Possible secondary question: If it is so that Google sees it as a major overhaul and decides to drop the pagerank sharply, does it come back to pre-change levels if the content "checks out" or will the page start over from scratch earning those pagerank points (which would mean that we would have to wait 6 months for the pages to recover to the level they had two weeks ago)? (duplicated from productforums.google dot com/forum/#!category-topic/webmasters/crawling-indexing--ranking/YsnyX0JzOpY, hoping to reach a wider audience)

Read the article

Do or can robots cause considerable performance issues?

- by Anicho

So the question in the title is exactly what I am trying to find out. My case is: At work we are in a discussion with team members who seem to think bots will cause us problems relating to performance when running on our services website. Out setup: Lets say I have site www.mysite.co.uk this is a shop window to our online services which sit on www.mysiteonline.co.uk. When people search in google for mysite they see mysiteonline.co.uk as well as mysite.co.uk. Cases against stopping bots crawling: We don't store gb's of data publicly available on the web Most friendly bots, if they were to cause issues would have done so already In our instance the bots can't crawl the site because it requires username & password Stopping bots with robot .txt causes an issue with seo (ref.1) If it was a malicious bot, it would ignore robot.txt or meta tags anyway Ref 1. If we were to block mysiteonline.co.uk from having robots crawl this will affect seo rankings and make it inconvenient for users who actively search for mysite to find mysiteonline. Which we can prove is the case for a good portion of our users.

Read the article

Guide to the web development ecosystem

- by acjohnson55

I'm a long-time software developer, and I've been thrown in the deep, deep end of developing from the ground up what will hopefully be a highly scalable and interactive web application. I've been out of the web game for about 8 years, and even when I was last in it, I wasn't exactly on the cutting edge. I think I've made judicious design decisions and I'm quite happy with the progress I've been making so far, but new, hot web technologies keep crawling out of the woodwork and into my headspace, forcing me to continually revalidate my implementation decisions. Complicating things even further is the preponderance of out-of-date information and the difficulty of knowing what is out of date in the first place. What I'm wondering is, are there any comprehensive books or guides dedicated to compiling and comparing the technologies out there, end-to-end in the web application stack? I'm happy to learn new techs on demand, but I don't like learning about them after I've already spent time going in another direction. I'm looking for the sort of executive info a CTO might read to make sure the best architectural decisions are being made. And just to be clear, this is a question about resources, not about specific technology suggestions.

Read the article

ArchBeat Link-o-Rama for 2012-04-03

- by Bob Rhubart

Crawling a Content Folio | Kyle Hatlestad blogs.oracle.com Kyle Hatlestad shares detials on a component developed by Ed Bryant that simplifies the task of "consuming and publishing that folio on a Site Studio page or in your portal using RIDC." Northeast Ohio Oracle Users Group 2 Day Seminar - May 14-15 - Cleveland, OH www.neooug.org More than 20 sessions over 4 tracks, featuring 18 speakers, including Oracle ACE Director Cary Millsap, Oracle ACE Director Rich Niemiec, and Oracle ACE Stewart Brand. Register before April 15 and save. OTN Member discounts for April www.oracle.com Save up to 40% on titles from Oracle Press, Pearson, O'Reilly, Apress, and more. The Java EE 6 Example - Galleria - Part 1 | Markus Eisele blog.eisele.net Oracle ACE Director Markus Eisele heaps praise on Vineet Reynolds' Java EE 6 Galleria demo application, which demonstrates the use of JSF 2.0 and JPA 2.0 in a Java EE project using Domain Driven Design. Reminder: JavaOne Call For Papers Closing April 9th, 11:59pm | Arun Gupta blogs.oracle.com One week left to submit your JavaOne papers. Narrowing the gap between UI design and ADF development | Jack Ritzen www.nl.capgemini.com "Joining my first demo project I was confronted with two traditional contradictory worlds," says Jack Ritzen. "In the left corner; me, as a beginning GUI designer. And in the right, a heavyweight ADF developer. Let the game begin!" Thought for the Day "Operating systems are like underwear — nobody really wants to look at them." — Bill Joy

Read the article

ArchBeat Link-o-Rama Top 20 for April 1-9, 2012

- by Bob Rhubart

The top 20 most popular items shared via my social networks for the week of April 1 - 8, 2012. Webcast: Oracle Maximum Availability Architecture Best Practices w/Tom Kyte - April 12 Oracle Cloud Conference: dates and locations worldwide Bad Practice Use Case for LOV Performance Implementation in ADF BC | Oracle ACE Director Andresjus Baranovskis How to create a Global Rule that stores a document’s folder path in a custom metadata field | Nicolas Montoya MySQL Cluster 7.2 GA Released How to deal with transport level security policy with OSB | Jian Liang Webcast Series: Data Warehousing Best Practices http://bit.ly/I0yUx1 Interactive Webcast and Live Chat: Oracle Enterprise Manager Ops Center 12c Launch - April 12 Is This How the Execs React to Your Recommendations? | Rick Ramsey Unsolicited login with OAM 11g | Chris Johnson Event: OTN Developer Day: MySQL - New York - May 2 OTN Member discounts for April: Save up to 40% on titles from Oracle Press, Pearson, O'Reilly, Apress, and more Get Proactive with Fusion Middleware | Daniel Mortimer How to use the Human WorkFlow Web Services | Oracle ACE Edwin Biemond Northeast Ohio Oracle Users Group 2 Day Seminar - May 14-15 - Cleveland, OH IOUG Real World Performance Tour, w/Tom Kyte, Andrew Holdsworth, Graham Wood WebLogic Server Performance and Tuning: Part I - Tuning JVM | Gokhan Gungor Crawling a Content Folio | Kyle Hatlestad The Java EE 6 Example - Galleria - Part 1 | Oracle ACE Director Markus Eisele Reminder: JavaOne Call For Papers Closing April 9th, 11:59pm | Arun Gupta Thought for the Day "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." — Leslie Lamport

Read the article

.Net search engine architecture and technology choice

- by shrivb

I am in the process of designing a search engine for an asp.net site. The site currently uses Microsoft Indexing Server to index and search content which range from simple text files to MS documents to PDFs. MIS is also used to crawl File servers. MIS in tandem with Index Server Companion crawls for content from external sites. I intend to replace MIS with the indexer/crawler I am trying to build. Since my platform is completely on the Microsoft stack, I cant afford to have a Java application server. Thus, Solr, and effectively, SolrNet is ruled out. With this being the context, I have couple of questions. 1.Technology choice I had done my initial investigation and looked at Lucene.Net. There seemed to be 2 issues in using Lucene.Net. First being, it cant crawl external content. There doesn't seem to be a direct port of Nutch in .Net. Second, since it is just an indexer, it cant parse various document types. The parsing is left to the developer. So, what would be best technology choice on the .Net platform to achieve indexing & crawling? Are there any .Net open source libraries available for document parsing? 2.Architectural pattern Is there any general architectural pattern or best practice that needs to be followed in designing such a search engine? Thanks in advance.

Read the article

Turn-based JRPG battle system architecture resources

- by BenoitRen

The past months I've been busy programming a 2D JRPG (Japanese-style RPG) in C++ using the SDL library. The exploration mode is more or less done. Now I'm tackling the battle mode. I have been unable to find any resources about how a classic turn-based JRPG battle system is structured. All I find are discussions about damage formula. I've tried googling, searching gamedev.net's message board, and crawling through C++-related questions here on Stack Exchange. I've also tried reading source code of existing open source RPGs, but without a guide of some sort it's like trying to find a needle in a haystack. I'm not looking for a set of rules like D&D or anything similar. I'm talking purely about code and object structure design. A battle system asks the player for input using menus. Next the battle turn is executed as the heroes and the enemies execute their actions. Can anyone point me in the right direction? Thanks in advance.

Read the article

JavaScript loaded external content SEO

- by user005569871

I wonder what is the best way to have Javascript loaded content indexed by search engines. I know that search engines don't execute Javascript, but I am thinking more of an progressive enchantment. I am creating a responsive website, and on the home page I will have some sections about most visited products and recommended product that I plan to load depending on the device detected. These products will be in sliders with thumbnail images and names of the products. If mobile is detected slider content will not load, ant the link to the external page will be shown. I know that external content will be indexed via link to those resources. Where will the users be directed from search in this case? To the external page or home page? Will it be bad for SEO if I show only product names on front page so they can be indexed and hide them with CSS? What is the best way to index that content and possibly direct users from search to home page? Also, i've seen the Ajax crawling but iI would like not to use that if there is any better way.

Read the article

Ranking drop after using reverse proxy for blog subdirectory and robots.txt for old blog subdomain

- by user40387

We have a 3Dcart store and a WordPress blog hosted on a separate server. Originally, we had a CNAME set up to point the blog to http://blog.example.com/. However, in our attempt to boost link-based and traffic-based authority on the main site, we've opted to do a reverse proxy to http://www.example.com/blog/. It’s been about two months since we finished the reverse proxy migration. It appears that everything is technically working as intended, including some robots and sitemap changes; the new URLs are even generating some traffic, as indicated on Google Analytics. While Google has been indexing the new URL locations, they’re ranking very poorly, even for non-competitive, long-tail keywords. Meanwhile, the old subdomain URLs are still ranking mostly as well as they used to (even though they aren’t showing meta titles and descriptions due to being blocked by robots.txt). Our working theory is that Google has an old index of the subdomain URLs, and is considering the new URLs to be duplicate content, since it’s being told not to crawl the subdomain and therefore can’t see the rel canonicals we have in place. To resolve this, we’ve updated the subdomain’s robot.txt to no longer block crawling and indexing. Theoretically, seeing the canonical tag on the subdomain pages will resolve any perceived duplicate content issues. In the meantime, we were wondering if anyone would have any other ideas. We are very concerned that we’ll be losing valuable traffic, as we’re entering our on season at the moment.

Read the article

Sudden drop in Total Indexed pages and increase in 'Not Selected' number.

- by Pravin

My blog is around 1 year old and have PR2. The average daily pageviews upto last 1 week were 1800. The total number of posts are 180. Though I have only 180 total posts, the total number of Indexed URL was increasing and it was as high as 510. But in the month of Sept2012, the total number of Indexed pages dropped from 510 to 214. The drop was sudden and it is now increasing very slowly. Also, the other main concern is huge increase in 'Not Selected' number. It is currently 814. I have never posted any post again and never copied any idea from any other blog. But I do use internal linking to some older post those are related to the new posts. The questions are:; Why there is sudden drop in the 'Total Indexed' pages. Why there was increase in total indexed pages to 500 even though the total posts were only 180. As the drop in 'Total Indexed' was in the month of sept2012, I was getting same organic traffic and it was steadily increasing till last week and then there was a 50 drop in the total pageviews. Why. Now, again the traffic is becoming to normal but still there is a problem. Is increase in the 'Not selected' number is a cause of drop in 'Total Indexed'? How to prevent or reduce the number of 'Not Selected' even though I do not have any duplicate post withing blog. Is the 'internal linking' to older post creating 'Not selected' problem? Should I edit my 'Robot.txt' to avoid crawling of labes that may be creating duplicate posts or something like that, if so, what is correct robot.txt. I have uploaded the screenshot of the graph of Webmaster Tools. Please take a look and give suggestions. Please help. Thank you in advance.

Search Results

Search found 241 results on 10 pages for 'crawling'.

Page 7/10 | < Previous Page | 3 4 5 6 7 8 9 10 | Next Page >

- by CrazyNick

- by CrazyNick

- by Satish

- by Andrew Grimm

- by Zak

- by goldenmean

- by aireq

- by Jeremy Ricketts

- by thkala

- by Koning Baard XIV

- by salvationishere

- by Morchuboo

- by ALOToverflow

- by gmkv

- by Doug Firr

- by olgatorresfoundation

- by Anicho

- by acjohnson55

- by Bob Rhubart

- by Bob Rhubart

- by shrivb

- by BenoitRen

- by user005569871

- by user40387

- by Pravin

< Previous Page | 3 4 5 6 7 8 9 10 | Next Page >