Search Results

Search found 241 results on 10 pages for 'crawling'.

Page 9/10 | < Previous Page | 5 6 7 8 9 10 | Next Page >

Anonymous users support vs Google bot

- by Andy

I have a User class in my web app that represents a user currently logged in. Every time a user vists a page, a User instance is populated based on authentication data supplied in cookies. A User instance is created even if an anonymous user logs in - and a corresponding new record is created in the User table in the database. This approach allows me to save some state info for the current user regardless of its type. The problem however with this approach is the Google bot, and other non-human web organisms crawling my pages. Every time a bot starts to walk around the site, thousands of useless records will be created in the database, each of them only to be used for a single page. Question: what is the best trade off? How to support anonymous users, save their state, and don't get too much overhead because of cookieless bots?

Read the article
Create an seo and web accessibility analyzer

- by rebellion

I'm thinking of making a little web tool for analyzing the search engine optimization and web accessiblity of a whole website. First of all, this is just a private tool for now. Crawling a whole website takes up alot of resources and time. I've found out that wget is the best option for downloading the markup for a whole site. I plan on using PHP/MySQL (maybe even CodeIgniter), but I'm not quite sure if that's the right way to do it. There's always someone who recommends Python, Ruby or Perl. I only know PHP and a little bit Rails. I've also found a great HTML DOM parser class in PHP on SourceForge. But, the thing is, I need some feedback on what I should and should not do. Everything from how I should make the crawl process to what I should be checking for in regards to SEO and WCAG. So, what comes to your mind when you hear this?

Read the article
Symfony: Pre filtering submitted values before/after validation

- by Rob

Hi All I've been scouring the net and i have found nothing! I am using symfonys form framework to build a simple 'Create' form. Validation is fine. However i'd like to pre-filter my submitted values, so adding ucfirst, strtoupper, and the like. I'm not sure if im missing something crucial here, but the way i see it is the only way to do this would be to create my own custom validators and utilizing the doClean method, which seems daft since i'd have hundreds of validators for each php function! Hope you guys can help, i've been crawling through source code, api's, numerous books and blogs and i haven't found a thing :( Either it's impossible, or it's really easy, i hope its the latter!

Read the article
Hiding part of a page from Search Server 2010 Express

- by Jonathan

I'm working on a soon-to-be-public-facing site, and we want to have our search live on day 1, and want it to be searchable but non-public during testing, so we're planning to use something whose crawling we can control -- Search Server 2010 Express. However, if I search for something in my top navigation bar, I get nearly every page as a hit. It kinda makes sense, as every page has that content, but it's completely irrelevant on most pages. I want it to crawl through my navigation, but ignore the text within the navigation for search results. I was hoping that it'd just figure that out on it's own (the HTML for the top nav is static), but it's apparently not. Is there some standard thing I can put in my HTML that will achieve the effect I'm going for? On a side note: when I go live, will I have the same problem with public search engines, or do they tend to be smarter?

Read the article
IE Hanging on jQuery code

- by OrangeRind

Here's another clichéd problem, but I couldn't find an exact match to this. I haven't posted any source here, as you can freely see all that is there on the link. :-) Statement:I have a web page at http://agrimgupta.com/antaragni/ Disclaimer: Pardon me for the pathetic coding on that page. ;-) It was done on a very short interval. Improvements will be done at a later stage. Observation: This page is functioning normally on my localhost on all browsers. Problem: IE 8 is crawling (nearly hanging) while loading this page from the website. Although it is working fine on localhost. When on the website, It fails to render the mouseover effects, doing them in almost what seems like a minute. Question: How to resolve this stuck up of IE? It is necessary to resolve this. Thanks in Advance

Read the article
The Sitemap Paradox

- by Jeff Atwood

We use a sitemap on Stack Overflow, but I have mixed feelings about it. Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata. Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site. Based on our two years' experience with sitemaps, there's something fundamentally paradoxical about the sitemap: Sitemaps are intended for sites that are hard to crawl properly. If Google can't successfully crawl your site to find a link, but is able to find it in the sitemap it gives the sitemap link no weight and will not index it! That's the sitemap paradox -- if your site isn't being properly crawled (for whatever reason), using a sitemap will not help you! Google goes out of their way to make no sitemap guarantees: "We cannot make any predictions or guarantees about when or if your URLs will be crawled or added to our index" citation "We don't guarantee that we'll crawl or index all of your URLs. For example, we won't crawl or index image URLs contained in your Sitemap." citation "submitting a Sitemap doesn't guarantee that all pages of your site will be crawled or included in our search results" citation Given that links found in sitemaps are merely recommendations, whereas links found on your own website proper are considered canonical ... it seems the only logical thing to do is avoid having a sitemap and make damn sure that Google and any other search engine can properly spider your site using the plain old standard web pages everyone else sees. By the time you have done that, and are getting spidered nice and thoroughly so Google can see that your own site links to these pages, and would be willing to crawl the links -- uh, why do we need a sitemap, again? The sitemap can be actively harmful, because it distracts you from ensuring that search engine spiders are able to successfully crawl your whole site. "Oh, it doesn't matter if the crawler can see it, we'll just slap those links in the sitemap!" Reality is quite the opposite in our experience. That seems more than a little ironic considering sitemaps were intended for sites that have a very deep collection of links or complex UI that may be hard to spider. In our experience, the sitemap does not help, because if Google can't find the link on your site proper, it won't index it from the sitemap anyway. We've seen this proven time and time again with Stack Overflow questions. Am I wrong? Do sitemaps make sense, and we're somehow just using them incorrectly?

Read the article
Site Search Engine for 1,000 page website

- by Ian

I manage a website with about 1,000 articles that need to be searchable by my members. The site search engines I've tried all had their own problems: Fluid Dynamics Search Engine Since it's written in perl, it was a bit hacky to integrate with my PHP-based CMS. I basically had to file_get_contents the search results page. However, FDSE had the best search results. Google CSE Ugh, the search results SUCK. It can't find documents even using unique strings. I'm so surprised that a Google search product is this bad. Nor can I get any answers on their 'help' forums, and I am a paying user. Boo, Google. Boo. Sphider Again, bad search results. Unable to locate some phrases used in link text. Better results than Google CSE though. Shame on Google that a free PHP script has better search results than their paid application. IndexTank This one looked really promising. I got all set up with their PHP API client. But it would only randomly add articles that I submitted. Out of 700+ articles I pushed to the index through their API, only 8 made it in. Unable to find any help on this subject. Update for IndexTank -- Got the above issue fixed, so this looks most promising so far. The site itself runs on php/mysql and FreeBSD, though this shouldn't matter for a web crawling indexer. I've looked at Lucene, but I don't know anything about Java or installing Java programs on my web server. I also do not have root access on my web server, if this would be required for installation. I really don't need a lot of fancy features. It just needs to be able to crawl my web site and return great (even decent!) search results. I don't need any crazy search operators. It doesn't need to index off my primary domain. It just needs to work! Thanks, Hive Mind!

Read the article
Live CD / Live USB much faster than full install

- by user29347

I've observed it on both laptops I own! HP Compaq nx6125 and Ubuntu 11.04 x64 - somewhat solved Lenovo Thinkpad T500 and Ubuntu 11.10 x64 - help needed! I'm still struggling with the Thinkpad to get performance level similar to that of 10 y.o. laptops... All in all a really serious issue with multiple versions of Ubuntu that renders computers with perfectly compatible hardware unusable, as far as out of the box experience is concerned. Troubleshooting resultant issues seems to be a hard case even for users with some experience with installing graphics drivers. EDIT: I can't really post additional details. Two different ubuntu versions, two laptops, two different set of graph. drivers (OS vs ATI prop.) - all with the same symptoms. Also I can't stress enough how massive the performance degradation is compared to a healthy system. For that reason I ask for input from people who may know roughly what are we dealing with here. I can post more details if we were to focus on my current Thinkpad T500. In that case my current system details: Lenovo Thinkpad T500 Ubuntu 11.10 x64 ATI Mobility Radeon HD 3650 (also see the "What I have already tried" section about Intel graphics tested) ATI Catalyst 11.10 drivers OCZ Agility 3 SSD but! same with the default driver for ATI the card same with the prop. driver for the ATI card from Jockey (Additional drivers applet) What I have already tried: 0. Switching to Intel integrated card (Intel GMA 4500M HD) with the default driver - same effects = may indicate not driver related problem but a problem with something of global influence like e.g. nomodeset or other I don't even know about. (What you can read above) ATI Catalyst 11.10 and radeon.modeset=0 boot parameter + disabled Wait for VBlank. Unity 2D Ubuntu 10.04 LTS tested (ubuntu-10.04.3-desktop-i386.iso): Both live USB and installed version blazing fast! (on the default drivers - without even installing the proprietary fglrx drivers). re2 a) seems to give me the only significant results (still poor) - perfect Unity elements performance with the same crawling stuttering/lagging when dragging windows around. re2 b) this happens often http://i17.photobucket.com/albums/b68/Bucic/ubuntuforumsorg/Screenshotat2011-10-28083140.png re2 c) Sometimes I am able to witness a normal performance when dragging a window around but only for a second or two. When I try to shake it longer it starts to lag and it will keep lagging like that with an increased probability of what you see in the sshot in point re2 b). re2 d) I can't establish the radeon.modeset=0 influence though. Once it seems to work be smooth with it, the other time - without it. Really can't tell.

Read the article
How do multiple displays work on a AMD 785G / ATI HD 4200 motherboard?

- by aireq

I just ordered a ASUS M4A785TD-V EVO which has the AMD 785G chipset and HD4200 integrated graphic. The board has VGA, DVI, and HDMI outputs. I'm wondering how many outputs I can run at once, and from what connectors? My guess is that I can only use the VGA, and either the DVI or the HDMI in a dual setup. But not the HDMI and the DVI at the same time. Is this correct? If I have devices plugged into both the HDMI and the DVI ports is there a way to choose between which port I want to use? I have a dual 19" monitor setup, as well as a LCD TV. I'd like to run the VGA and the DVI into my two monitors, and then the HDMI to my TV. Then when I want to watch something on the TV I'd like to be able to switch over from the DVI to the HDMI. Is this possible with out crawling under my desk and unplugging/plugging things in? Update I found the following in the manual off ASUS's website, which confirms my original suspicion that HDMI and DVI can't be used at the same time. But I'd still like to know if it's possible through software to switch between using the HDMI and DVI.

Read the article
Googlebot repeatedly looks for files that aren't on my server

- by John at CashCommons

I'm hosting a site for a volunteer organization. I've moved the site to WordPress, but it wasn't always that way. I suspect at one point it was hacked badly. My Apache error log file has grown to 122 kB in just the past 18 hours. The large majority of the errors logged are of this form -- it's repeated hundreds of times today alone in my log files: [Mon Nov 12 18:29:27 2012] [error] [client xx.xxx.xx.xxx] File does not exist: /home/*******/public_html/*******.org/calendar.php [Mon Nov 12 18:29:27 2012] [error] [client xx.xxx.xx.xxx] File does not exist: /home/*******/public_html/*******.org/404.shtml (I verified that xx.xxx.xx.xxx was a Google server.) I suspect there was a security hole somewhere before, likely in calendar.php, that was exploited. The files don't exist anymore, but there may be many backlinks that exist that reference here, hence why googlebot is so interested in crawling them. How do I fix this gracefully? I still would like Google to index the site. I just want to tell it somehow not to look for these files anymore.

Read the article
How can I tell GoogleBot that a subdirectory is now a subdomain? [migrated]

- by cwd

I had about a million pages of a catalog indexed under a subdirectory, and now that's moved to a subdomain. GoogleBot is crawling each one of them and getting a 301 redirect to the new location. Even though I have set up the redirect rule in the apache sites-enabled configuration file, (i.e. it's early on when apache does the redirect - PHP is not even getting loaded), even though I have done that, the server isn't handling the load well. GoogleBot is making around 5 requests per second, and on top of my normal traffic that is hiking up the CPU for a few hours at a time. I checked in Webmaster Tools and the corresponding documentation for a way to let Google know that the content had been moved from a subdirectory to a subdomain, but with little luck. Basically the most helpful thing I saw said to just send 301 headers for the new location. How can I tell GoogleBot that a subdirectory is now a subdomain? If that is not an option, how can I more efficiently send 301 redirects out for a particular subdomain? I was thinking perhaps the Nginx server but I'm not sure that I can run both Apache and Nginx side by side on port 80 for different subdomains.

Read the article
sharepoint crawl not indexing main site

- by user22215

Guys I'm having some strange search issues' going on with my main portal application. First off let me give you a little back ground on the problem web app. Our Sharepoint environment was originally set up by a consultant that did not follow best practices. She used one web app to house our companies' intranet site, ssp, and mysites. Since than I have provisioned a new ssp that I have segmented correctly I moved all of our other sites over to the new ssp with out any problems . However, I could not assign the main portal app to the new ssp since the portal app housed the ssp site collection. So I deleted the ssp site collection after that I deleted the ssp and assigned the portal app to my new ssp. Now this is where the problem starts when I attempt to crawl this application the crawl starts than stops 5 seconds later with a status of success also it reports that 1 item was successfully crawled. The funny thing is the main portal app has nearly 30000 items. I have tracked the problem down to the web app if I create a test web app than restore the content I have no problem crawling all 30000 items. Also all of my other web apps that use the same ssp have no problem completing crawls. I don't see anything in the ULS logs or server 2003's event viewer. Also I'm using a separate dedicated index server that's configured to crawl itself via host file configuration. I would like to fix this problem with out having to recreate our main portal site due to the fact that we have several custom code modifications where DLL's were registered to the IIS bin folder also I don't even want to get into the Silverlight mods that were done. Any help with this problem is much appreciated Same problem as minehttp://www.experts-exchange.com/OS/Microsoft_Operating_Systems/Server/MS-SharePoint/Q_23885820.html

Read the article
Removing Paths/ Landing Pages From SharePoint Search Results

- by j.strugnell

Hi there, We've been asked by a client to remove a number of pages from being shown up in their public website search results page. I've been into the SSP and created Crawl Rules to remove these pages. All seemed to have worked ok but we have an issue in that landing pages are still showing up in their "www.domain.com/sitearea/" form but not in their "www.domain.com/sitearea/pages/default.aspx". For each of this type of page we have created one rule to "Exclude" the "aspx" path and another rule to include the "/" path but to "Follow links on the URL without crawling the URL itself". We tried adding rules to exclude the "/" format but that only resulted in all results underneath that being excluded. Does anybody know how to remove the "area/pages/default.aspx" and the "area/" pats from Search Results? I'm not sure if it's the "done thing" to ask 2 questions in one but this is in a similar vein so it should be ok. I was wondering if anyone knew of a tool (or if it is possible) to allow site admins to exclude pages from search results (not via SSP/Crawl Rules). I know they can do it at the site level but I was wondering if anything out there enabled this to be done at the page level through either Page or Site Settings?

Read the article
System Issues and Major Malfuctions after Failed hibernation Exit

- by Sarah Seguin

I have a HP G71-340US that went into hibernation mode for a while and when I tried coming out of it, I got an error message: You're computer cannot come out if hibernation . Status: 0xc000009a Info: A fatal error occurred processing the restoration data. File: \hiberfil.sys Any information that was not saved before the computer went into hybernation will be lost enter=continue So I hit continue and it ran soooo super slow it. It was seriously crawling. Finally I gave up and turned it off manually (IE press and hold the button). It's been a week or two since then and EVERY SINGLE TIME I have tried to to do ANYTHING it takes forever. When I say forever, I literally mean takes 5-7 minutes to load the internet, then the page itself, then to click a link, so on so forth. Eventually everything just goes not responding and I have to give up (4-6 HOURS later). I also cannot access my thumb/jump drives once I've managed to load windows. I was going to try runing malware bytes incase of a virus, but it's windows explorer developes errors and goes not responding on me. Currently I'm running scan disk or check disk and like every file is coming back unreadable. I let it run the last 2 hours straight in chkdesk and I'm only at 6 percent with around 500+ errors and still going. Yes, I've taken logs of the errors via cell phone camera and patience. A week or two prior to this happening I had to change our the hard drive due to blunt force trama next to the mouse. OH! Running on Windows 7: ) And I've tried loading the computer in safe mode and it makes absolutely no difference. Any and all help would be appreciated. I really don't know what to do from here and I'm kind of freaking out. I've googled different part of the error and things that I've done/seen and there are so many different answers/topics that I thought it best to just post the questions.

Read the article
How to create managed properties at site collection level in SharePoint2013

- by ybbest

In SharePoint2013, you can create managed properties at site collection. Today, I’d like to show you how to do so through PowerShell. 1. Define your managed properties and crawled properties and managed property Type in an external csv file. PowerShell script will read this file and create the managed and the mapping. 2. As you can see I also defined variant Type, this is because you need the variant type to create the crawled property. In order to have the crawled properties, you need to do a full crawl and also make sure you have data populated for your custom column. However, if you do not want to a full crawl to create those crawled properties, you can create them yourself by using the PowerShell; however you need to make sure the crawled properties you created have the same name if created by a full crawl. Managed properties type: Text = 1 Integer = 2 Decimal = 3 DateTime = 4 YesNo = 5 Binary = 6 Variant Type: Text = 31 Integer = 20 Decimal = 5 DateTime = 64 YesNo = 11 3. You can use the following script to create your managed properties at site collection level, the differences for creating managed property at site collection level is to pass in the site collection id. param( [string] $siteUrl="http://SP2013/", [string] $searchAppName = "Search Service Application", $ManagedPropertiesList=(IMPORT-CSV ".\ManagedProperties.csv") ) Add-PSSnapin Microsoft.SharePoint.PowerShell -ErrorAction SilentlyContinue $searchapp = $null function AppendLog { param ([string] $msg, [string] $msgColor) $currentDateTime = Get-Date $msg = $msg + " --- " + $currentDateTime if (!($logOnly -eq $True)) { # write to console Write-Host -f $msgColor $msg } # write to log file Add-Content $logFilePath $msg } $scriptPath = Split-Path $myInvocation.MyCommand.Path $logFilePath = $scriptPath + "\CreateManagedProperties_Log.txt" function CreateRefiner {param ([string] $crawledName, [string] $managedPropertyName, [Int32] $variantType, [Int32] $managedPropertyType,[System.GUID] $siteID) $cat = Get-SPEnterpriseSearchMetadataCategory –Identity SharePoint -SearchApplication $searchapp $crawledproperty = Get-SPEnterpriseSearchMetadataCrawledProperty -Name $crawledName -SearchApplication $searchapp -SiteCollection $siteID if($crawledproperty -eq $null) { Write-Host AppendLog "Creating Crawled Property for $managedPropertyName" Yellow $crawledproperty = New-SPEnterpriseSearchMetadataCrawledProperty -SearchApplication $searchapp -VariantType $variantType -SiteCollection $siteID -Category $cat -PropSet "00130329-0000-0130-c000-000000131346" -Name $crawledName -IsNameEnum $false } $managedproperty = Get-SPEnterpriseSearchMetadataManagedProperty -Identity $managedPropertyName -SearchApplication $searchapp -SiteCollection $siteID -ErrorAction SilentlyContinue if($managedproperty -eq $null) { Write-Host AppendLog "Creating Managed Property for $managedPropertyName" Yellow $managedproperty = New-SPEnterpriseSearchMetadataManagedProperty -Name $managedPropertyName -Type $managedPropertyType -SiteCollection $siteID -SearchApplication $searchapp -Queryable:$true -Retrievable:$true -FullTextQueriable:$true -RemoveDuplicates:$false -RespectPriority:$true -IncludeInMd5:$true } $mappedProperty = $crawledproperty.GetMappedManagedProperties() | ?{$_.Name -eq $managedProperty.Name } if($mappedProperty -eq $null) { Write-Host AppendLog "Creating Crawled -> Managed Property mapping for $managedPropertyName" Yellow New-SPEnterpriseSearchMetadataMapping -CrawledProperty $crawledproperty -ManagedProperty $managedproperty -SearchApplication $searchapp -SiteCollection $siteID } $mappedProperty = $crawledproperty.GetMappedManagedProperties() | ?{$_.Name -eq $managedProperty.Name } #Get-FASTSearchMetadataCrawledPropertyMapping -ManagedProperty $managedproperty } $searchapp = Get-SPEnterpriseSearchServiceApplication $searchAppName $site= Get-SPSite $siteUrl $siteId=$site.id Write-Host "Start creating Managed properties" $i = 1 FOREACH ($property in $ManagedPropertiesList) { $propertyName=$property.managedPropertyName $crawledName=$property.crawledName $managedPropertyType=$property.managedPropertyType $variantType=$property.variantType Write-Host $managedPropertyType Write-Host "Processing managed property $propertyName $($i)..." $i++ CreateRefiner $crawledName $propertyName $variantType $managedPropertyType $siteId Write-Host "Managed property created " $propertyName } Key Concepts Crawled Properties: Crawled properties are discovered by the search index service component when crawling content. Managed Properties: Properties that are part of the Search user experience, which means they are available for search results, advanced search, and so on, are managed properties. Mapping Crawled Properties to Managed Properties: To make a crawled property available for the Search experience—to make it available for Search queries and display it in Advanced Search and search results—you must map it to a managed property. References Administer search in SharePoint 2013 Preview Managing Metadata New-SPEnterpriseSearchMetadataCrawledProperty New-SPEnterpriseSearchMetadataManagedProperty Remove-SPEnterpriseSearchMetadataManagedProperty Overview of crawled and managed properties in SharePoint 2013 Preview Remove-SPEnterpriseSearchMetadataManagedProperty SharePoint 2013 – Search Service Application

Read the article
How to create managed properties at site collection level in SharePoint2013

- by ybbest

In SharePoint2013, you can create managed properties at site collection. Today, I’d like to show you how to do so through PowerShell. 1. Define your managed properties and crawled properties and managed property Type in an external csv file. PowerShell script will read this file and create the managed and the mapping. 2. As you can see I also defined variant Type, this is because you need the variant type to create the crawled property. In order to have the crawled properties, you need to do a full crawl and also make sure you have data populated for your custom column. However, if you do not want to a full crawl to create those crawled properties, you can create them yourself by using the PowerShell; however you need to make sure the crawled properties you created have the same name if created by a full crawl. Managed properties type: Text = 1 Integer = 2 Decimal = 3 DateTime = 4 YesNo = 5 Binary = 6 Variant Type: Text = 31 Integer = 20 Decimal = 5 DateTime = 64 YesNo = 11 3. You can use the following script to create your managed properties at site collection level, the differences for creating managed property at site collection level is to pass in the site collection id. param( [string] $siteUrl="http://SP2013/", [string] $searchAppName = "Search Service Application", $ManagedPropertiesList=(IMPORT-CSV ".\ManagedProperties.csv") ) Add-PSSnapin Microsoft.SharePoint.PowerShell -ErrorAction SilentlyContinue $searchapp = $null function AppendLog { param ([string] $msg, [string] $msgColor) $currentDateTime = Get-Date $msg = $msg + " --- " + $currentDateTime if (!($logOnly -eq $True)) { # write to console Write-Host -f $msgColor $msg } # write to log file Add-Content $logFilePath $msg } $scriptPath = Split-Path $myInvocation.MyCommand.Path $logFilePath = $scriptPath + "\CreateManagedProperties_Log.txt" function CreateRefiner {param ([string] $crawledName, [string] $managedPropertyName, [Int32] $variantType, [Int32] $managedPropertyType,[System.GUID] $siteID) $cat = Get-SPEnterpriseSearchMetadataCategory –Identity SharePoint -SearchApplication $searchapp $crawledproperty = Get-SPEnterpriseSearchMetadataCrawledProperty -Name $crawledName -SearchApplication $searchapp -SiteCollection $siteID if($crawledproperty -eq $null) { Write-Host AppendLog "Creating Crawled Property for $managedPropertyName" Yellow $crawledproperty = New-SPEnterpriseSearchMetadataCrawledProperty -SearchApplication $searchapp -VariantType $variantType -SiteCollection $siteID -Category $cat -PropSet "00130329-0000-0130-c000-000000131346" -Name $crawledName -IsNameEnum $false } $managedproperty = Get-SPEnterpriseSearchMetadataManagedProperty -Identity $managedPropertyName -SearchApplication $searchapp -SiteCollection $siteID -ErrorAction SilentlyContinue if($managedproperty -eq $null) { Write-Host AppendLog "Creating Managed Property for $managedPropertyName" Yellow $managedproperty = New-SPEnterpriseSearchMetadataManagedProperty -Name $managedPropertyName -Type $managedPropertyType -SiteCollection $siteID -SearchApplication $searchapp -Queryable:$true -Retrievable:$true -FullTextQueriable:$true -RemoveDuplicates:$false -RespectPriority:$true -IncludeInMd5:$true } $mappedProperty = $crawledproperty.GetMappedManagedProperties() | ?{$_.Name -eq $managedProperty.Name } if($mappedProperty -eq $null) { Write-Host AppendLog "Creating Crawled -> Managed Property mapping for $managedPropertyName" Yellow New-SPEnterpriseSearchMetadataMapping -CrawledProperty $crawledproperty -ManagedProperty $managedproperty -SearchApplication $searchapp -SiteCollection $siteID } $mappedProperty = $crawledproperty.GetMappedManagedProperties() | ?{$_.Name -eq $managedProperty.Name } #Get-FASTSearchMetadataCrawledPropertyMapping -ManagedProperty $managedproperty } $searchapp = Get-SPEnterpriseSearchServiceApplication $searchAppName $site= Get-SPSite $siteUrl $siteId=$site.id Write-Host "Start creating Managed properties" $i = 1 FOREACH ($property in $ManagedPropertiesList) { $propertyName=$property.managedPropertyName $crawledName=$property.crawledName $managedPropertyType=$property.managedPropertyType $variantType=$property.variantType Write-Host $managedPropertyType Write-Host "Processing managed property $propertyName $($i)..." $i++ CreateRefiner $crawledName $propertyName $variantType $managedPropertyType $siteId Write-Host "Managed property created " $propertyName } Key Concepts Crawled Properties: Crawled properties are discovered by the search index service component when crawling content. Managed Properties: Properties that are part of the Search user experience, which means they are available for search results, advanced search, and so on, are managed properties. Mapping Crawled Properties to Managed Properties: To make a crawled property available for the Search experience—to make it available for Search queries and display it in Advanced Search and search results—you must map it to a managed property. References Administer search in SharePoint 2013 Preview Managing Metadata

Read the article
Tips & Tricks: How to crawl a SSL enabled Oracle E-Business Suite

- by Rajesh Ghosh

Oracle E-Business Suite can be integrated with Oracle Secure Enterprise Search for a superior end user experience and enhanced data retrieval capabilities. Before end-users can perform search operations, data has to be crawled and indexed into Oracle SES server. However if the Oracle E-Business Suite instance is on SSL, some additional configurations are needed in Oracle SES server as well as in Oracle Search Modeler, before a search object can be deployed and crawled. The process involves the following steps: Step 1: Export the SSL certificate of Oracle E-Business Suite Access the Oracle E-business Suite instance from a web browser. You should be able to locate a security or certificate icon somewhere in the browser toolbar or status bar, depending on which browser you are using. Click on it and you should be able to view the certificate as well as export it to a local file. While exporting make sure that you use “DER encoded” format. Step 2: Import the SSL certificate into Oracle Secure Enterprise server’s java key-store Oracle SES (10.1.8.4) by default ships a JDK under $ORACLE_HOME. The Oracle SES mid-tier uses this jdk to start the oc4j container services. In this step the Oracle E-Business Suite’s SSL certificate which has been exported in step #1, has to be imported into the Oracle SES server’s java key store. Perform the following: Copy the certificate file onto the server where Oracle SES server is running; under $ORACLE_HOME/jdk/jre/lib/security/cacerts. “ORACLE_HOME” points to the Oracle SES oracle home. Set the JAVA_HOME environment variable to $ORACLE_HOME/jdk. Append $JAVA_HOME/bin to the PATH environment variable Issue the command : “keytool -import -keystore keystore.jks -trustcacerts -alias myOHS –file ebs.crt” . Please substitute “ebs.crt” with the name of the certificate file you copied in step #2.1. The default key-store password “changeit”. Enter the same when prompted. If successful this process will end with a message saying “certificate successfully imported”. Step 3: Import the SSL certificate into Search Modeler java key-store Unlike Oracle SES, Search Modeler is not shipped with a bundled JDK. If you are using standalone OC4J, then you actually use an external JDK to start the oc4j container services. If you are using IAS instance then the JDK comes bundled with the IAS installation. Perform the following: Copy the certificate file onto the server where Search Modeler application is running; under $JDK_HOME/jre/lib/security/cacerts. “JDK_HOME” points to the JDK directory depending on whether you are using external JDK or a bundled one. Set the JAVA_HOME environment variable to JDK directory. Append $JAVA_HOME/bin to the PATH environment variable Issue the command : “keytool -import -keystore keystore.jks -trustcacerts -alias myOHS –file ebs.crt” . Please substitute “ebs.crt” with the name of the certificate file you copied in step #3.1. The default key-store password “changeit”. Enter the same when prompted. If successful this process will end with a message saying “certificate successfully imported”. Once you have completed the above steps successfully, you can deploy the search objects using Search Modeler and then start crawling them as well.

Read the article
Scrapy issue with iTunes' AppStore

- by Eric

I am using Scrapy to fetch some data from iTunes' AppStore database. I start with this list of apps: http://itunes.apple.com/us/genre/mobile-software-applications/id36?mt=8 In the following code I have used the simplest regex which targets all apps in the US store. from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.contrib.spiders import CrawlSpider, Rule class AppStoreSpider(CrawlSpider): domain_name = 'itunes.apple.com' start_urls = ['http://itunes.apple.com/us/genre/mobile-software-applications/id6015?mt=8'] rules = ( Rule(SgmlLinkExtractor(allow='itunes\.apple\.com/us/app'), 'parse_app', follow=True, ), ) def parse_app(self, response): .... SPIDER = AppStoreSpider() When I run it I receive the following: [itunes.apple.com] DEBUG: Crawled (200) <GET http://itunes.apple.com/us/genre/mobile-software-applications/id6015?mt=8> (referer: None) [itunes.apple.com] DEBUG: Filtered offsite request to 'itunes.apple.com': <GET http://itunes.apple.com/us/app/bloomberg/id281941097?mt=8> As you can see, when it starts crawling the first page it says: "Filtered offsite request to 'itunes.apple.com'". and then the spider stops.. it also returns this message: [ScrapyHTTPPageGetter,client] /usr/lib/python2.5/cookielib.py:1577: exceptions.UserWarning: cookielib bug! Traceback (most recent call last): File "/usr/lib/python2.5/cookielib.py", line 1575, in make_cookies parse_ns_headers(ns_hdrs), request) File "/usr/lib/python2.5/cookielib.py", line 1532, in _cookies_from_attrs_set cookie = self._cookie_from_cookie_tuple(tup, request) File "/usr/lib/python2.5/cookielib.py", line 1451, in _cookie_from_cookie_tuple if version is not None: version = int(version) ValueError: invalid literal for int() with base 10: '"1"' I have used the same script for other website and I didn't have this problem. Any suggestion?

Read the article
TypeError: coercing to Unicode: need string or buffer, User found

- by Clemens

hi, i have to crawl last.fm for users (university exercise). I'm new to python and get following error: Traceback (most recent call last): File "crawler.py", line 23, in <module> for f in user_.get_friends(limit='200'): File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/pylast.py", line 2717, in get_friends for node in _collect_nodes(limit, self, "user.getFriends", False): File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/pylast.py", line 3409, in _collect_nodes doc = sender._request(method_name, cacheable, params) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/pylast.py", line 969, in _request return _Request(self.network, method_name, params).execute(cacheable) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/pylast.py", line 721, in __init__ self.sign_it() File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/pylast.py", line 727, in sign_it self.params['api_sig'] = self._get_signature() File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/pylast.py", line 740, in _get_signature string += self.params[name] TypeError: coercing to Unicode: need string or buffer, User found i use the pylast lib for crawling. what i want to do: i want to get a users friends and the friends of the users friends. the error occurs, when i have a for loop in another for loop. here's the code: network = pylast.get_lastfm_network(api_key = API_KEY, api_secret = API_SECRET, username = username, password_hash = password_hash) user = network.get_user("vidarnelson") friends = user.get_friends(limit='200') i = 1 for friend in friends: user_ = network.get_user(friend) print '#%d %s' % (i, friend) i = i + 1 for f in user_.get_friends(limit='200'): print f any advice? thanks in advance. regards!

Read the article
crawl websites out of java web application without using bin/nutch

- by Marcel

hi :) i am trying to using nutch (1.1) without bin/nutch from my (java) mojarra 2.0.2 webapp... i am searching at google for examples, but there are no examples how i can realize this :/ ... i get an exception and the job fails :/ (i think of cause something with hadoop)... here is my code: public void run() throws Exception { final String[] args = new String[] { String.format("%s%s%s%s", JSFUtils.getWebAppRoot(), "nutch", File.separator, DIRECTORY_URLS), "-dir", String.format("%s%s%s%s", JSFUtils.getWebAppRoot(), "nutch", File.separator, DIRECTORY_CRAWL), "-threads", this.preferences.get("threads"), "-depth", this.preferences.get("depth"), "-topN", this.preferences.get("topN"), "-solr", this.preferences.get("solr") }; Crawl.main(args); } and a part of the logging: 10/05/17 10:42:54 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 10/05/17 10:42:54 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/05/17 10:42:54 INFO mapred.FileInputFormat: Total input paths to process : 1 10/05/17 10:42:54 INFO mapred.JobClient: Running job: job_local_0001 10/05/17 10:42:54 INFO mapred.FileInputFormat: Total input paths to process : 1 10/05/17 10:42:55 INFO mapred.MapTask: numReduceTasks: 1 10/05/17 10:42:55 INFO mapred.MapTask: io.sort.mb = 100 java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.crawl.Injector.inject(Injector.java:211) at org.apache.nutch.crawl.Crawl.main(Crawl.java:124) at lan.localhost.process.NutchCrawling.run(NutchCrawling.java:108) at lan.localhost.main.Index.indexing(Index.java:71) at lan.localhost.bean.FeedingBean.actionStart(FeedingBean.java:25) .... can someone help me or tell me how i can crawling from a java application? i have increased the Xms to 256m and Xmx to 768m, but nothing changed... best regards marcel

Read the article
Asp.net Crawler Webresponse Operation Timed out.

- by Leon

Hi I have built a simple threadpool based web crawler within my web application. Its job is to crawl its own application space and build a Lucene index of every valid web page and their meta content. Here's the problem. When I run the crawler from a debug server instance of Visual Studio Express, and provide the starting instance as the IIS url, it works fine. However, when I do not provide the IIS instance and it takes its own url to start the crawl process(ie. crawling its own domain space), I get hit by operation timed out exception on the Webresponse statement. Could someone please guide me into what I should or should not be doing here? Here is my code for fetching the page. It is executed in the multithreaded environment. private static string GetWebText(string url) { string htmlText = ""; HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url); request.UserAgent = "My Crawler"; using (WebResponse response = request.GetResponse()) { using (Stream stream = response.GetResponseStream()) { using (StreamReader reader = new StreamReader(stream)) { htmlText = reader.ReadToEnd(); } } } return htmlText; } And the following is my stacktrace: at System.Net.HttpWebRequest.GetResponse() at CSharpCrawler.Crawler.GetWebText(String url) in c:\myAppDev\myApp\site\App_Code\CrawlerLibs\Crawler.cs:line 366 at CSharpCrawler.Crawler.CrawlPage(String url, List1 threadCityList) in c:\myAppDev\myApp\site\App_Code\CrawlerLibs\Crawler.cs:line 105 at CSharpCrawler.Crawler.CrawlSiteBuildIndex(String hostUrl, String urlToBeginSearchFrom, List1 threadCityList) in c:\myAppDev\myApp\site\App_Code\CrawlerLibs\Crawler.cs:line 89 at crawler_Default.threadedCrawlSiteBuildIndex(Object threadedCrawlerObj) in c:\myAppDev\myApp\site\crawler\Default.aspx.cs:line 108 at System.Threading.QueueUserWorkItemCallback.WaitCallback_Context(Object state) at System.Threading.ExecutionContext.runTryCode(Object userData) at System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode code, CleanupCode backoutCode, Object userData) at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean ignoreSyncCtx) at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem() at System.Threading.ThreadPoolWorkQueue.Dispatch() at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback() Thanks and cheers, Leon.

Read the article
Entity Framework - Foreign key constraints not added for inherited entity

- by Tri Q

Hello, It appears to me that a strange phenomenon is occurring with inherited entities (TPT) in EF4. I have three entities. 1. Asset 2. Property 3. Activity Property is a derived-type of Asset. Property has many activities (many-to-many) When modeling this in my EDMX, everything seems fine until I try to insert a new Property into the database. If the property does not contain any Activity, it works, but all hell breaks loose when I add some new activities to the new Property. As it turns out after 2 days of crawling the web and fiddling around, I noticed that in the EF store (SSDL) some of the constraints between entities were not picked up during the update process. Property_Activity table which links properties and activities show only one constraint FK_Property_Activity_Activity but FK_Property_Activity_Property was missing. I knew this is an Entity Framework anomoly because when I switched the relationship in the database to: Asset <-- Asset_Activity <-- Activity After an update, all foreign key constraints are picked up and the save is successful, with or without activities in the new property. Is this intended or a bug in EF? How do I get around this problem? Should I abandon inheritance altogether?

Read the article
Erlang OTP application design

- by Toby Hede

I am struggling a little coming to grips with the OTP development model as I convert some code into an OTP app. I am essentially making a web crawler and I just don't quite know where to put the code that does the actual work. I have a supervisor which starts my worker: -behaviour(supervisor). -define(CHILD(I, Type), {I, {I, start_link, []}, permanent, 5000, Type, [I]}). init(_Args) -> Children = [ ?CHILD(crawler, worker) ], RestartStrategy = {one_for_one, 0, 1}, {ok, {RestartStrategy, Children}}. In this design, the Crawler Worker is then responsible for doing the actual work: -behaviour(gen_server). start_link() -> gen_server:start_link(?MODULE, [], []). init([]) -> inets:start(), httpc:set_options([{verbose_mode,true}]), % gen_server:cast(?MODULE, crawl), % ok = do_crawl(), {ok, #state{}}. do_crawl() -> % crawl! ok. handle_cast(crawl}, State) -> ok = do_crawl(), {noreply, State}; do_crawl spawns a fairly large number of processes and requests that handle the work of crawling via http. Question, ultimately is: where should the actual crawl happen? As can be seen above I have been experimenting with different ways of triggering the actual work, but still missing some concept essential for grokering the way things fit together. Note: some of the OTP plumbing is left out for brevity - the plumbing is all there and the system all hangs together

Read the article
How Search Engine Bots Crawl Forums?

- by Waleed Eissa

If I have a forums site with a large number of threads, will the search engine bot crawl the whole site every time? Say I have over 1,000,000 threads in my site, will they get crawled every time the bot crawls my site? or how does it work? I want my website to be indexed but I don't want the bot to kill my website! In other words I don't want the bot to keep crawling the old threads again and again every time it crawls my website. Also, what about the pages crawled before? Will the bot request them every time it crawls my website to make sure they are still on the site? I'm asking this because I only link to the latest threads, i.e. there's a page that contains a list of all the latest threads, but I don't link to the older threads, they have to be explicitly requested by URL, e.g. http://www.mysite.com/showthread.aspx?threadid=7 , will this work to stop the bot from bringing my site down and consuming all my bandwidth? P.S. The site is still under development but I want to know in order to design the site so that search engine bots don't bring it down. Thanks

Read the article
Is it possible to do A/B testing by page rather than by individual?

- by mojones

Lets say I have a simple ecommerce site that sells 100 different t-shirt designs. I want to do some a/b testing to optimise my sales. Let's say I want to test two different "buy" buttons. Normally, I would use AB testing to randomly assign each visitor to see button A or button B (and try to ensure that that the user experience is consistent by storing that assignment in session, cookies etc). Would it be possible to take a different approach and instead, randomly assign each of my 100 designs to use button A or B, and measure the conversion rate as (number of sales of design n) / (pageviews of design n) This approach would seem to have some advantages; I would not have to worry about keeping the user experience consistent - a given page (e.g. www.example.com/viewdesign?id=6) would always return the same html. If I were to test different prices, it would be far less distressing to the user to see different prices for different designs than different prices for the same design on different computers. I also wonder whether it might be better for SEO - my suspicion is that Google would "prefer" that it always sees the same html when crawling a page. Obviously this approach would only be suitable for a limited number of sites; I was just wondering if anyone has tried it?

Read the article

Search Results

Search found 241 results on 10 pages for 'crawling'.

Page 9/10 | < Previous Page | 5 6 7 8 9 10 | Next Page >

- by Andy

- by rebellion

- by Rob

- by Jonathan

- by OrangeRind

- by Jeff Atwood

- by Ian

- by user29347

- by aireq

- by John at CashCommons

- by cwd

- by user22215

- by j.strugnell

- by Sarah Seguin

- by ybbest

- by ybbest

- by Rajesh Ghosh

- by Eric

- by Clemens

- by Marcel

- by Leon

- by Tri Q

- by Toby Hede

- by Waleed Eissa

- by mojones

< Previous Page | 5 6 7 8 9 10 | Next Page >