Search Results

Search found 260 results on 11 pages for 'a barking spider'.

Page 6/11 | < Previous Page | 2 3 4 5 6 7 8 9 10 11  | Next Page >

  • Data extract from website URL

    - by user2522395
    From this below script I am able to extract all links of particular website, But i need to know how I can generate data from extracted links especially like eMail, Phone number if its there Please help how i will modify the existing script and get the result or if you have full sample script please provide me. Private Sub btnGo_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles btnGo.Click 'url must be in this format: http://www.example.com/ Dim aList As ArrayList = Spider("http://www.qatarliving.com", 1) For Each url As String In aList lstUrls.Items.Add(url) Next End Sub Private Function Spider(ByVal url As String, ByVal depth As Integer) As ArrayList 'aReturn is used to hold the list of urls Dim aReturn As New ArrayList 'aStart is used to hold the new urls to be checked Dim aStart As ArrayList = GrabUrls(url) 'temp array to hold data being passed to new arrays Dim aTemp As ArrayList 'aNew is used to hold new urls before being passed to aStart Dim aNew As New ArrayList 'add the first batch of urls aReturn.AddRange(aStart) 'if depth is 0 then only return 1 page If depth < 1 Then Return aReturn 'loops through the levels of urls For i = 1 To depth 'grabs the urls from each url in aStart For Each tUrl As String In aStart 'grabs the urls and returns non-duplicates aTemp = GrabUrls(tUrl, aReturn, aNew) 'add the urls to be check to aNew aNew.AddRange(aTemp) Next 'swap urls to aStart to be checked aStart = aNew 'add the urls to the main list aReturn.AddRange(aNew) 'clear the temp array aNew = New ArrayList Next Return aReturn End Function Private Overloads Function GrabUrls(ByVal url As String) As ArrayList 'will hold the urls to be returned Dim aReturn As New ArrayList Try 'regex string used: thanks google Dim strRegex As String = "<a.*?href=""(.*?)"".*?>(.*?)</a>" 'i used a webclient to get the source 'web requests might be faster Dim wc As New WebClient 'put the source into a string Dim strSource As String = wc.DownloadString(url) Dim HrefRegex As New Regex(strRegex, RegexOptions.IgnoreCase Or RegexOptions.Compiled) 'parse the urls from the source Dim HrefMatch As Match = HrefRegex.Match(strSource) 'used later to get the base domain without subdirectories or pages Dim BaseUrl As New Uri(url) 'while there are urls While HrefMatch.Success = True 'loop through the matches Dim sUrl As String = HrefMatch.Groups(1).Value 'if it's a page or sub directory with no base url (domain) If Not sUrl.Contains("http://") AndAlso Not sUrl.Contains("www") Then 'add the domain plus the page Dim tURi As New Uri(BaseUrl, sUrl) sUrl = tURi.ToString End If 'if it's not already in the list then add it If Not aReturn.Contains(sUrl) Then aReturn.Add(sUrl) 'go to the next url HrefMatch = HrefMatch.NextMatch End While Catch ex As Exception 'catch ex here. I left it blank while debugging End Try Return aReturn End Function Private Overloads Function GrabUrls(ByVal url As String, ByRef aReturn As ArrayList, ByRef aNew As ArrayList) As ArrayList 'overloads function to check duplicates in aNew and aReturn 'temp url arraylist Dim tUrls As ArrayList = GrabUrls(url) 'used to return the list Dim tReturn As New ArrayList 'check each item to see if it exists, so not to grab the urls again For Each item As String In tUrls If Not aReturn.Contains(item) AndAlso Not aNew.Contains(item) Then tReturn.Add(item) End If Next Return tReturn End Function

    Read the article

  • JSON.parse vs. eval()

    - by Kevin Major
    My Spider Sense warns me that using eval() to parse incoming JSON is a bad idea. I'm just wondering if JSON.parse() - which I assume is a part of JavaScript and not a browser-specific function - is more secure.

    Read the article

  • Blocking 'good' bots in nginx with multiple conditions for certain off-limits URL's where humans can go

    - by Glenn Plas
    After 2 days of searching/trying/failing I decided to post this here, I haven't found any example of someone doing the same nor what I tried seems to be working OK. I'm trying to send a 403 to bots not respecting the robots.txt file (even after downloading it several times). Specifically Googlebot. It will support the following robots.txt definition. User-agent: * Disallow: /*/*/page/ The intent is to allow Google to browse whatever they can find on the site but return a 403 for the following type of request. Googlebot seems to keep on nesting these links eternally adding paging block after block: my_domain.com:80 - 66.x.67.x - - [25/Apr/2012:11:13:54 +0200] "GET /2011/06/ page/3/?/page/2//page/3//page/2//page/3//page/2//page/2//page/4//page/4//pag e/1/&wpmp_switcher=desktop HTTP/1.1" 403 135 "-" "Mozilla/5.0 (compatible; G ooglebot/2.1; +http://www.google.com/bot.html)" It's a wordpress site btw. I don't want those pages to show up, even though after the robots.txt info got through, they stopped for a while only to begin crawling again later. It just never stops .... I do want real people to see this. As you can see, google get a 403 but when I try this myself in a browser I get a 404 back. I want browsers to pass. root@my_domain:# nginx -V nginx version: nginx/1.2.0 I tried different approaches, using a map and plain old nono if's and they both act the same: (under http section) map $http_user_agent $is_bot { default 0; ~crawl|Googlebot|Slurp|spider|bingbot|tracker|click|parser|spider 1; } (under the server section) location ~ /(\d+)/(\d+)/page/ { if ($is_bot) { return 403; # Please respect the robots.txt file ! } } I recently had to polish up my Apache skills for a client where I did about the same thing like this : # Block real Engines , not respecting robots.txt but allowing correct calls to pass # Google RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Googlebot/2\.[01];\ \+http://www\.google\.com/bot\.html\)$ [NC,OR] # Bing RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ bingbot/2\.[01];\ \+http://www\.bing\.com/bingbot\.htm\)$ [NC,OR] # msnbot RewriteCond %{HTTP_USER_AGENT} ^msnbot-media/1\.[01]\ \(\+http://search\.msn\.com/msnbot\.htm\)$ [NC,OR] # Slurp RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Yahoo!\ Slurp;\ http://help\.yahoo\.com/help/us/ysearch/slurp\)$ [NC] # block all page searches, the rest may pass RewriteCond %{REQUEST_URI} ^(/[0-9]{4}/[0-9]{2}/page/) [OR] # or with the wpmp_switcher=mobile parameter set RewriteCond %{QUERY_STRING} wpmp_switcher=mobile # ISSUE 403 / SERVE ERRORDOCUMENT RewriteRule .* - [F,L] # End if match This does a bit more than I asked nginx to do but it's about the same principle, I'm having a hard time figuring this out for nginx. So my question would be, why would nginx serve my browser a 404 ? Why isn't it passing, The regex isn't matching for my UA: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.30 Safari/536.5" There are tons of example to block based on UA alone, and that's easy. It also looks like the matchin location is final, e.g. it's not 'falling' through for regular user, I'm pretty certain that this has some correlation with the 404 I get in the browser. As a cherry on top of things, I also want google to disregard the parameter wpmp_switcher=mobile , wpmp_switcher=desktop is fine but I just don't want the same content being crawled multiple times. Even though I ended up adding wpmp_switcher=mobile via the google webmaster tools pages (requiring me to sign up ....). that also stopped for a while but today they are back spidering the mobile sections. So in short, I need to find a way for nginx to enforce the robots.txt definitions. Can someone shell out a few minutes of their lives and push me in the right direction please ? I really appreciate ANY response that makes me think harder ;-)

    Read the article

  • AWS Load balancer connection reset

    - by joshmmo
    I have an ELB set up with two instances. The issue I have with it is that when I do not add www. to it, the ELB just hangs. This is some info I get when I spider with wget: Spider mode enabled. Check if remote file exists. --2013-06-20 13:40:54-- http://learning.example.com/ Resolving learning.example.com... 54.xxx.x.x53, 50.xx.xxx.x71 Connecting to learning.example.com|54.xxx.x.x53|:80... connected. HTTP request sent, awaiting response... No data received. Retrying. when I add www. it works great. I have a GoDaddy SSL cert that I added to the listener section that covers 3 domains, www.learning.example.com, files.learning.example.com and learning.example.com. These are my listener settings: - HTTP 80 HTTPS 443 N/A N/A - SSL 443 SSL 443 Change canvasNew (Change) My EC2 instances are running apache2 on Ubuntu 12.04. I will be happy to post my vhosts file if needed. However, when I ran the server with the domains pointing to just one EC2 instance things worked fine. How can I fix this issue for learning.example.com? Why does www work just fine? A second question would be what is the difference between instance protocol and load balancer protocol? EDIT: Here are the dig results for learning.example.com from yesterday. I changed the DNS entry to point to one instance to make sure it was the elb. When I switch it back I will do it for www.learning.example.com ; <<>> DiG 9.9.1-P2 <<>> learning.example.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20210 ;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;learning.example.com. IN A ;; ANSWER SECTION: learning.example.com. 2559 IN CNAME canvas-22222222222.us-west-1.elb.amazonaws.com. canvas-22222222222.us-west-1.elb.amazonaws.com. 60 IN A 54.xxx.x.x53 canvas-22222222222.us-west-1.elb.amazonaws.com. 60 IN A 50.xx.xxx.x71 ;; Query time: 83 msec ;; SERVER: 10.x.xx.20#53(10.x.xx.20) ;; WHEN: Thu Jun 20 13:40:47 2013 ;; MSG SIZE rcvd: 137 EDIT 2: Here is some more info that might be helpful. Port Configuration: 80 (HTTP) forwarding to 443 (HTTPS) Backend Authentication: Disabled Stickiness: Disabled(edit) 443 (SSL, Certificate: canvasNew) forwarding to 443 (SSL) Backend Authentication: Disabled So I switched everything to one EC2 IP address to bypass the elb to make sure things are working. It's running great. www and the non-www url work perfectly fine. Its only when I switch things to the ELB that learning.example.com hangs and www.learning.example.com works. Hopefully you can get some ideas flowing.

    Read the article

  • How to set _optimizer_search_limit and _optimizer_max_permutations in Oracle10g.

    - by user52856
    I am working on a product that must support both MSSQL and Oracle (10g and 11g). I have some very complex queries that seem to run without issue on MSSQL 2005/2008, but very, very slow with Oracle. The CPU on the oracle server skyrockets for long periods of time, and it seems like the optimizer may be trying to find the best execution plan for the very complex query. I did some Googling to figure out how to limit the amount of time the optimizer spends on this, and came up with _optimizer_search_limit and _optimizer_max_permutations. Both of these parameters are hidden in Oracle 10g, and setting them in init.ora doesn't seem to make any difference. How do I set these parameters in Oracle. Or am I just totally barking up the wrong tree with the assumption that the optimizer is spending several minutes finding an execution plan? Thanks.

    Read the article

  • nginx + php-fpm “504 Gateway Time-out” after compiling with curl support

    - by Brian
    We recently switched to managing php with php-fpm. It was working great, but is now giving me issues. The most recent change was to install libcurl-devel and re-compile php (5.3.3) using --with-curl. Now I'm getting 504 timeouts with nginx and the pages won't load. HTML pages load fine, phpinfo() loads as well. Tried backing out the changes and re-compiling without curl support, but still not having any luck. Also tried adjusting request_terminate_timeout per some of the other posts here on SF without change. This is on a test machine that has no other clients hitting it. I also tried switching to unix socket instead of tcp--same result. What am I missing here? Am I barking up the wrong tree with curl?

    Read the article

  • Can I easily use a VPN to duplicate SSH Tunneling functionality?

    - by Steve V.
    Right now, when I want to use an unsecured wireless connection with my (Linux) laptop, I secure my connection using a variation of the method provided here. However, to the best of my knowledge, the (non-jailbroken) iPad does not allow applications to tunnel traffic through local ports. However, it does seem to allow certain VPN traffic. I have never set up, or even used, a VPN before. I'm looking for confirmation that I'm not barking up the wrong tree before I invest significant effort into setting up my own VPN server. If I want to secure my wireless iPad traffic over an unsecure wireless connection, would I be on the right track by looking at a VPN?

    Read the article

  • Samba networking with a domain-joined computer

    - by MCS
    I have two computers connected to the same linksys router - one running Ubuntu 8.04 and one running Windows 7. The Windows computer is part of a work domain (when connected via VPN). I'm trying to also create a home network so I can access the linux server from Windows. From what I understand, I have to first create a Windows workgroup and then configure Samba on linux to join the Windows network. So in Windows, I went to Control Panel - Network and Internet - HomeGroup and got the following message: There is currently no homegroup on your network. Because this computer also belongs to a domain, you can't create your own homegroup, but can join one created by someone on your network. Can I use Samba to create a homegroup? Is there any other way to create a Windows workgroup? Or am I barking up the wrong tree completely?

    Read the article

  • Will Hotmail DELIVER to spam-friendly ISP?

    - by Oddthinking
    I have a user who is using Hotmail to communicate with one of our suppliers. Recently, Hotmail started bouncing emails to that address (while other ISPs are working fine.) Hotmail bounce messages contain very little details. Investigating, I found that the supplier's ISP has appeared on a blacklist of ISPs who do not do enough to prevent spamming from their users. (Note: Neither my user nor the supplier are mass-mailing.) In the circumstances, I can understand Hotmail declining to accept emails from the host, or accepting them but being quick to mark them as spam. However, I am surprised that they might refuse delivery. Does this sound plausible to you, or am I barking up the wrong tree?

    Read the article

  • Installing the Apple Root Certificate Authority on CentOS CLI

    - by Daniel Hollands
    I could be barking up the wrong tree here, but I'm looking for help on installing Apple's Root certificate (http://www.apple.com/certificateauthority/) on a CentOS server via the command line - which I need to send messages to their APNS system. The code I'm using for this purpose is a variation on this: https://github.com/jPaolantonio/SimplePush/blob/master/simplepush.php - which works perfectly well on a Windows server, but as soon as we try to use it on a CentOS one, it falls over. We're lead to believe this has something to do with not having the CA installed on our CentOS box - but all efforts to do so have failed. As the CentOS server is headless, we need the ability to do this via the commandline. Can someone help?

    Read the article

  • How do I re-enable the backlight?

    - by Scott Severance
    Since Oneiric, if I leave my machine (HP Mini 110 netbook) unattended and it goes into power-save mode, the backlight gets disabled. How can I turn it back on? Note that the keyboard backlight controls (Fn+F4 and Fn+F3) don't have any effect in this situation. I've already filed a bug, but filing a bug doesn't fix my problem. I tried this workaround posted in this bug report dealing with Acer laptops: sudo setpci -s 00:02.0 F4.B=0 However, if anything, that command makes things worse. In the general case, I can see a little bit if I'm in a dark room with a flashlight aimed just so. But after running setpci I can't see anything. And I find the setpci documentation to be utterly incomprehensible, so I don't know whether I need to tweak my command somehow or whether I'm completely barking up the wrong tree. Update: I've found a workaround: I'm now booting with the kernel parameter acpi=off. This disables power management, which prevents the machine from going into power saving mode and thus failing to come back up correctly. Of course, not having power management means that I can't use suspend or do anything to manage power other than powering it off (even then, I have to manually use the power switch). Also, it prevents me from using Unity 3D or Gnome Shell, forcing me into Unity 2C or Gnome Classic. So, I'd really like to be able to stop using this hack.

    Read the article

  • Persisting high score table in flash game without a network. (Featuring: HttpListenerException)

    - by bearcdp
    Hi everyone, this question is very programming-centric, but it's for a game so I figured I might as well post it here. I'm doing polishing work on a GGJ '11 game because it will be shown at an indie arcade tomorrow afternoon, and they're expecting our final build in the morning. We'd like to have a high score table that displays during attract mode, but since it's Flash (Flixel) it would require some networking, Mochi, or something to keep a record of these scores. Only problem is the machine we'd be running on probably won't have network access. As a quick solution, I thought I'd just write up a dinky little high score server in C#/.NET that could take basic GET requests for submitting scores and getting the score list. We're talking REAL basic, like blocking while waiting for an incoming request, run & forget console app, etc. There's no guarantee that our .swf won't get reloaded, and we'd like the scores to persist, so this server would pretty much exists to keep a safe copy of the scores that the game can add to and request, and occasionally the server will write the scores to a flat text file. But, HttpListener is giving me Error Code 87 'The parameter is incorrect.' Have any idea what I'm doing wrong? Or better yet, am I barking up the wrong tree and missing an obviously simpler solution? This is all I've got so far in my Main(): HttpListener listener = new HttpListener(); listener.Prefixes.Add("http://localhost:66666/"); listener.Start(); The exception happens at listener.Start(); and the stack trace is: at System.Net.HttpListener.AddAllPrefixes() at System.Net.HttpListener.Start() at WOSEBCE_ScoreServer.Program.Main(String[] args) in C:\Users\Michael\Documents\Visual Studio 2010\VS2010 Projects\WOSEBCE_ScoreServer\WOSEBCE_ScoreServer\Program.cs:line 24 at System.AppDomain._nExecuteAssembly(RuntimeAssembly assembly, String[] args) at System.AppDomain.ExecuteAssembly(String assemblyFile, Evidence assemblySecurity, String[] args) at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly() at System.Threading.ThreadHelper.ThreadStart_Context(Object state) at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean ignoreSyncCtx) at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.ThreadHelper.ThreadStart()

    Read the article

  • Is my htaccess setting hurting SEO?

    - by Ramanonos
    I have a site that I have redirecting to https. I do this to leverage wildcard SSL for my password protected pages. Everything seems to work fine with testing. For example, whether you type in http or www, you always get redirected to the SSL https... That said, I have about 200-300 external backlinks -- many high quality, yet google webmaster (along with SEOMoz), shows I have just 4... Huh? I'm embarrassed to say I just discovered this. This has led me to hypothesize that maybe my settings in htaccess is messed up, so google isn't recognizing a link because it's recorded on another site as http, instead of https. Maybe? At any rate, here is my simple htaccess setting for 301 www to http, and from http to https. RewriteCond %{SERVER_PORT} !443 RewriteCond %{HTTP_HOST} ^www\.example\.com$ [NC] RewriteRule ^(.*)$ http://example.com/$1 [L,R=301] RewriteCond %{SERVER_PORT} 443 RewriteCond %{HTTP_HOST} ^www\.example\.com$ [NC] RewriteRule ^(.*)$ https://example.com/$1 [L,R=301] Like I said, everything works fine for redirect over https, so I'd rather not screw up what works. On the other hand something is very wrong with google finding all my back links, so I need to fix something... I'm just wondering that maybe google isn't picking up a my backlinks from other websites recording me as http because I'm at https. Maybe google doesn't care and it's some other issue. Am I barking up the right tree? If so any quick fixes? Thanks as always!

    Read the article

  • Video acceleration problem with Windows 7 games and PPTX files

    - by Jordan 1GT
    I have a Dell xps M1330 which originally ran Vista, but I upgraded to Windows 7. When I try to run a Win 7 game like spider solitaire I receive the following message: The game is running in software rendering mode. Hardware acceleration is either disabled or not supported by your video card driver which could slow down game performance. Make sure you have the latest video card driver installed and that hardware acceleration is turned on. I confirmed that hardware acceleration is turned on. When I go to Dell's site, I'm told there is no later video driver. When I run the game it runs very choppy. I have a .pptx file which is doing strange things in normal view and I suspect it may be related to the same video acceleration problem.

    Read the article

  • Video problem with Windows 7 Games

    - by Jordan 1GT
    I have a Dell xps M1330 which originally ran Vista, but I upgraded to Windows 7. When I try to run a Win 7 game like spider solitaire I receive the following message: "The game is running in software rendering mode. Hardware acceleration is either disabled or not supported by your video card driver which could slow down game performance. Make sure you have the latest video card driver installed and that hardware acceleration is turned on." I confirmed that hardware acceleration is turned on. When I go to Dell's site, I'm told there is no later video driver. When I run the game it runs very choppy. I wouldn't care, but I loaded a .pptx file which is doing strange things in normal view and I suspect may be related to the same video problem. Any ideas?

    Read the article

  • Mongo Client RedHat EL5 UT8 Support

    - by Michael Irey
    # mongo MongoDB shell version: 1.6.4 Fri Mar 16 11:55:46 *** warning: spider monkey build without utf8 support. consider rebuilding with utf8 support connecting to: test Mongo Server seems to handle the utf8 characters fine, as well as my php-mongo-client driver. But when I try to query a record that has a utf8 character from the mongo command line client I get: > db.Users.find({age:33}); error:non ascii character detected Fri Mar 16 11:55:43 mongo got signal 11 (Segmentation fault), stack trace: Fri Mar 16 11:55:43 0x440b50 0x3664c302d0 0x3f47e7b6e0 0x3f47e83bbd 0x3f47e254f3 0x3f47e25660 0x3f47e256ee 0x3f47e25792 0x3f47e2876e 0x4b031d 0x443b72 0x445476 0x3664c1d994 0x43fd39 mongo(_Z12quitAbruptlyi+0x3b0) [0x440b50] /lib64/libc.so.6 [0x3664c302d0] /usr/lib64/libjs.so.1 [0x3f47e7b6e0] /usr/lib64/libjs.so.1(js_CompileTokenStream+0x3d) [0x3f47e83bbd] /usr/lib64/libjs.so.1 [0x3f47e254f3] /usr/lib64/libjs.so.1(JS_CompileUCScriptForPrincipals+0x60) [0x3f47e25660] /usr/lib64/libjs.so.1(JS_EvaluateUCScriptForPrincipals+0x3e) [0x3f47e256ee] /usr/lib64/libjs.so.1(JS_EvaluateUCScript+0x22) [0x3f47e25792] /usr/lib64/libjs.so.1(JS_EvaluateScript+0x6e) [0x3f47e2876e] mongo(_ZN5mongo7SMScope4execERKSsS2_bbbi+0xed) [0x4b031d] mongo(_Z5_mainiPPc+0x14a2) [0x443b72] mongo(main+0x26) [0x445476] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3664c1d994] mongo(__gxx_personality_v0+0x269) [0x43fd39] Any ideas or suggestions would be welcome

    Read the article

  • Download web server structure with empty files

    - by golimar
    I want to make a mirror of a Web server, but downloading the actual files will take too long. So I thought of having just the directory and file structure, and when I need the actual contents of the file, I can download just that file. I have tried wget --spider URL and in a short time it has created in my local disk the directory structure with no files. But I've checked all of wget's or curl's switches and there is nothing like what I need. Can this be done with wget, curl or any other tool?

    Read the article

  • SEO - Google and link cleaning / cloaking [closed]

    - by Jens Törnell
    Possible Duplicate: Does the Google spider render JavaScript? This a SEO related question, not a code related one. Googles own link cleaning / cloaking Gå to http://www.google.com and search for something. Hover the title and you will se a link to the page you want to go to. The URL you see when hovering is NOT the link you are clicking on. Instead of clicking you can drag the title a little bit and then hover it. Then you will se the real URL. My own link cleaning / cloaking Go to http://jsfiddle.net/NvmER/1/ and click the link, or look at the code below. You will be "redirected" to http://www.test.com. The real link are http://www.test.com/?event=23 Working code in case jsfiddle don't work If you need to se how it works I pasted a code below. <a class="direct" href="http://www.test.com/?event=23" data-redirect="http://www.test.com">Länk</a>? $(document).ready(function() { $("a.direct").live("mousedown", function(e){ var oldurl = $(this).attr('href'); var newurl = $(this).attr('data-redirect'); $(this).attr('href', newurl); }); });? Question Is this ok with Google? It's done with javascript. If you have an answer, link to a source or test to support it.

    Read the article

  • Idea of an algorithm to detect a website's navigation structure?

    - by Uwe Keim
    Currently I am in the process of developing an importer of any existing, arbitrary (static) HTML website into the upcoming release of our CMS. While the downloading the files is solved successfully, I'm pulling my hair off when it comes to detect a site structure (pages and subpages) purely from the HTML files, without the user specifying additional hints. Basically I want to get a tree like: + Root page 1 + Child page 1 + Child page 2 + Child child page1 + Child page 3 + Root page 2 + Child page 4 + Root page 3 + ... I.e. I want to be able to detect the menu structure from the links inside the pages. This has not to be 100% accurate, but at least I want to achieve more than just a flat list. I thought of looking at multiple pages to see similar areas and identify these as menu areas and parse the links there, but after all I'm not that satisfied with this idea. My question: Can you imagine any algorithm when it comes to detecting such a structure? Update 1: What I'm looking for is not a web spider, but an algorithm do create a logical tree of the relationship of the pages to be able to create pages and subpages inside my CMS when importing them. Update 2: As of Robert's suggestion I'll solve this by starting at the root page, and then simply parse links as you go and treat every link inside a page simply as a child page. Probably I'll recurse not in a deep-first manner but rather in a breadth-first manner to get a more balanced navigation structure.

    Read the article

  • Handling SEO for Infinite pages that cause external slow API calls

    - by Noam
    I have an 'infinite' amount of pages in my site which rely on an external API. Generating each page takes time (1 minute). Links in the site point to such pages, and when a users clicks them they are generated and he waits. Considering I cannot pre-create them all, I am trying to figure out the best SEO approach to handle these pages. Options: Create really simple pages for the web spiders and only real users will fetch the data and generate the page. A little bit 'afraid' google will see this as low quality content, which might also feel duplicated. Put them under a directory in my site (e.g. /non-generated/) and put a disallow in robots.txt. Problem here is I don't want users to have to deal with a different URL when wanting to share this page or make sense of it. Thought about maybe redirecting real users from this URL back to the regular hierarchy and that way 'fooling' google not to get to them. Again not sure he will like me for that. Letting him crawl these pages. Main problem is I can't control to rate of the API calls and also my site seems slower than it should from a spider's perspective (if he only crawled the generated pages, he'd think it's much faster). Which approach would you suggest?

    Read the article

  • CodePlex Daily Summary for Tuesday, November 30, 2010

    CodePlex Daily Summary for Tuesday, November 30, 2010Popular ReleasesSense/Net Enterprise Portal & ECMS: SenseNet 6.0.1 Community Edition: Sense/Net 6.0.1 Community Edition This half year we have been working quite fiercely to bring you the long-awaited release of Sense/Net 6.0. Download this Community Edition to see what we have been up to. These months we have worked on getting the WebCMS capabilities of Sense/Net 6.0 up to par. New features include: New, powerful page and portlet editing experience. HTML and CSS cleanup, new, powerful site skinning system. Upgraded, lightning-fast indexing and query via Lucene. Limita...Minecraft GPS: Minecraft GPS 1.1.1: New Features Compass! New style. Set opacity on main window to allow overlay of Minecraft. Open World in any folder. Fixes Fixed style so listbox won't grow the window size. Fixed open file dialog issue on non-vista kernel machines.DotSpatial: DotSpatial 11-28-2001: This release introduces some exciting improvements. Support for big raster, both in display and changing the scheme. Faster raster scheme creation for all rasters. Caching of the "sample" values so once obtained the raster symbolizer dialog loads faster. Reprojection supported for raster and image classes. Affine transform fully supported for images and rasters, so skewed images are now possible. Projection uses better checks when loading unprojected layers. GDAL raster support f...Virtu: Virtu 0.9.0: Source Requirements.NET Framework 4 Visual Studio 2010 or Visual Studio 2010 Express Silverlight 4 Tools for Visual Studio 2010 Windows Phone 7 Developer Tools (which includes XNA Game Studio 4) Binaries RequirementsSilverlight 4 .NET Framework 4 XNA Framework 4SuperWebSocket: SuperWebSocket(60438): It is the first release of SuperWebSocket. Because it is base on SuperSocket, most features of SuperSocket are supported in SuperWebSocket. The source code include a LiveChat demo.MDownloader: MDownloader-0.15.25.7002: Fixed updater Fixed FileServe Fixed LetItBitNotepad.NET: Notepad.NET 0.7 Preview 1: Whats New?* Optimized Code Generation: Which means it will run significantly faster. * Preview of Syntax Highlighting: Only VB.NET highlighting is supported, C# and Ruby will come in Preview 2. * Improved Editing Updates (when the line number, etc updates) to be more graceful. * Recent Documents works! * Images can be inserted but they're extremely large. Known Bugs* The Update Process hangs: This is a bug apparently spawning since 0.5. It will be fixed in Preview 2. Until then, perform a fr...Cropper: 1.9.4: Mostly fixes for issues with a few feature requests. Fixed Issues 2730 & 3638 & 14467 11044 11447 11448 11449 14665 Implemented Features 6123 11581PFC: PFC for PB 11.5: This is just a migration from the 11.0 code. No changes have been made yet (and they are needed) for it to work properly with 11.5.PDF Rider: PDF Rider 0.5: This release does not add any new feature for pdf manipulation, but enables automatic updates checking, so it is reccomended to install it in order to stay updated with next releases. Prerequisites * Microsoft Windows Operating Systems (XP - Vista - 7) * Microsoft .NET Framework 3.5 runtime * A PDF rendering software (i.e. Adobe Reader) that can be opened inside Internet Explorer. Installation instructionsChoose one of the following methods: 1. Download and run the "pdfRider0...BCLExtensions: BCL Extensions v1.0: The files associated with v1.0 of the BCL Extensions library.XamlQuery/WPF - The Write Less, Do More, WPF Library: XamlQuery-WPF v1.2 (Runtime, Source): This is the first release of popular XamlQuery library for WPF. XamlQuery has already gained recognition among Silverlight developers.Math.NET Numerics: Beta 1: First beta of Math.NET Numerics. Only contains the managed linear algebra provider. Beta 2 will include the native linear algebra providers along with better documentation and examples.Microsoft All-In-One Code Framework: Visual Studio 2010 Code Samples 2010-11-25: Code samples for Visual Studio 2010Wii Backup Fusion: Wii Backup Fusion 0.8.5 Beta: - WBFS repair (default) options fixed - Transfer to image fixed - Settings ui widget names fixed - Some little bug fixes You need to reset the settings! Delete WiiBaFu's config file or registry entries on windows: Linux: ~/.config/WiiBaFu/wiibafu.conf Windows: HKEY_CURRENT_USER\Software\WiiBaFu\wiibafu Mac OS X: ~/Library/Preferences/com.wiibafu.wiibafu.plist Caution: This is a BETA version! Errors, crashes and data loss not impossible! Use in test environments only, not on productive syste...Minemapper: Minemapper v0.1.3: Added process count and world size calculation progress to the status bar. Added View->'Status Bar' menu item to show/hide the status bar. Status bar is automatically shown when loading a world. Added a prompt, when loading a world, to use or clear cached images.Sexy Select: sexy select v0.4: Changes in v0.4 Added method : elements. This returns all the option elements that are currently added to the select list Added method : selectOption. This method accepts two values, the element to be modified and the selected state. (true/false)Deep Zoom for WPF: First Release: This first release of the Deep Zoom control has the same source code, binaries and demos as the CodeProject article (http://www.codeproject.com/KB/WPF/DeepZoom.aspx).BlogEngine.NET: BlogEngine.NET 2.0 RC: This is a Release Candidate version for BlogEngine.NET 2.0. The most current, stable version of BlogEngine.NET is version 1.6. Find out more about the BlogEngine.NET 2.0 RC here. If you want to extend or modify BlogEngine.NET, you should download the source code. To get started, be sure to check out our installation documentation and the installation screencast. If you are upgrading from a previous version, please take a look at the Upgrading to BlogEngine.NET 2.0 instructions. As this ...NodeXL: Network Overview, Discovery and Exploration for Excel: NodeXL Excel Template, version 1.0.1.156: The NodeXL Excel template displays a network graph using edge and vertex lists stored in an Excel 2007 or Excel 2010 workbook. What's NewThis release adds a feature for aggregating the overall metrics in a folder full of NodeXL workbooks, adds geographical coordinates to the Twitter import features, and fixes a memory-related bug. See the Complete NodeXL Release History for details. Please Note: There is a new option in the setup program to install for "Just Me" or "Everyone." Most people...New ProjectsActiveRecordTest: ActiveRecordTest is a sample project that is really a quick guide for start using Castle ActiveRecord within an ASP.NET web application.BacteriaManage: just test codeplexDS CMS: Diamond Shop - open source project. 1. ASP.NET MVC 3.0 2. Entity Framework 3. Jquery 4. LinqGeneral Media Access WebService: This project is focused on building a general purpose media access webservice based on WCF.JavaEE server for XUNU: C'est le serveur internet du site à ChoupieLearning management system: Learning management system to help teachers on their work.LogWriterReader using Named pipe: LogWriterReader using Named pipeNMix: NMix???EntLib,NHibernate,log4net??????????,????????????????,?????????、?????、????、????、?????????。Nosso Rico Dinheirinho: Financial control system like Microsoft Money, but via web.Post Template: Post Template (for now) is for craigslist posters looking to make their posts more visually appealing. Abstracting the styling and layout details of HTML and CSS, Post Template eliminates the need to know these languages when posting. Post Template is mostly written in C#.SharePoint Silverlight Clock: SharePoint Silverlight ClockSilverlight MVVM wizard using Caliburn Micro: This MVVM style Silverlight 4 wizard shows some Caliburn Micro features, as well as the use of MEF and MVVM style unit testing. The UI and code are based on the code accompanying the "Code Project" article "Creating an Internationalized Wizard in WPF" from dec. 2008.Spider Framework: A ruler-based spider framework developing with C#syx Open Source Project: syx Open Source ProjectTigerCat: TigerCat will support application development as infrastructure and RAD tools.TitleNetSolution: This my team Solution.!Uploadert: UploadertWidget Suite for DotNetNuke: This project is intended to hold a suite of useful widgets to make your skinning easier, and raise the level of interactivity with DotNetNuke website visitors.ZenBridge for Picasa: ZenBridge for Picasa makes it easy for Zenfolio users to upload edited images directly to a chosen Zenfolio gallery. It's developed in C#.NET 4.

    Read the article

  • Recovering a lost website with no backup?

    - by Jeff Atwood
    Unfortunately, our hosting provider experienced 100% data loss, so I've lost all content for two hosted blog websites: http://blog.stackoverflow.com http://www.codinghorror.com (Yes, yes, I absolutely should have done complete offsite backups. Unfortunately, all my backups were on the server itself. So save the lecture; you're 100% absolutely right, but that doesn't help me at the moment. Let's stay focused on the question here!) I am beginning the slow, painful process of recovering the website from web crawler caches. There are a few automated tools for recovering a website from internet web spider (Yahoo, Bing, Google, etc.) caches, like Warrick, but I had some bad results using this: My IP address was quickly banned from Google for using it I get lots of 500 and 503 errors and "waiting 5 minutes…" Ultimately, I can recover the text content faster by hand I've had much better luck by using a list of all blog posts, clicking through to the Google cache and saving each individual file as HTML. While there are a lot of blog posts, there aren't that many, and I figure I deserve some self-flagellation for not having a better backup strategy. Anyway, the important thing is that I've had good luck getting the blog post text this way, and I am definitely able to get the text of the web pages out of the Internet caches. Based on what I've done so far, I am confident I can recover all the lost blog post text and comments. However, the images that go with each blog post are proving…more difficult. Any general tips for recovering website pages from Internet caches, and in particular, places to recover archived images from website pages? (And, again, please, no backup lectures. You're totally, completely, utterly right! But being right isn't solving my immediate problem… Unless you have a time machine…)

    Read the article

  • How should I deal with user agent parsing in logs?

    - by Mr. Jefferson
    My web app project includes logging functionality so we can see where visitors are coming from (referrer URL), what the popular user agents are, what pages are most popular, etc. The log is stored in SQL Server, and when I query the user agents I use a large (almost 100 lines) and growing CASE statement to separate the user agents using string matching (i.e. if the user agent contains the string "Firefox/9" then it's Firefox 9). Is there a better way to do this so I don't have to continually add to that CASE statement to deal with new browser releases? Also, how should I deal with less common, weird/unknown user agents? I've seen the following in the logs and been unable to find good information online about what they are: WordPress/3.3.1; http://www.facecolony.org Mozilla/4.0 ( http://www.hairirons.org redips; <a href=http://hairirons.org/>chi hair iron</a>) I'd guess they're bots/crawlers, but the sites they point to don't appear to reference web crawlers (or even be available sometimes). I've seen other user agents aren't familiar to me, but I know they're bots because they include "bot" or "spider" or something similar in them.

    Read the article

  • Googlebot visit but no cache update - why?

    - by Mick
    I have made a new plain vanilla HTML website. I have been making regular modifications to it on an almost daily basis. The site is hosted by hostmonster and as part of their service they offer "awstats" to let you know assorted details of visitors to the site. One thing is puzzling me. According to awstats, a "robot/spider" calling itself "Googlebot" visited my site as recently as today (28th June 2011), but when I find my site on google (e.g. by searching for "full reserve banking") the cache is dated only the 5th June. I always thought that a visit from the google robot was synonymous with a cache update. Am I wrong? Or have I accidentally put something in the site telling google that nothing has been updated? EDIT: It seems a moderator has removed the name of my website, so there is now no chance that anyone could check out if I had made some error on my site :-( ... but anyway, in answer to paulmorriss' question, here is what aw stats was telling me:

    Read the article

  • Googlebot visit but no cache update - why?

    - by Mick
    I have made a new plain vanilla HTML website. I have been making regular modifications to it on an almost daily basis. The site is hosted by hostmonster and as part of their service they offer "awstats" to let you know assorted details of visitors to the site. One thing is puzzling me. According to awstats, a "robot/spider" calling itself "Googlebot" visited my site as recently as today (28th June 2011), but when I find my site on google (e.g. by searching for "full reserve banking") the cache is dated only the 5th June. I always thought that a visit from the google robot was synonymous with a cache update. Am I wrong? Or have I accidentally put something in the site telling google that nothing has been updated? EDIT: It seems a moderator has removed the name of my website, so there is now no chance that anyone could check out if I had made some error on my site :-( ... but anyway, in answer to paulmorriss' question, here is what aw stats was telling me:

    Read the article

< Previous Page | 2 3 4 5 6 7 8 9 10 11  | Next Page >