Search Results

Search found 210 results on 9 pages for 'scrape'.

Page 4/9 | < Previous Page | 1 2 3 4 5 6 7 8 9  | Next Page >

  • PHP Curl Not formatting quotes properly; producing weird character set for single/double quotes

    - by user595052
    I wrote a html scraper to scrape my various social identites, so I can make a real time 'biography' website. However after using php curl_exec, I find that texts that I have quoted, end up being formatted in a weird character set. ex: "I love dogs" gets formatted to ’I love dogs ’ "I hate cheese" gets formatted to “I hate cheese� How do I either scrub these characters, or set curl not to format quotes like this. Also, I have turned off magic_quotes.

    Read the article

  • Do not filter outlinks in Nutch?

    - by sigpwned
    I'm currently trying to perform a deep crawl within a small list of sites. To accomplish this, I updated conf/domain-urlfilter.txt with the domains of the sites I wish to scrape, which worked nicely. However, I found that not only were the links crawled at every step filtered, but the outlinks captured from each page crawled were filtered as well. Is there a way to avoid filtering captured outlinks while still filtering crawled URLs?

    Read the article

  • Full disk encryption on linux (ubuntu) w/o re-installing - possible?

    - by sa125
    Hi - I work at a company that takes security very seriously (like most). Our IT guy came in today to prepare us mentally to re-install our systems after he'll apply the new encryption policy (which will basically scrape our HD clean). For our team this means about a week of re-configuring, installing, and tweaking our desktops until we are back to work capacity - anyone who has to re-install a development machine probably knows what I'm talking about. So, I guess my question is if there's any way to perform full disk encryption on a linux (ubuntu = 9.04) system without having to re-install EVERYTHING [sigh]. IT guy said there isn't any - please prove him wrong. thanks :)

    Read the article

  • help with Outlook Exchange server and curl

    - by stib
    I work on a mac in a building full of PCs, and the IT department here doesn't have IMAP access turned on on the exchange servers. So I miss a lot of meetings because I don't get reminders because I access my mail via Outlook Web access. I had written a script to scrape my Outlook Web Access calendar and turn it into iCal format, so I could get my reminders via thunderbird or iCal.app. It basically downloaded the calendar page via curl, parsed the HTML and reformatted all the appointments as ical. it wasn't elegant, but it worked. Then they changed to outlook 2007, and it doesn't work any more. I have a sketchy knowledge of curl, and almost zero knowledge of how outlook works. Can anyone point me towards a reference for getting calendar info out of an exchange server without using outlook? If I can configure curl to get the HTML I will be happy, but if there's a more elegant way, such as getting the calendar info as XML I'll be delirious.

    Read the article

  • Identify malicious subnet

    - by Macros
    I have been experiencing performance issues on a website for a while, and it always seems to hit around the same time. Having analysed the logs I've found a big spike in requests which corresponds with this slowdown, with all requests coming from the same subnet. It feels to me like an attempt to scrape the site (it is a car hire site and the requests are sequential for each IP and with incremental search criteria) and I would like to identify the source. The Subnet in question is 209.67.89.x which I can see is owned by Savvis however I can't reverse DNS any of the IPs - is there any other way I can gain more info on this (other than contacting them direct - I am also doing this)?

    Read the article

  • Parsing HTML Documents with the Html Agility Pack

    Screen scraping is the process of programmatically accessing and processing information from an external website. For example, a price comparison website might screen scrape a variety of online retailers to build a database of products and what various retailers are selling them for. Typically, screen scraping is performed by mimicking the behavior of a browser - namely, by making an HTTP request from code and then parsing and analyzing the returned HTML. The .NET Framework offers a variety of classes for accessing data from a remote website, namely the WebClient class and the HttpWebRequest class. These classes are useful for making an HTTP request to a remote website and pulling down the markup from a particular URL, but they offer no assistance in parsing the returned HTML. Instead, developers commonly rely on string parsing methods like String.IndexOf, String.Substring, and the like, or through the use of regular expressions. Another option for parsing HTML documents is to use the Html Agility Pack, a free, open-source library designed to simplify reading from and writing to HTML documents. The Html Agility Pack constructs a Document Object Model (DOM) view of the HTML document being parsed. With a few lines of code, developers can walk through the DOM, moving from a node to its children, or vice versa. Also, the Html Agility Pack can return specific nodes in the DOM through the use of XPath expressions. (The Html Agility Pack also includes a class for downloading an HTML document from a remote website; this means you can both download and parse an external web page using the Html Agility Pack.) This article shows how to get started using the Html Agility Pack and includes a number of real-world examples that illustrate this library's utility. A complete, working demo is available for download at the end of this article. Read on to learn more! Read More >

    Read the article

  • Html Agility Pack for Reading “Real World” HTML

    - by WeigeltRo
    In an ideal world, all data you need from the web would be available via well-designed services. In the real world you sometimes have to scrape the data off a web page. Ugly, dirty – but if you really want that data, you have no choice. Just don’t write (yet another) HTML parser. I stumbled across the Html Agility Pack (HAP) a long time ago, but just now had the need for a robust way to read HTML. A quote from the website: This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams). Using the HAP was a simple matter of getting the Nuget package, taking a look at the example and dusting off some of my XPath knowledge from years ago. The documentation on the Codeplex site is non-existing, but if you’ve queried a DOM or used XPath or XSLT before you shouldn’t have problems finding your way around using Intellisense (ReSharper tip: Press Ctrl+Shift+F1 on class members for reading the full doc comments).

    Read the article

  • Killer content for my Kindle - The Economist with no need for an iPad - yipeee!

    - by Liam Westley
    I admin it, I was jealous of someone's iPad. They were reading The Economist, for free, as they were a print subscriber. I'm a print subscriber too. However, I don't have an iPad or an iPhone, just an Android phone and a Kindle. As soon as I got the Kindle, I looked up how to get The Economist on it. £9.99 per month. Hmmm, twice as much again as a my print subscription and I wanted to maintain the print subscription. No way Amazon. Fortunately some nice person wrote similar comments on The Economist subscription for Kindle, but added a very important additional nugget of information; and there is no need, as a print subscriber you can just use the free Calibre e-book creation tool anyway. So I downloaded it, searched for The Economist online 'recipe', entered my login name and password (part of my print subscription) and off went Calibre to screen scrape every single article from the Christmas 2010 issue into a .mobi file, complete with front cover image and full indexing. It's wonderful. Truely wonderful. Every section individually indexed, with each article separated and all inline images preserved. It even feels wonderfully retro, back to the days when The Economist only used black and white images. So many thanks the guys behind Calibre and The Economist recipe creators. Finally, I have my essential Kindle content that I've been waiting for.

    Read the article

  • Ubuntu 13.04 to 13.10: Filesystem check or mount failed

    - by SamHuckaby
    I attempted to upgrade from Ubuntu 13.04 to 13.10 today, and mid upgrade the system started flaking out, and eventually locked up entirely. I was forced to restart the computer, and am now unable to get the computer to boot up at all. When I boot currently, it takes me to the GRUB menu, and I can choose to boot normally, or boot in an older version. I have tried several things, which I list below, but no matter what, when I try to finish booting into Ubuntu, I receive the following error: Filesystem check or mount failed. A maintenance shell will now be started. CONTROL-D will terminate this shell and continue booting after re-trying filesystems. Any further errors will be ignored root@ubuntu-computername:~# I have fun fsck -f and everything appears correct, no errors are reported. and it passes all 5 checks. If I run fdisk -l then I get the following information: Disk /dev/sda: 320.1 GB, 320072933376 bytes 255 heads, 63 sectors/track, 38913 cylinders, total 625142448 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 4096 bytes / 4096 bytes Disk identifier: 0x00010824 Device Boot Start End Blocks Id System /dev/sda1 * 2048 608456703 304227328 83 Linux /dev/sda2 608458750 625141759 8341505 5 Extended Partition 2 does not start on physical sector boundary. /dev/sda5 608458752 625141759 8341504 82 Linux swap / Solaris Disk /dev/sdb: 320.1 GB, 320072933376 bytes 255 heads, 63 sectors/track, 38913 cylinders, total 625142448 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk identifier: 0x0fb4b7e8 Device Boot Start End Blocks Id System /dev/sdb1 8192 625139711 312565760 7 HPFS/NTFS/exFAT I am considering just installing a new OS on the other disk, that currently has nothing on it, and then just attempting to scrape my data off the old disk (thankfully I didn't encrypt the files). Really my question is this: Can I salvage this Ubuntu install, or should I give up and just reinstall?

    Read the article

  • Get external metadata with streamripper using python script

    - by user72379
    Hi I like using streamripper to rip music from the web. I have a favorite radio station that doesn't have the metadata for the songs, so I have to screen scrape it from its website manually. I created this neat python script in the format that the docs suggest, and linked the address in the GUI for streamriper. But it still doesn't work, any one know how to make it work..? I know it used to work. It gives you a sample here: http://streamripper.sourceforge.net/history.php import http import time import re u = 'SEE IMAGE FOR THIS URL' s = http.Session(0, 0) s.add_headers(h, persistent=1) while 1: c = unicode(s.get(u)) pat = r'class="title"([^<]+)([^<]+)' m = re.search(pat, c) title = m.group(1) artist = m.group(2) print 'TITLE='+title+'\n'+'ARTIST='+artist+'\n.\n' time.sleep(30) [img]http://s11.postimage.org/sok928lsz/urlstream.png[/img] I put the address to the script here: [img]http://s17.postimage.org/4bhmhi4yn/streamripper.png[/img] I've tried putting it in the root of the streamripper application and doing this: lax.py I've even compiled it to a EXE, and tried linking to that.. nothing What am I doing wrong?

    Read the article

  • Skynet Big Data Demo Using Hexbug Spider Robot, Raspberry Pi, and Java SE Embedded (Part 3)

    - by hinkmond
    In Part 2, I described what connections you need to make for this demo using a Hexbug Spider Robot, a Raspberry Pi, and Java SE Embedded for programming. Here are some photos of me doing the soldering. Software engineers should not be afraid of a little soldering work. It's all good. See: Skynet Big Data Demo (Part 2) One thing to watch out for when you open the remote is that there may be some glue covering the contact points. Make sure to use an Exacto knife or small screwdriver to scrape away any glue or non-conductive material covering each place where you need to solder. And after you are done with your soldering and you gave the solder enough time to cool, make sure all your connections are marked so that you know which wire goes where. Give each wire a very light tug to make sure it is soldered correctly and is making good contact. There are lots of videos on the Web to help you if this is your first time soldering. Check out Laday Ada's (from adafruit.com) links on how to solder if you need some additional help: http://www.ladyada.net/learn/soldering/thm.html If everything looks good, zip everything back up and meet back here for how to connect these wires to your Raspberry Pi. That will be it for the hardware part of this project. See, that wasn't so bad. Hinkmond

    Read the article

  • c# Network Programming - HTTPWebRequest Scraping

    - by masterguru
    Hi, I am building a web scraping application. It should scrape a complex web site with concurrent HttpWebRequests from a single host to a single target web server. The application should run on Windows server 2008. One single HttpWebRequest for data could take from 1 minute to 4 minutes to complete (because of long running db operations) I should have at least 100 parallel requests to the target web server, but i have noticed that when i use more then 2-3 long-running requests i have big performance issues (request timeouts/hanging). How many concurrent requests can i have in this scenario from a single host to a single target web server? can i use Thread Pools in the application to run parallel HttpWebRequests to the server? will i have any issues with the default outbound HTTP connection/requests limits? what about Request timeouts when i reach outbound connection limits? what would be the best setup for my scenario? Any help would be appreciated. Thanks

    Read the article

  • Obtaining IP addresses in Bittorrent

    - by Legend
    I am trying to get a list of IP addresses serving or downloading a file. What I did was to contact a tracker like openbittorrent.com to get the following (as part of the scrape file): B%00%00%0C%5F%B1%B1l%CAGa%84S%CB%B0%9BG%84%3BE:0:1 Now, the long string in the beginning is the info hash. As a next step, I did this: http://tracker.sometracker.com/announce?info_hash=B%00%00%0C%5F%B1%B1l%CAGa%84S%CB%B0%9BG%84%3BE It gave me back the following. So far so good. The message contained this: d8:completei0e10:downloadedi0e10:incompletei2e8:intervali1931e12:min intervali965e5:peers12:U????????^@^@e Can someone tell me what should I be doing after this to get the IP addresses currently serving the file or downloading it?

    Read the article

  • How to store Ruby method references in a database?

    - by Mad Wombat
    I am writing my first rails app. It needs to aggregate some data from multiple sites and for each site I have a unique way of getting the data (some provide RSS, some JSON, for some I scrape the HTML etc.). These will run on schedule, probably as a rake task from cron. It seems logical to store the sites and relevant information in a model, but I am not sure where to put unique data retrieval methods. Do I store method names in the model? Do I just name the methods the same as site name and call them that way? Basically, I need a way to read a list of sites and call appropriate method for each site. What is the Ruby on Rails way to do it?

    Read the article

  • Creating a Stack Overflow notifier

    - by Trey
    I could not find a Stack Overflow notifier Android app so I am planning on making one. I hope that my app will serve a similar purpose as the Stack Overflow Notifier Chrome extension. This will be my first Android app so I am still unfamiliar with the platform. My main concern when creating this app is, what is the proper way to access the user's Recent Activity page? I thought of two different approaches but I'm not sure how to implement either one: Make the user login to Stack Overflow through the Browser application or an embedded browser and scrape their recent activity page occasionally for updates. Ask the user for their username and password and forward this information to Stack Overflow for authentication, storing cookies somehow to keep the session active. I think Astrid uses something similar to the first approach, but I haven't been able to figure it out yet from skimming their code. What is the correct way to handle a notification application like this that requires session management?

    Read the article

  • Web scraping with Python

    - by Jack
    I'm currently trying to scrape a website that has fairly poorly-formatted HTML (often missing closing tags, no use of classes or ids so it's incredibly difficult to go straight to the element you want, etc.). I've been using BeautifulSoup with some success so far but every once and a while (though quite rarely), I run into a page where BeautifulSoup creates the HTML tree a bit differently from (for example) Firefox or Webkit. While this is understandable as the formatting of the HTML leaves this ambiguous, if I were able to get the same parse tree as Firefox or Webkit produces I would be able to parse things much more easily. The problems are usually something like the site opens a <b> tag twice and when BeautifulSoup sees the second <b> tag, it immediately closes the first while Firefox and Webkit nest the <b> tags. Is there a web scraping library for Python (or even any other language (I'm getting desperate)) that can reproduce the parse tree generated by Firefox or WebKit (or at least get closer than BeautifulSoup in cases of ambiguity).

    Read the article

  • Web scraping with Python

    - by Jack
    I'm currently trying to scrape a website that has fairly poorly-formatted HTML (often missing closing tags, no use of classes or ids so it's incredibly difficult to go straight to the element you want, etc.). I've been using BeautifulSoup with some success so far but every once and a while (though quite rarely), I run into a page where BeautifulSoup creates the HTML tree a bit differently from (for example) Firefox or Webkit. While this is understandable as the formatting of the HTML leaves this ambiguous, if I were able to get the same parse tree as Firefox or Webkit produces I would be able to parse things much more easily. The problems are usually something like the site opens a <b> tag twice and when BeautifulSoup sees the second <b> tag, it immediately closes the first while Firefox and Webkit nest the <b> tags. Is there a web scraping library for Python (or even any other language (I'm getting desperate)) that can reproduce the parse tree generated by Firefox or WebKit (or at least get closer than BeautifulSoup in cases of ambiguity).

    Read the article

  • RegEx - Match optional groups

    - by Maurizio
    I know RE is not the best way to scrape HTMLs, but this is it... I have some something like: <td> Writing: <a href="creator.php?c=CCh">Carlo Chendi</a> Art: <a href="creator.php?c=LBo">Luciano Bottaro</a> </td> And I need to match the Writing and Art parts. But it is not said they're there and there could be other parts like Ink and Pencils... How do i do this ? I need to use pure Regex, no additional Python libs... Thanks !

    Read the article

  • I would like to build an app that alerts me when road traffic is high or low. Where can I get the r

    - by MedicineMan
    I'm sitting at work, waiting for traffic to die down. The thought occurred to me. I know when I want to go home, why don't I have an app that watches traffic for me? I also know that there are a lot of smart people on stackoverflow. Where can I get live traffic data for the san francisco bay area region? The data source should be timely, accurate, and as high resolution as possible. I would like to build an app on top of a service, rather than watch google maps or watch another website. I would prefer that I not have to scrape the data, but I have been know to do this in the past when no other option exists.

    Read the article

  • Working with a Java Mail Server for Testing

    - by Charlie
    I'm in the process of testing an application that takes mail out of a mailbox, performs some action based on the content of that mail, and then sends a response mail depending on the result of the action. I'm looking for a way to write tests for this application. Ideally, I'd like for these tests to bring up their own mail server, push my test emails to a folder on this mail server, and have my application scrape the mail out of the mail server that my test started. Configuring the application to use the mailserver is not difficult, but I do not know where to look for a programatic way of starting a mail server in Java. I've looked at JAMES, but I am unable to figure out how to start the server from within my test. So the question is this: What can I use for a mail server in Java that I can configure and start entirely within Java?

    Read the article

  • Are there any free .NET OCR libraries that will perform OCR on an application window directly?

    - by Kelsey
    I am looking for a free .NET OCR library that will be able to do OCR on a given application window or even a image in memory (I can take a snapshot of the application window myself). I have looked at tessnet2 and MODI but both require an image located on disk. I need to use OCR because the application I am trying to write a script for does some wacky stuff that cannot be read using windows API and I need to scrape data from the screen. I have tested both of tessnet2 and MODI and they both can read the text mostly but because this has to run in an enviroment that will not be able to write to disk, I need it to be able to read directly from the applciation window or some type of memory stream. I am thinking OCR is my only soution but there could be other methods that I am not thinking of. Suggestions?

    Read the article

  • Problem with eastern european characters when scraping data from the European Parliaments Website

    - by Thomas Jensen
    Dear Experts I am trying to scrape a lot of data from the European Parliament website for a research project. Ther first step is the create a list if all parliamentarians, however due to the many Eastern European names and the accents they use i get a lot of missing entries. Here is an example of what is giving me troubles (notice the accents at the end of the family name): ANDRIKIENE, Laima Liucija Group of the European People's Party (Christian Democrats) So far I have been using PyParser and the following code: parser_names name = Word(alphanums + alphas8bit) begin, end = map(Suppress, "<") names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end for name in names.searchString(page): print(name) However this does not catch the name from the html above. Any advice in how to proceed? Best, Thomas

    Read the article

  • Python GUI Scraper hanging issues.

    - by bball
    I wrote a scraper using python a while back, and it worked fine in the command line. I have made a GUI for the application now, but I am having trouble with one issue. When I attempt to update text inside the gui (e.g. 'fetching URL 12/50'), I am unable seeing as the function within the scraper is grabbing 100+ links. Also when going from one scraping function, to a function that should update the gui, to another function, the gui update function seems to be skipped over while the next scrape function is run. An example would be: scrapeLinksA() #takes 20 seconds updateInfo("LinksA done") scrapeLinksB() #takes another 20 seconds in the above example, updateInfo is never executed, unless I end the program with a KeyboardInterrupt. I'm thinking my solution is threading, but I'm not sure. What can I do to fix this? I am using: PyQt4 urllib2 BeautifulSoup

    Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9  | Next Page >