information extraction - Page 31

Extract news links from news website

- by Ali

Is there any reliable method to find out the collection of links which is directed us to detail news page. in other word after visiting the first page of website I just want those links that refer to a news item. any solution ?

Read the article

Get the rendered text from HTML (Delphi)

- by Daisetsu

I have some HTML and I need to extract the actual written text from the page. So far I have tried using a web browser and rendering the page, then going to the document property and grabbing the text. This works, but only where the browser is supported (IE com object). The problem is I want this to be able to run under wine also, so I need a solution that doesn't use IE COM. There must be a programatic way to do this that is reasonable.

Read the article

What is the best way to parse html in C#?

- by Luke

I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries.

Read the article

Getting BeautifulSoup to find a specific <p>

- by Ryan

I'm trying to put together a basic HTML scraper for a variety of scientific journal websites, specifically trying to get the abstract or introductory paragraph. The current journal I'm working on is Nature, and the article I've been using as my sample can be seen at http://www.nature.com/nature/journal/v463/n7284/abs/nature08715.html. I can't get the abstract out of that page, however. I'm searching for everything between the <p class="lead">...</p> tags, but I can't seem to figure out how to isolate them. I thought it would be something simple like from BeautifulSoup import BeautifulSoup import re import urllib2 address="http://www.nature.com/nature/journal/v463/n7284/full/nature08715.html" html = urllib2.urlopen(address).read() soup = BeautifulSoup(html) abstract = soup.find('p', attrs={'class' : 'lead'}) print abstract Using Python 2.5, BeautifulSoup 3.0.8, running this returns 'None'. I have no option of using anything else that needs to be compiled/installed (like lxml). Is BeautifulSoup confused, or am I?

Read the article

How do you parse an HTML in vb.net

- by tooleb

I would like to know if there is a simple way to parse HTML in vb.net. I know that HTML is not sctrict subset of XML, but it would be nice if it could be treated that way. Is there anything out there that would let me parse HTML in an XML-like way in VB.net?

Read the article

looking for alternative to Webzinc .NET , screen scraping, web automation library for .net

- by gpow

i came across this .net library http://www.webzinc.com/online/faq.aspx however, i was wondering if there was a free alternative out there ?

Read the article

parsing HTML on the iPhone

- by Ben Alpert

Can anyone recommend a C or Objective-C library for HTML parsing? It needs to handle messy HTML code that won't quite validate. Does such a library exist, or am I better off just trying to use regular expressions?

Read the article

How to extract data from a PDF?

- by Fermin

Hi, My company receives data from an external company via Excel. We export this into SQL Server to run reports on the data. They are now changing to PDF format, is there a way to reliably port the data from the PDF and insert it into our SQL Server 2008 database? Would this require writing an app or is there an automated way of doing this?

Read the article

How do I extract HTML content using Regex in PHP

- by gAMBOOKa

I know, i know... regex is not the best way to extract HTML text. But I need to extract article text from a lot of pages, I can store regexes in the database for each website. I'm not sure how XML parsers would work with multiple websites. You'd need a separate function for each website. In any case, I don't know much about regexes, so bear with me. I've got an HTML page in a format similar to this <html> <head>...</head> <body> <div class=nav>...</div><p id="someshit" /> <div class=body>....</div> <div class=footer>...</div> </body> I need to extract the contents of the body class container. I tried this. $pattern = "/<div class=\"body\">$.*?$<\/div>/sui" $text = $htmlPageAsIs; if (preg_match($pattern, $text, $matches)) echo "MATCHED!"; else echo "Sorry gambooka, but your text is in another castle."; What am I doing wrong? My text ends up in another castle.

Read the article

Extracting Window Contents

- by user293392

I need to extract window content if this is based on text, or at least the file path associated to that window. To-date, I have considered: 1. win32api 2. 3rd party libraries 3. wrapper classes However, I am not satisfied with the solutions. So any ideas how this can be done in a clean way?

Read the article

How can I read from an std::istream (using operator>>)?

- by dehmann

How can I read from an std::istream using operator>>? I tried the following: void foo(const std::istream& in) { std::string tmp; while(in >> tmp) { std::cout << tmp; } } But it gives an error: error: no match for 'operator>>' in 'in >> tmp'

Read the article

Extract part of a git repository?

- by Riobard

Assume my git repository has the following structure: /.git /Project /Project/SubProject-0 /Project/SubProject-1 /Project/SubProject-2 and the repository has quite some commits. Now one of the subprojects (SubProject-0) grows pretty big, and I want to take SubProject-0 out and set it up as a standalone project. Is it possible to extract all the commit history involving SubProject-0 from the parent git repository and move it to a new one?

Read the article

Access Adobe InDesign files

- by PeterMmm

I need some directions for the following problem: I have a lot of InDesign files and i have to setup a process that will track if a certain paragraph or text block has changed between diferent versions of the file. If the text block has changed i want to extract that text block in a "portable" format (html, pdf, txt). Is there an Adobe product that would do that ? Is there any public API to access an InDesign file ? Is there the posibility to export InDesign to, say, html ?

Read the article

Is there anything for Python that is like readability.js?

- by Emre Sevinç

Hi, I'm looking for a package / module / function etc. that is approximately the Python equivalent of Arc90's readability.js http://lab.arc90.com/experiments/readability http://lab.arc90.com/experiments/readability/js/readability.js so that I can give it some input.html and the result is cleaned up version of that html page's "main text". I want this so that I can use it on the server-side (unlike the JS version that runs only on browser side). Any ideas? PS: I have tried Rhino + env.js and that combination works but the performance is unacceptable it takes minutes to clean up most of the html content :( (still couldn't find why there is such a big performance difference).

Read the article

How to extract common / significant phrases from a series of text entries

- by arronsky

I have a series of text items- raw HTML from a MYSQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching). My example is any review on Yelp.com, that shows 3 snippets from hundreds of reviews of a given restaurant, in the format: "Try the hamburger" (in 44 reviews) e.g., the "Review Highlights" section of this page: http://www.yelp.com/biz/sushi-gen-los-angeles/ I have NLTK installed and I've played around with it a bit, but am honestly overwhelmed by the options. This seems like a rather common problem and I haven't been able to find a straightforward solution by searching here. Thanks in advance for any help.

Read the article

Extract strings in python

- by shadyabhi

Basically, I want to extract the strings "AAA", "BBB", "CCC", "DDD" from a text file.. ...... (other text goes here)..... <TD align="left" class=texttd><font class='textfont'>AAA</font></TD> ..... (useless text here)..... <TD align="left" class=texttd><font class='textfont'>BBB</font></TD> ....(more text)..... <TD align="left" class=texttd><font class='textfont'>CCC</font></TD> <TD align="left" class=texttd><font class='textfont'>DDD</font></TD> ......(more text)..... I want something like if I do:- data = foo("file.txt") i get:- data = ['AAA','BBB','CCC','DDD'] What is the best possible way? My file is not big..

Read the article

Self-extracting Delphi program

- by Steve

I'm writing an updater program in Delphi7 which will be run once, but needs many files to run. What I'd like the achieve: 1, User runs exe 2, Exe unpacks files, runs updater 3, If updater detects and error, prompts the user to send log in e-mail 4, After the updater is run, temporary files are deleted (some of these files are dlls used by the updater program, so the updater has to be closed before the files can be deleted) Can anyone recommend a good solution? I've thought about using Inno Setup (too complicated for such an easy task) or using a self-extracting zip file (but how to delete the files afterwards)? Thanks!

Read the article

Python module for converting PDF to text

- by cnu

Is there any python module to convert PDF files into text? I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use.

Read the article

C# Use Linq to Extract a single XML attribute for each XML file in a directory.

- by guazz

Hi all, How do I use Linq to extract a single XML attribute form each XML file in a directory and put that element in a C# list. Do I have to loop thru each file one-by-one? The XML files are quite large so I'd like to do this without loading the entire file into memory. Thanks, j

Read the article

A specific string format with a number and character together represeting a certain item

- by sil3nt

Hello there, I have a string which looks like this "a 3e,6s,1d,3g,22r,7c 3g,5r,9c 19.3", how do I go through it and extract the integers and assign them to its corresponding letter variable?. (i have integer variables d,r,e,g,s and c). The first letter in the string represents a function, "3e,6s,1d,3g,22r,7c" and "3g,5r,9c" are two separate containers . And the last decimal value represents a number which needs to be broken down into those variable numbers. my problem is extracting those integers with the letters after it and assigning them into there corresponding letter. and any number with a negative sign or a space in between the number and the letter is invalid. How on earth do i do this?

Read the article

Why is lua crashing after extracting zip files?

- by Brian T Hannan

I have the following code but it crashes every time it reaches the end of the function, but it successfully extracts all the files and puts them in the right location. require "zip" function ExtractZipAndCopyFiles(zipPath, zipFilename, destinationPath) local zfile, err = zip.open(zipPath .. zipFilename) -- iterate through each file insize the zip file for file in zfile:files() do local currFile, err = zfile:open(file.filename) local currFileContents = currFile:read("*a") -- read entire contents of current file local hBinaryOutput = io.open(destinationPath .. file.filename, "wb") -- write current file inside zip to a file outside zip if(hBinaryOutput)then hBinaryOutput:write(currFileContents) hBinaryOutput:close() end end zfile:close() end -- call the function ExtractZipAndCopyFiles("C:\\Users\\bhannan\\Desktop\\LUA\\", "example.zip", "C:\\Users\\bhannan\\Desktop\\ZipExtractionOutput\\") Why does it crash every time it reaches the end?

Read the article

How to extract comment out of header file using python, perl, or sed?

- by WilliamKF

I have a header file like this: /* * APP 180-2 ALG-254/258/772 implementation * Last update: 03/01/2006 * Issue date: 08/22/2004 * * Copyright (C) 2006 Somebody's Name here * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the project nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE PROJECT AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE PROJECT OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #ifndef HEADER_H #define HEADER_H /* More comments and C++ code here. */ #endif /* End of file. */ And I wish to extract out the contents of the first C style comment only and drop the " *" at the start of each line to get a file with the following contents: APP 180-2 ALG-254/258/772 implementation Last update: 03/01/2006 Issue date: 08/22/2004 Copyright (C) 2006 Somebody's Name here All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the project nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE PROJECT AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE PROJECT OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Please suggest an easy way to do this with Python, Perl, sed, or some other way on Unix. Preferably as a one-liner.

Read the article

What's the best way to write a maintainable web scraping app?

- by Benj

I wrote a perl script a while ago which logged into my online banking and emailed me my balance and a mini-statement every day. I found it very useful for keeping track of my finances. The only problem is that I wrote it just using perl and curl and it was quite complicated and hard to maintain. After a few instances of my bank changing their webpage I got fed up of debugging it to keep it up to date. So what's the best way of writing such a program in such a way that it's easy to maintain? I'd like to write a nice well engineered version in either Perl or Java which will be easy to update when the bank inevitably fiddle with their web site.

Read the article

Extracting text from PDF with Poppler (C++)

- by nico

I'm trying to get my way through Poppler and its (lack of) documentation. What I want to do is a very simple thing: open a PDF file and read the text in it. I'm then going to process the text, but that doesn't really matter here. So... I saw the poppler_page_get_text function, and it kind of works, but I have to specify a selection rectangle, which is not very handy. Isn't there just a very simple function that would output the PDF text in order (maybe line by line?). Thank you Nicola

Read the article

How to extract the lat/lng of pins in google maps?

- by BCS

I want to extract the latitude and longitude of a set of about 50-100 pins in a Google maps web page. I don't control the page and I don't need to do it more than once so I'm looking for something quick and dirty. I've got FireFox with FireBug as well as Chrome and all I need is something that's easier than re typing all the numbers.

Search Results

Search found 24744 results on 990 pages for 'information extraction'.

Page 31/990 | < Previous Page | 27 28 29 30 31 32 33 34 35 36 37 38 | Next Page >

- by Ali

- by Daisetsu

- by Luke

- by Ryan

- by tooleb

- by gpow

- by Ben Alpert

- by Fermin

- by gAMBOOKa

- by user293392

- by dehmann

- by Riobard

- by PeterMmm

- by Emre Sevinç

- by arronsky

- by shadyabhi

- by Steve

- by cnu

- by guazz

- by sil3nt

- by Brian T Hannan

- by WilliamKF

- by Benj

- by nico

- by BCS

< Previous Page | 27 28 29 30 31 32 33 34 35 36 37 38 | Next Page >