Search Results

Search found 7251 results on 291 pages for 'pdf parsing'.

Page 97/291 | < Previous Page | 93 94 95 96 97 98 99 100 101 102 103 104  | Next Page >

  • Regex for capturing numbered text list

    - by crunchyt
    I have a test list that I am trying to capture data from using a regex. Here is a sample of the text format: (1) this is a sample string /(2) something strange /(3) another bit of text /(4) the last one/ something!/ I have a Regex that currently captures this correctly, but I am having some difficulty with making it work under outlier conditions. Here is my regex /\(?\d\d?\)([^\)]+)(\/|\z)/ Unfortunately some of the data contains parentheses like this: (1) this is a sample string (1998-1999) /(2) something strange (blah) /(3) another bit of text /(4) the last one/ something!/ The substrings '(1998-1999)' and '(blah)' make it fail! Anyone care to have a crack at this one? Thank you :D

    Read the article

  • HTMLAgilitypack getting <P> and <STRONG> text

    - by StealthRT
    Hey all i am looking for a way to get this HTML code: <DIV class=channel_row><SPAN class=channel> <DIV class=logo><IMG src='/images/channel_logos/WGNAMER.png'></DIV> <P><STRONG>2</STRONG><BR>WGNAMER </P></SPAN> using the HtmlAgilityPack. I have been trying this: With channel info!Logo = .SelectSingleNode(".//img").Attributes("src").Value info!Channel = .SelectSingleNode(".//span[@class='channel']").ChildNodes(1).ChildNodes(0).InnerText info!Station = .SelectSingleNode(".//span[@class='channel']").ChildNodes(1).ChildNodes(2).InnerText End With I can get the Logo but it comes up with a blank string for the Channel and for the Station it says Index was out of range. Must be non-negative and less than the size of the collection. I've tried all types of combinations: info!Station = .SelectSingleNode(".//span[@class='channel']").ChildNodes(1).ChildNodes(1).InnerText info!Station = .SelectSingleNode(".//span[@class='channel']").ChildNodes(1).ChildNodes(3).InnerText info!Station = .SelectSingleNode(".//span[@class='channel']").ChildNodes(0).ChildNodes(1).InnerText info!Station = .SelectSingleNode(".//span[@class='channel']").ChildNodes(0).ChildNodes(2).InnerText info!Station = .SelectSingleNode(".//span[@class='channel']").ChildNodes(0).ChildNodes(3).InnerText What do i need to do in order to correct this?

    Read the article

  • scraping text from multiple html files into a single csv file

    - by Lulu
    I have just over 1500 html pages (1.html to 1500.html). I have written a code using Beautiful Soup that extracts most of the data I need but "misses" out some of the data within the table. My Input: e.g file 1500.html My Code: #!/usr/bin/env python import glob import codecs from BeautifulSoup import BeautifulSoup with codecs.open('dump2.csv', "w", encoding="utf-8") as csvfile: for file in glob.glob('*html*'): print 'Processing', file soup = BeautifulSoup(open(file).read()) rows = soup.findAll('tr') for tr in rows: cols = tr.findAll('td') #print >> csvfile,"#".join(col.string for col in cols) #print >> csvfile,"#".join(td.find(text=True)) for col in cols: print >> csvfile, col.string print >> csvfile, "===" print >> csvfile, "***" Output: One CSV file, with 1500 lines of text and columns of data. For some reason my code does not pull out all the required data but "misses" some data, e.g the Address1 and Address 2 data at the start of the table do not come out. I modified the code to put in * and === separators, I then use perl to put into a clean csv file, unfortunately I'm not sure how to work my code to get all the data I'm looking for!

    Read the article

  • Extracting URIs from RDF web page in Java using Jena Library

    - by Prannoy Mittal
    I have written following code for extratcting URIs from a web page with content type application/rdf-xml for Linked Data application. public static void test(String url) { try { Model read = ModelFactory.createDefaultModel().read(url); System.out.println("to go"); StmtIterator si; si = read.listStatements(); System.out.println("to go"); while(si.hasNext()) { Statement s=si.nextStatement(); Resource r=s.getSubject(); Property p=s.getPredicate(); RDFNode o=s.getObject(); System.out.println(r.getURI()); System.out.println(p.getURI()); System.out.println(o.asResource().getURI()); } } catch(JenaException | NoSuchElementException c) { } } But above code is not extracting all URIs. It provides only few of the URIs. Please guide me where i Went wrong?? hey Rafeel For Eq: for XML File : <?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:ex="http://example.org/stuff/1.0/"> <rdf:Description rdf:about="http://www.w3.org/TR/rdf-syntax-grammar" dc:title="RDF/XML Syntax Specification (Revised)"> <ex:editor> <rdf:Description ex:fullName="Dave Beckett"> <ex:homePage rdf:resource="http://purl.org/net/dajobe/" /> </rdf:Description> </ex:editor> </rdf:Description> </rdf:RDF> The output is : Subject URI is http://www.w3.org/TR/rdf-syntax-grammar Predicate URI is http://example.org/stuff/1.0/editor Object URI is null Subject URI is http://www.w3.org/TR/rdf-syntax-grammar Predicate URI is http://purl.org/dc/elements/1.1/title Website is read

    Read the article

  • Scrape HTML tables from a given URL into CSV

    - by dreeves
    I seek a tool that can be run on the command line like so: tablescrape 'http://someURL.foo.com' [n] If n is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list. If n is specified or if there's only one table, it should parse the table and spit it to stdout as CSV or TSV. Potential additional features: To be really fancy you could parse a table within a table, but for my purposes -- fetching data from wikipedia pages and the like -- that's overkill. The Perl module HTML::TableExtract can do this and may be good place to start for writing the tool I have in mind. An option to asciify any unicode. An option to apply an arbitrary regex substitution for fixing weirdnesses in the parsed table. Related questions: http://stackoverflow.com/questions/259091/how-can-i-scrape-an-html-table-to-csv http://stackoverflow.com/questions/1403087/how-can-i-convert-an-html-table-to-csv http://stackoverflow.com/questions/2861/options-for-html-scraping

    Read the article

  • LIbrary issue: How do I set up QtWebKit to parse HTML?

    - by user560106
    Nick Presta showed that you can parse HTML with qt here: Library Recommendation: C++ HTML Parser However, when I attempt to build this, I get an access violation on the "QWebFrame* frame = page.mainFrame();" line. What am I doing wrong? #include <QtWebKit\QWebElement> #include <QtWebKit\QWebView> #include <QtWebKit\QWebFrame> #include <QtWebKit\QWebPage> #include <iostream> int main() { QWebPage page; QWebFrame* frame = page.mainFrame(); frame->setHtml( "<html><head></head><body></body></html>" ); QWebElement document = frame->documentElement(); return 0; }

    Read the article

  • view .txt, .pdf files in iphone

    - by Ekra
    Hi friends, I am getting the data of the file from network and receiving it in NSData(not saving it any were). I want to view the files without saving it anywere. I tried it with UIWebView but with no success [webView loadData:data_ MIMEType:@"text" textEncodingName:@"UTF-8" baseURL:nil]; Any hint in right direction would be highly appreciated.

    Read the article

  • Haskell Parsec Numeration

    - by Martin
    I'm using Text.ParserCombinators.Parsec and Text.XHtml to parse an input like this: - First type A\n -- First type B\n - Second type A\n -- First type B\n --Second type B\n And my output should be: <h11 First type A\n</h1 <h21.1 First type B\n</h2 <h12 Second type A\n</h2 <h22.1 First type B\n</h2 <h22.2 Second type B\n</h2 I have come to this part, but I cannot get any further: title1= do{ ;(count 1 (char '-')) ;s <- many1 anyChar newline ;return (h1 << s) } title2= do{ ;(count 2 (char '--')) ;s <- many1 anyChar newline ;return (h1 << s) } text=do { ;many (choice [try(title1),try(title2)]) } main :: IO () main = do t putStr "Error: " print err Right x - putStrLn $ prettyHtml x This is ok, but it does not include the numbering. Any ideas? Thanks!

    Read the article

  • jquery selector logical AND?

    - by taber
    In jQuery I'm trying to select only mount nodes where a and b's text values are 64 and "test" accordingly. I'd also like to fallback to 32 if no 64 and "test" exist. What I'm seeing with the code below though, is that the 32 mount is being returned instead of the 64. The XML: <thingses <thing <a32</a <-- note, a here is 32 and not 64 -- <other...</other <mountsample 1</mount <btest</b </thing <thing <a64</a <other...</other <mountsample 2</mount <btest</b </thing <thing <a64</a <other...</other <mountsample 3</mount <bunrelated</b </thing <thing <a128</a <other...</other <mountsample 4</mount <bunrelated</b </thing </thingses And unfortunately I don't have control over the XML as it comes from somewhere else. What I'm doing now is: var ret_val = ''; $data.find('thingses thing').each(function(i, node) { var $node = $(node), found_node = $node.find('b:first:is(test), a:first:is(64)').end().find('mount:first').text(); if(found_node) { ret_val = found_node; return; } found_node = $node.find('b:first:is(test), a:first:is(32)').end().find('mount:first').text(); if(found_node) { ret_val = found_node; return; } ret_val = 'not found'; }); // expected result is "sample 2", but if sample 2's parent "thing" was missing, the result would be "sample 1" alert(ret_val); For my ":is" selector I'm using: if(jQuery){ jQuery.expr[":"].is = function(obj, index, meta, stack){ return (obj.textContent || obj.innerText || $(obj).text() || "").toLowerCase() == meta[3].toLowerCase(); }; } There has to be a better way than how I'm doing it. I wish I could replace the "," with "AND" or something. :) Any help would be much appreciated. thanks!

    Read the article

  • JavaScript parser in JavaScript

    - by emk
    I need to add some lightweight syntactic sugar to JavaScript source code, and process it using a JavaScript-based build system. Are there any open source JavaScript parsers written in JavaScript? And are they reasonably fast when run on top of V8 or a similar high-performance JavaScript implementation? Thank you for any pointers you can provide!

    Read the article

  • What is the difference between an Abstract Syntax Tree and a Concrete Syntax Tree?

    - by Jason Baker
    I've been reading a bit about how interpreters/compilers work, and one area where I'm getting confused is the difference between an AST and a CST. My understanding is that the parser makes a CST, hands it to the semantic analyzer which turns it into an AST. However, my understanding is that the semantic analyzer simply ensures that rules are followed. I don't really understand why it would actually make any changes to make it abstract rather than concrete. Is there something that I'm missing about the semantic analyzer, or is the difference between an AST and CST somewhat artificial?

    Read the article

  • How do I get 3 lines of text from a paragraph

    - by Keltex
    I'm trying to create an "snippet" from a paragraph. I have a long paragraph of text with a word hilighted in the middle. I want to get the line containing the word before that line and the line after that line. I have the following piece of information: The text (in a string) The lines are deliminated by a NEWLINE character \n I have the index into the string of the text I want to hilight A couple other criteria: If my word falls on first line of the paragraph, it should show the 1st 3 lines If my word falls on the last line of the paragraph, it should show the last 3 lines Should show the entire paragraph in the degenative cases (the paragraph only has 1 or 2 lines) Here's an example: This is the 1st line of CAT text in the paragraph This is the 2nd line of BIRD text in the paragraph This is the 3rd line of MOUSE text in the paragraph This is the 4th line of DOG text in the paragraph This is the 5th line of RABBIT text in the paragraph Example, if my index points to BIRD, it should show lines 1, 2, & 3 as one complete string like this: This is the 1st line of CAT text in the paragraph This is the 2nd line of BIRD text in the paragraph This is the 3rd line of MOUSE text in the paragraph If my index points to DOG, it should show lines 3, 4, & 5 as one complete string like this: This is the 3rd line of MOUSE text in the paragraph This is the 4th line of DOG text in the paragraph This is the 5th line of RABBIT text in the paragraph etc. Anybody want to help tackle this?

    Read the article

  • Boolean logic parser for SQL

    - by d03boy
    This is going to sound crazy but does anyone have techniques that would allow me to parse boolean logic strings in Sql Server 2005 without extraordinary/rediculous effort? Here is an example: (SOMEVAR=4 OR SOMEVAR=5) AND (NOT OTHERVAR=Y) I feel like recursion would help a lot if that were possible in Sql but I'm not really sure how to go about that sort of thing. If not, maybe there's a way to attach an external system to do the recursion for me? Don't worry, I'm not getting my hopes up.

    Read the article

  • Extracting a URL in Python

    - by Kyle Hayes
    In regards to: http://stackoverflow.com/questions/720113/find-hyperlinks-in-text-using-python-twitter-related How can I extract just the url so I can put it into a list/array? Edit Let me clarify, I don't want to parse the URL into pieces. I want to extract the URL from the text of the string to put it into an array. Thanks!

    Read the article

  • How to parse AMF data in Ruby?

    - by Matchu
    So I see that there are a few Rails plugins for serving AMF. However, is there a library that I can use in a Ruby environment to act as an AMF client: to read AMF data, and deserialize it into a Ruby object? If not, how could I best go about using tools built in other languages? I suppose I could write something in Python or Java or whatever, and call it from Ruby directly via backticks... but I'd first like to ensure that there isn't really any better option. Thanks!

    Read the article

  • Simplest way to add HTML as a String to a new Nokogiri HTML document body?

    - by viatropos
    I have a bunch of content from the body of one HTML file. How do I put that into the body of a new blank-slate HTML document using Nokogiri? Something like this, but with Nokogiri: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Default Title</title> </head> <body class='default-class'> <%= yield :body %> </body> </html>

    Read the article

  • Element Based XML Parising

    - by demos
    I have an XML document which reads like this: <xml> <web:Web> <web:Total>4000</web:Total> <web:Offset>0</web:Offset> </web:Web> </xml> my question is how do I access them using a library like BeautifulSoup in python? xmlDom.web["Web"].Total ? does not work?

    Read the article

  • Vote on Pros and Cons of Java HTML to XML cleaners

    - by George Bailey
    I am looking to allow HTML emails (and other HTML uploads) without letting in scripts and stuff. I plan to have a white list of safe tags and attributes as well as a whitelist of CSS tags and value regexes (to prevent automatic return receipt). I asked a question: Parse a badly formatted XML document (like an HTML file) I found there are many many ways to do this. Some systems have built in sanitizers (which I don't care so much about). This page is a very nice listing page but I get kinda lost http://java-source.net/open-source/html-parsers It is very important that the parsers never throw an exception. There should always be best guess results to the parse/clean. It is also very important that the result is valid XML that can be traversed in Java. I posted some product information and said Community Wiki. Please post any other product suggestions you like and say Community Wiki so they can be voted on. Also any comments or wiki edits on what part of a certain product is better and what is not would be greatly appreciated. (for example,, speed vs accuracy..) It seems that we will go with either jsoup (seems more active and up to date) or TagSoup (compatible with JDK4 and been around awhile). A +1 for any of these products would be if they could convert all style sheets into inline style on the elements.

    Read the article

  • Extract / Parse Tags from Mixed Content String

    - by Andreas
    Hello, i want to parse Tags from a mixed Content String. The string goes like this: "<PERSON>yasir arafat</PERSON> , the president of the <LOCATION>palestinian authority</LOCATION> , on the defensive , mr . sharon believes , a government official" I only want to use jaxp. Got anybody an idea for this. May an easy way with Expressions. But i need the Element names as well though. Best Regards Andreas

    Read the article

  • Java .split() Method To Split XML Parameters

    - by Buzz Lightyear
    I have this line from an XML document: <?xml version="1.0" encoding="UTF-8"?> <svg xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg" contentScriptType="text/ecmascript" width="1024" zoomAndPan="magnify" contentStyleType="text/css" viewBox="0 0 1024 768" height="768" preserveAspectRatio="xMidYMid meet" version="1.0"> I want to be able to split it up, using the split method. For example i want to save each parameter into a String array. So i'd like: contentScriptType="text/ecmascript" width="1024" zoomAndPan="magnify" contentStyleType="text/css" viewBox="0 0 1024 768" height="768" etc etc to be saved into a string array, is there anyway to do this using the split method, or can anybody suggest an easier, more efficient way to do this?

    Read the article

  • How do I create Document Fragments with Nokogiri?

    - by viatropos
    I have an html document like this: <div class="something"> <textarea name="another"/> <div class="nested"> <label>Nested Label</label> <input name="nested_input"/> </div> </div> I have gone through and modified some of the html tree by building it into a Nokogiri::HTML::Document like so: html = Nokogiri::HTML(IO.read("test.html")) html.children.each do ... Now I want to be able to extract the nested part into a document so I can apply a stylesheet to it, or so I can manipulate it as if it were like a Rails partial. Something like this: fragment = Nokogiri::HTML(html.xpath("//div[@class='nested']).first) Is there a way to do that? Such a way that when I output it, it doesn't wrap it in<html> tags and turn it into an HTML document, I just want HTML, no document. Is this possible?

    Read the article

  • Custom whiteSpace using Haskell Parsec

    - by fryguybob
    I would like to use Parsec's makeTokenParser to build my parser, but I want to use my own definition of whiteSpace. Doing the following replaces whiteSpace with my definition, but all the lexeme parsers still use the old definition (e.g. P.identifier lexer will use the old whiteSpace). ... lexer :: P.TokenParser () lexer = l { P.whiteSpace = myWhiteSpace } where l = P.makeTokenParser myLanguageDef ... Looking at the code for makeTokenParser I think I understand why it works this way. I want to know if there are any workarounds to avoid completely duplicating the code for makeTokenParser?

    Read the article

  • Android XML Parser isnt working

    - by Bram
    I am writing an android application with a XML parser. I have a parser that used to work but when I run it it isnt doing anything. This is my class: import java.net.URL; import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import org.w3c.dom.Document; import org.w3c.dom.Element; import org.w3c.dom.Node; import org.w3c.dom.NodeList; import org.xml.sax.InputSource; import android.app.Activity; import android.os.Bundle; import android.widget.LinearLayout; import android.widget.TextView; import android.widget.Toast; public class XMLParsingUsingDomeActivity extends Activity { @Override public void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); LinearLayout layout = new LinearLayout(this); layout.setOrientation(1); TextView ID[]; TextView vraag[]; TextView category[]; TextView a1[]; TextView p1[]; TextView a2[]; TextView p2[]; TextView a3[]; TextView p3[]; try { URL url = new URL( "http://128.140.217.126/vragen.xml"); DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); DocumentBuilder dbu= dbf.newDocumentBuilder(); Document doc = dbu.parse(new InputSource(url.openStream())); doc.getDocumentElement().normalize(); NodeList nodeList = doc.getElementsByTagName("item"); ID = new TextView[nodeList.getLength()]; vraag = new TextView[nodeList.getLength()]; category = new TextView[nodeList.getLength()]; a1 = new TextView[nodeList.getLength()]; p1 = new TextView[nodeList.getLength()]; a2 = new TextView[nodeList.getLength()]; p2 = new TextView[nodeList.getLength()]; a3 = new TextView[nodeList.getLength()]; p3 = new TextView[nodeList.getLength()]; for (int i = 0; i < nodeList.getLength(); i++) { Node node = nodeList.item(i); ID[i] = new TextView(this); vraag[i] = new TextView(this); category[i] = new TextView(this); a1[i] = new TextView(this); p1[i] = new TextView(this); a2[i] = new TextView(this); p2[i] = new TextView(this); a3[i] = new TextView(this); p3[i] = new TextView(this); Element fstElmnt = (Element) node; NodeList nameList = fstElmnt.getElementsByTagName("ID"); Element nameElement = (Element) nameList.item(0); nameList = nameElement.getChildNodes(); ID[i].setText(((Node) nameList.item(0)).getNodeValue()); NodeList vraagList = fstElmnt.getElementsByTagName("vraag"); Element vraagElement = (Element) vraagList.item(0); vraagList = vraagElement.getChildNodes(); vraag[i].setText(((Node) vraagList.item(0)).getNodeValue()); NodeList a1List = fstElmnt.getElementsByTagName("a1"); Element a1Element = (Element) a1List.item(0); a1List = a1Element.getChildNodes(); a1[i].setText(((Node) a1List.item(0)).getNodeValue()); NodeList p1List = fstElmnt.getElementsByTagName("p1"); Element p1Element = (Element) p1List.item(0); p1List = p1Element.getChildNodes(); p1[i].setText(((Node) p1List.item(0)).getNodeValue()); NodeList a2List = fstElmnt.getElementsByTagName("a2"); Element a2Element = (Element) a2List.item(0); a2List = a2Element.getChildNodes(); a2[i].setText(((Node) a2List.item(0)).getNodeValue()); NodeList p2List = fstElmnt.getElementsByTagName("p2"); Element p2Element = (Element) p2List.item(0); p2List = p2Element.getChildNodes(); p2[i].setText(((Node) p2List.item(0)).getNodeValue()); NodeList a3List = fstElmnt.getElementsByTagName("a3"); Element a3Element = (Element) a3List.item(0); a3List = a3Element.getChildNodes(); a3[i].setText(((Node) a3List.item(0)).getNodeValue()); NodeList p3List = fstElmnt.getElementsByTagName("p3"); Element p3Element = (Element) p3List.item(0); p3List = p3Element.getChildNodes(); p3[i].setText(((Node) p3List.item(0)).getNodeValue()); layout.addView(category[i]); Toast.makeText(this, "ID: " + i + "\n" + "Vraag: " + ((Node) vraagList.item(0)).getNodeValue() + "\n" + "A1: " + ((Node) a1List.item(0)).getNodeValue() + "\n" + "P2: " + ((Node) p1List.item(0)).getNodeValue() + "\n" + "A2: " + ((Node) a2List.item(0)).getNodeValue() + "\n" + "P2: " + ((Node) p2List.item(0)).getNodeValue() + "\n" + "A3: " + ((Node) a3List.item(0)).getNodeValue() + "\n" + "P3: " + ((Node) p3List.item(0)).getNodeValue(), Toast.LENGTH_LONG).show(); } } catch (Exception e) { System.out.println("XML Pasing Excpetion = " + e); } /** Set the layout view to display */ setContentView(layout); } } And my manifest: <?xml version="1.0" encoding="utf-8"?> <manifest xmlns:android="http://schemas.android.com/apk/res/android" package="your.pace.namace" android:versionCode="1" android:versionName="1.0" > <uses-sdk android:minSdkVersion="10" /> <uses-permission android:name="android.permission.INTERNET"></uses-permission> <application android:icon="@drawable/ic_launcher" android:label="@string/app_name" > <activity android:name=".XMLParsingUsingDomeActivity" android:label="@string/app_name" > <intent-filter> <action android:name="android.intent.action.MAIN" /> <category android:name="android.intent.category.LAUNCHER" /> </intent-filter> </activity> </application> </manifest> And the logcat output is worthless. I didnt change the code but its just not working anymore.

    Read the article

< Previous Page | 93 94 95 96 97 98 99 100 101 102 103 104  | Next Page >