Search Results

Search found 7251 results on 291 pages for 'pdf parsing'.

Page 50/291 | < Previous Page | 46 47 48 49 50 51 52 53 54 55 56 57  | Next Page >

  • Does JAXP natively parse HTML?

    - by ikmac
    So, I whip up a quick test case in Java 7 to grab a couple of elements from random URIs, and see if the built-in parsing stuff will do what I need. Here's the basic setup (with exception handling etc omitted): DocumentBuilderFactory dbfac = DocumentBuilderFactory.newInstance(); DocumentBuilder dbuild = dbfac.newDocumentBuilder(); Document doc = dbuild.parse("uri-goes-here"); With no error handler installed, the parse method throws exceptions on fatal parse errors. When getting the standard Apache 2.2 directory index page from a local server: a SAXParseException with the message White spaces are required between publicId and systemId. The doctype looks ok to me, whitespace and all. When getting a page off a Drupal 7 generated site, it never finishes. The parse method seems to hang. No exceptions thrown, never returns. When getting http://www.oracle.com, a SAXParseException with the message The element type "meta" must be terminated by the matching end-tag "</meta>". So it would appear that the default setup I've used here doesn't handle HTML, only strictly written XML. My question is: can JAXP be used out-of-the-box from openJDK 7 to parse HTML from the wild (without insane gesticulations), or am I better off looking for an HTML 5 parser? PS this is for something I may not open-source, so licensing is also an issue :(

    Read the article

  • Can the csv format be defined by a regex?

    - by Spencer Rathbun
    A colleague and I have recently argued over whether a pure regex is capable of fully encapsulating the csv format, such that it is capable of parsing all files with any given escape char, quote char, and separator char. The regex need not be capable of changing these chars after creation, but it must not fail on any other edge case. I have argued that this is impossible for just a tokenizer. The only regex that might be able to do this is a very complex PCRE style that moves beyond just tokenizing. I am looking for something along the lines of: ... the csv format is a context free grammar and as such, it is impossible to parse with regex alone ... Or am I wrong? Is it possible to parse csv with just a POSIX regex? For example, if both the escape char and the quote char are ", then these two lines are valid csv: """this is a test.""","" "and he said,""What will be, will be."", to which I replied, ""Surely not!""","moving on to the next field here..."

    Read the article

  • Sentence Tree v/s Words List

    - by Rohit Jose
    I was recently tasked with building a Name Entity Recognizer as part of a project. The objective was to parse a given sentence and come up with all the possible combinations of the entities. One approach that was suggested was to keep a lookup table for all the know connector words like articles and conjunctions, remove them from the words list after splitting the sentence on the basis of the spaces. This would leave out the Name Entities in the sentence. A lookup is then done for these identified entities on another lookup table that associates them to the entity type, for example if the sentence was: Remember the Titans was a movie directed by Boaz Yakin, the possible outputs would be: {Remember the Titans,Movie} was {a movie,Movie} directed by {Boaz Yakin,director} {Remember the Titans,Movie} was a movie directed by Boaz Yakin {Remember the Titans,Movie} was {a movie,Movie} directed by Boaz Yakin {Remember the Titans,Movie} was a movie directed by {Boaz Yakin,director} Remember the Titans was {a movie,Movie} directed by Boaz Yakin Remember the Titans was {a movie,Movie} directed by {Boaz Yakin,director} Remember the Titans was a movie directed by {Boaz Yakin,director} Remember the {the titans,Movie,Sports Team} was {a movie,Movie} directed by {Boaz Yakin,director} Remember the {the titans,Movie,Sports Team} was a movie directed by Boaz Yakin Remember the {the titans,Movie,Sports Team} was {a movie,Movie} directed by Boaz Yakin Remember the {the titans,Movie,Sports Team} was a movie directed by {Boaz Yakin,director} The entity lookup table here would contain the following data: Remember the Titans=Movie a movie=Movie Boaz Yakin=director the Titans=Movie the Titans=Sports Team Another alternative logic that was put forward was to build a crude sentence tree that would contain the connector words in the lookup table as parent nodes and do a lookup in the entity table for the leaf node that might contain the entities. The tree that was built for the sentence above would be: The question I am faced with is the benefits of the two approaches, should I be going for the tree approach to represent the sentence parsing, since it provides a more semantic structure? Is there a better approach I should be going for solving it?

    Read the article

  • Haskell: Best tools to validate textual input?

    - by Ana
    In Haskell, there are a few different options to "parsing text". I know of Alex & Happy, Parsec and Attoparsec. Probably some others. I'd like to put together a library where the user can input pieces of a URL (scheme e.g. HTTP, hostname, username, port, path, query, etc.) I'd like to validate the pieces according to the ABNF specified in RFC 3986. In other words, I'd like to put together a set of functions such as: validateScheme :: String -> Bool validateUsername :: String -> Bool validatePassword :: String -> Bool validateAuthority :: String -> Bool validatePath :: String -> Bool validateQuery :: String -> Bool What is the most appropriate tool to use to write these functions? Alex's regexps is very concise, but it's a tokenizer and doesn't straightforwardly allow you to parse using specific rules, so it's not quite what I'm looking for, but perhaps it can be wrangled into doing this easily. I've written Parsec code that does some of the above, but it looks very different from the original ABNF and unnecessarily long. So, there must be an easier and/or more appropriate way. Recommendations?

    Read the article

  • How exactly is an Abstract Syntax Tree created?

    - by Howcan
    I think I understand the goal of an AST, and I've build a couple of tree structures before, but never an AST. I'm mostly confused because the nodes are text and not number, so I can't think of a nice way to input a token/string as I'm parsing some code. For example, when I looked at diagrams of AST's, the variable and its value were leaf nodes to an equal sign. This makes perfect sense to me, but how would I go about implementing this? I guess I can do it case by case, so that when I stumble upon an "=" I use that as a node, and add the value parsed before the "=" as the leaf. It just seems wrong, because I'd probably have to make cases for tons and tons of things, depending on the syntax. And then I came upon another problem, how is the tree traversed? Do I go all the way down the height, and go back up a node when I hit the bottom, and do the same for it's neighbor? I've seen tons of diagrams on ASTs, but I couldn't find a fairly simple example of one in code, which would probably help.

    Read the article

  • Any good reason open files in text mode?

    - by Tinctorius
    (Almost-)POSIX-compliant operating systems and Windows are known to distinguish between 'binary mode' and 'text mode' file I/O. While the former mode doesn't transform any data between the actual file or stream and the application, the latter 'translates' the contents to some standard format in a platform-specific manner: line endings are transparently translated to '\n' in C, and some platforms (CP/M, DOS and Windows) cut off a file when a byte with value 0x1A is found. These transformations seem a little useless to me. People share files between computers with different operating systems. Text mode would cause some data to be handled differently across some platforms, so when this matters, one would probably use binary mode instead. As an example: while Windows uses the sequence CR LF to end a line in text mode, UNIX text mode will not treat CR as part of the line ending sequence. Applications would have to filter that noise themselves. Older Mac versions only use CR in text mode as line endings, so neither UNIX nor Windows would understand its files. If this matters, a portable application would probably implement the parsing by itself instead of using text mode. Implementing newline interpretation in the parser might also remove some overhead of using text mode, as buffers would need to be rewritten (and possibly resized) before returning to the application, while this may be less efficient than when it would happen in the application instead. So, my question is: is there any good reason to still rely on the host OS to translate line endings and file truncation?

    Read the article

  • How do I parse a header with two different version [ID3] avoiding code duplication?

    - by user66141
    I really hope you can give me some interesting viewpoints for my situation, because I am not satisfied with my current approach. I am writing an MP3 parser, starting with an ID3v2 parser. Right now I`m working on the extended header parsing, my issue is that the optional header is defined differently in version 2.3 and 2.4 of the tag. The 2.3 version optional header is defined as follows: struct ID3_3_EXTENDED_HEADER{ DWORD dwExtHeaderSize; //Extended header size (either 6 or 8 bytes , excluded) WORD wExtFlags; //Extended header flags DWORD dwSizeOfPadding; //Size of padding (size of the tag excluding the frames and headers) }; While the 2.4 version is defined : struct ID3_4_EXTENDED_HEADER{ DWORD dwExtHeaderSize; //Extended header size (synchsafe int) BYTE bNumberOfFlagBytes; //Number of flag bytes BYTE bFlags; //Flags }; How could I parse the header while minimizing code duplication? Using two different functions to parse each version sounds less great, using a single function with a different flow for each occasion is similar, any good practices for this kind of issues ? Any tips for avoiding code duplication? Any help would be appreciated.

    Read the article

  • Extracting useful information from free text

    - by insta
    We filter and analyse seats for events. Apparently writing a domain query language for the floor people isn't an option. I'm using C# 4.0 & .NET 4.0, and have relatively free reign to use whatever open-source tools are available. </background-info> If a request comes in for "FLOOR B", the sales people want it to show up if they've entered "FLOOR A-FLOOR F" in a filter. The only problem I have is that there's absolutely no structure to the parsed parameters. I get the string already concatenated (it actually uses a tilde instead of dash). Examples I've seen so far with matches after each: 101WC-199WC (needs to match 150WC) AAA-ZZZ (needs to match AAA, BBB, ABC but not BB) LOGE15-LOGE20 (needs to match LOGE15 but not LOGE150) At first I wanted to try just stripping off the numeric part of the lower and upper, and then incrementing through that. The problem I have is that only some entries have numbers, sometimes the numbers AND letters increment, sometimes its all letters that increment. Since I can't impose any kind of grammar to use (I really wanted [..] expansion syntax), I'm stuck using these entries. Are there any suggestions for how to approach this parsing problem?

    Read the article

  • Any good reason to open files in text mode?

    - by Tinctorius
    (Almost-)POSIX-compliant operating systems and Windows are known to distinguish between 'binary mode' and 'text mode' file I/O. While the former mode doesn't transform any data between the actual file or stream and the application, the latter 'translates' the contents to some standard format in a platform-specific manner: line endings are transparently translated to '\n' in C, and some platforms (CP/M, DOS and Windows) cut off a file when a byte with value 0x1A is found. These transformations seem a little useless to me. People share files between computers with different operating systems. Text mode would cause some data to be handled differently across some platforms, so when this matters, one would probably use binary mode instead. As an example: while Windows uses the sequence CR LF to end a line in text mode, UNIX text mode will not treat CR as part of the line ending sequence. Applications would have to filter that noise themselves. Older Mac versions only use CR in text mode as line endings, so neither UNIX nor Windows would understand its files. If this matters, a portable application would probably implement the parsing by itself instead of using text mode. Implementing newline interpretation in the parser might also remove some overhead of using text mode, as buffers would need to be rewritten (and possibly resized) before returning to the application, while this may be less efficient than when it would happen in the application instead. So, my question is: is there any good reason to still rely on the host OS to translate line endings and file truncation?

    Read the article

  • High-level strategy for distinguishing a regular string from invalid JSON (ie. JSON-like string detection)

    - by Jonline
    Disclaimer On Absence of Code: I have no code to post because I haven't started writing; was looking for more theoretical guidance as I doubt I'll have trouble coding it but am pretty befuddled on what approach(es) would yield best results. I'm not seeking any code, either, though; just direction. Dilemma I'm toying with adding a "magic method"-style feature to a UI I'm building for a client, and it would require intelligently detecting whether or not a string was meant to be JSON as against a simple string. I had considered these general ideas: Look for a sort of arbitrarily-determined acceptable ratio of the frequency of JSON-like syntax (ie. regex to find strings separated by colons; look for colons between curly-braces, etc.) to the number of quote-encapsulated strings + nulls, bools and ints/floats. But the smaller the data set, the more fickle this would get look for key identifiers like opening and closing curly braces... not sure if there even are more easy identifiers, and this doesn't appeal anyway because it's so prescriptive about the kinds of mistakes it could find try incrementally parsing chunks, as those between curly braces, and seeing what proportion of these fractional statements turn out to be valid JSON; this seems like it would suffer less than (1) from smaller datasets, but would probably be much more processing-intensive, and very susceptible to a missing or inverted brace Just curious if the computational folks or algorithm pros out there had any approaches in mind that my semantics-oriented brain might have missed. PS: It occurs to me that natural language processing, about which I am totally ignorant, might be a cool approach; but, if NLP is a good strategy here, it sort of doesn't matter because I have zero experience with it and don't have time to learn & then implement/ this feature isn't worth it to the client.

    Read the article

  • Reliable and fast way to convert a zillion ODT files in PDF?

    - by Marco Mariani
    I need to pre-produce a million or two PDF files from a simple template (a few pages and tables) with embedded fonts. Usually, I would stay low level in a case like this, and compose everything with a library like ReportLab, but I joined late in the project. Currently, I have a template.odt and use markers in the content.xml files to fill with data from a DB. I can smoothly create the ODT files, they always look rigth. For the ODT to PDF conversion, I'm using openoffice in server mode (and PyODConverter w/ named pipe), but it's not very reliable: in a batch of documents, there is eventually a point after which all the processed files are converted into garbage (wrong fonts and letters sprawled all over the page). Problem is not predictably reproducible (does not depend on the data), happens in OOo 2.3 and 3.2, in Ubuntu, XP, Server 2003 and Windows 7. My Heisenbug detector is ticking. I tried to reduce the size of batches and restarting OOo after each one; still, a small percentage of the documents are messed up. Of course I'll write about this on the Ooo mailing lists, but in the meanwhile, I have a delivery and lost too much time already. Where do I go? Completely avoid the ODT format and go for another template system. Suggestions? Anything that takes a few seconds to run is way too slow. OOo takes around a second and it sums to 15 days of processing time. I had to write a program for clustering the jobs over several clients. Keep the format but go for another tool/program for the conversion. Which one? There are many apps in the shareware or commercial repositories for windows, but trying each one is a daunting task. Some are too slow, some cannot be run in batch without buying it first, some cannot work from command line, etc. Open source tools tend not to reinvent the wheel and often depend on openoffice. Converting to an intermediate .DOC format could help to avoid the OOo bug, but it would double the processing time and complicate a task that is already too hairy. Try to produce the PDFs twice and compare them, discarding the whole batch if there's something wrong. Although the documents look equal, I know of no way to compare the binary content. Restart OOo after processing each document. it would take a lot more time to produce them it would lower the percentage of the wrong files, and make it very hard to identify them. Go for ReportLab and recreate the pages programmatically. This is the approach I'm going to try in a few minutes. Learn to properly format bulleted lists Thanks a lot.

    Read the article

  • Is Prince the best way to PDF in Ruby on Rails?

    - by Angela
    After several Google searches, it appears that the way to create PDF's in Rails from HTML and CSS (versus a new markup language) is to use Prince. With licensing at $3800 for my non-big-commercial app, I'm wondering if this is, in fact, consensus or people have an alternative they can share the what's and how's.

    Read the article

  • Can I have a user submit a form and receive a PDF copy?

    - by thatryan
    Not quite sure how to ask this :) But what I need is to have some forms online that people can fill out, like registration form etc. They will then submit it and be emailed it as well as me and have it stored in a database. Is there a way for the files to be PDFs? Can I have an editable PDF form online? Thank you.

    Read the article

  • Java API to create pdf with tables: any recommendations?

    - by jack
    I need to create a PDF containing some tables. When looking on google/stackoverflow the most frequent API seems to be iText but that's under the AGPL licence and thus not desirable for my purposes. I also frequently see apache pdfbox but that does not seem to have native support for tables (although a slightly hacky way was posted at Apache PDFBox Java library - Is there an API for creating tables? ) Does anyone have any recommendations?

    Read the article

< Previous Page | 46 47 48 49 50 51 52 53 54 55 56 57  | Next Page >