Search Results

Search found 13659 results on 547 pages for 'email parsing'.

Page 136/547 | < Previous Page | 132 133 134 135 136 137 138 139 140 141 142 143  | Next Page >

  • How to read XML using XPath in Java

    - by kaibuki
    Hi guys, I want to read XML data using XPath in Java, so for the information I have gathered I am not able to parse XML according to my requirement. here is what I want to do: Get XML file from online via its URL, then use XPath to parse it, I want to create two methods in it. One is in which I enter a specific node attribute id, and I get all the child nodes as result, and second is suppose I just want to get a specific child node value only <?xml version="1.0"?><howto> <topic name="Java"> <url>http://www.rgagnonjavahowto.htm</url> <car>taxi</car> </topic> <topic ame="PowerBuilder"> <url>http://www.rgagnon/pbhowto.htm</url> <url>http://www.rgagnon/pbhowtonew.htm</url> </topic> <topic name="Javascript"> <url>http://www.rgagnon/jshowto.htm</url> </topic> <topic name="VBScript"> <url>http://www.rgagnon/vbshowto.htm</url> </topic></howto> In above example I want to read all the elements if I search via @name and also one function in which I just want the url from @name 'Javascript' only return one node element. I hope I cleared my question :) Thanks. Kai

    Read the article

  • `strip`ing the results of a split in python

    - by Igor
    i'm trying to do something pretty simple: line = "name : bob" k, v = line.lower().split(':') k = k.strip() v = v.strip() is there a way to combine this into one line somehow? i found myself writing this over and over again when making parsers, and sometimes this involves way more than just two variables. i know i can use regexp, but this is simple enough to not really have to require it...

    Read the article

  • Unable to Parse Date using NSDateFormatter

    - by Ansari
    Hi, I am fetching a RSS, in which i receive the following Date stamp: 2010-05-10T06:11:14.000Z Now i am using NSDateFormatter to parse this datetime stamp. [parseFormatter setDateFormat:@"yyyy-MM-dTH:m:s.z"]; But its not working fine if just remove the time stamp part it works for the date [parseFormatter setDateFormat:@"yyyy-MM-d"]; But if i add the rest of the stuff it returns nil. Any idea ? Thanks in Advance....

    Read the article

  • Send email to the planet Jupiter

    - by Chacha102
    My boss has a really strange request. He wants to start marketing to the residents of the planet Jupiter. However, I'm not sure the IP addresses I should send the email to. What are some existing tools that allow me to send emails to other planets?

    Read the article

  • Email notification when an error occurs

    - by user279521
    I need to design a bug alert system, where the web support team is notified via email when a user of our website encounters an error of any sort (database exception, or a 404) What would be the best way to design this section of the project? Any ideas would be appreciated.

    Read the article

  • Lexing partial SQL in C#

    - by Chris T
    I'd need to parse partial SQL queries (it's for a SQL injection auditing tool). For example '1' AND 1=1-- Should break down into tokens like [0] => [SQL_STRING, '1'] [1] => [SQL_AND] [2] => [SQL_INT, 1] [3] => [SQL_AND] [4] => [SQL_INT, 1] [5] => [SQL_COMMENT] [6] => [SQL_QUERY_END] Are their any at least lexers for SQL that I base mine off of or any good tools like bison for C# (though I'd rather not write my own grammar as I need to support most if not all the grammar of MySQL 5)

    Read the article

  • Why the double.Parse throw error in live server and how to track?

    - by Kovu
    Hi, I build a website, that: reads data from a website by HttpWebRequest Sort all Data Parse values of the data and give out newly On local server it works perfect, but when I push it to my live server, the double.Parse fails with an error. So: - how to track what the double.parse is trying to parse? - how to debug live server? Lang is ASP.Net / C#.net 2.0

    Read the article

  • Code editor with autocomplete

    - by Andrey
    I need to create a code editor for my own simple language: className.MethodName(parameterName = 2, ... ) I've created the appropriate grammar and autogenerate parser using ANTLR tool. Now I would like to have an autocomplete for class, method, variables and parameter names. This list should be context dependent, f.e. for "class." it should display methods and for "class.Method(" - parameters. I was going to parse the text and display the list depending on in which node the cursor is. The problem is that for incomplete code like "aaa.bbb(" the parser produces an error instead of a syntax tree. Any idea how to solve this problem? Maybe I'm on the wrong way and I shouldn't parse code to display autocomplete?

    Read the article

  • In Exim, is RBL spam rejected prior to being scanned by SpamAssassin?

    - by user955664
    I've recently been battling spam issues on our mail server. One account in particular was getting hammered with incoming spam. SpamAssassin's memory use is one of our concerns. What I've done is enable RBLs in Exim. I now see many rejection notices in the Exim log based on the various RBLs, which is good. However, when I run Eximstats, the numbers seem to be the same as they were prior to the enabling of the RBLs. I am assuming because the email is still logged in some way prior to the rejection. Is that what's happening, or am I missing something else? Does anyone know if these emails are rejected prior to being processed by SpamAssassin? Or does anyone know how I'd be able to find out? Is there a standard way to generate SpamAssassin stats, similar to Eximstats, so that I could compare the numbers? Thank you for your time and any advice. Edit: Here is the ACL section of my Exim configuration file ###################################################################### # ACLs # ###################################################################### begin acl # ACL that is used after the RCPT command check_recipient: # to block certain wellknown exploits, Deny for local domains if # local parts begin with a dot or contain @ % ! / | deny domains = +local_domains local_parts = ^[.] : ^.*[@%!/|] # to restrict port 587 to authenticated users only # see also daemon_smtp_ports above accept hosts = +auth_relay_hosts condition = ${if eq {$interface_port}{587} {yes}{no}} endpass message = relay not permitted, authentication required authenticated = * # allow local users to send outgoing messages using slashes # and vertical bars in their local parts. # Block outgoing local parts that begin with a dot, slash, or vertical # bar but allows them within the local part. # The sequence \..\ is barred. The usage of @ % and ! is barred as # before. The motivation is to prevent your users (or their virii) # from mounting certain kinds of attacks on remote sites. deny domains = !+local_domains local_parts = ^[./|] : ^.*[@%!] : ^.*/\\.\\./ # local source whitelist # accept if the source is local SMTP (i.e. not over TCP/IP). # Test for this by testing for an empty sending host field. accept hosts = : # sender domains whitelist # accept if sender domain is in whitelist accept sender_domains = +whitelist_domains # sender hosts whitelist # accept if sender host is in whitelist accept hosts = +whitelist_hosts accept hosts = +whitelist_hosts_ip # envelope senders whitelist # accept if envelope sender is in whitelist accept senders = +whitelist_senders # accept mail to postmaster in any local domain, regardless of source accept local_parts = postmaster domains = +local_domains # accept mail to abuse in any local domain, regardless of source accept local_parts = abuse domains = +local_domains # accept mail to hostmaster in any local domain, regardless of source accept local_parts = hostmaster domains =+local_domains # OPTIONAL MODIFICATIONS: # If the page you're using to notify senders of blocked email of how # to get their address unblocked will use a web form to send you email so # you'll know to unblock those senders, then you may leave these lines # commented out. However, if you'll be telling your senders of blocked # email to send an email to [email protected], then you should # replace "errors" with the left side of the email address you'll be # using, and "example.com" with the right side of the email address and # then uncomment the second two lines, leaving the first one commented. # Doing this will mean anyone can send email to this specific address, # even if they're at a blocked domain, and even if your domain is using # blocklists. # accept mail to [email protected], regardless of source # accept local_parts = errors # domains = example.com # deny so-called "legal" spammers" deny message = Email blocked by LBL - to unblock see http://www.example.com/ # only for domains that do want to be tested against RBLs domains = +use_rbl_domains sender_domains = +blacklist_domains # deny using hostname in bad_sender_hosts blacklist deny message = Email blocked by BSHL - to unblock see http://www.example.com/ # only for domains that do want to be tested against RBLs domains = +use_rbl_domains hosts = +bad_sender_hosts # deny using IP in bad_sender_hosts blacklist deny message = Email blocked by BSHL - to unblock see http://www.example.com/ # only for domains that do want to be tested against RBLs domains = +use_rbl_domains hosts = +bad_sender_hosts_ip # deny using email address in blacklist_senders deny message = Email blocked by BSAL - to unblock see http://www.example.com/ domains = +use_rbl_domains senders = +blacklist_senders # By default we do NOT require sender verification. # Sender verification denies unless sender address can be verified: # If you want to require sender verification, i.e., that the sending # address is routable and mail can be delivered to it, then # uncomment the next line. If you do not want to require sender # verification, leave the line commented out #require verify = sender # deny using .spamhaus deny message = Email blocked by SPAMHAUS - to unblock see http://www.example.com/ # only for domains that do want to be tested against RBLs domains = +use_rbl_domains dnslists = sbl.spamhaus.org # deny using ordb # deny message = Email blocked by ORDB - to unblock see http://www.example.com/ # # only for domains that do want to be tested against RBLs # domains = +use_rbl_domains # dnslists = relays.ordb.org # deny using sorbs smtp list deny message = Email blocked by SORBS - to unblock see http://www.example.com/ # only for domains that do want to be tested against RBLs domains = +use_rbl_domains dnslists = dnsbl.sorbs.net=127.0.0.5 # Next deny stuff from more "fuzzy" blacklists # but do bypass all checking for whitelisted host names # and for authenticated users # deny using spamcop deny message = Email blocked by SPAMCOP - to unblock see http://www.example.com/ hosts = !+relay_hosts domains = +use_rbl_domains !authenticated = * dnslists = bl.spamcop.net # deny using njabl deny message = Email blocked by NJABL - to unblock see http://www.example.com/ hosts = !+relay_hosts domains = +use_rbl_domains !authenticated = * dnslists = dnsbl.njabl.org # deny using cbl deny message = Email blocked by CBL - to unblock see http://www.example.com/ hosts = !+relay_hosts domains = +use_rbl_domains !authenticated = * dnslists = cbl.abuseat.org # deny using all other sorbs ip-based blocklist besides smtp list deny message = Email blocked by SORBS - to unblock see http://www.example.com/ hosts = !+relay_hosts domains = +use_rbl_domains !authenticated = * dnslists = dnsbl.sorbs.net!=127.0.0.6 # deny using sorbs name based list deny message = Email blocked by SORBS - to unblock see http://www.example.com/ domains =+use_rbl_domains # rhsbl list is name based dnslists = rhsbl.sorbs.net/$sender_address_domain # accept if address is in a local domain as long as recipient can be verified accept domains = +local_domains endpass message = "Unknown User" verify = recipient # accept if address is in a domain for which we relay as long as recipient # can be verified accept domains = +relay_domains endpass verify=recipient # accept if message comes for a host for which we are an outgoing relay # recipient verification is omitted because many MUA clients don't cope # well with SMTP error responses. If you are actually relaying from MTAs # then you should probably add recipient verify here accept hosts = +relay_hosts accept hosts = +auth_relay_hosts endpass message = authentication required authenticated = * deny message = relay not permitted # default at end of acl causes a "deny", but line below will give # an explicit error message: deny message = relay not permitted # ACL that is used after the DATA command check_message: accept

    Read the article

  • translating play in HTML to python

    - by aharon
    So, I'd like to represent one of Shakespeare's plays, Hamlet, into the following objects (maybe this isn't the best representation, if so please tell me): class Play(): acts = [] ... def add_act(self, act): acts.append(act) class Act(): scenes = [] ... def add_scene(self, scene): scenes.append(scene) class Scene(): elems = [] def __init__(self, title, setting=""): ... def add_elem(self, elem): elems.append(elem) ... class StageDirection(): # elem def __init__(self, text): ... class Line(): # elem def __init__(self, id, text, character = None): ... # A None character represents a continuation from the previous line # id could be, for example, 1.1.1 There are other methods, of course, for printing and such in each of the classes. The question is, how do I get a structure based on these classes (or something like them) from HTML 4 code that looks like this: <H3>ACT I</h3> <h3>SCENE I. Elsinore. A platform before the castle.</h3> <p><blockquote> <i>FRANCISCO at his post. Enter to him BERNARDO</i> </blockquote> <A NAME=speech1><b>BERNARDO</b></a> <blockquote> <A NAME=1.1.1>Who's there?</A><br> </blockquote> <A NAME=speech2><b>FRANCISCO</b></a> <blockquote> <A NAME=1.1.2>Nay, answer me: stand, and unfold yourself.</A><br> </blockquote> <A NAME=speech3><b>BERNARDO</b></a> <blockquote> <A NAME=1.1.3>Long live the king!</A><br> </blockquote> <A NAME=speech4><b>FRANCISCO</b></a> <blockquote> <A NAME=1.1.4>Bernardo?</A><br> </blockquote> <A NAME=speech5><b>BERNARDO</b></a> <blockquote> <A NAME=1.1.5>He.</A><br> </blockquote> <!-- for more, see the source of shakespeare.mit.edu/hamlet/full.html --> translating that into something like this: play = Play() actI = Act() sceneI = Scene("Scene I", "Elsinore. A platform before the castle.") sceneI.add_elem(StageDirection("Francisco at his post. Enter to him Bernardo.")) sceneI.add_elem(Line("Bernardo", "Who's there?")) ... Of course, I don't expect all the code—but what libraries and, when there aren't libraries, logic should I use? Thanks. (This is for a future opensource project and me learning Python for fun—not homework.)

    Read the article

  • How can I parse a C header file with Perl?

    - by Alphaneo
    Hi, I have a header file in which there is a large struct. I need to read this structure using some program and make some operations on each member of the structure and write them back. For example I have some structure like const BYTE Some_Idx[] = { 4,7,10,15,17,19,24,29, 31,32,35,45,49,51,52,54, 55,58,60,64,65,66,67,69, 70,72,76,77,81,82,83,85, 88,93,94,95,97,99,102,103, 105,106,113,115,122,124,125,126, 129,131,137,139,140,149,151,152, 153,155,158,159,160,163,165,169, 174,175,181,182,183,189,190,193, 197,201,204,206,208,210,211,212, 213,214,215,217,218,219,220,223, 225,228,230,234,236,237,240,241, 242,247,249}; Now, I need to read this and apply some operation on each of the member variable and create a new structure with different order, something like: const BYTE Some_Idx_Mod_mul_2[] = { 8,14,20, ... ... 484,494,498}; Is there any Perl library already available for this? If not Perl, something else like Python is also OK. Can somebody please help!!!

    Read the article

  • Email tracking techniques in php

    - by pandora
    Hi Everyone, I am doing a newsletter Management in php. I need to track the visitors who opens our newsletter, I have inserted the tracking image in newsletter thats seems to work little. Using Shift mailer there is option to embed an inline image to newsletter. Is it possible to track using this inline image? Is there any other techniques for tracking email.

    Read the article

  • Why does this JSON.parse code not work?

    - by SuZi
    I am trying to pass json encoded values from a php script to a, GnuBookTest.js, javascript file that initiates a Bookreader object and use the values i have passed in via the variable i named "result". The php script is sending the values like: <div id="bookreader"> <div id="BookReader" style="left:10px; right:10px; top:30px; bottom:30px;">x</div> <script type="text/javascript">var result = {"istack":"zi94sm65\/BUCY\/BUCY200707170530PM","leafCount":"14","wArr":"[893,893,893,893,893,893,893,893,893,893,893,893,893,893]","hArr":"[1155,1155,1155,1155,1155,1155,1155,1155,1155,1155,1155,1155,1155,1155]","leafArr":"[0,1,2,3,4,5,6,7,8,9,10,11,12,13]","sd":"[\"RIGHT\",\"LEFT\",\"RIGHT\",\"LEFT\",\"RIGHT\",\"LEFT\",\"RIGHT\",\"LEFT\",\"RIGHT\",\"LEFT\",\"RIGHT\",\"LEFT\",\"RIGHT\",\"LEFT\"]"}</script> <script type="text/javascript" src="http://localhost:8080/application/js/GnuBookTest.js"></script> </div> </div> and in the GnuBookTest.js file i am trying to use the values like: br = new BookReader(); // Return the width of a given page. br.getPageWidth = function(index) { return this.pageW[index]; } // Return the height of a given page. br.getPageHeight = function(index) { return this.pageH[index]; } br.pageW = JSON.parse(result.wArr); br.pageH = JSON.parse(result.hArr); br.leafMap = JSON.parse(result.leafArr); //istack is an url fragment for location of image files var istack = result.istack; . . . Using JSON.parse as i have written it above loads the Bookreader and uses my values correctly in a few web-browsers: Firefox, IE8, and desktop-Safari; but does not work at all in mac-Chrome, mobile-Safari, plus older versions of IE. Mobile safari keeps giving me a reference error msg: can't find variable: JSON. The other browsers just do not load the Bookreader and show the "x" instead, like they did not get the values from the php script. Where is the problem?

    Read the article

  • How to use Wordpress' http.php in external projects?

    - by NJTechGuy
    I am trying to parse data from a pipe-delimited text file hosted on another server which in turn will be inserted in a database. My host (1and1) disabled allow_url_fopen in php.ini I guess. Error message : Warning: fopen() [function.fopen]: URL file-access is disabled in the server configuration in Code : <? // make sure curl is installed if (function_exists('curl_init')) { // initialize a new curl resource $ch = curl_init(); // set the url to fetch curl_setopt($ch, CURLOPT_URL, 'http://abc.com/data/output.txt'); // don't give me the headers just the content curl_setopt($ch, CURLOPT_HEADER, 0); // return the value instead of printing the response to browser curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // use a user agent to mimic a browser curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0'); $content = curl_exec($ch); // remember to always close the session and free all resources curl_close($ch); } else { // curl library is not installed so we better use something else } //$contents = fread ($fd,filesize ($filename)); //fclose ($fd); $delimiter = "|"; $splitcontents = explode($delimiter, $contents); $counter = ""; ?> <font color="blue" face="arial" size="4">Complete File Contents</font> <hr> <? echo $contents; ?> <br><br> <font color="blue" face="arial" size="4">Split File Contents</font> <hr> <? foreach ( $splitcontents as $color ) { $counter = $counter+1; echo "<b>Split $counter: </b> $colorn<br>"; } ?> Wordpress has this cool http.php file. Is there a better way of doing it? If not, how do I use http.php for this task? Thank you guys..

    Read the article

  • Perl - Read XML

    - by chinna_82
    XML <?xml version='1.0'?> <employee> <name>Smith</name> <age>43</age> <sex>M</sex> <department role='manager'>Operations</department> </employee> Perl use XML::Simple; use Data::Dumper; $xml = new XML::Simple; foreach my $data1 ($data = $xml->XMLin("test.xml")) { print Dumper($data1); } Above code managed to all the xml value like this. Output $VAR1 = { 'department' => { 'content' => 'Operations', 'role' => 'manager' }, 'name' => 'John Doe', 'sex' => 'M', 'age' => '43' }; How do I do, if I only want to get the role value. For this example I need to get Role = manager. Any advice or reference link is highly appreciated.

    Read the article

  • libxml2 on iPhone

    - by mellkord
    I'm trying to parse HTML file with libxml2. Usually this works fine, but not in this case: <p> <b>Titles</b> (Some Text) <table> <tr> <td valign="top"> …Something1... </td> <td align="right" valign="top"> …Something2... </td> </tr> </table> </p> I do this query to get the first <td> //p[b='Titles']/table/tr/td[0] but nothing is returned because libxml think that <table> tag is not a child of a tag <p> and following him. And finally the question WHY?

    Read the article

  • Changing href atributes with nokogiri and ruby on rails

    - by fool
    Hi, I Have a HTML document with links links, for exemple: <html> <body> <ul> <li><a href="http://someurl.com/etc/etc">teste1</a></li> <li><a href="http://someurl.com/etc/etc">teste2</a></li> <li><a href="http://someurl.com/etc/etc">teste3</a></li> <ul> </body> </html> I want with Ruby on Rails, with nokogiri or some other method, to have a final doc like this: <html> <body> <ul> <li><a href="http://myproxy.com/?url=http://someurl.com/etc/etc">teste1</a></li> <li><a href="http://myproxy.com/?url=http://someurl.com/etc/etc">teste2</a></li> <li><a href="http://myproxy.com/?url=http://someurl.com/etc/etc">teste3</a></li> <ul> </body> </html> What's the best strategy to achieve this?

    Read the article

  • mailto fails in IE where there is a long body text. Is there any way to resolve this?

    - by MedicineMan
    I am having a problem using Internet Explorer 8 (IE8) to open mailto links with long messages. After the user clicks on the link, IE changes to an about:blank page and never completes the call to outlook to create an email Here's an example: <a href="mailto:[email protected]?subject=123456789&amp;body=111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111">mailto fails in IE8</a> If I shorten the list of 1's, the email is generated and can be sent. Is this a known IE issue? What are the limitations?

    Read the article

  • ICalendar parser in PHP that supports timezones

    - by Vincent Robert
    I am looking for a PHP class that can parse an ICalendar (ICS) file and correctly handle timezones. I already created an ICS parser myself but it can only handle timezones known to PHP (like 'Europe/Paris'). Unfortunately, ICS file generated by Evolution (default calendar software of Ubuntu) does not use default timezone IDs. It exports events with its a specific timezone ID exporting also the full definition of the timezone: daylight saving dates, recurrence rule and all the hard stuff to understand about timezones. This is too much for me. Since it was only a small utility for my girlfriend, I won't have time to investigate further the ICalendar specification and create a full blown ICalendar parser myself. So is there any known implementation in PHP of ICalendar file format that can parse timezones definitions?

    Read the article

  • Input string was not in the correct format using int.Parse

    - by JDWebs
    I have recently been making a login 'representation' which is not secure. So before answering, please note I am aware of security risks etc., and this will not be on a live site. Also note I am a beginner :P. For my login representation, I am using LINQ to compare values of a DDL to select a username and a Textbox to enter a password, when a login button is clicked. However, an error is thrown 'Input string was not in the correct format', when using int.Parse. Front End: <%@ Page Language="C#" AutoEventWireup="true" CodeFile="Login_Test.aspx.cs" Inherits="Login_Login_Test" %> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head runat="server"> <title>Login Test</title> </head> <body> <form id="LoginTest" runat="server"> <div> <asp:DropDownList ID="DDL_Username" runat="server" Height="20px" DataTextField="txt"> </asp:DropDownList> <br /> <asp:TextBox ID="TB_Password" runat="server" TextMode="Password"></asp:TextBox> <br /> <asp:Button ID="B_Login" runat="server" onclick="B_Login_Click" Text="Login" /> <br /> <asp:Literal ID="LI_Result" runat="server"></asp:Literal> </div> </form> </body> </html> Back End: using System; using System.Collections.Generic; using System.Linq; using System.Web; using System.Web.UI; using System.Web.UI.WebControls; public partial class Login_Login_Test : System.Web.UI.Page { protected void Page_Load(object sender, EventArgs e) { if (!Page.IsPostBack) { Binder(); } } private void Binder() { using (DataClassesDataContext db = new DataClassesDataContext()) { DDL_Username.DataSource = from x in db.DT_Honeys select new { id = x.UsernameID, txt = x.Username }; DDL_Username.DataValueField = "id"; DDL_Username.DataTextField = "txt"; DDL_Username.DataBind(); } } protected void B_Login_Click(object sender, EventArgs e) { if (TB_Password.Text != "") { using (DataClassesDataContext db = new DataClassesDataContext()) { DT_Honey blah = new DT_Honey(); blah = db.DT_Honeys.SingleOrDefault(x => x.UsernameID == int.Parse(DDL_Username.SelectedValue.ToString())); if (blah == null) { LI_Result.Text = "Something went wrong :/"; } if (blah.Password == TB_Password.Text) { LI_Result.Text = "Credentials recognised :-)"; } else { LI_Result.Text = "Error with credentials :-("; } } } } } I am aware this problem is very common, but none of the help I have found online is useful/relevant. Any help/suggestions appreciated; thank you for your time :-).

    Read the article

  • Best way to get back to using the power of lxml after having to use a regex to find something in an

    - by PyNEwbie
    I am trying to rip some text out of a large number of html documents (numbers in the hundreds of thousands). The documents are really forms but they are prepared by a very large group of different organizations so there is significant variation in how they create the document. For example, the documents are divided into chapters. I might want to extract the contents of Chapter 5 from every document so I can analyze the content of the chapter. Initially I thought this would be easy but it turns out that the authors might use a set of non-nested tables throughout the document to hold the content so that Chapter n could be displayed using td tags inside a table. Or they might use other elements such as p tags H tags, div tags or any other block level element. After trying repeatedly to use lxml to help me identify the beginning and end of each chapter I have determined that it is a lot cleaner to use a regular expression because in every case, no matter what the enclosing html element is the chapter label is always in the form of >Chapter # It is a little more complicated in that there might be some white space or non-breaking space represented in different ways (  or   or just spaces). Nonetheless it was trivial to write a regular expression to identify the beginning of each section. (The beginning of one section is the end of the previous section.) But now I want to use lxml to get the text out. My thought is that I have really no choice but to walk along my string to find the close tag for the element that encloses the text I am using to find the relevant section. That is here is one example where the element holding the Chapter name is a div <div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="left"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman">Chapter 1.&#160;&#160;&#160;Our Beginnings.</font></div> So I am imagining that I would begin at the location where I found the match for chapter 1 and set up a regular expressions to find the next </div|</td|</p|</h1 . . . So at this point I have identified the type of element holding my chapter heading I can use the same logic to find all of the text that is within that element that is set up a regular expression to help me mark from >Chapter 1.&#160;&#160;&#160;Our Beginnings.< So I have identified where my Chapter 1 begins I can do the same for chapter 2 (which is where Chapter 1 ends) Now I am imagining that I am going to snip the document beginning at the opening of the element that I identified as the element the indicates where chapter 1 begins and ending just before the opening of the element that I identified as the element that indicates where Chapter 2 begins. The string that I have identified will then be fed to lxml to use its power to get the content. I am going to all of this trouble because I have read over and over - never use a regular expression to extract content from html documents and I have not hit on a way to be as accurate with lxml to identify the starting and ending locations for the text I want to extract. For example, I can never be certain that the subtitle of Chapter 1 is Our Beginnings it could be Our Red Canary. Let me say that I spent two solid days trying with lxml to be confident that I had the beginning and ending elements and I could only be accurate <60% of the time but a very short regular expression has given me better than 95% success. I have a tendency to make things more complicated than necessary so I am wondering if anyone has seen or solved a similar problems and if they had an approach (not the details mind you) that they would like to offer.

    Read the article

  • Python/YACC: Resolving a shift/reduce conflict

    - by Rosarch
    I'm using PLY. Here is one of my states from parser.out: state 3 (5) course_data -> course . (6) course_data -> course . course_list_tail (3) or_phrase -> course . OR_CONJ COURSE_NUMBER (7) course_list_tail -> . , COURSE_NUMBER (8) course_list_tail -> . , COURSE_NUMBER course_list_tail ! shift/reduce conflict for OR_CONJ resolved as shift $end reduce using rule 5 (course_data -> course .) OR_CONJ shift and go to state 7 , shift and go to state 8 ! OR_CONJ [ reduce using rule 5 (course_data -> course .) ] course_list_tail shift and go to state 9 I want to resolve this as: if OR_CONJ is followed by COURSE_NUMBER: shift and go to state 7 else: reduce using rule 5 (course_data -> course .) How can I fix my parser file to reflect this? Do I need to handle a syntax error by backtracking and trying a different rule?

    Read the article

< Previous Page | 132 133 134 135 136 137 138 139 140 141 142 143  | Next Page >