Search Results

Search found 68 results on 3 pages for 'htmlagilitypack'.

Page 1/3 | 1 2 3 | Next Page >

HtmlAgilityPack giving problems with malformed html

- by Kapil

I want to extract meaningful text out of an html document and I was using html-agility-pack for the same. Here is my code: string convertedContent = HttpUtility.HtmlDecode(ConvertHtml(HtmlAgilityPack.HtmlEntity.DeEntitize(htmlAsString))); ConvertHtml: public string ConvertHtml(string html) { HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); StringWriter sw = new StringWriter(); ConvertTo(doc.DocumentNode, sw); sw.Flush(); return sw.ToString(); } ConvertTo: public void ConvertTo(HtmlAgilityPack.HtmlNode node, TextWriter outText) { string html; switch (node.NodeType) { case HtmlAgilityPack.HtmlNodeType.Comment: // don't output comments break; case HtmlAgilityPack.HtmlNodeType.Document: foreach (HtmlNode subnode in node.ChildNodes) { ConvertTo(subnode, outText); } break; case HtmlAgilityPack.HtmlNodeType.Text: // script and style must not be output string parentName = node.ParentNode.Name; if ((parentName == "script") || (parentName == "style")) break; // get text html = ((HtmlTextNode)node).Text; // is it in fact a special closing node output as text? if (HtmlNode.IsOverlappedClosingElement(html)) break; // check the text is meaningful and not a bunch of whitespaces if (html.Trim().Length > 0) { outText.Write(HtmlEntity.DeEntitize(html) + " "); } break; case HtmlAgilityPack.HtmlNodeType.Element: switch (node.Name) { case "p": // treat paragraphs as crlf outText.Write("\r\n"); break; } if (node.HasChildNodes) { foreach (HtmlNode subnode in node.ChildNodes) { ConvertTo(subnode, outText); } } break; } } Now in some cases when the html pages are malformed (for example the following page - http://rareseeds.com/cart/products/Purple_of_Romagna_Artichoke-646-72.html has a malformed meta-tag like <meta content="text/html; charset=uft-8" http-equiv="Content-Type">) [Note "uft" instead of utf] my code is puking at the time I am trying to load the html document. Can someone suggest me how can I overcome these malformed html pages and still extract relevant text out of a html document? Thanks, Kapil

Read the article
HTMLAgilitypack usage

- by Jepe d Hepe

How can i select all html tags using htmlagilitypack? Thanks, jepe

Read the article
HTMLAgilityPack, HTML duplicate IDs

- by Sha Le

Hi: This is similar to this one here. But needs to be done at the server level rather at the client level. Currently I use HTMLAgilityPack, is there anyway I could detect duplicate IDs? Thanks in advance.

Read the article
HTMLAgilityPack ChildNodes index works, named node does not

- by XgenX

I am parsing an XML API response with HTMLAgilityPack. I am able to select the result items from the API call. Then I loop through the items and want to write the ChildNodes to a table. When I select ChildNodes by saying something like: sItemId = dnItem.ChildNodes(0).innertext I get the proper itemId result. But when I try: sItemId = dnItem.ChildNodes("itemId").innertext I get "Referenced object has a value of 'Nothing'." I have tried "itemID[1]", "/itemId[1]" and a veriety of strings. I have tried SelectSingleNode and ChildNodes.Item("itemId").innertext. The only one that has worked is using the index. The problem with using the index is that sometimes child elements are omitted in the results and that throw off the index. Anybody know what I am doing wrong?

Read the article
Select only items in a specific DIV using HtmlAgilityPack

- by Adam Haile

I'm trying to use the HtmlAgilityPack to pull all of the links from a page that are contained within a div declared as <div class='content'> However, when I use the code below I simply get ALL links on the entire page. This doesn't really make sense to me since I am calling SelectNodes from the sub-node I selected earlier (which when viewed in the debugger only shows the HTML from that specific div). So, it's like it's going back to the very root node every time I call SelectNodes. The code I use is below: HtmlWeb hw = new HtmlWeb(); HtmlDocument doc = hw.Load(@"http://example.com"); HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[@class='content']"); foreach(HtmlNode link in node.SelectNodes("//a[@href]")) { Console.WriteLine(link.Value); } Is this the expected behavior? And if so, how do I get it to do what I'm expecting?

Read the article
HtmlAgilityPack - Vs 2010 - c# ASP - File Not found

- by Janosch Geigowskoskilu

First, I've already searched the web & StackOverflow for hours, and i did find a lot about troubleshooting HtmlAgilityPack and tried most of these but nothing worked. The Situation: I'm developing a C# ASP .NET WebPart in SharePoint Foundation. Everything works fine, now I want to Parse a HTML Page to get all ImagePaths and save the Images on HD/Temp. To do that I was downloading HtmlAgilityPack, current version, add reference to Project, everything looks OK, IntelliSense works fine. The Exception: But when I want to run the section where HtmlAgilityPack should be used my Browser shows me a FileNotFoundException - The File or Assembly could not be found. What I tried: After first searches i tried to include v1.4.0 of HtmlAgilityPack cause I read that the current version in some case is not really stable. This works fine to until the point I want to use HtmlAgilityPack, the same Exception. I also tried moving the HtmlAgilityPack direct to the Solution directory, nothing changed. I tried to insert HtmlAgilityPack via using and I tried direct call e.g. HtmlAgilityPack.HtmlDocument. Conclusion : When I compile no error occurs, the reference is set correct. When I trace the HtmlAgilityPack.dll with ProcMon the Path is shown correct, but sometimes the Result is 'File Locked with only Readers' but I don't know enough about ProcMon to Know what this means or if this is critical. It couldn't have something to do with File Permissions because if I check the DLL the permissions are all given.

Read the article
Select all links from a Html table using XPath (and HtmlAgilityPack)

- by Adam Asham

What I am trying to achieve is to extract all links with a href attribute that starts with http://, https:// or /. These links lie within a table (tbody tr td etc) with a certain class. I thought I could specify just the the a element without the whole path to it but it does not seem to work. I get a NullReferenceException at the line that selects the links: var table = doc.DocumentNode.SelectSingleNode("//table[@class='containerTable']"); if (table != null) { foreach (HtmlNode item in table.SelectNodes("a[starts-with(@href, 'https://')]")) { //not working I don't know about any recommendations or best practices when it comes to XPath. Do I create overhead when I query the document two times?

Read the article
XPATH query, HtmlAgilityPack and Extracting Text

- by Soham

I had been trying to extract links from a class called "tim_new" . I have been given a solution as well. Both the solution, snippet and necessary information is given here The said XPATH query was "//a[@class='tim_new'], my question is, how did this query differentiate between the first line of the snippet (given in the link above and the second line of the snippet). More specifically, what is the literal translation (in English) of this XPATH query. Furthermore, I want to write a few lines of code to extract the text written against NSE: <div class="FL gL_12 PL10 PT15">BSE: 523395   |   NSE: 3MINDIA   |   ISIN: INE470A01017</div> Would appreciate help in forming the necessary selection query. My code is written as: IEnumerable<string> NSECODE = doc.DocumentNode.SelectSingleNode("//div[@NSE:]"); But this doesnt look right. Would appreciate some help.

Read the article
Extracting table rows with a particular attribute from an HTML File, using HTMLAgilityPack

- by Soham

Consider this html snippet: <tr> <td valign=top class="tim_new"><a href="/stocks/company_info/pricechart.php?sc_did=MI42" class="tim_new">3M India</a></td> <td class="tim_new" valign=top><a href='/stocks/marketstats/indcomp.php?optex=NSE&indcode=Diversified' class=tim>Diversified</a></td> <td class="tim_new" align=right valign=top>2,487.25</td> <td class="tim_new" align=right valign=top><font color=#16a903>187.25</font></td> <td class="tim_new" align=right valign=top><font color=#16a903>8.14</font></td> <td class="tim_new" align=right valign=top>2,801.90</td> <td class="tim_new" align=right valign=top>0.06</td> </tr> Realize these three things: The HTML file from which this snippet has been taken, contains multiple number of HTML tables. The table from which this snippet has been extracted doesnt contain only rows of the shown format, but also of other formats like this, for example: <tr><td colspan=7><img src="http://img1.moneycontrol.com/images/blank.gif"height="5"></td></tr>` This same table contains multiple rows of the format which I need to extract. So given this scenario, is it possible to run a code, which extracts, the link with the class name = "tim_new"? Help Appreciated, Soham

Read the article
How would I use HTMLAgilityPack to extract the value I want

- by Nai

For the given HTML I want the value of id <div class="name" id="john-5745844"> <div class="name" id="james-6940673"> UPDATE This is what I have at the moment HtmlDocument htmlDoc = new HtmlDocument(); htmlDoc.Load(new StringReader(pageResponse)); HtmlNode root = htmlDoc.DocumentNode; List<string> anchorTags = new List<string>(); foreach (HtmlNode div in root.SelectNodes("//div[@class='name' and @id]")) { HtmlAttribute att = div.Attributes["id"]; Console.WriteLine(att.Value); } The error I am getting is at the foreach line stating: Object reference not set to an instance of an object.

Read the article
Too nervous to install

- by The Prop

Yesterday I (a professional rugby prop of somewhat limited intellect) landed in http://htmlagilitypack.codeplex.com/ and found myself stranded in a town with no signposts. The locals don't need signposts - they know their way around - so who gives a hoot about visitors? Well I'm a visitor and I'm lost. Here's my plea to the good burgesses of Codeplex-sans-signs: HELP!! Let me back-track and explain what landed me at the bottom of this tangled ruck. There's a "Download" button positioned near the top-right of the Codeplex web page, right? Like the Sword of Damocles, a down-arrow to the left of the button indicates, presumably, what a download would include: CURRENT 1.4.0 Stable DATE Fri May 7 2010 at 7:00 AM STATUS Stable With a simple-minded confidence that has since deserted me (the confidence - not the simple-mindedness), I clicked "Download". This introduced 3 new files to my computer: HtmlAgilityPack.dll, HtmlAgilityPack.pdb, and HtmlAgilityPack.XML This is when the first stab of doubt penetrated that globe between my cauliflower ears that I call a head. Where's the dot cs? Somewhere in Codeplex, I'd read advice to another lost soul to "download and build the HTMLAgilityPack solution". As I've done so many times as an All Black prop, I glared at the opposition front row - ah, I mean the 3 new files. Shouldn't one of them have a ".cs" on the back of his jersey - er, on the end of its name? Or is this just how they play the game in Codeplex-sans-signs? Undaunted (props have more courage than sense) I packed into my first C# scrum. The half-back feeds the ball in, and the front rows collapse - er, the debugging stops at this line of my code: "HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();" Then the Referee blows his whistle and announces one of those verdicts that's utterly indecipherable to your average loose-head prop: Locating source for 'C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs'. Checksum: MD5 {62 bc f3 7e 9a 92 a6 32 7 d6 5b f8 76 59 7b 5b} The file 'C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs' does not exist. Looking in script documents for 'C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs'... Looking in the projects for 'C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs'. The file was not found in a project. Looking in directory 'C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\IDE\vc7\atlmfc'... Looking in directory 'C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\IDE\vc7\crt'... The debugger will ask the user to find the file: C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs. The user pressed Cancel [a brain-stemmer from the prop] in the Find Source dialog. The debug source files settings for the active solution have been modified so that the debugger will not ask the user to find the file: C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs. The debugger could not locate the source file 'C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs'. Even if it had been the first 50 stanzas of "Eskimo Nell", I couldn't have been more shocked. I'm so shocked, my jaws clamp shut around the opposition hooker's ear. He thumbs me in the iris. With a cornea-torn eye I peer at the Codeplex site. My brain stem sparks and I punch the "View all downloads" link. It sparks four more times on each download link, and.. lo! FOUR files this time: HAPExplorer.zip, HtmlAgilityPack.1.4.0.Source.zip, HtmlAgilityPack.1.4.0.zip, HtmlAgilityPack.Documentation.chm But... is this not the same place arrived at recently by my flat-mate Chaz, journalist extraordinaire? (Chaz, if you're reading this, I'm not plugging for nothing - just write kindly about me in your next report, okay?) Didn't these same four files flummox Chaz The Great? He told me about it. Chaz left a message with Codeplex and then solved the problem by just walking away. Typical journalist, huh. But I'm not like that. I don't walk away. I'm made of the sort of stubborn stuff that becomes an All Black prop. Hence this impassioned plea: GOOD TOWNSFOLK OF CODEPLEX-SANS-SIGNS, WHAT SHOULD I DO NEXT? Can somebody point me to Main Street? How does a simpleton install 'C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs'? I'm willing to prostrate myself and grovel to the first kind face that passes in front of my rapidly clouding sight. So help me, I'd even tug my forelock if I had one! Should I hold forth my rod over the wilderness, and create a folder called 'C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\' or some such? If so, what files should I move into it? ANYTHING else a dum-ass should know about? - and I mean ANYTHING - you just don't know how witless a punch-drunk prop can be.. %( Whenever I've installed other programs they've given me an ".exe" or ".msi" that I can click on and it's all done for me like magic. HEY... there's nothing of that nature here, is there? Am I missing something? Something for dummies to click? (From the waiting rooms of Dr I. Sight Phixes) (signed) The Prop

Read the article
Screen scraping in C# using HtmlAgilityPack.

In my example, you can scraping complete page or a part of page.

Read the article
C#: HtmlAgilityPack extract inner text

- by Yang

I am using HtmlAgilityPack. Is there a one line code that I can get all inner text of html, e.g., remove all html tags and scripts?

Read the article
HTMLAgilitypack getting <P> and <STRONG> text

- by StealthRT

Hey all i am looking for a way to get this HTML code: <DIV class=channel_row><SPAN class=channel> <DIV class=logo><IMG src='/images/channel_logos/WGNAMER.png'></DIV> <P><STRONG>2</STRONG><BR>WGNAMER </P></SPAN> using the HtmlAgilityPack. I have been trying this: With channel info!Logo = .SelectSingleNode(".//img").Attributes("src").Value info!Channel = .SelectSingleNode(".//span[@class='channel']").ChildNodes(1).ChildNodes(0).InnerText info!Station = .SelectSingleNode(".//span[@class='channel']").ChildNodes(1).ChildNodes(2).InnerText End With I can get the Logo but it comes up with a blank string for the Channel and for the Station it says Index was out of range. Must be non-negative and less than the size of the collection. I've tried all types of combinations: info!Station = .SelectSingleNode(".//span[@class='channel']").ChildNodes(1).ChildNodes(1).InnerText info!Station = .SelectSingleNode(".//span[@class='channel']").ChildNodes(1).ChildNodes(3).InnerText info!Station = .SelectSingleNode(".//span[@class='channel']").ChildNodes(0).ChildNodes(1).InnerText info!Station = .SelectSingleNode(".//span[@class='channel']").ChildNodes(0).ChildNodes(2).InnerText info!Station = .SelectSingleNode(".//span[@class='channel']").ChildNodes(0).ChildNodes(3).InnerText What do i need to do in order to correct this?

Read the article
Using HTMLAgility Pack to Extract Links

- by Soham

Hi Folks, Consider this simplest piece of code: using System; using System.Collections.Generic; using System.Linq; using System.Text; using HtmlAgilityPack; namespace WebScraper { class Program { static void Main(string[] args) { HtmlDocument doc = new HtmlDocument(); doc.LoadHtml("http://www.google.com"); foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]")) { } } } } This effectively doesnt do anything at all, and is copied/inspired from various other StackOverflow questions like this. When compiling this, there is a runtime error which says "Object reference not set to an instance of an object." highlighting the foreach line. I can't understand, why the environment has become irritable to this humble,innocent and useless piece of code. I would also like to know, does HTMLAgilityPack accept HTML classes as nodes?

Read the article
HtmlAgilityPack SelectNodes expression to ignore an element with a certain attribute

- by thaky

I am trying to select nodes except from script nodes and a ul that has a class called 'relativeNav'. Can someone please direct me to the right path? I have been searching for this for a week and I can't find it anywhere. Currently I have this but it obviously selecting the //ul[@class='relativeNav'] as well. Is there anyway to put an NOT expression of it so that SelectNode will ignore that one? foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//body//*[not(self::script)]/text()")) { Console.WriteLine("Node: " + node); singleString += node.InnerText.Trim() + "\n"; }

Read the article
Split a html string in N parts

- by Matt Brailsford

Hi Guys, Does anybody have an example of spliting a html string (coming from a tiny mce editor) and splitting it into N parts using C#? I need to split the string evenly without splitting words. I was thinking of just splitting the html and using the HtmlAgilityPack to try and fix the broken tags. Though I'm not sure how to find the split point, as Ideally it should be based purley on the text rather than the html aswell. Anybody got any ideas on how to go about this? Many thanks Matt

Read the article
XPath query to get node after some other node

- by czesio

I am using "HtmlAgilityPack" to parse HTML content. My target is to get number value. <div> some content 1 <br> some <b>content</b> 2 <br> <b>NUMBER:</b> 9788492688647 <br> some content 3 <br> some content 4 </div> aim: - get "9788492688647" Anybody can tell me how to get value between /div/b[2] and <br> ?

Read the article
How could I parse this HTML file?

- by Sergio Tapia

<div id="main"> <style type="text/css"> </style> <script language="JavaScript"> </script> <p style="margin: 0pt 0pt 0.5em;"><b>Media from <a onclick="(new Image()).src='/rg/find-media-title/media_strip/images/b.gif?link=/title/tt0087538/';" href="/title/tt0087538/">The Karate Kid</a> (1984)</b></p> <style type="text/css"> </style> <table style="border-collapse: collapse;"> </table> </div> I need to somehow extract the href value of the (new Image()). How exactly would I accomplish this with HtmlAgilityPack? I'm new to it, and so far I haven't found a useful tutorial on how to effectively use it for parsing. Thanks for the help!

Read the article
Parsing tables, cells with Html agility in C#

- by Kaeso

I need to parse Html code. More specifically, parse each cell of every rows in all tables. Each row represent a single object and each cell represent different properties. I want to parse these to be able to write an XML file with every data inside (without the useless HTML code). This is the way I thought it out initially but I ran out of ideas: HTML: <tr> <td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF"> 1 </td> <td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="left"> <a href="/ice/player.htm?id=8471675">Sidney Crosby</a> </td> <td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="center"> PIT </td> <td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="center"> C </td> <td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right"> 39 </td> <td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right"> 32 </td> <td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right"> 33 </td> <td class="statBox sorted" style="border-width:0px 1px 1px 0px; background-color: #E0E0E0" align="right"> <font color="#000000"> 65 </font> </td> <td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right"> 20 </td> <td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right"> 29 </td> <td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right"> 10 </td> <td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right"> 1 </td> <td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right"> 3 </td> <td class="statBox" style="border-width:0px 0px 1px 0px; background-color: #FFFFFF" align="right"> </td> <td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right"> 0 </td> <td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right"> 154 </td> <td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right"> 20.8 </td> <td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right"> 21:54 </td> <td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right"> 22.6 </td> <td class="statBox" style="border-width:0px 0px 1px 0px; background-color: #FFFFFF" align="right"> 55.7 </td> </tr> C#: using HtmlAgilityPack; using System.Data; namespace Stats { class StatsParser { private string htmlCode; private static string fileName = "[" + DateTime.Now.ToShortDateString() + " NHL Stats].xml"; public StatsParser(string htmlCode) { this.htmlCode = htmlCode; this.ParseHtml(); } public DataTable ParseHtml() { var result = new DataTable(); HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(htmlCode); HtmlNode row = doc.DocumentNode.SelectNodes("//tr"); foreach (var statBox in row.SelectNodes("//td[@class='statBox']")) { System.Windows.MessageBox.Show(statBox.InnerText); } } } }

Read the article
HTML Agility Pack - ReplaceNode doesn't change the InnerHTML of the Body

- by morsanu

Hi there, I have this The body: <body><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent leo leo, ultrices eu venenatis et, rutrum fringilla dolor.</p></body> The code: HtmlNode body = doc.DocumentNode.SelectSingleNode("//body"); Dictionary<HtmlNode, HtmlNode> toReplace = new Dictionary<HtmlNode, HtmlNode>(); // I do some logic here adding nodes to the toReplace dictionary. foreach (HtmlNode replaceNode in toReplace.Keys) { replaceNode.ParentNod.ReplaceChild(toReplace[replaceNode], replaceNode); } After i do this, the InnerHtml of the body node remains the same as from beginning, although the OutterHtml or the InnerText are showing the good result. Is there something wrong with my code? The result: // body.InnerHtml <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent leo leo, ultrices eu venenatis et, rutrum fringilla dolor.</p> // body.OutterHtml <body><p>Lorem ipsum dolor sit amet...</p></body>

Read the article
“html agility pack” like solutions for C/Objective-c/iPhone

- by mxg

Hi everyone! I need a powerful HTML parser and manipulator for Objective-C/C, like HTML Agility Pack. Can anyone tell me some optimal solution? One solution is libxml2, but it seams is not the best. Thanks in advance!

Read the article
HTML Agility Pack Screen Scraping XPATH isn't returning data

- by Matthias Welsh

I'm attempting to write a screen scraper for Digikey that will allow our company to keep accurate track of pricing, part availability and product replacements when a part is discontinued. There seems to be a discrepancy between the XPATH that I'm seeing in Chrome Devtools as well as Firebug on Firefox and what my C# program is seeing. The code I'm currently using is pretty quick and dirty... //This function retrieves data from the digikey private static List<string> ExtractProductInfo(HtmlDocument doc) { List<HtmlNode> m_unparsedProductInfoNodes = new List<HtmlNode>(); List<string> m_unparsedProductInfo = new List<string>(); //Base Node for part info string m_baseNode = @"//html[1]/body[1]/div[2]"; //Write part info to list m_unparsedProductInfoNodes.Add(doc.DocumentNode.SelectSingleNode(m_baseNode + @"/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]")); //More lines of similar form will go here for more info //this retrieves digikey PN foreach(HtmlNode node in m_unparsedProductInfoNodes) { m_unparsedProductInfo.Add(node.InnerText); } return m_unparsedProductInfo; } Although the path I'm using appears to be "correct" I keep getting NULL when I look at the list "m_unparsedProductInfoNodes" Any idea what's going on here? I'll also add that if I do a "SelectNodes" on the baseNode it only returns a div... not sure what that indicates but it doesn't seem right.

Read the article
How to extract innermost table from html file with the help of the html agility pack ?

- by Harikrishna

I am parsing the tabular information from the html file with the help of the html agility pack. Now I can do it and it works. But when the table what I want to extract is inner most. Or I don't know at which position it is in nested tables.And there can be any number of nested tables and from that I want to extract the information of the table which has column name name,address. Ex. <table> <tr><td>PHONE NO.</td><td>OTHER INFO.</td></tr> <tr><td> <table> <tr><td>AMOUNT</td></tr> <tr><td>50000</td></tr> <tr><td>80000</td></tr> </table> </td></tr> <tr><td> <table> <tr><td> <table> <tr><td> <table> <tr><td> NAME </td><td>ADDRESS</td> <tr><td> ABC </td><td> kfks </td> <tr><td> BCD </td><td> fdsa </td> </table> </tr></td> </table> </td></tr> </table> </td></tr> </table> There are many tables but I want to extract the table which has column name name,address. So what should I do ?

Read the article
Can not parse tabular information from html document.

- by Harikrishna

I am parsing many html documents.I am using html agility pack And I want to parse the tabular information from each document. And there may be any number of tables in each document.But I want to extract only one table from each document which has column header name NAME,PHONE NO,ADDRESS.And this table can be anywhere in the document,like in the document there is ten tables and from ten table there is one table which has many nested tables and from nested table there may be a table what I want to extract means table can be anywhere in the document and I want to find that table from the document by column header name.If I got that table then I want to then extract the information from that table. Now I can find the table which has column header NAME,PHONE NO,ADDRESS and also can extract the information from that.I am doing for that is, first I find the all tables in a document by foreach (var table in doc.DocumentNode.Descendants("table")) then for each table got I find the row for each table like, var rows = table.Descendants("tr"); and then for each row I am checking that row has that header name NAME,ADDRESS,PHONENO and if it is then I skip that row and extract all information after that row foreach (var row in rows.Skip(rowNo)) { var data = new List<string>(); foreach (var column in row.Descendants("td")) { data.Add(properText); } } Such that I am extracting all information from almost many document. But now problem is sometimes what happened that in some document I can not parse the information.Like a document in which there are like 10 tables and from these 10 tables 1 table is like there are many nested tables in that table. And from these nested tables I want to find the table which tabel has column header like NAME,ADDRESS,PHONE NO.So if table may be anywhere in the document even in the nested tables or anywhere it can be find through column header name.So I can parse the information from that table and skip the outer tabular information of that table.

Read the article

1 2 3 | Next Page >