Search Results

Search found 58 results on 3 pages for 'htmldocument'.

Page 1/3 | 1 2 3 | Next Page >

Changing content of HTMLDocument displayed in a JTextPane

- by kayahr

I'm displaying some tables as HTML code (rendered by a Freemarker template) with a JTextPane. I also have some HTML links in this HTML output which can be used to interact with the values of the table (For example "Delete a row" and stuff like this). Currently I always recreate the whole HTML output on each change and replace the whole document in the JTextPane with the new one. This is getting really slow and the JTextPane always changes the scroll position when I do this. Is it possible to change only parts of the HTML code during runtime without replacing the whole document? In a real browser I would use the DOM and do something like document.getElementById("price").innerHTML = "$123.44". Is something like that possible with the HTMLDocument in a JTextPane?

Read the article
Element to string in HTMLDocument

- by kalpesh

i have a Element object its a HTMLDocument object and i want to string value of this element. i want this result Christina Toth, Pharm. D. ======================= plz see below code. public static void main(String args[]) throws Exception { InputStream is = Nullsoft.getInputStream(); InputStreamReader isr = new InputStreamReader(is); BufferedReader br = new BufferedReader(isr); HTMLEditorKit htmlKit = new HTMLEditorKit(); HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument(); HTMLEditorKit.Parser parser = new ParserDelegator(); HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0); parser.parse(br, callback, true); // Parse ElementIterator iterator = new ElementIterator(htmlDoc); Element element; while ((element = iterator.next()) != null) { AttributeSet attributes = element.getAttributes(); Object name = attributes.getAttribute(StyleConstants.NameAttribute); if ((name instanceof HTML.Tag) && ((name == HTML.Tag.DIV) || (name == HTML.Tag.H2) || (name == HTML.Tag.H3))) { StringBuffer text = new StringBuffer(); int count = element.getElementCount(); for (int i = 0; i < count; i++) { Element child = element.getElement(i); AttributeSet childAttributes = child.getAttributes(); // if (childAttributes.getAttribute(StyleConstants.NameAttribute) == HTML.Tag.CONTENT) { int startOffset = child.getStartOffset(); int endOffset = child.getEndOffset(); int length = endOffset - startOffset; text.append(htmlDoc.getText(startOffset, length)); } } System.out.println(name + ": " + text.toString()); } } System.exit(0); } public static InputStream getInputStream() { String text = "<html>\n" + "<head>\n" + "<title>pg_0001</title>\n" + "\n" + "<style type=\"text/css\">\n" + ".ft3{font-style:normal;font-weight:bold;font-size:11px;font-family:Helvetica;color:#000000;}\n" + "</style>\n" + "</head>\n" + "<body vlink=\"#FFFFFF\" link=\"#FFFFFF\" bgcolor=\"#ffffff\">\n" + "\n" + "\n" + "<div style=\"position:absolute;top:597;left:252\"><nobr><span class=\"ft3\">Christina Toth, Pharm. D.</span></nobr></div>\n" + "\n" + "\n" + "</body>\n" + "</html>"; InputStream is = null; try { is = new ByteArrayInputStream(text.getBytes("UTF-8")); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } return is; }

Read the article
Html Agility Pack: Setting an HtmlNode's Attribute Value isn't reflected in the HtmlDocument.

- by Avi

In Html Agility Pack, when I set an attribute of an HtmlNode, should I see this in the HtmlDocument from which the node was selected? Lets say that htmlDocument is an HtmlDocument. So the simplified code looks like this: HtmlNode documentNode = htmlDocument.DocumentNode; HtmlNodeCollection nodeCollection = documentNode.SelectNodes(someXPath); foreach(var node in nodeCollection) if(SomeCondition(node)) node.SetAttributeValue("class","something"); Now, I see the class attribte of node change, but I don't see this change reflected in the htmlDocument's html.

Read the article
WebBrowser control HTMLDocument automate selecting option drop-down

- by CWinKY

I'm trying to automate in a WinForm using a WebBrowser control to navigate and pull report info from a website. You can enter values in textboxes and invoke the click events for buttons and links, but I have not figured out how select a option drop-down .... in a automated way. Anybody recommend how to select a item from a drop-down, given this html example: <SELECT id="term_id" size="1" name="p_term_in"><option value="">Select Another Term<option value="201050">Summer 2010<option value="201010">Spring 2010<option value="200980">Fall 2009</SELECT> For others that can learn from entering values to textboxes and invoking click events here's how you do it: webBrowser1.Document.GetElementById("<HTML ELEMENT NAME>").SetAttribute("value", "THE NAME"); Invoke button or hyperlink click: webBrowser1.Document.GetElementById("<BUTTON>").InvokeMember("click"); So I've solved entering values and invoking click, but I have not solved selecting a drop-down value.

Read the article
TypeError: Object #<HTMLDocument> has no method 'observe' issue with the plugin chosen

- by NewBoy

Hi made attempts to install the Jquery plugin chosen which enables me to customise my <select> tag in all browsers. Click here Anyway i have integrated this pluigin into my site and i have come across the following error message in my element inspector..Click here "TypeError: Object # has no method 'observe'" from the following code <script type="text/javascript"> document.observe('dom:loaded', function(evt) { var select, selects, _i, _len, _results; if (Prototype.Browser.IE && (Prototype.BrowserFeatures['Version'] === 6 || Prototype.BrowserFeatures['Version'] === 7)) { return; } selects = $$(".chzn-select"); _results = []; for (_i = 0, _len = selects.length; _i < _len; _i++) { select = selects[_i]; _results.push(new Chosen(select)); } deselects = $$(".chzn-select-deselect"); for (_i = 0, _len = deselects.length; _i < _len; _i++) { select = deselects[_i]; _results.push(new Chosen(select,{allow_single_deselect:true})); } return _results; }); </script> Does anyone know how i can solve this problem??

Read the article
Will I use HtmlDocument even I want to parse the HTML string using HtmlAglityPack ?

- by skhan

Hi everyone, I'm working in C#. I'm trying to extract the first instance of img tag from a HTML string (which is actually a post data). This is my code: private string GrabImage(string htmlContent) { String firstImage; HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument(); htmlDoc.LoadHtml(htmlContent); HtmlAgilityPack.HtmlNode imageNode = htmlDoc.DocumentNode.SelectSingleNode("//img"); if (imageNode != null) { return firstImage = imageNode.ToString(); } else return firstImage=" "; } But it gets null in htmlDoc, will I use the HtmlDocument type even if I'm trying to parse the HTML from a string ? P.S btw is it the correct way of grabbing the first instance of image tag from my HTML string?

Read the article
Iteration through the HtmlDocument.All collection stops at the referenced stylesheet?

- by Jonas

Since "bug in .NET" is often not the real cause of a problem, I wonder if I'm missing something here. What I'm doing feels pretty simple. I'm iterating through the elements in a HtmlDocument called doc like this: System.Diagnostics.Debug.WriteLine("*** " + doc.Url + " ***"); foreach (HtmlElement field in doc.All) System.Diagnostics.Debug.WriteLine(string.Format("Tag = {0}, ID = {1} ", field.TagName, field.Id)); I then discovered the debug window output was this: Tag = !, ID = Tag = HTML, ID = Tag = HEAD, ID = Tag = TITLE, ID = Tag = LINK, ID = ... when the actual HTML document looks like this: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html> <head> <title>Protocol</title> <link rel="Stylesheet" type="text/css" media="all" href="ProtocolStyle.css"> </head> <body onselectstart="return false"> <table>  </table> </body> </html> Commenting out the LINK tag solves the issue for me, and the document is completely parsed. The ProtocolStyle.css file exist on disk and is loaded properly, if that would matter. Is this a bug in .NET 3.5 SP1, or what? For being such a web-oriented framework, I find it hard to believe there would be such a major bug in it.

Read the article
Too nervous to install

- by The Prop

Yesterday I (a professional rugby prop of somewhat limited intellect) landed in http://htmlagilitypack.codeplex.com/ and found myself stranded in a town with no signposts. The locals don't need signposts - they know their way around - so who gives a hoot about visitors? Well I'm a visitor and I'm lost. Here's my plea to the good burgesses of Codeplex-sans-signs: HELP!! Let me back-track and explain what landed me at the bottom of this tangled ruck. There's a "Download" button positioned near the top-right of the Codeplex web page, right? Like the Sword of Damocles, a down-arrow to the left of the button indicates, presumably, what a download would include: CURRENT 1.4.0 Stable DATE Fri May 7 2010 at 7:00 AM STATUS Stable With a simple-minded confidence that has since deserted me (the confidence - not the simple-mindedness), I clicked "Download". This introduced 3 new files to my computer: HtmlAgilityPack.dll, HtmlAgilityPack.pdb, and HtmlAgilityPack.XML This is when the first stab of doubt penetrated that globe between my cauliflower ears that I call a head. Where's the dot cs? Somewhere in Codeplex, I'd read advice to another lost soul to "download and build the HTMLAgilityPack solution". As I've done so many times as an All Black prop, I glared at the opposition front row - ah, I mean the 3 new files. Shouldn't one of them have a ".cs" on the back of his jersey - er, on the end of its name? Or is this just how they play the game in Codeplex-sans-signs? Undaunted (props have more courage than sense) I packed into my first C# scrum. The half-back feeds the ball in, and the front rows collapse - er, the debugging stops at this line of my code: "HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();" Then the Referee blows his whistle and announces one of those verdicts that's utterly indecipherable to your average loose-head prop: Locating source for 'C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs'. Checksum: MD5 {62 bc f3 7e 9a 92 a6 32 7 d6 5b f8 76 59 7b 5b} The file 'C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs' does not exist. Looking in script documents for 'C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs'... Looking in the projects for 'C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs'. The file was not found in a project. Looking in directory 'C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\IDE\vc7\atlmfc'... Looking in directory 'C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\IDE\vc7\crt'... The debugger will ask the user to find the file: C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs. The user pressed Cancel [a brain-stemmer from the prop] in the Find Source dialog. The debug source files settings for the active solution have been modified so that the debugger will not ask the user to find the file: C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs. The debugger could not locate the source file 'C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs'. Even if it had been the first 50 stanzas of "Eskimo Nell", I couldn't have been more shocked. I'm so shocked, my jaws clamp shut around the opposition hooker's ear. He thumbs me in the iris. With a cornea-torn eye I peer at the Codeplex site. My brain stem sparks and I punch the "View all downloads" link. It sparks four more times on each download link, and.. lo! FOUR files this time: HAPExplorer.zip, HtmlAgilityPack.1.4.0.Source.zip, HtmlAgilityPack.1.4.0.zip, HtmlAgilityPack.Documentation.chm But... is this not the same place arrived at recently by my flat-mate Chaz, journalist extraordinaire? (Chaz, if you're reading this, I'm not plugging for nothing - just write kindly about me in your next report, okay?) Didn't these same four files flummox Chaz The Great? He told me about it. Chaz left a message with Codeplex and then solved the problem by just walking away. Typical journalist, huh. But I'm not like that. I don't walk away. I'm made of the sort of stubborn stuff that becomes an All Black prop. Hence this impassioned plea: GOOD TOWNSFOLK OF CODEPLEX-SANS-SIGNS, WHAT SHOULD I DO NEXT? Can somebody point me to Main Street? How does a simpleton install 'C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs'? I'm willing to prostrate myself and grovel to the first kind face that passes in front of my rapidly clouding sight. So help me, I'd even tug my forelock if I had one! Should I hold forth my rod over the wilderness, and create a folder called 'C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\' or some such? If so, what files should I move into it? ANYTHING else a dum-ass should know about? - and I mean ANYTHING - you just don't know how witless a punch-drunk prop can be.. %( Whenever I've installed other programs they've given me an ".exe" or ".msi" that I can click on and it's all done for me like magic. HEY... there's nothing of that nature here, is there? Am I missing something? Something for dummies to click? (From the waiting rooms of Dr I. Sight Phixes) (signed) The Prop

Read the article
Serialize HTMLDocument and then rendering it in the server?

- by Yang Bo

Hi, After some Google search, I did not find anything fill my need. I want to save the current web page just as what it is. I mean, many web pages has Javascript executed and CSS changed, so after some user interactive, the web page may be different from the one when it is firstly loaded into browser. And I want to save the current web page state to the sever and rendering it in the server. Is there any Javascript library for this task? Thanks!

Read the article
Check on null reference

- by Neir0

Hello, i have the follow code: searchResults.SearchResultCollection.Add( new SearchResult() { Header = HttpUtility.HtmlDecode( htmlDocument.DocumentNode .SelectSingleNode(initialXPath + "h3") .InnerText), Link = HttpUtility.HtmlDecode( htmlDocument.DocumentNode .SelectSingleNode(initialXPath + "div/cite") .InnerText) } ); Sometimes htmlDocument.DocumentNode.SelectSingleNode(....) returns null and my application crashed with NullReferenceException. Ofcourse i can write code which check returned value on null reference but then code will be overly verbose. What is graceful way to do that?

Read the article
Html Agility Pack: DescendantsOrSelf() not returning HTML element

- by Program.X

I have some HTML, eg: <%@ Page Title="About Us" Language="C#" MasterPageFile="~/Site.master" AutoEventWireup="true" CodeBehind="ContentManagedTargetPage.aspx.cs" Inherits="xxx.ContentManagedTargetPage" %> <%@ Register TagPrefix="CxCMS" Namespace="xxx.ContentManagement.ASPNET.UI" Assembly="xxx.ContentManagement.ASPNET" %> <asp:Content ID="HeaderContent" runat="server" ContentPlaceHolderID="HeadContent"> </asp:Content> <asp:Content ID="BodyContent" runat="server" ContentPlaceHolderID="MainContent"> <h2> Content Managed </h2> <p> Put content here. [<CxCMS:ContentManagedPlaceHolder Key="keyThingy" runat="server" />] </p> </asp:Content> And I want to find all the instances of the CxCMS:ContentManagedPlaceHolder element. I'm using HTML Agility Pack, which seems the best fit. However, despite looking at the [meagre] documentation, I can't get my code to work. I would expect the following to work: string searchForElement = "CxCMS:ContentManagedPlaceHolder"; IEnumerable<HtmlNode> contentPlaceHolderHtmlNodes = HtmlDocument.DocumentNode.Descendants(searchForElement); int count = contentPlaceHolderHtmlNodes.Count(); But I get nothing back. If I change to DescendantsOrSelf, I get the document node back, "#document" - which is incorrect: string searchForElement = "CxCMS:ContentManagedPlaceHolder"; IEnumerable<HtmlNode> contentPlaceHolderHtmlNodes = HtmlDocument.DocumentNode.DescendantsOrSelf(searchForElement); int count = contentPlaceHolderHtmlNodes.Count(); I also tried using LINQ: string searchForElement = "CxCMS:ContentManagedPlaceHolder"; IEnumerable<HtmlNode> contentPlaceHolderHtmlNodes = HtmlDocument.DocumentNode.DescendantsOrSelf().Where(q=>q.Name==searchForElement); int count = contentPlaceHolderHtmlNodes.Count(); As neither of these methods work, I moved onto using SelectNodes, instead: string searchForElement = "CxCMS:ContentManagedPlaceHolder"; string xPath="//"+searchForElement // "//CxCMS:ContentManagedPlaceHolder" var nodes= HtmlDocument.DocumentNode.SelectNodes(xPath); This just throws the exception: "Namespace Manager or XsltContext needed. This query has a prefix, variable, or user-defined function.". I can't find any way of adding namespace management to the HtmlDocument object. What am I missing, here? The DescendantsOrSelf() method works if using a "standard" HTML tag, such as "p", but not the one I have. Surely it should work? (It needs to!)

Read the article
How can I place zeroes to the left of a given number to a maximum of 6 digits including the given nu

- by Sergio Tapia

I have this method that receives an ID number and downloads an HTML website according to that ID. Typically, an IMDB link is like this: http://www.imdb.com/title/tt0892791/ http://www.imdb.com/title/tt1226229/ http://www.imdb.com/title/tt0000429/ They all follow the 'tt' then 7 digits, with lack of digits turning into zeroes to fill out the left spaces. How can I accomplish this using C#? I'm kind of stumped. Here's my method: /// <summary> /// Find a movie page using its precise IMDB id. /// </summary> /// <param name="id">IMDB Movie ID</param> /// <returns>Returns an HtmlDocument with the source code included.</returns> public HtmlDocument ByID(string id) { string url = String.Format("http://www.imdb.com/title/tt{0}/", id); HtmlDocument page = downloader.Load(url); return page; } Thank you very much for your time, and if you are interested in helping out, you can check out the complete source code of TheFreeIMDB here: http://thefreeimdb.codeplex.com/

Read the article
HTML file: add annotations through IHTMLDocument

- by peterchen

I need to add "annotations" to existing HTML documents - best in the form of string property values I can read & write by name. Apparently (to me), meta elements in the header seem to be the common way - i.e. adding/modifying elements like <head> <meta name="unique-id_property-name" content="property-value"/> ... </head> Question 1: Ist that "acceptable" / ok, or is there a better way to add meta data? I have a little previous experience with getting/mut(il)ating HTML contents through the document in an web browser control. For this task, I've already loaded the HTML document into a HTMLDocument object, but I'm not sure how to go on: // what I have IHTMLDocument2Ptr doc; doc.CreateInstance(__uuidof(HTMLDocument)); IPersistFile pf = doc; pf->Load(fileName, STGM_READ); // everything ok until here Questions 2: Should I be using anything else than HTMLDocument? Questions 3..N: How do I get the head element? How do I get the value of a meta element with a given name? How do I set the value of a meta element (adding the item if and only if it doesn't exist yet)? doc->all returns a collection of all tags, which I can enumerate even though count returns 0. I could scan that for head, then scan that for all meta where the name starts with a certain string, etc. - but this feels very clumsy.

Read the article
Using HTMLAgility Pack to Extract Links

- by Soham

Hi Folks, Consider this simplest piece of code: using System; using System.Collections.Generic; using System.Linq; using System.Text; using HtmlAgilityPack; namespace WebScraper { class Program { static void Main(string[] args) { HtmlDocument doc = new HtmlDocument(); doc.LoadHtml("http://www.google.com"); foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]")) { } } } } This effectively doesnt do anything at all, and is copied/inspired from various other StackOverflow questions like this. When compiling this, there is a runtime error which says "Object reference not set to an instance of an object." highlighting the foreach line. I can't understand, why the environment has become irritable to this humble,innocent and useless piece of code. I would also like to know, does HTMLAgilityPack accept HTML classes as nodes?

Read the article
Setting HTML Text Element value

- by Gpx

Hi, in my C# WPF prog i´am trying to set a value of a HTML Text Element which is defined like: <input name="tbBName" type="text" id="tbBName" tabindex="1" /> What i found about it and tried is: mshtml.HTMLDocument doc = (mshtml.HTMLDocument)webBrowser1.Document; mshtml.HTMLInputTextElement tbName = (mshtml.HTMLInputTextElement)doc.getElementsByName("tbBName"); tbName.value = "Test"; But i got the exception: Unable to cast COM object of type 'System.__ComObject' to interface type 'mshtml.HTMLInputTextElement'. This operation failed because the QueryInterface call on the COM component for the interface with IID '{3050F520-98B5-11CF-BB82-00AA00BDCE0B}' failed due to the following error: No such interface supported (Exception from HRESULT: 0x80004002 (E_NOINTERFACE)). I know what it says but i dont know which object i can use to access the Texbox. Thanks for any answers.

Read the article
How can I create two contructors that act differently but recieve the same data type?

- by Sergio Tapia

public class Parser { Downloader download = new Downloader(); HtmlDocument Page; public Parser(string MovieTitle) { Page = download.FindMovie(MovieTitle); } public Parser(string ActorName) { Page = download.FindActor(ActorName); } } I want to create a constructor that will allow other developers who use this library to easily create a Parser object with the relevant HtmlDocument already loaded as soon as it's done creating it. The problem lies in that a constructor cannot exist twice with the same type of parameters. Sure I can tell the logical difference between the two paramters, but the computer can't. Any suggestions on how to handle this? Thank you!

Read the article
Kind of stumped with some basic C# constructors.

- by Sergio Tapia

public class Parser { Downloader download = new Downloader(); HtmlDocument Page; public Parser(string MovieTitle) { Page = download.FindMovie(MovieTitle); } public Parser(string ActorName) { Page = download.FindActor(ActorName); } } I want to create a constructor that will allow other developers who use this library to easily create a Parser object with the relevant HtmlDocument already loaded as soon as it's done creating it. The problem lies in that a constructor cannot exist twice with the same type of parameters. Sure I can tell the logical difference between the two paramters, but the computer can't. Any suggestions on how to handle this? Thank you!

Read the article
How would I use HTMLAgilityPack to extract the value I want

- by Nai

For the given HTML I want the value of id <div class="name" id="john-5745844"> <div class="name" id="james-6940673"> UPDATE This is what I have at the moment HtmlDocument htmlDoc = new HtmlDocument(); htmlDoc.Load(new StringReader(pageResponse)); HtmlNode root = htmlDoc.DocumentNode; List<string> anchorTags = new List<string>(); foreach (HtmlNode div in root.SelectNodes("//div[@class='name' and @id]")) { HtmlAttribute att = div.Attributes["id"]; Console.WriteLine(att.Value); } The error I am getting is at the foreach line stating: Object reference not set to an instance of an object.

Read the article
Avoiding a NullReferenceException

- by Nikhil K

I have used this code for extracting urls from web page.But in the line of 'foreach' it is showing Object reference not set to an instance of an object exception. What is the problem? how can i correct that? WebClient client = new WebClient(); string url = "http://www.google.co.in/search?hl=en&q=java&start=10&sa=N"; string source = client.DownloadString(url); HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(source); foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href and @rel='nofollow']")) { Console.WriteLine(link.Attributes["href"].Value); }

Read the article
Parsing HTML tags to find a specific Table Row

- by moutonc

Hello everyone I was set up with a Challenge where I must parse through an HTML page to find the end date of all the classes. I am using the HTMLAgilityPack but, this is the first time I have used it, also the webpage who ever set it up has no classes or Id's and the end dates are stored in a Tr H4 tag. I am not sure how to Parse through any hits? My Code: HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.Load(txtURL.Text); sw.WriteLine("GET /academics/academic-calendar/ HTTP/1.1"); sw.WriteLine(); String response = sr.ReadToEnd(); txtHTML.Text = response;

Read the article
Removing exception

- by Nikhil K

I have used this code for extracting urls from web page.But in the line of 'foreach' it is showing Object reference not set to an instance of an object exception. What is the problem? how can i correct that? WebClient client = new WebClient(); string url = "http://www.google.co.in/search?hl=en&q=java&start=10&sa=N"; string source = client.DownloadString(url); HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(source); foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href and @rel='nofollow']")) { Console.WriteLine(link.Attributes["href"].Value); }

Read the article
HtmlAgilityPack giving problems with malformed html

- by Kapil

I want to extract meaningful text out of an html document and I was using html-agility-pack for the same. Here is my code: string convertedContent = HttpUtility.HtmlDecode(ConvertHtml(HtmlAgilityPack.HtmlEntity.DeEntitize(htmlAsString))); ConvertHtml: public string ConvertHtml(string html) { HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); StringWriter sw = new StringWriter(); ConvertTo(doc.DocumentNode, sw); sw.Flush(); return sw.ToString(); } ConvertTo: public void ConvertTo(HtmlAgilityPack.HtmlNode node, TextWriter outText) { string html; switch (node.NodeType) { case HtmlAgilityPack.HtmlNodeType.Comment: // don't output comments break; case HtmlAgilityPack.HtmlNodeType.Document: foreach (HtmlNode subnode in node.ChildNodes) { ConvertTo(subnode, outText); } break; case HtmlAgilityPack.HtmlNodeType.Text: // script and style must not be output string parentName = node.ParentNode.Name; if ((parentName == "script") || (parentName == "style")) break; // get text html = ((HtmlTextNode)node).Text; // is it in fact a special closing node output as text? if (HtmlNode.IsOverlappedClosingElement(html)) break; // check the text is meaningful and not a bunch of whitespaces if (html.Trim().Length > 0) { outText.Write(HtmlEntity.DeEntitize(html) + " "); } break; case HtmlAgilityPack.HtmlNodeType.Element: switch (node.Name) { case "p": // treat paragraphs as crlf outText.Write("\r\n"); break; } if (node.HasChildNodes) { foreach (HtmlNode subnode in node.ChildNodes) { ConvertTo(subnode, outText); } } break; } } Now in some cases when the html pages are malformed (for example the following page - http://rareseeds.com/cart/products/Purple_of_Romagna_Artichoke-646-72.html has a malformed meta-tag like <meta content="text/html; charset=uft-8" http-equiv="Content-Type">) [Note "uft" instead of utf] my code is puking at the time I am trying to load the html document. Can someone suggest me how can I overcome these malformed html pages and still extract relevant text out of a html document? Thanks, Kapil

Read the article
C# WebBrowser control not applying css

- by JamesL

I have a project that I am working on in VS2005. I have added a WebBrowser control. I add a basic empty page to the control private const string _basicHtmlForm = "<html> " + "<head> " + "<meta http-equiv='Content-Type' content='text/html; charset=utf-8'/> " + "<title>Test document</title> " + "<script type='text/javascript'> " + "function ShowAlert(message) { " + " alert(message); " + "} " + "</script> " + "</head> " + "<body><div id='mainDiv'> " + "</div></body> " + "</html> "; private string _defaultFont = "font-family: Arial; font-size:10pt;"; private void LoadWebForm() { try { _webBrowser.DocumentText = _basicHtmlForm; } catch(Exception ex) { MessageBox.Show(ex.Message); } } and then add various elements via the dom (using _webBrowser.Document.CreateElement). I am also loading a css file: private void AddStyles() { try { mshtml.HTMLDocument currentDocument = (mshtml.HTMLDocument) _webBrowser.Document.DomDocument; mshtml.IHTMLStyleSheet styleSheet = currentDocument.createStyleSheet("", 0); TextReader reader = new StreamReader(Path.Combine(Path.GetDirectoryName(Application.ExecutablePath),"basic.css")); string style = reader.ReadToEnd(); styleSheet.cssText = style; } catch(Exception ex) { MessageBox.Show(ex.Message); } } Here is the css page contents: body { background-color: #DDDDDD; } .categoryDiv { background-color: #999999; } .categoryTable { width:599px; background-color:#BBBBBB; } #mainDiv { overflow:auto; width:600px; } The style page is loading successfully, but the only elements on the page that are being affected are the ones that are initially in the page (body and mainDiv). I have also tried including the css in a element in the header section, but it still only affects the elements that are there when the page is created. So my question is, does anyone have any idea on why the css is not being applied to elements that are created after the page is loaded? I have also tried no applying the css until after all of my elements are added, but the results don't change.

Read the article
How can I get a websites favicon when using the WebBrowser control?

- by Xaero

I have a Windows Forms WebBrowser control embedded in my application. Is there any way to get a web pages favicon using either the WebBrowser or HtmlDocument API? Even obtaining it from the local file system would suffice. Downloading the icon as a separate operation would be a last resort... Thanks.

Read the article
Why is this javascript getting permission denied?

- by poo

Permission denied for <http://xxx.com> to get property HTMLDocument.body from http://www.xxx.com. var c = parseFloat(cf.contentDocument.body.offsetHeight) + 30;

Read the article

1 2 3 | Next Page >