Search Results

Search found 32731 results on 1310 pages for 'regex for html'.

Page 4/1310 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >

  • php regex to remove HTML

    - by Me1000
    Before we start, strip_tags() doesn't work. now, I've got some data that needs to be parsed, the problem is, I need to get rid of all the HTML that has been formated very strangely. the tags look like this: (notice the spaces) < p > blah blah blah < / p > < a href= " link.html " > blah blah blah < /a > All the regexs I've been trying aren't working, and I don't know enough about regex formating to make them work. I don't care about preserving anything inside of the tags, and would prefer to get rid of the text inside a link if I could. Anyone have any idea? (I really need to just sit down and learn regular expressions one day)

    Read the article

  • Perl Regex Multiple Items in Single String

    - by Sho Minamimoto
    I'm trying to parse a single string and get multiple chunks of data out from the same string with the same regex conditions. I'm parsing a single HTML doc that is static (For an undisclosed reason, I can't use an HTML parser to do the job.) I have an expression that looks like $string =~ /\<img\ssrc\="(.*)"/; and I want to get the value of $1. However, in the one string, there are many img tags like this, so I need something like an array returned (@1?) is this possible?

    Read the article

  • Split string into smaller part with constrain [PHP RegEx HTML]

    - by Sadi
    Hello, I need to split long string into a array with following constrains: Each part will have a limited number of character (e.g. not more than 8000 character) Each part can contain multiple sentences (delimited by . [full stop]) but never a partial sentences. Except if the last part of the string (as last part may not have any full stop. The string may contain HTML tags. But the tag can not be divided as ( to ). That means HTML tag should be intact. But starting tag and ending tag can be stay on different segment/chunk. I think regular expression with preg_split can do it. Would please help me with the proper RegEx. Thank you Sadi

    Read the article

  • Using regular expression to trim html

    - by Tim
    Been trying to solve this for a while now. I need a regex to strip the newlines, tabs and spaces between the html tags demonstrated in the example below: Source: <html> <head> <title> Some title </title> </head> </html> Wanted result: <html><head><title>Some title</title></head></html> The trimming of the whitespaces before the "Some title" is optional. I'd be grateful for any help

    Read the article

  • Parse html and find data in the html

    - by Dan.StackOverflow
    Hi all. I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a table: <html> <table> <tr><td>Header</td></tr> <tr><td>Want This</td></tr> </table> </html> so lets try it: >>> doc = html5lib.parse('<html><table><tr><td>Header</td></tr><tr><td>Want This</td> </tr></table></html>', treebuilder='lxml') >>> doc <lxml.etree._ElementTree object at 0x1a1c290> that looks good, lets see what else we have: >>> root = doc.getroot() >>> print(lxml.etree.tostring(root)) <html:html xmlns:html="http://www.w3.org/1999/xhtml"><html:head/><html:body><html:table><html:tbody><html:tr><html:td>Header</html:td></html:tr><html:tr><html:td>Want This</html:td></html:tr></html:tbody></html:table></html:body></html:html> LOL WUT? seriously. I was planning on using some xpath to get at the data I want, but that doesn't seem to work. So what can I do? I am willing to try different libraries and approaches.

    Read the article

  • RegEx to extract all HTML tag attributes including inline JavaScript

    - by Mike
    I found this useful regex code here while looking to parse HTML tag attributes: (\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']? It works great, but it's missing one key element that I need. Some attributes are event triggers that have inline Javascript code in them like this: onclick="doSomething(this, 'foo', 'bar');return false;" Or: onclick='doSomething(this, "foo", "bar");return false;' I can't figure out how to get the original expression to not count the quotes from the JS (single or double) while it's nested inside the set of quotes that contain the attribute's value.

    Read the article

  • RegEx - Match optional groups

    - by Maurizio
    I know RE is not the best way to scrape HTMLs, but this is it... I have some something like: <td> Writing: <a href="creator.php?c=CCh">Carlo Chendi</a> Art: <a href="creator.php?c=LBo">Luciano Bottaro</a> </td> And I need to match the Writing and Art parts. But it is not said they're there and there could be other parts like Ink and Pencils... How do i do this ? I need to use pure Regex, no additional Python libs... Thanks !

    Read the article

  • JavaScript lazy regex for matching HTML tags

    - by Grnbeagle
    Hi, I'm having a problem writing a regular expression for matching HTML tags. I found a similar entry here, but this didn't quite work in my case. Here's my test string: <div id="div0" class="myclass">here's some text that may include whitespace</div><div id="div1" class="myclass"> and some more here </div> And here's my regex based on the aforementioned entry: <div[^>]*class="myclass">[^~]*?<\/div> Note that I need to match the first instance of <div /> with class of "myclass." The content may have carriage returns. These <div> tags won't be nested. Here's a rubular page for testing: http://rubular.com/r/vlfcikKMXk

    Read the article

  • RegEx - Remove HTML hyperlinks based on the link text

    - by Rob
    Hi, I have some text that has HTML hyper-links in it. I want to remove the hyperlinks, but only specific ones. e.g. I start with this: This is text <a href="link/to/somewhere">Link to Remove</a> and more text with another link <a href="/link/to/somewhere/else">Keep this link</a> I want to have: This is text and more text with another link <a href="/link/to/somewhere/else">Keep this link</a> I have this RegEx expression, <a\s[^>]*>.*?</a> ... but it matches ALL of the links. What do I need to add to that expression to match only the links with the link-text 'Remove' (for example) in it? thanks in advance.

    Read the article

  • How to regex match text with different endings?

    - by Mint
    This is what I have at the moment. <h2>header</h2>\n +<p>(.*)<br />|</p> ^ that is a tab space, didn't know if there was a better way to represent one or more (it seems to work) Im trying to match the 'bla bla.' text, but my current regex doesn't quite work, it will match most of the line, but I want it to match the first <h2>Information</h2> <p>bla bla.<br /><br /><a href="http://www.google.com">google</a><br /> or <h2>Information</h2> <p>bla bla.</p> other code... Oh and my php code: preg_match('#h2>header</h2>\n +<p>(.*)<br />|</p>#', $result, $postMessage);

    Read the article

  • Managing JS and CSS for a static HTML web application

    - by Josh Kelley
    I'm working on a smallish web application that uses a little bit of static HTML and relies on JavaScript to load the application data as JSON and dynamically create the web page elements from that. First question: Is this a fundamentally bad idea? I'm unclear on how many web sites and web applications completely dispense with server-side generation of HTML. (There are obvious disadvantages of JS-only web apps in the areas of graceful degradation / progressive enhancement and being search engine friendly, but I don't believe that these are an issue for this particular app.) Second question: What's the best way to manage the static HTML, JS, and CSS? For my "development build," I'd like non-minified third-party code, multiple JS and CSS files for easier organization, etc. For the "release build," everything should be minified, concatenated together, etc. If I was doing server-side generation of HTML, it'd be easy to have my web framework generate different development versus release HTML that includes multiple verbose versus concatenated minified code. But given that I'm only doing any static HTML, what's the best way to manage this? (I realize I could hack something together with ERB or Perl, but I'm wondering if there are any standard solutions.) In particular, since I'm not doing any server-side HTML generation, is there an easy, semi-standard way of setting up my static HTML so that it contains code like <script src="js/vendors/jquery.js"></script> <script src="js/class_a.js"></script> <script src="js/class_b.js"></script> <script src="js/main.js"></script> at development time and <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js"></script> <script src="js/entire_app.min.js"></script> for release?

    Read the article

  • parsing simple html for iphone

    - by sitara
    I have a very simple html page to parse. The html page will remain simple always. as simple as this <html> <head><title>title</title></head> <body>some data here</body> </html> I have fetched the html content of such an html page and have it in an NSString. I want to get what ever data is there in the body of the html page. Please tell me how can this be done and let me know if there are more than one possible ways. I would prefer doing it using basic obj-c if it is possible. Thanks

    Read the article

  • javascript regex: replace url text link with image,but not in html tags

    Hi this is my pice of code: <div style="overflow: hidden; width: 445px;">[IMG]http://i29.tinypic.com/mydog.png[/IMG] tak si to http://i29.tinypic.com/mycat.png Lorem ipsum loremai <img width="15" border="0" align="middle" src="images/smejo.gif" valign="middle"/> <img src=http://www.example.com/index.png alt> <img src="http://www.example.com/index.png" alt>     <a href="#reakcia" title="reagovat na temu"><span class="poradna-tl-reaguj"><reaction> </span></a></div> </td> </tr><img src=http://www.example.com/index.png alt><img src="http://www.example.com/index.png" alt> and i need regex pattern to replace ONLY text image links with image without touch of inner url tags. But i can't use "Lookbehind" or possessive quantifiers because JS don't support them=/ So i want to catch only "http://i29.tinypic.com/mydog.png" and "http://i29.tinypic.com/mycat.png". I using array method to replacing (will be greasemonkey script.) Many Thanks

    Read the article

  • Regex: Strip HTML attributes except SRC

    - by Ian Silber
    Hi, I'm trying to write a regular expression that will strip all tag attributes except for the SRC attribute. For example: <p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p> Would be returned as: <p>This is a paragraph with an image <img src="/path/to/image.jpg" /></p> I have a regular expression to strip all attributes, but I'm trying to tweak it to leave in src. Here's what I have so far: <?php preg_replace('/<([A-Z][A-Z0-9]*)(\b[^>]*)>/i', '<$1>', '<html><goes><here>'); Using PHP's preg_replace() for this. Thanks! Ian

    Read the article

  • Match HTML tags in two strings using regex in Python

    - by jack
    I want to verify that the HTML tags present in a source string are also present in a target string. For example: >> source = '<em>Hello</em><label>What's your name</label>' >> verify_target(’<em>Hi</em><label>My name is Jim</label>') True >> verify_target('<label>My name is Jim</label><em>Hi</em>') True >> verify_target('<em>Hi<label>My name is Jim</label></em>') False

    Read the article

  • Ajax Control Toolkit July 2011 Release and the New HTML Editor Extender

    - by Stephen Walther
    I’m happy to announce the July 2011 release of the Ajax Control Toolkit which includes important bug fixes and a completely new HTML Editor Extender control. You can download the July 2011 Release by visiting the Ajax Control Toolkit CodePlex site at: http://AjaxControlToolkit.CodePlex.com Using the New HTML Editor Extender Control You can use the new HTML Editor Extender to extend any standard ASP.NET TextBox control so that it supports rich formatting such as bold, italics, bulleted lists, numbered lists, typefaces and different foreground and background colors. The following code illustrates how you can extend a standard ASP.NET TextBox control with the HtmlEditorExtender: <%@ Page Language="C#" AutoEventWireup="true" CodeBehind="Simple.aspx.cs" Inherits="WebApplication1.Simple" %> <%@ Register TagPrefix="asp" Namespace="AjaxControlToolkit" Assembly="AjaxControlToolkit" %> <html xmlns="http://www.w3.org/1999/xhtml"> <head runat="server"> <title>Simple</title> </head> <body> <form id="form1" runat="server"> <asp:ToolkitScriptManager runat="Server" /> <asp:TextBox ID="txtComments" TextMode="MultiLine" Columns="60" Rows="8" runat="server" /> <asp:HtmlEditorExtender TargetControlID="txtComments" runat="server" /> </form> </body> </html> This page has the following three controls: ToolkitScriptManager – The ToolkitScriptManager renders all of the scripts required by the Ajax Control Toolkit. TextBox – The TextBox control is a standard ASP.NET TextBox which is set to display multiple lines (a TextArea instead of an Input element). HtmlEditorExtender – The HtmlEditorExtender is set to extend the TextBox control. You can use the standard TextBox Text property to read the rich text entered into the TextBox control on the server. Lightweight and HTML5 The HTML Editor Extender works on all modern browsers including the most recent versions of Mozilla Firefox (Firefox 5), Google Chrome (Chrome 12), and Apple Safari (Safari 5). Furthermore, the HTML Editor Extender is compatible with Microsoft Internet Explorer 6 and newer. The HTML Editor Extender is very lightweight. It takes advantage of the HTML5 ContentEditable attribute so it does not require an iframe or complex browser workarounds. If you select View Source in your browser while using the HTML Editor Extender, we hope that you will be pleasantly surprised by how little markup and script is generated by the HTML Editor Extender. Customizable Toolbar Buttons Depending on the web application that you are building, you will want to display different toolbar buttons with the HTML Editor Extender. One of the design goals of the HTML Editor Extender was to make it very easy for you to customize the toolbar buttons. Imagine, for example, that you want to use the HTML Editor Extender when accepting comments on blog posts. In that case, you might want to restrict the type of formatting that a user can display. You might want to enable a user to format text as bold or italic but you do not want the user to make any other formatting changes. The following page illustrates how you can customize the HTML Editor Extender toolbar: <%@ Page Language="C#" AutoEventWireup="true" CodeBehind="CustomToolbar.aspx.cs" Inherits="WebApplication1.CustomToolbar" %> <%@ Register TagPrefix="asp" Namespace="AjaxControlToolkit" Assembly="AjaxControlToolkit" %> <html> <head runat="server"> <title>Custom Toolbar</title> </head> <body> <form id="form1" runat="server"> <asp:ToolkitScriptManager Runat="server" /> <asp:TextBox ID="txtComments" TextMode="MultiLine" Columns="50" Rows="10" Text="Hello <b>world!</b>" Runat="server" /> <asp:HtmlEditorExtender TargetControlID="txtComments" runat="server"> <Toolbar> <asp:Bold /> <asp:Italic /> </Toolbar> </asp:HtmlEditorExtender> </form> </body> </html> Notice that the HTML Editor Extender in the page above has a Toolbar subtag. You can list the toolbar buttons which you want to appear within the subtag. In the case above, only Bold and Italic buttons are displayed. Here is a complete list of the Toolbar buttons currently supported by the HTML Editor Extender: Undo Redo Bold Italic Underline StrikeThrough Subscript Superscript JustifyLeft JustifyCenter JustifyRight JustifyFull InsertOrderedList InsertUnorderedList CreateLink UnLink RemoveFormat SelectAll UnSelect Delete Cut Copy Paste BackgroundColorSelector ForeColorSelector FontNameSelector FontSizeSelector Indent Outdent InsertHorizontalRule HorizontalSeparator Of course the HTML Editor Extender was designed to be extensible. You can create your own buttons and add them to the control. Compatible with the AntiXSS Library When using the HTML Editor Extender on a public facing website, we strongly recommend that you use the HTML Editor Extender with the AntiXSS Library. If you allow users to submit arbitrary HTML, and you don’t take any action to strip out malicious markup, then you are opening your website to Cross-Site Scripting Attacks (XSS attacks). The HTML Editor Extender uses the Provider Model to support different Sanitizer Providers. The July 2011 release of the Ajax Control Toolkit ships with a single Sanitizer Provider which uses the AntiXSS library (see http://AntiXss.CodePlex.com ). A Sanitizer Provider is responsible for sanitizing HTML markup by removing any malicious elements, attributes, and attribute values. For example, the AntiXss Sanitizer Provider will take the following block of HTML: <b><a href=""javascript:doEvil()"">Visit Grandma</a></b> <script>doEvil()</script> And return the following sanitized block of HTML: <b><a href="">Visit Grandma</a></b> Notice that the JavaScript href and <SCRIPT> tag are both stripped out. Be aware that there are a depressingly large number of ways to sneak evil markup into your HTML. You definitely want a Sanitizer as a safety net. Before you can use the AntiXSS Sanitizer Provider, you must add three assemblies to your web application: AntiXSSLibrary.dll, HtmlSanitizationLibrary.dll, and SanitizerProviders.dll. All three assemblies are included with the CodePlex download of the Ajax Control Toolkit in the SanitizerProviders folder. Here’s how you modify your web.config file to use the AntiXSS Sanitizer Provider: <configuration> <configSections> <sectionGroup name="system.web"> <section name="sanitizer" requirePermission="false" type="AjaxControlToolkit.Sanitizer.ProviderSanitizerSection, AjaxControlToolkit"/> </sectionGroup> </configSections> <system.web> <compilation targetFramework="4.0" debug="true"/> <sanitizer defaultProvider="AntiXssSanitizerProvider"> <providers> <add name="AntiXssSanitizerProvider" type="AjaxControlToolkit.Sanitizer.AntiXssSanitizerProvider"></add> </providers> </sanitizer> </system.web> </configuration> You can detect whether the HTML Editor Extender is using the AntiXSS Sanitizer Provider by checking the HtmlEditorExtender SanitizerProvider property like this: if (MyHtmlEditorExtender.SanitizerProvider == null) { throw new Exception("Please enable the AntiXss Sanitizer!"); } When the SanitizerProvider property has the value null, you know that a Sanitizer Provider has not been configured in the web.config file. Because the AntiXSS library requires Full Trust, you cannot use the AntiXSS Sanitizer Provider with most shared website hosting providers. Because most shared hosting providers only support Medium Trust and not Full Trust, we do not recommend using the HTML Editor Extender with a public website hosted with a shared hosting provider. Why a New HTML Editor Control? The Ajax Control Toolkit now includes two HTML Editor controls. Why did we introduce a new HTML Editor control when there was already an existing HTML Editor? We think you will like the new HTML Editor much more than the previous one. We had several goals with the new HTML Editor Extender: Lightweight – We wanted to leverage HTML5 to create a lightweight HTML Editor. The new HTML Editor generates much less markup and script than the previous HTML Editor. Secure – We wanted to make it easy to integrate the AntiXSS library with the HTML Editor. If you are creating a public facing website, we strongly recommend that you use the AntiXSS Provider. Customizable – We wanted to make it easy for users to customize the toolbar buttons displayed by the HTML Editor. Compatibility – We wanted to ensure that the HTML Editor will work with the latest versions of the most popular browsers (including Internet Explorer 6 and higher). The old HTML Editor control is still included in the Ajax Control Toolkit and continues to live in the AjaxControlToolkit.HTMLEditor namespace. We have not modified the control and you can continue to use the control in the same way as you have used it in the past. However, we hope that you will consider migrating to the new HTML Editor Extender for the reasons listed above. Summary We’ve introduced a new Ajax Control Toolkit control with this release. I want to thank the developers and testers on the Superexpert team for the huge amount of work which they put into this control. It was a non-trivial task to build an entirely new control which has the complexity of the HTML Editor in less than 6 weeks. Please let us know what you think! We want to hear your feedback. If you discover issues with the new HTML Editor Extender control, or you have questions about the control, or you have ideas for how it can be improved, then please post them to this blog. Tomorrow starts a new sprint

    Read the article

  • JavaScript Regex Problem

    - by oraclee
    Hi all; Csharp Regex Pattern: Regex rg = new Regex("(?i)(?<=>)[^<]+(?=</TD>)"); JavaScript Regex Pattern: var pattern = (?i)(?<=>)[^<]+(?=</TD>); var result = str.match(pattern); Csharp Regex pattern work, but javascript regex pattern not work pls help ?

    Read the article

  • c# Truncate HTML safely for article summary

    - by WickedW
    Hi All, Does anyone have a c# variation of this? This is so I can take some html and display it without breaking as a summary lead in to an article? http://stackoverflow.com/questions/1193500/php-truncate-html-ignoring-tags Save me from reinventing the wheel! Thank you very much ---------- edit ------------------ Sorry, new here, and your right, should have phrased the question better, heres a bit more info I wish to take a html string and truncate it to a set number of words (or even char length) so I can then show the start of it as a summary (which then leads to the main article). I wish to preserve the html so I can show the links etc in preview. The main issue I have to solve is the fact that we may well end up with unclosed html tags if we truncate in the middle of 1 or more tags! The idea I have for solution is to a) truncate the html to N words (words better but chars ok) first (be sure not to stop in the middle of a tag and truncate a require attribute) b) work through the opened html tags in this truncated string (maybe stick them on stack as I go?) c) then work through the closing tags and ensure they match the ones on stack as I pop them off? d) if any open tags left on stack after this, then write them to end of truncated string and html should be good to go!!!! -- edit 12112009 Here is what I have bumbled together so far as a unittest file in VS2008, this 'may' help someone in future My hack attempts based on Jan code are at top for char version + word version (DISCLAIMER: this is dirty rough code!! on my part) I assume working with 'well-formed' HTML in all cases (but not necessarily a full document with a root node as per XML version) Abels XML version is at bottom, but not yet got round to fully getting tests to run on this yet (plus need to understand the code) ... I will update when I get chance to refine having trouble with posting code? is there no upload facility on stack? Thanks for all comments :) using System; using System.Collections.Generic; using System.Text.RegularExpressions; using System.Xml; using System.Xml.XPath; using Microsoft.VisualStudio.TestTools.UnitTesting; namespace PINET40TestProject { [TestClass] public class UtilityUnitTest { public static string TruncateHTMLSafeishChar(string text, int charCount) { bool inTag = false; int cntr = 0; int cntrContent = 0; // loop through html, counting only viewable content foreach (Char c in text) { if (cntrContent == charCount) break; cntr++; if (c == '<') { inTag = true; continue; } if (c == '>') { inTag = false; continue; } if (!inTag) cntrContent++; } string substr = text.Substring(0, cntr); //search for nonclosed tags MatchCollection openedTags = new Regex("<[^/](.|\n)*?>").Matches(substr); MatchCollection closedTags = new Regex("<[/](.|\n)*?>").Matches(substr); // create stack Stack<string> opentagsStack = new Stack<string>(); Stack<string> closedtagsStack = new Stack<string>(); // to be honest, this seemed like a good idea then I got lost along the way // so logic is probably hanging by a thread!! foreach (Match tag in openedTags) { string openedtag = tag.Value.Substring(1, tag.Value.Length - 2); // strip any attributes, sure we can use regex for this! if (openedtag.IndexOf(" ") >= 0) { openedtag = openedtag.Substring(0, openedtag.IndexOf(" ")); } // ignore brs as self-closed if (openedtag.Trim() != "br") { opentagsStack.Push(openedtag); } } foreach (Match tag in closedTags) { string closedtag = tag.Value.Substring(2, tag.Value.Length - 3); closedtagsStack.Push(closedtag); } if (closedtagsStack.Count < opentagsStack.Count) { while (opentagsStack.Count > 0) { string tagstr = opentagsStack.Pop(); if (closedtagsStack.Count == 0 || tagstr != closedtagsStack.Peek()) { substr += "</" + tagstr + ">"; } else { closedtagsStack.Pop(); } } } return substr; } public static string TruncateHTMLSafeishWord(string text, int wordCount) { bool inTag = false; int cntr = 0; int cntrWords = 0; Char lastc = ' '; // loop through html, counting only viewable content foreach (Char c in text) { if (cntrWords == wordCount) break; cntr++; if (c == '<') { inTag = true; continue; } if (c == '>') { inTag = false; continue; } if (!inTag) { // do not count double spaces, and a space not in a tag counts as a word if (c == 32 && lastc != 32) cntrWords++; } } string substr = text.Substring(0, cntr) + " ..."; //search for nonclosed tags MatchCollection openedTags = new Regex("<[^/](.|\n)*?>").Matches(substr); MatchCollection closedTags = new Regex("<[/](.|\n)*?>").Matches(substr); // create stack Stack<string> opentagsStack = new Stack<string>(); Stack<string> closedtagsStack = new Stack<string>(); foreach (Match tag in openedTags) { string openedtag = tag.Value.Substring(1, tag.Value.Length - 2); // strip any attributes, sure we can use regex for this! if (openedtag.IndexOf(" ") >= 0) { openedtag = openedtag.Substring(0, openedtag.IndexOf(" ")); } // ignore brs as self-closed if (openedtag.Trim() != "br") { opentagsStack.Push(openedtag); } } foreach (Match tag in closedTags) { string closedtag = tag.Value.Substring(2, tag.Value.Length - 3); closedtagsStack.Push(closedtag); } if (closedtagsStack.Count < opentagsStack.Count) { while (opentagsStack.Count > 0) { string tagstr = opentagsStack.Pop(); if (closedtagsStack.Count == 0 || tagstr != closedtagsStack.Peek()) { substr += "</" + tagstr + ">"; } else { closedtagsStack.Pop(); } } } return substr; } public static string TruncateHTMLSafeishCharXML(string text, int charCount) { // your data, probably comes from somewhere, or as params to a methodint XmlDocument xml = new XmlDocument(); xml.LoadXml(text); // create a navigator, this is our primary tool XPathNavigator navigator = xml.CreateNavigator(); XPathNavigator breakPoint = null; // find the text node we need: while (navigator.MoveToFollowing(XPathNodeType.Text)) { string lastText = navigator.Value.Substring(0, Math.Min(charCount, navigator.Value.Length)); charCount -= navigator.Value.Length; if (charCount <= 0) { // truncate the last text. Here goes your "search word boundary" code: navigator.SetValue(lastText); breakPoint = navigator.Clone(); break; } } // first remove text nodes, because Microsoft unfortunately merges them without asking while (navigator.MoveToFollowing(XPathNodeType.Text)) { if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After) { navigator.DeleteSelf(); } } // moves to parent, then move the rest navigator.MoveTo(breakPoint); while (navigator.MoveToFollowing(XPathNodeType.Element)) { if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After) { navigator.DeleteSelf(); } } // moves to parent // then remove *all* empty nodes to clean up (not necessary): // TODO, add empty elements like <br />, <img /> as exclusion navigator.MoveToRoot(); while (navigator.MoveToFollowing(XPathNodeType.Element)) { while (!navigator.HasChildren && (navigator.Value ?? "").Trim() == "") { navigator.DeleteSelf(); } } // moves to parent navigator.MoveToRoot(); return navigator.InnerXml; } [TestMethod] public void TestTruncateHTMLSafeish() { // Case where we just make it to start of HREF (so effectively an empty link) // 'simple' nested none attributed tags Assert.AreEqual(@"<h1>1234</h1><b><i>56789</i>012</b>", TruncateHTMLSafeishChar( @"<h1>1234</h1><b><i>56789</i>012345</b>", 12)); // In middle of a! Assert.AreEqual(@"<h1>1234</h1><a href=""testurl""><b>567</b></a>", TruncateHTMLSafeishChar( @"<h1>1234</h1><a href=""testurl""><b>5678</b></a><i><strong>some italic nested in string</strong></i>", 7)); // more Assert.AreEqual(@"<div><b><i><strong>1</strong></i></b></div>", TruncateHTMLSafeishChar( @"<div><b><i><strong>12</strong></i></b></div>", 1)); // br Assert.AreEqual(@"<h1>1 3 5</h1><br />6", TruncateHTMLSafeishChar( @"<h1>1 3 5</h1><br />678<br />", 6)); } [TestMethod] public void TestTruncateHTMLSafeishWord() { // zero case Assert.AreEqual(@" ...", TruncateHTMLSafeishWord( @"", 5)); // 'simple' nested none attributed tags Assert.AreEqual(@"<h1>one two <br /></h1><b><i>three ...</i></b>", TruncateHTMLSafeishWord( @"<h1>one two <br /></h1><b><i>three </i>four</b>", 3), "we have added ' ...' to end of summary"); // In middle of a! Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four ...</b></a>", TruncateHTMLSafeishWord( @"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four five </b></a><i><strong>some italic nested in string</strong></i>", 4)); // start of h1 Assert.AreEqual(@"<h1>one two three ...</h1>", TruncateHTMLSafeishWord( @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>", 3)); // more than words available Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i> ...", TruncateHTMLSafeishWord( @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>", 99)); } [TestMethod] public void TestTruncateHTMLSafeishWordXML() { // zero case Assert.AreEqual(@" ...", TruncateHTMLSafeishWord( @"", 5)); // 'simple' nested none attributed tags string output = TruncateHTMLSafeishCharXML( @"<body><h1>one two </h1><b><i>three </i>four</b></body>", 13); Assert.AreEqual(@"<body>\r\n <h1>one two </h1>\r\n <b>\r\n <i>three</i>\r\n </b>\r\n</body>", output, "XML version, no ... yet and addeds '\r\n + spaces?' to format document"); // In middle of a! Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four ...</b></a>", TruncateHTMLSafeishCharXML( @"<body><h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four five </b></a><i><strong>some italic nested in string</strong></i></body>", 4)); // start of h1 Assert.AreEqual(@"<h1>one two three ...</h1>", TruncateHTMLSafeishCharXML( @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>", 3)); // more than words available Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i> ...", TruncateHTMLSafeishCharXML( @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>", 99)); } } }

    Read the article

  • Repairing malformatted html attributes using c#

    - by jhoefnagels
    I have a web application with an upload functionality for HTML files generated by chess software to be able to include a javascript player that reproduces a chess game. I do not like to load the uploaded files in a frame so I reconstruct the HTML and javascript generated by the software by parsing the dynamic parts of the file. The problem with the HTML is that all attributes values are surrounded with an apostrophe instead of a quotation mark. I am looking for a way to fix this using a library or a regex replace using c#. The html looks like this: <DIV class='pgb'><TABLE class='pgbb' CELLSPACING='0' CELLPADDING='0'><TR><TD> and I would transform it into: <DIV class="pgb"><TABLE class="pgbb" CELLSPACING="0" CELLPADDING="0"><TR><TD>

    Read the article

  • Regex to parse a multiline HTML

    - by dreamer
    am trying to parse a multi-line html file using regex. HTML code: < td>Details< /td> < /tr> < tr class=d1> < td>uss_vod_translator< /td> Regex Expression: if ($line =~ m/Details<\/td>\s*<\/tr>\s*<tr\s*class=d1>\s*<td>(\w*)<\/td>/) { print "$1"; } I am using /s* (space) for multi-line, but it is not working. I searched about it, even used /\? for multi-line but that too did not work. Can any one please suggest me how to parse a multiline HTML?

    Read the article

  • If you're not supposed to use Regular Expressions to parse HTML, then how are HTML parsers written?

    - by Andy E
    I see questions every day asking how to parse or extract something from some HTML string and the first answer/comment is always "Don't use RegEx to parse HTML, lest you feel the wrath!" (that last part is sometimes omitted). This is rather confusing for me, I always thought that in general, the best way to parse any complicated string is to use a regular expression. So how does a HTML parser work? Doesn't it use regular expressions to parse. One particular argument for using a regular expression is that there's not always a parsing alternative (such as JavaScript, where DOMDocument isn't a universally available option). jQuery, for instance, seems to manage just fine using a regex to convert a HTML string to DOM nodes. Not sure whether or not to CW this, it's a genuine question that I want to be answered and not really intended to be a discussion thread.

    Read the article

  • 301 redirect from "/index.html" to root if index.html not exist

    - by Andrij Muzychka
    Can I create 301 redirect from "index.html" to root directory if file "index.html" not exist? For example: link "http://example.com/index.html" show "404 Error" page. I need 301 redirect to root directory: "http://example.com/" in .htaccess I add rule: Options +FollowSymLinks RewriteCond %{THE_REQUEST} ^.*/index.html RewriteRule ^(.*)index.html$ http://example.com/$1 [R=301,L] but it doesn't work. Can you help me solve this problem?

    Read the article

  • C# RegEx - find html tags (div and anchor)

    - by czesio
    Hi I have to retrieve several div section (of specific class name "row ") with it's content, and additionally find all anchor tags (link urls) (with class "underline red bold"). Shortly speaing : get section of: ... (divs, tags ...) and collections of urls string[] urls = {"/searchClickThru? pid=prod56534895&q=&rpos=109181&rpp=10&_dyncharset=UTF-8&sort=&url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p"} the entire page looks like that: <html> ... a lot of stuff <div class="row "> <div class="photo"> <a rel="nofollow" href="/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p"> <img alt="alt msg" src="/b/s/b9/03/b9038292d147a582add07ee1f0607827.jpg"> </a> </div> <div class="desc"> <div class="l1"> <div class="icons"> </div> <table cellspacing="0" cellpadding="0" border="0"> <tbody> <tr> <td> <div class="fleft"> <a class="underline red bold" href="/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p"> Culture And Gender <br>Intimate Relation</a> </div> <div class="fleft"> </div> </td> </tr> </tbody> </table> </div> <div class="l2"> <div> </div> <div> <div class="but"> </div> </div> </div> <div class="l3"> Long description <a class="underlinepix_red no_wrap" rel="nofollow" href="/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p"> more<img alt="" src="/b/img/arr_red_sm.gif"> </a> </div> </div> </div> <div class="omit"></div> <div class="row "> <div class="photo"> <a rel="nofollow" href="/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod56534899,p"> <img alt="alt msg" src="/b/s/b9/03/b9038292d147a582add07ee1f06078222.jpg"> </a> </div> <div class="desc"> <div class="l1"> <div class="icons"> </div> <table cellspacing="0" cellpadding="0" border="0"> <tbody> <tr> <td> <div class="fleft"> <a class="underline red bold" href="/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod5653489225,p"> Culture And Gender <br>Intimate Relation</a> </div> <div class="fleft"> </div> </td> </tr> </tbody> </table> </div> <div class="l2"> <div> </div> <div> <div class="but"> </div> </div> </div> <div class="l3"> Long description <a class="underlinepix_red no_wrap" rel="nofollow" href="/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p"> more<img alt="" src="/b/img/arr_red_sm.gif"> </a> </div> </div> </div> Can anybody help me to create suitable reg ex?

    Read the article

  • Trying to do a batch rename, can't figure out the proper RegEx

    - by trezy
    I'm trying to rename my movie collection. All of the files are currently named using dots instead of spaces, i.e. Men.in.Black.avi. I want to replace all of the dots with spaces which isn't terribly difficult, but I need to preserve the last dot for the file extension, i.e. .avi, .mp4, .ogg, etc. My Googling has provided no solutions. I'm also a Javascript developer and could see some snazzy applications for it. So, any suggestions?

    Read the article

  • RegEx match open tags except XHTML self-contained tags

    - by Jeff
    I need to match all of these opening tags: <p> <a href="foo"> But not these: <br /> <hr class="foo" /> I came up with this and wanted to make sure I've got it right. I am only capturing the a-z. <([a-z]+) *[^/]*?> I believe it says: Find a less-than, then Find (and capture) a-z one or more times, then Find zero or more spaces, then Find any character zero or more times, greedy, except /, then Find a greater-than Do I have that right? And more importantly, what do you think? =) EDIT: Hmm, which answer to mark as correct? For the record, ALL the answers are appreciated. Many thanks!

    Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >