I think this is sufficiently different from similar questions to warrant a new one.
I have the following regex to match the beginning hyperlink tags in HTML, including the http(s):// part in order to avoid mailto: links
<a[^>]*?href=[""'](?<href>\\b(https?)://[^\[\]""]+?)[""'][^>]*?>
When I run this through Nregex (with escaping removed) it matches correctly for the following test cases:
<a href="http://www.bbc.co.
uk">
<a href="http://bbc.co.
uk">
<a href="https://www.bbc.co.
uk">
<a href="mailto:
[email protected]">
However when I run this in my C# code it fails. Here is the matching code:
public static IEnumerable<string> GetUrls(this string input, string matchPattern)
{
var matches = Regex.Matches(input, matchPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
foreach (Match match in matches)
{
yield return match.Groups["href"].Value;
}
}
And my tests:
@"<a href=""https://www.bbc.co.
uk"">bbc</a>".GetUrls(StringExtensions.HtmlUrlRegexPattern).Count().ShouldEqual(1);
@"<a href=""mailto:
[email protected]"">bbc</a>".GetUrls(StringExtensions.HtmlUrlRegexPattern).Count().ShouldEqual(0);
The problem seems to be in the \\b(https?):// part which I added, removing this passes the normal URL test but fails the mailto: test.
Anyone shed any light?