I'd like to "grab" a few hundred urls from a few hundred html pages.
Pattern:
<h2><a href="http://www.the.url.might.be.long/urls.asp?urlid=1" target="_blank">The Website</a></h2>
Hi All,
I am working writing a regular expression used to validate string in C. Here is to what I have gone so far
'^"[A-Za-z0-9]*[\t\n]*"$'
for rules
- A string should begin with double quotes
- May not contain a newline character
However, I am not able to capture the rule for allowing '\' or '"' in a string if preceded with '\'. Here is what I tried:
'^"[A-Za-z0-9]*[\t\n]*[\\\|\\"]?"$'
But this doesn't seem to work. What might be wrong with the regular expression here?
Regards,
darkie15
I'd like to be able to parse out the city, state or zip from a string in python. So, if I entered
Boulder, Co
80303
Boulder, Colorado
Boulder, Co 80303
...
any variation of these it would return the city, state or zip.
This is all going to be user inputted data and inputted in one text field.
I have been trying several regular expressions
$str =~ s/^0+(.)/$1/;
converts 0000 to 0 and 0001 to 1
$str =~ s/^0+./$1/;
converts 0000 to empty string, 000100 to 00, 0001100 to 100.
what difference is the parentheses making?
I have this situation(Java code):
1) a string such as : "A wild adventure" should match.
2) a string with adjacent repeated words: "A wild wild adventure" shouldn't match.
With this regular expression: .* \b(\w+)\b\s*\1\b.* i can match strings containing adjacent repeated words.
How to reverse the situation i.e how to match strings which do not contain adjacent repeat words
i just discovered http://code.google.com/p/re2, a promising library that uses a long-neglected way (Thompson NFA) to implement a regular expression engine that can be orders of magnitudes faster than the available engines of awk, Perl, or Python.
so i downloaded the code and did the usual sudo make install thing. however, that action had seemingly done little more than adding /usr/local/include/re2/re2.h to my system. there seemed to be some `*.a file in addition, but then what is it with this *.a extension?
i would like to use re2 from Python (preferrably Python 3.1) and was excited to see files like make_unicode_groups.py in the distro (maybe just used during the build process?). those however were not deployed on my machine.
how can i use re2 from Python?
Hi
I have a list of calling codes for all countries(the phone number prefixes), I would like to split them up in the
country name and the actual code so I can put then into an xml.
I have tried back and forth but can not get a regexp going that takes all cases into account.
I think it is fairly simple for someone with a bit of experience.
The codes have these formats:
Afghanistan 93
Anguilla 1 264
Antarctica 6721
Antigua and Barbuda 1 268
Bosnia and Herzegovina 387
Canada 1
Congo, Republic of the 242
Cote d'Ivoire 225
Ireland (Eire) 353
United States of America 1
There are around 235 of them in total, but these are the regulars and the exceptions.
^[a-zA-Z]\s,'()] for between 1 and X words and then it is [0-9\s]{1,5}$ for the numbers:
X
XX
XXX
XXXX
X XXX
So if I should express it as a sentence it would be: "from beginning of a line, take all characters (1) including space,'() until you encounter digits, then take all of these including space(2) until you encounter a line break."
I am using TextMate, and the docs says:
TextMate uses the Oniguruma regular
expression library by K. Kosako.
I would appreciate any help given:)
Thank you.
I have long regexp with two complicated subpatters inside. How i can match that subpatterns in any order?
Simplified example:
/(apple)?\s?(banana)?\s?(orange)?\s?(kiwi)?/
and i want to match both of
apple banana orange kiwi
apple orange banana kiwi
It is very simplified example. In my case banana and orange is long complicated subpatterns and i don't want to do something like
/(apple)?\s?((banana)?\s?(orange)?|(orange)?\s?(banana)?)\s?(kiwi)?/
Is it possible to group subpatterns like chars in character class?
UPD Real data as requested:
14:24 26,37 Mb
108.53 01:19:02 06.07
24.39 19:39
46:00
my strings much longer, but it is significant part. Here you can see two lines what i need to match.
First has two values: length (14 min 24 sec) and size 26.37 Mb.
Second one has three values but in different order: size 108.53 Mb, length 01 h 19 m 02 s and date June, 07
Third one has two size and length
Fourth has only length
There are couple more variations and i need to parse all values.
I have a regexp that pretty close except i can't figure out how to match patterns in different order without writing it twice.
(?<size>\d{1,3}\[.,]\d{1,2}\s+(?:Mb)?)?\s?
(?<length>(?:(?:01:)?\d{1,2}:\d{2}))?\s*
(?<date>\d{2}\.\d{2}))?
NOTE: that is only part of big regexp that forks fine already.
I'm a newbie to regular expressions and I have the following string:
sequence = ["{\"First\":\"Belyuen,NT,0801\",\"Second\":\"Belyuen,NT,0801\"}","{\"First\":\"Larrakeyah,NT,0801\",\"Second\":\"Larrakeyah,NT,0801\"}"]
I am trying to extract the text Belyuen,NT,0801 and Larrakeyah,NT,0801 in python. I have the following code which is not working:
re.search('\:\\"...\\', ''.join(sequence))
I.e. I want to get the string between characters :\ and \.
I need to separate natural, coherent text/sentences in emails from lists, signatures, greetings and so on before further processing.
example:
Hi tom,
last monday we did bla bla, lore Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et
dolore magna aliqua.
list item 2
list item 3
list item 3
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid x ea commodi consequat. Quis aute iure reprehenderit
in voluptate velit
regards, K.
---line-of-funny-characters-#######
example inc.
33 evil street, london
mobile: 00 234534/234345
Ideally the algorithm would match only the bold parts.
Is there any recommended approach - or are there even existing algorithms for that problem? Should I try approximate regular expressions or more statistical stuff based on number of punctation marks, length and so on?
I have to migrate a very large dataset from one system to another. One of the "source" column contains a date but is really a string with no constraint, while the destination system mandates a date in the format yyyy-mm-dd.
Many, but not all, of the source dates are formatted as yyyymmdd. So to coerce them to the expected format, I do (in Perl):
return "$1-$2-$3" if ($val =~ /(\d{4})[-\/]*(\d{2})[-\/]*(\d{2})/);
The problem arises when the source dates moves away from the "generic" yyyymmdd. The goal is to salvage as many dates as possible, before giving up. Example source strings include:
21/3/1998,
March 2004,
2001,
3/4/97
I can try to match as many of the examples I can find with a succession of regular expressions such as the one above.
But is there something smarter to do? Am I not reinventing the wheel? Is there a library somewhere doing something similar? I couldn't find anything relevant googling "forgiving date parser". (any language is OK).
Hi,
I'm trying to extract a number from a string.
And do something like this [0-9]+ on this string "aaaa12xxxx" and get "12".
I thought it would be something like:
> grep("[0-9]+","aaa12xxx", value=TRUE)
[1] "aaa12xxx"
And then I figured...
> sub("[0-9]+", "\\1", "aaa12xxxx")
[1] "aaa12xxx"
But I got some form of response doing:
> sub("[0-9]+", "ARGH!", "aaa12xxxx")
[1] "aaaARGH!xxx"
There's a small detail I'm missing Please advice :-)
I'm using R version 2.10.1 (2009-12-14)
Thanks !
Comments on the solution
The best solution is to ignore the standard functions and install Hadley Wickham's stringr package to get something that actually makes sense.
Kudos to Marek for figuring out how the standard library worked.
I'm pretty experienced with Perl and Ruby but new to Python so I'm hoping someone can show me the Pythonic way to accomplish the following task. I want to compare several lines against multiple regular expressions and retrieve the matching group. In Ruby it would be something like this:
STDIN.each_line do |line|
case line
when /^A:(.*?)$/ then puts "FOO: #{$1}"
when /^B:(.*?)$/ then puts "BAR: #{$1}"
# when ...
else puts "NO MATCH: #{line}"
end
end
My attempts in Python are turning out pretty ugly because the matching group is returned from a call to match/search on a regular expression and Python has no assignment in conditionals or switch statements. What's the Pythonic way to do (or think!) about this problem?
Hi Guys... Hope you help me with this one....
I have a main question which is ''how to judge whether a regular expression will be accepted by NFA and/or DFA?
For eg. My question says that which of the regular expressions are equivalent? explain...
1.(a+b)*b(a+b)*b(a+b)*
2.a*ba*ba*
3.a*ba*b(a+b)*
do we have to draw the NFA and DFA and then find through minimisation algorithm? if we do then how do we come to know that which regular expression is accepted by NFA/DFA so that we can begin with the answer? its so confusing....
Second is a very similar one, the question asks me to show that the language (a^nb^n|n1} is not accepted by DFA...grrrrr...how do i know this? (BTW this is a set of all strings of where a number of a's is followed by the same number of b's)....
I hope I explained clearly well....
Hi all,
If I have a string, e.g.
setting=value
How can I remove the '=' and turn that into two separate strings containing 'setting' and 'value' respectively?
Thanks very much!
As you can read in the tittle i need a regular expression for getting any letter, symbol, number from 1 to 100 maxlength (any text posible). Can someone provide that for me and maybe a good link to understand how it works.
Thank you.
What C# regular expression would replace all of these:
<BR style=color:#93c47d>
<BR style=color:#fefefe>
<BR style="color:#93c47d">
<BR style="color:#93c47d ...">
<BR>
<BR/>
<br style=color:#93c47d>
<br style=color:#fefefe>
<br style="color:#93c47d">
<br style="color:#93c47d ...">
<br>
<br/>
with:
<br/>
basically "remove all attributes from any BR element and lowercase it".
Hi,
i need to check a string that should contain only ABCDEFG characters, in any sequence and with only 7 characters. Please let me know the correct way of using regular expression.
as corrently i am using
String abs = "ABPID";
if(!Pattern.matches("[[ABCDEFG]", abs))
System.out.println("Error");
i am using the following code which works when i use the String abcdefg but for other cases it fails. please help me out.
I'm trying to locate the number matches in a relative path for directory up references("..\").
So I have the following pattern : "(..\)" which works as expected for the path "....\a\b" where it will give me 2 successfull groups ("..\") but when I try the path "..\a\b" it will also return 2 when it should be 1. I tried this in a reg ex tool such Expresso and it seems to work as expected in there but not in in .net, any ideas?
I have a string that looks like this:
var str = "Hello world, hello >world, hello world!";
... and I'd like to replace all the hellos with e.g. bye and world with earth, except the words that start with   or >. Those should be ignored. So the result should be:
bye earth, hello >world, bye earth!
Tried to this with
str.replace(/(?!\ )hello/gi,'bye'));
But it doesn't work.
Hi all,
I've got an e-shop on a virtual server that's been used as a subdirectory for the last few years, but now I'm finally giving the VS it's own domain name. What I really need is visitors to the old URL to be transparently (and 301) redirected to the new URL with everything after /eshop/ maintained and apended to the new host.
I.e. http://www.example.com/eshop/page.php - http://www.newdomain.com/page.php
Any help would be greatly appreciated.
I have a string like this:
<span style="font-weight: bold;">Foo</span>
I want to use PHP to make it
<strong>Foo</strong>
…without affecting other spans.
How can I do this?
I am trying to extract a number from a string. The number might be zero. Numbers appear like this: '+123', '-8' or '0'.
alert( '+123'.match(/[-?|+?]\d+/) );
alerts +123
alert( '-8'.match(/[-?|+?]\d+/) );
alerts -8
alert( '0'.match(/[-?|+?]\d+/) );
alerts null // why oh why?
How do I get '0'.match(/[-?|+?]\d+/) to return 0 instead of null?
Deal all,
I need to find match between two tab delimited files files like this:
File 1:
ID1 1 65383896 65383896 G C PCNXL3
ID1 2 56788990 55678900 T A ACT1
ID1 1 56788990 55678900 T A PRO55
File 2
ID2 34 65383896 65383896 G C MET5
ID2 2 56788990 55678900 T A ACT1
ID2 2 56788990 55678900 T A HLA
what I would like to do is to retrive the matching line between the two file. What I would like to match is everyting after the gene ID
So far I have written this code but unfortunately perl keeps giving me the error:
use of "Use of uninitialized value in pattern match (m//)"
Could you please help me figure out where i am doing it wrong?
Thank you in advance!
use strict;
open (INA, $ARGV[0]) || die "cannot to open gene file";
open (INB, $ARGV[1]) || die "cannot to open coding_annotated.var files";
my @sample1 = <INA>;
my @sample2 = <INB>;
foreach my $line (@sample1) {
my @tab = split (/\t/, $line);
my $chr = $tab[1];
my $start = $tab[2];
my $end = $tab[3];
my $ref = $tab[4];
my $alt = $tab[5];
my $name = $tab[6];
foreach my $item (@sample2){
my @fields = split (/\t/,$item);
if ($fields[1]=~ m/$chr(.*)/ && $fields[2]=~ m/$start(.*)/ && $fields[4]=~ m/$ref(.*)/ && $fields[5]=~ m/$alt(.*)/&& $fields[6]=~ m/$name(.*)/){
print $line,"\n",$item;
}
}
}