I have a following string that I would like to parse into either a List or a String[].
(Test)(Testing (Value))
End result should be Test and Testing (Value)
I try to break down the http://stackoverflow.com/questions/2711961/decoding-algorithm-wanted question into smaller questions. This is Part I.
Question:
two strings: s1 and s2
part of s1 is identical to part of s2
space is separator
how to extract the identical part(s)?
example 1:
s1 = "12 November 2010 - 1 visitor"
s2 = "6 July 2010 - 100 visitors"
the identical parts are "2010", "-", "1" and "visitor"
example 2:
s1 = "Welcome, John!"
s2 = "Welcome, Peter!"
the identical parts are "Welcome," and "!"
Python and Ruby preferred. Thanks
I'm parsing some big log files and have some very simple string matches for example
if(m/Some String Pattern/o){
#Do something
}
It seems simple enough but in fact most of the matches I have could be against the start of the line, but the match would be "longer" for example
if(m/^Initial static string that matches Some String Pattern/o){
#Do something
}
Obviously this is a longer regular expression and so more work to match. However I can use the start of line anchor which would allow an expression to be discarded as a failed match sooner.
It is my hunch that the latter would be more efficient. Can any one back me up/shoot me down :-)
I have a text that contains string of a following structure:
text I do not care about, persons name followed by two IDs.
I know that:
a person's name is always preceded by XYZ code and that is always followed by
two, space separated numbers.
Name is not always just a last name and first name. It can be multiple last or first names
(think Latin american names).
So, I am looking to extract string that follows the constant XYZ code and that is always terminated by two separate numbers.
You can say that my delimiter is XYZ and two numbers, but numbers need to be part of the extracted value as well.
From
blah, blah XYZ names, names 122322 344322 blah blah
I want to extract:
names, names 122322 344322
Would someone please advise on the regular expression for this that would work with Python's re package.
I'm migrating wiki pages from the FlexWiki engine to the FOSwiki engine using Python regular expressions to handle the differences between the two engines' markup languages.
The FlexWiki markup and the FOSwiki markup, for reference.
Most of the conversion works very well, except when I try to convert the renamed links.
Both wikis support renamed links in their markup.
For example, Flexwiki uses:
"Link To Wikipedia":[http://www.wikipedia.org/]
FOSwiki uses:
[[http://www.wikipedia.org/][Link To Wikipedia]]
both of which produce something that looks like
I'm using the regular expression
renameLink = re.compile ("\"(?P<linkName>[^\"]+)\":\[(?P<linkTarget>[^\[\]]+)\]")
to parse out the link elements from the FlexWiki markup, which after running through something like
"Link Name":[LinkTarget]
is reliably producing groups
<linkName> = Link Name
<linkTarget = LinkTarget
My issue occurs when I try to use re.sub to insert the parsed content into the FOSwiki markup.
My experience with regular expressions isn't anything to write home about, but I'm under the impression that, given the groups
<linkName> = Link Name
<linkTarget = LinkTarget
a line like
line = renameLink.sub ( "[[\g<linkTarget>][\g<linkName>]]" , line )
should produce
[[LinkTarget][Link Name]]
However, in the output to the text files I'm getting
[[LinkTarget [[Link Name]]
which breaks the renamed links.
After a little bit of fiddling I managed a workaround, where
line = renameLink.sub ( "[[\g<linkTarget>][ [\g<linkName>]]" , line )
produces
[[LinkTarget][ [[Link Name]]
which, when displayed in FOSwiki looks like
<[[Link Name> <--- Which WORKS, but isn't very pretty.
I've also tried
line = renameLink.sub ( "[[\g<linkTarget>]" + "[\g<linkName>]]" , line )
which is producing
[[linkTarget [[linkName]]
There are probably thousands of instances of these renamed links in the pages I'm trying to convert, so fixing it by hand isn't any good.
For the record I've run the script under Python 2.5.4 and Python 2.7.3, and gotten the same results.
Am I missing something really obvious with the syntax? Or is there an easy workaround?
Dont ask how this works but currently it does ("^\|(.?)\|*$")....kinda. This removes all extra pipes...part one....I have searched all over no anwser yet. I am using VB2011 beta...asp web form......vb coding though!
I want to capture special character pipe (|) which is used to seperate words...i.e. car|truck|van|cycle
problem is users lead with, trail with, use multiple, and use spaces before and after...i.e. |||car||truck | van || cycle.
another example: george bush|micheal jordon|bill gates|steve jobs <-- this would be correct but when I do remove space it takes correct space out.
so I want to get rid of whitespace leading, trailing, any space before | and space after | and only allow one pipe (|)....in between alphanumeric of course.
I'm new to Python scripting, so please forgive me in advance if the answer to this question seems inherently obvious.
I'm trying to put together a large-scale find-and-replace script using Python. I'm using code similar to the following:
findreplace = [
('term1', 'term2'),
]
inF = open(infile,'rb')
s=unicode(inF.read(),charenc)
inF.close()
for couple in findreplace:
outtext=s.replace(couple[0],couple[1])
s=outtext
outF = open(outFile,'wb')
outF.write(outtext.encode('utf-8'))
outF.close()
How would I go about having the script do a find and replace for regular expressions?
Specifically, I want it to find some information (metadata) specified at the top of a text file. Eg:
Title: This is the title
Author: This is the author
Date: This is the date
and convert it into LaTeX format. Eg:
\title{This is the title}
\author{This is the author}
\date{This is the date}
Maybe I'm tackling this the wrong way. If there's a better way than regular expressions please let me know!
Thanks!
Hi There,
Does anyone have a regurlar expression available which only accepts dates in the format dd/mm/yy but also has strict checking to make sure that the date is valid, including leap year support?
I am coding in vb.net and am struggling to work this one out.
Many Thanks
Hi guys,
Need a guide line ....
I am trying to write a personal blog.
What is the standard structure for for input for the post.
I am trying the format like:
This is the simple text
And I am [b] bold text[/b].
This is the code part:
[code lang=java]
public static void main (String args[]) {
System.out.println("Hello World!");
}
[/code]
Is this the right way to store post in the database?
And What is the right method to parse this kind of post?
Shall I use regular expression to parse this or there is another standard for this.
If the above mentioned format is not the right way for storage, then what it could be?
Thanks
For example, if I'm doing some form input validation and I'm using the following code for the name field.
preg_match("/^[a-zA-Z .-]$/", $firstname);
If someone types in Mr. (Awkward) Double-Barrelled I want to be able to display a message saying Invalid character(s): (, )
I know charwise positions of matches like 1 3 7 8. I need to know their corresponding line number.
Example: file.txt
Match: X
Mathes: 1 3 7 8.
Want: 1 2 4 4
$ cat file.txt
X2
X
4
56XX
[Added: does not notice many linewise matches, there is probably easier way to do it with stacks]
$ java testt
1
2
4
$ cat testt.java
import java.io.*;
import java.util.*;
public class testt {
public static String data ="X2\nX\n4\n56XX";
public static String[] ar = data.split("\n");
public static void main(String[] args){
HashSet<Integer> hs = new HashSet<Integer>();
Integer numb = 1;
for(String s : ar){
if(s.contains("X")){
hs.add(numb);
numb++;
}else{
numb++;
}
}
for (Integer i : hs){
System.out.println(i);
}
}
}
Hi,
I'm converting patch scripts using a commandline script - within these scripts there's the combination two lines like:
--- /dev/null
+++ filename.txt
which needs to be converted to:
--- filename.txt
+++ filename.txt
Initially I tried:
less file.diff | sed -e "s/---\/dev\null\n+++ \(.*\)/--- \1\n+++ \1/"
But I had to find out that multiline-handling is much more complex in sed :(
Any help is appreciated...
Hi all, I'm basically trying to create my own tags - and replace them with the right HTML tags. So {B} {/B} would turn into <b> </b>
I have only got so far with this, here: http://www.nacremedia.com/text2.htm
Use the [B] button to bold stuff the current selection... it creates two bold tags and one closing for some reason.
I'm so close! But I just need a bit of direction to get the final bugs out - can anyone please help??
Also, if there is a better way of doing this altogether then I am more than welcome to new ideas.
I have several very large XML files and I'm trying to find the lines that contain non-ASCII characters. I've tried the following:
grep -e "[\x{00FF}-\x{FFFF}]" file.xml
But this returns every line in the file, regardless of whether the line contains a character in the range specified.
Do I have the syntax wrong or am I doing something else wrong? I've also tried:
egrep "[\x{00FF}-\x{FFFF}]" file.xml
(with both single and double quotes surrounding the pattern).
I know this may be the simplest question ever asked on Stack Overflow, but what is the regular expression for a decimal with a precision of 2?
Valid examples:
123.12
2
56754
92929292929292.12
0.21
3.1
Invalid examples:
12.1232
2.23332
e666.76
Sorry for the lame question, but for the life of me I haven't been able to find anyone that can help!
The decimal place may be option, and that integers may also be included.
i am extracting file names of html files using line:
filename = File.basename(input_filename, ".*")
which currently prints full file name excluding .html extension
All files are stored in the form of http^x.x.edu^1^2 all file names begin with http^ and contain edu^ what i want is to extract 2 (which changes) but it is always the second element after .edu I have attempted destructive gsub! but i m weak with regular expressions.
I need help on regular expression on the condition (4) below:
Begin with a-z
End with a-z0-9
allow 3 special characters like ._-
The characters in (3) must be followed by alphanumeric characters, and it cannot be followed by any characters in (3) themselves.
Not sure how to do this. Any help is appreciated, with the sample and some explanations.
I have got a file with following format.
1234, 'US', 'IN',......
324, 'US', 'IN',......
...
...
53434, 'UK', 'XX', ....
...
...
253, 'IN', 'UP',....
253, 'IN', 'MH',....
Here I want to extract only those lines having 'IN' as 2nd keyword. i.e.
253, 'IN', 'UP',....
253, 'IN', 'MH',....
Can any one please tell me a command to grep it.
My primary concern is with the Java flavor, but I'd also appreciate information regarding others.
Let's say you have a subpattern like this:
(.*)(.*)
Not very useful as is, but let's say these two capture groups (say, \1 and \2) are part of a bigger pattern that matches with backreferences to these groups, etc.
So both are greedy, in that they try to capture as much as possible, only taking less when they have to.
My question is: who's greedier? Does \1 get first priority, giving \2 its share only if it has to?
What about:
(.*)(.*)(.*)
Let's assume that \1 does get first priority. Let's say it got too greedy, and then spit out a character. Who gets it first? Is it always \2 or can it be \3?
Let's assume it's \2 that gets \1's rejection. If this still doesn't work, who spits out now? Does \2 spit to \3, or does \1 spit out another to \2 first?
I need regular expression to match braces correct e.g for every open one close one
abc{abc{bc}xyz} I need it get all it from {abc{bc}xyz} not get {abc{bc} I tried this
({.*?})
Hi, i don't know about regular expressions, I asked here for one that:
gets either anything up to the first parenthesis/colon or the first word inside the first parenthesis. This was the answer:
preg_match('/(?:^[^(:]+|(?<=^\\()[^\\s)]+)/', $var, $match);
I need an improvement, I need to get either anything up to the first parenthesis/colon/quotation marks or the first word inside the first parenthesis.
So if I have something like:
$var = 'story "The Town in Hell"s Backyard'; // I get this: $match = 'story';
$var = "screenplay (based on)"; // I get this: $match = 'screenplay';
$var = "(play)"; // I get this: $match = 'play';
$var = "original screen"; // I get this: $match = 'original screen';
Thanks!
I have a string like this:
<![CDATA[<ClinicalDocument>rest of CCD here</ClinicalDocument>]]>
I'd like to replace the escape sequences with their non-escaped characters, to end up with:
<![CDATA[<ClinicalDocument>rest of CCD here</ClinicalDocument>]]>