How to remove lowercase sentence fragments from text?
Posted
by Aaron
on Stack Overflow
See other posts from Stack Overflow
or by Aaron
Published on 2010-03-13T20:48:09Z
Indexed on
2010/03/13
20:55 UTC
Read the original article
Hit count: 417
Hello:
I'm tyring to remove lowercase sentence fragments from standard text files using regular expresions or a simple Perl oneliner.
These are commonly referred to as speech or attribution tags, for example - he said, she said, etc.
This example shows before and after using manual deletion:
- Original:
"Ah, that's perfectly true!" exclaimed Alyosha.
"Oh, do leave off playing the fool! Some idiot comes in, and you put us to shame!" cried the girl by the window, suddenly turning to her father with a disdainful and contemptuous air.
"Wait a little, Varvara!" cried her father, speaking peremptorily but looking at them quite approvingly. "That's her character," he said, addressing Alyosha again.
"Where have you been?" he asked him.
"I think," he said, "I've forgotten something... my handkerchief, I think.... Well, even if I've not forgotten anything, let me stay a little."
He sat down. Father stood over him.
"You sit down, too," said he.
- All lower case sentence fragments manually removed:
"Ah, that's perfectly true!"
"Oh, do leave off playing the fool! Some idiot comes in, and you put us to shame!"
"Wait a little, Varvara!" "That's her character,"
"Where have you been?"
"I think," "I've forgotten something... my handkerchief, I think.... Well, even if I've not forgotten anything, let me stay a little."
He sat down. Father stood over him.
"You sit down, too,"
I've changed straight quotes " to balanced and tried: ” (...)+[.]
Of course, this removes some fragments but deletes some text in balanced quotes and text starting with uppercase letters. [^A-Z] didn't work in the above expression.
I realize that it may be impossible to achieve 100% accuracy but any useful expression, perl, or python script would be deeply appreciated.
Cheers,
Aaron
© Stack Overflow or respective owner