Parsing srt subtitles

Posted by Vojtech R. on Stack Overflow See other posts from Stack Overflow or by Vojtech R.
Published on 2010-04-11T10:54:32Z Indexed on 2010/04/11 11:03 UTC
Read the original article Hit count: 522

Filed under:
|

Hi,

I want to parse srt subtitles:

    1
    00:00:12,815 --> 00:00:14,509
    Chlapi, jak to jde s
    tema pracovníma svetlama?.

    2
    00:00:14,815 --> 00:00:16,498
    Trochu je zesilujeme.

    3
    00:00:16,934 --> 00:00:17,814
    Jo, sleduj.

Every item into structure. With this regexs:

A:

RE_ITEM = re.compile(r'''(?P<index>\d+).(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> (?P<end>\d{2}:\d{2}:\d{2},\d{3}).(?P<text>.*?)''', re.DOTALL)

B:

RE_ITEM = re.compile(r'''(?P<index>\d+).(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> (?P<end>\d{2}:\d{2}:\d{2},\d{3}).(?P<text>.*)''', re.DOTALL)

And this code:

    for i in Subtitles.RE_ITEM.finditer(text):
    result.append((i.group('index'), i.group('start'), 
             i.group('end'), i.group('text')))

With code B I have only one item in array (because of greedy .*) and with code A I have empty 'text' because of no-greedy .*?

How to cure this?

Thanks

© Stack Overflow or respective owner

Related posts about regex

Related posts about python