detecting pauses in a spoken word audio file using pymad, pcm, vad, etc

Posted by james on Stack Overflow See other posts from Stack Overflow or by james
Published on 2010-04-13T00:51:31Z Indexed on 2010/04/13 0:52 UTC
Read the original article Hit count: 443

Filed under:
|

First I am going to broadly state what I'm trying to do and ask for advice. Then I will explain my current approach and ask for answers to my current problems.


Problem

I have an MP3 file of a person speaking. I'd like to split it up into segments roughly corresponding to a sentence or phrase. (I'd do it manually, but we are talking hours of data.)

If you have advice on how to do this programatically or for some existing utilities, I'd love to hear it. (I'm aware of voice activity detection and I've looked into it a bit, but I didn't see any freely available utilities.)


Current Approach

I thought the simplest thing would be to scan the MP3 at certain intervals and identify places where the average volume was below some threshold. Then I would use some existing utility to cut up the mp3 at those locations.

I've been playing around with pymad and I believe that I've successfully extracted the PCM (pulse code modulation) data for each frame of the mp3. Now I am stuck because I can't really seem to wrap my head around how the PCM data translates to relative volume. I'm also aware of other complicating factors like multiple channels, big endian vs little, etc.

Advice on how to map a group of pcm samples to relative volume would be key.

Thanks!

© Stack Overflow or respective owner

Related posts about mp3

Related posts about pcm