String chunking algorithm with natural language context

Posted by Chris Ballance on Stack Overflow See other posts from Stack Overflow or by Chris Ballance
Published on 2010-03-22T18:37:50Z Indexed on 2010/03/22 18:41 UTC
Read the original article Hit count: 568

I have a arbitrarily large string of text from the user that needs to be split into 10k chunks (potentially adjustable value) and sent off to another system for processing.

  • Chunks cannot be longer than 10k (or other arbitrary value)
  • Text should be broken with natural language context in mind
    • split on punctuation when possible
    • split on spaces if no punction exists
    • break a word as a last resort

I'm trying not to re-invent the wheel with this, any suggestions before I roll this from scratch?

Using C#.

© Stack Overflow or respective owner

Related posts about string-manipulation

Related posts about c#