String chunking algorithm with natural language context
Posted
by Chris Ballance
on Stack Overflow
See other posts from Stack Overflow
or by Chris Ballance
Published on 2010-03-22T18:37:50Z
Indexed on
2010/03/22
18:41 UTC
Read the original article
Hit count: 568
I have a arbitrarily large string of text from the user that needs to be split into 10k chunks (potentially adjustable value) and sent off to another system for processing.
- Chunks cannot be longer than 10k (or other arbitrary value)
- Text should be broken with natural language context in mind
- split on punctuation when possible
- split on spaces if no punction exists
- break a word as a last resort
I'm trying not to re-invent the wheel with this, any suggestions before I roll this from scratch?
Using C#.
© Stack Overflow or respective owner