String chunking algorithm with natural language context
- by Chris Ballance
I have a arbitrarily large string of text from the user that needs to be split into 10k chunks (potentially adjustable value) and sent off to another system for processing.
Chunks cannot be longer than 10k (or other arbitrary value)
Text should be broken with natural language context in mind
split on punctuation when possible
split on spaces if no punction exists
break a word as a last resort
I'm trying not to re-invent the wheel with this, any suggestions before I roll this from scratch?
Using C#.