High-level strategy for distinguishing a regular string from invalid JSON (ie. JSON-like string detection)

Posted by Jonline on Programmers See other posts from Programmers or by Jonline
Published on 2014-05-30T17:26:57Z Indexed on 2014/05/30 21:57 UTC
Read the original article Hit count: 176

Filed under:
|
|

Disclaimer On Absence of Code:

I have no code to post because I haven't started writing; was looking for more theoretical guidance as I doubt I'll have trouble coding it but am pretty befuddled on what approach(es) would yield best results. I'm not seeking any code, either, though; just direction.

Dilemma

I'm toying with adding a "magic method"-style feature to a UI I'm building for a client, and it would require intelligently detecting whether or not a string was meant to be JSON as against a simple string.

I had considered these general ideas:

  1. Look for a sort of arbitrarily-determined acceptable ratio of the frequency of JSON-like syntax (ie. regex to find strings separated by colons; look for colons between curly-braces, etc.) to the number of quote-encapsulated strings + nulls, bools and ints/floats. But the smaller the data set, the more fickle this would get

  2. look for key identifiers like opening and closing curly braces... not sure if there even are more easy identifiers, and this doesn't appeal anyway because it's so prescriptive about the kinds of mistakes it could find

  3. try incrementally parsing chunks, as those between curly braces, and seeing what proportion of these fractional statements turn out to be valid JSON; this seems like it would suffer less than (1) from smaller datasets, but would probably be much more processing-intensive, and very susceptible to a missing or inverted brace

Just curious if the computational folks or algorithm pros out there had any approaches in mind that my semantics-oriented brain might have missed.

PS: It occurs to me that natural language processing, about which I am totally ignorant, might be a cool approach; but, if NLP is a good strategy here, it sort of doesn't matter because I have zero experience with it and don't have time to learn & then implement/ this feature isn't worth it to the client.

© Programmers or respective owner

Related posts about algorithms

Related posts about parsing