I'm working with a binary structure, the goal of which is to index the significance of specific bits for any character encoding so that we may trigger events while doing specific checks against the profile.
Each character encoding scheme has an associated system record. This record's leading value will be a C++ unsigned long long binary value and signifies the length, in bits, of encoded characters.
Following the length are three values, each is a bit field of that length.
offset_mask - defines the occurrence of non-printable characters within the min,max of print_mask
range_mask - defines the occurrence of the most popular 50% of printable characters
print_mask - defines the occurrence value of printable characters
The structure of profiles has changed from the op of this question. Most likely I will try to factorize or compress these values in the long-term instead of starting out with ranges after reading more.
I have to write some of the core functionality for these main reasons.
It has to fit into a particular event architecture we are using,
Better understanding of character encoding. I'm about to need it.
Integrating into non-linear design is excluding many libraries without special hooks.
I'm unsure if there is a standard, cross-encoding mechanism for communicating such data already. I'm just starting to look into how chardet might do profiling as suggested by @amon. The Unicode BOM would be easily enough (for my current project) if all encodings were Unicode.
Of course ideally, one would like to support all encodings, but I'm not asking about implementation - only the general case.
How can these profiles be efficiently populated, to produce a set of bitmasks which we can use to match strings with common characters in multiple languages?
If you have any editing suggestions please feel free, I am a lightweight when it comes to localization, which is why I'm trying to reach out to the more experienced. Any caveats you may be able to help with will be appreciated.