How can I avoid encoding mixups of strings in a C/C++ API?
- by Frerich Raabe
I'm working on implementing different APIs in C and C++ and wondered what techniques are available for avoiding that clients get the encoding wrong when receiving strings from the framework or passing them back. For instance, imagine a simple plugin API in C++ which customers can implement to influence translations. It might feature a function like this:
const char *getTranslatedWord( const char *englishWord );
Now, let's say that I'd like to enforce that all strings are passed as UTF-8. Of course I'd document this requirement, but I'd like the compiler to enforce the right encoding, maybe by using dedicated types. For instance, something like this:
class Word {
public:
static Word fromUtf8( const char *data ) { return Word( data ); }
const char *toUtf8() { return m_data; }
private:
Word( const char *data ) : m_data( data ) { }
const char *m_data;
};
I could now use this specialized type in the API:
Word getTranslatedWord( const Word &englishWord );
Unfortunately, it's easy to make this very inefficient. The Word class lacks proper copy constructors, assignment operators etc.. and I'd like to avoid unnecessary copying of data as much as possible. Also, I see the danger that Word gets extended with more and more utility functions (like length or fromLatin1 or substr etc.) and I'd rather not write Yet Another String Class. I just want a little container which avoids accidental encoding mixups.
I wonder whether anybody else has some experience with this and can share some useful techniques.
EDIT: In my particular case, the API is used on Windows and Linux using MSVC 6 - MSVC 10 on Windows and gcc 3 & 4 on Linux.