L10N: Trusted test data for Locale Specific Sorting
- by Chris Betti
I'm working on an internationalized database application that supports multiple locales in a single instance. When international users sort data in the applications built on top of the database, the database theoretically sorts the data using a collation appropriate to the locale associated with the data the user is viewing.
I'm trying to find sorted lists of words that meet two criteria:
the sorted order follows the collation rules for the locale
the words listed will allow me to exercise most / all of the specific collation rules for the locale
I'm having trouble finding such trusted test data. Are such sort-testing datasets currently available, and if so, what / where are they?
"words.en.txt" is an example text file containing American English text:
Andrew
Brian
Chris
Zachary
I am planning on loading the list of words into my database in randomized order, and checking to see if sorting the list conforms to the original input.
Because I am not fluent in any language other than English, I do not know how to create sample datasets like the following sample one in French (call it "words.fr.txt"):
cote
côte
coté
côté
The French prefer diacritical marks to be ordered right to left. If you sorted that using code-point order, it likely comes out like this (which is an incorrect collation):
cote
coté
côte
côté
Thank you for the help,
Chris