I am writing a simple OCR solution for a finite set of characters. That is, I know the exact way all 26 letters in the alphabet will look like. I am using C# and am able to easily determine if a given pixel should be treated as black or white.
I am generating a matrix of black/white pixels for every single character. So for example, the letter I (capital i), might look like the following:
01110
00100
00100
00100
01110
Note: all points, which I use later in this post, assume that the top left pixel is (0, 0), bottom right pixel is (4, 4). 1's represent black pixels, and 0's represent white pixels.
I would create a corresponding matrix in C# like this:
CreateLetter("I", new List<List<bool>>() {
new List<bool>() { false, true, true, true, false },
new List<bool>() { false, false, true, false, false },
new List<bool>() { false, false, true, false, false },
new List<bool>() { false, false, true, false, false },
new List<bool>() { false, true, true, true, false }
});
I know I could probably optimize this part by using a multi-dimensional array instead, but let's ignore that for now, this is for illustrative purposes. Every letter is exactly the same dimensions, 10px by 11px (10px by 11px is the actual dimensions of a character in my real program. I simplified this to 5px by 5px in this posting since it is much easier to "draw" the letters using 0's and 1's on a smaller image).
Now when I give it a 10px by 11px part of an image to analyze with OCR, it would need to run on every single letter (26) on every single pixel (10 * 11 = 110) which would mean 2,860 (26 * 110) iterations (in the worst case) for every single character.
I was thinking this could be optimized by defining the unique characteristics of every character. So, for example, let's assume that the set of characters only consists of 5 distinct letters: I, A, O, B, and L. These might look like the following:
01110 00100 00100 01100 01000
00100 01010 01010 01010 01000
00100 01110 01010 01100 01000
00100 01010 01010 01010 01000
01110 01010 00100 01100 01110
After analyzing the unique characteristics of every character, I can significantly reduce the number of tests that need to be performed to test for a character. For example, for the "I" character, I could define it's unique characteristics as having a black pixel in the coordinate (3, 0) since no other characters have that pixel as black. So instead of testing 110 pixels for a match on the "I" character, I reduced it to a 1 pixel test.
This is what it might look like for all these characters:
var LetterI = new OcrLetter() {
Name = "I",
BlackPixels = new List<Point>() { new Point (3, 0) }
}
var LetterA = new OcrLetter() {
Name = "A",
WhitePixels = new List<Point>() { new Point(2, 4) }
}
var LetterO = new OcrLetter() {
Name = "O",
BlackPixels = new List<Point>() { new Point(3, 2) },
WhitePixels = new List<Point>() { new Point(2, 2) }
}
var LetterB = new OcrLetter() {
Name = "B",
BlackPixels = new List<Point>() { new Point(3, 1) },
WhitePixels = new List<Point>() { new Point(3, 2) }
}
var LetterL = new OcrLetter() {
Name = "L",
BlackPixels = new List<Point>() { new Point(1, 1), new Point(3, 4) },
WhitePixels = new List<Point>() { new Point(2, 2) }
}
This is challenging to do manually for 5 characters and gets much harder the greater the amount of letters that are added. You also want to guarantee that you have the minimum set of unique characteristics of a letter since you want it to be optimized as much as possible.
I want to create an algorithm that will identify the unique characteristics of all the letters and would generate similar code to that above. I would then use this optimized black/white matrix to identify characters.
How do I take the 26 letters that have all their black/white pixels filled in (e.g. the CreateLetter code block) and convert them to an optimized set of unique characteristics that define a letter (e.g. the new OcrLetter() code block)? And how would I guarantee that it is the most efficient definition set of unique characteristics (e.g. instead of defining 6 points as the unique characteristics, there might be a way to do it with 1 or 2 points, as the letter "I" in my example was able to).
An alternative solution I've come up with is using a hash table, which will reduce it from 2,860 iterations to 110 iterations, a 26 time reduction. This is how it might work:
I would populate it with data similar to the following:
Letters["01110 00100 00100 00100 01110"] = "I";
Letters["00100 01010 01110 01010 01010"] = "A";
Letters["00100 01010 01010 01010 00100"] = "O";
Letters["01100 01010 01100 01010 01100"] = "B";
Now when I reach a location in the image to process, I convert it to a string such as: "01110 00100 00100 00100 01110" and simply find it in the hash table. This solution seems very simple, however, this still requires 110 iterations to generate this string for each letter.
In big O notation, the algorithm is the same since O(110N) = O(2860N) = O(N) for N letters to process on the page. However, it is still improved by a constant factor of 26, a significant improvement (e.g. instead of it taking 26 minutes, it would take 1 minute).
Update: Most of the solutions provided so far have not addressed the issue of identifying the unique characteristics of a character and rather provide alternative solutions. I am still looking for this solution which, as far as I can tell, is the only way to achieve the fastest OCR processing.
I just came up with a partial solution:
For each pixel, in the grid, store the letters that have it as a black pixel.
Using these letters:
I A O B L
01110 00100 00100 01100 01000
00100 01010 01010 01010 01000
00100 01110 01010 01100 01000
00100 01010 01010 01010 01000
01110 01010 00100 01100 01110
You would have something like this:
CreatePixel(new Point(0, 0), new List<Char>() { });
CreatePixel(new Point(1, 0), new List<Char>() { 'I', 'B', 'L' });
CreatePixel(new Point(2, 0), new List<Char>() { 'I', 'A', 'O', 'B' });
CreatePixel(new Point(3, 0), new List<Char>() { 'I' });
CreatePixel(new Point(4, 0), new List<Char>() { });
CreatePixel(new Point(0, 1), new List<Char>() { });
CreatePixel(new Point(1, 1), new List<Char>() { 'A', 'B', 'L' });
CreatePixel(new Point(2, 1), new List<Char>() { 'I' });
CreatePixel(new Point(3, 1), new List<Char>() { 'A', 'O', 'B' });
// ...
CreatePixel(new Point(2, 2), new List<Char>() { 'I', 'A', 'B' });
CreatePixel(new Point(3, 2), new List<Char>() { 'A', 'O' });
// ...
CreatePixel(new Point(2, 4), new List<Char>() { 'I', 'O', 'B', 'L' });
CreatePixel(new Point(3, 4), new List<Char>() { 'I', 'A', 'L' });
CreatePixel(new Point(4, 4), new List<Char>() { });
Now for every letter, in order to find the unique characteristics, you need to look at which buckets it belongs to, as well as the amount of other characters in the bucket. So let's take the example of "I". We go to all the buckets it belongs to (1,0; 2,0; 3,0; ...; 3,4) and see that the one with the least amount of other characters is (3,0). In fact, it only has 1 character, meaning it must be an "I" in this case, and we found our unique characteristic.
You can also do the same for pixels that would be white. Notice that bucket (2,0) contains all the letters except for "L", this means that it could be used as a white pixel test. Similarly, (2,4) doesn't contain an 'A'.
Buckets that either contain all the letters or none of the letters can be discarded immediately, since these pixels can't help define a unique characteristic (e.g. 1,1; 4,0; 0,1; 4,4).
It gets trickier when you don't have a 1 pixel test for a letter, for example in the case of 'O' and 'B'. Let's walk through the test for 'O'...
It's contained in the following buckets:
// Bucket Count Letters
// 2,0 4 I, A, O, B
// 3,1 3 A, O, B
// 3,2 2 A, O
// 2,4 4 I, O, B, L
Additionally, we also have a few white pixel tests that can help: (I only listed those that are missing at most 2). The Missing Count was calculated as (5 - Bucket.Count).
// Bucket Missing Count Missing Letters
// 1,0 2 A, O
// 1,1 2 I, O
// 2,2 2 O, L
// 3,4 2 O, B
So now we can take the shortest black pixel bucket (3,2) and see that when we test for (3,2) we know it is either an 'A' or an 'O'. So we need an easy way to tell the difference between an 'A' and an 'O'. We could either look for a black pixel bucket that contains 'O' but not 'A' (e.g. 2,4) or a white pixel bucket that contains an 'O' but not an 'A' (e.g. 1,1). Either of these could be used in combination with the (3,2) pixel to uniquely identify the letter 'O' with only 2 tests.
This seems like a simple algorithm when there are 5 characters, but how would I do this when there are 26 letters and a lot more pixels overlapping? For example, let's say that after the (3,2) pixel test, it found 10 different characters that contain the pixel (and this was the least from all the buckets). Now I need to find differences from 9 other characters instead of only 1 other character. How would I achieve my goal of getting the least amount of checks as possible, and ensure that I am not running extraneous tests?