Getting started with character and text processing (encoding, regular expressions)
- by TK
I'd like to learn foundations of encodings, characters and text. Understanding these is important for dealing with a large set of text whether that are log files or text source for building algorithms for collective intelligence. My current knowledge is pretty basic: something like "As long as I use UTF-8, I'm okay."
I don't say I need to learn about advanced topics right away. But I need to know:
Bit and bytes level knowledge of encodings.
Characters and alphabets not used in English.
Multi-byte encodings. (I understand some Chinese and Japanese. And parsing them is important.)
Regular expressions.
Algorithm for text processing.
Parsing natural languages.
I also need an understanding of mathematics and corpus linguistics. The current and future web (semantic, intelligent, real-time web) needs processing, parsing and analyzing large text.
I'm looking for some resources (maybe books?) that get me started with some of the bullets. (I find many helpful discussion on regular expressions here on Stack Overflow. So, you don't need to suggest resources on that topic.)