Storing millions of URLs in a database for fast pattern matching
- by Paras Chopra
I am developing a web analytics kind of system which needs to log referring URL, landing page URL and search keywords for every visitor on the website. What I want to do with this collected data is to allow end-user to query the data such as "Show me all visitors who came from Bing.com searching for phrase that contains 'red shoes'" or "Show me all visitors who landed on URL that contained 'campaign=twitter_ad'", etc.
Because this system will be used on many big websites, the amount of data that needs to log will grow really, really fast. So, my question: a) what would be the best strategy for logging so that scaling the system doesn't become a pain; b) how to use that architecture for rapid querying of arbitrary requests? Is there a special method of storing URLs so that querying them gets faster?
In addition to MySQL database that I use, I am exploring (and open to) other alternatives better suited for this task.