Storing millions of URLs in a database for fast pattern matching

Posted by Paras Chopra on Stack Overflow See other posts from Stack Overflow or by Paras Chopra
Published on 2010-06-05T18:23:17Z Indexed on 2010/06/06 2:32 UTC
Read the original article Hit count: 257

Filed under:

mysql

|

database

|

architecture

|

url

I am developing a web analytics kind of system which needs to log referring URL, landing page URL and search keywords for every visitor on the website. What I want to do with this collected data is to allow end-user to query the data such as "Show me all visitors who came from Bing.com searching for phrase that contains 'red shoes'" or "Show me all visitors who landed on URL that contained 'campaign=twitter_ad'", etc.

Because this system will be used on many big websites, the amount of data that needs to log will grow really, really fast. So, my question: a) what would be the best strategy for logging so that scaling the system doesn't become a pain; b) how to use that architecture for rapid querying of arbitrary requests? Is there a special method of storing URLs so that querying them gets faster?

In addition to MySQL database that I use, I am exploring (and open to) other alternatives better suited for this task.

© Stack Overflow or respective owner

Related posts about mysql

How to remove MySQL completely with config and library files on ubuntu 12.04 gnome 3.0

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I tried everything till now: sudo apt-get remove mysql-server mysql-client mysql-common sudo apt-get purge mysql-server mysql-client mysql-common sudo apt-get autoremove and even more commands... But whenever I am trying to locate mysql. I get a no. of files related to mysql command: shell>>… >>> More
mysql: Cannot load from mysql.proc. The table is probably corrupted

as seen on Server Fault - Search for 'Server Fault'
Mysql was started: /usr/bin/mysqld_safe --datadir=/srv/mysql/myDB --log-error=/srv/mysql/logs/mysqld-myDB.log --pid-file=/srv/mysql/pids/mysqld-myDB.pid --user=mysql --socket=/srv/mysql/sockets/mysql-myDB.sock --port=3700 but when I'm trying to do something: ERROR 1548 (HY000) at line 1: Cannot… >>> More
Why is there a /etc/init.d/mysql file on this Slackware machine? How could it have gotten there?

as seen on Server Fault - Search for 'Server Fault'
A client of my IT-consulting service owns a web-development shop. He's been having problems with a Slackware 12.0 server running MySQL 5.0.67. The machine was set up by the client's sysadmin, who left on bad terms. My client no longer employs a sysadmin. As far as I can tell, the only copy of MySQL… >>> More
mysql: Bind on unix socket: Permission denied

as seen on Server Fault - Search for 'Server Fault'
Can't start mysql with: sudo /usr/bin/mysqld_safe --datadir=/srv/mysql/myDB --log-error=/srv/mysql/logs/mysqld-myDB.log --pid-file=/srv/mysql/pids/mysqld-myDB.pid --user=mysql --socket=/srv/mysql/sockets/mysql-myDB.sock --port=3700 120222 13:40:48 mysqld_safe Starting mysqld daemon with databases… >>> More
MySQL – Learning MySQL Online in 6 Hours – MySQL Fundamentals in 320 Minutes

as seen on SQL Authority - Search for 'SQL Authority'
MySQL is one of the most popular database language and I have been recently working with it a lot. Data have no barrier and every database have their own place. I have been working with MySQL for quite a while and just like SQL Server, I often find lots of people asking me if I have a tutorial which… >>> More

Related posts about database

SQL SERVER Retrieve and Explore Database Backup without Restoring Database Idera virtual database

as seen on Dot net Slackers - Search for 'Dot net Slackers'
I recently downloaded Ideras SQL virtual database, and tested it. There are a few things about this tool which caught my attention.My ScenarioIt is quite common in real life that sometimes observing or retrieving older data is necessary; however, it had changed as time passed by. The full database… >>> More
Cloning A Database On The Same Server Using Rman Duplicate From Active Database

as seen on Oracle Blogs - Search for 'Oracle Blogs'
To clone a database using Rman we used to require an existing Rman backup, on 11g we can clone databases using the "from active" database option. In this case we do not require an existing backup, the active datafiles will be used as the source for the clone. In order to clone with the source database… >>> More
cPickle ImportError: No module named multiarray

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, I'm using cPickle to save my Database into file. The code looks like that: def Save_DataBase(): import cPickle from scipy import * from numpy import * a=Results.VersionName #filename='D:/results/'+a[a.find('/')+1:-a.find('/')-2]+Results.AssType[:3]+str(random.randint(0,100))+Results.Distribution+"… >>> More
SQL SERVER – 2008 – Introduction to Snapshot Database – Restore From Snapshot

as seen on SQL Authority - Search for 'SQL Authority'
Snapshot database is one of the most interesting concepts that I have used at some places recently. Here is a quick definition of the subject from Book On Line: A Database Snapshot is a read-only, static view of a database (the source database). Multiple snapshots can exist on a source database and… >>> More
OTN ???? ?????? ???????

as seen on Oracle Blogs - Search for 'Oracle Blogs'
Database ?? Database ??????? Database ?????????? Java WebLogic Server/????????·???? SOA/BPM/????? ???????/???? ID??/?????? ?????EPM/BI EPM/BI ??????? EPM/BI ???? OS/??? ???? ????? MySQL Database ?? ???? ?? ????????? ??? ?? ORACLE MASTER… >>> More