Indexing data from multiple tables with Oracle Text
- by Roger Ford
It's well known that Oracle Text indexes perform best when all the data to be indexed is combined into a single index.
The query
select * from mytable
where contains (title, 'dog') 0
or contains (body, 'cat') 0
will tend to perform much worse than
select * from mytable
where contains (text, 'dog WITHIN title OR cat WITHIN body') 0
For this reason, Oracle Text provides the MULTI_COLUMN_DATASTORE which will combine data from multiple columns into a single index. Effectively, it constructs a "virtual document" at indexing time, which might look something like:
<title>the big dog</title>
<body>the ginger cat smiles</body>
This virtual document can be indexed using either AUTO_SECTION_GROUP, or by explicitly defining sections for title and body, allowing the query as expressed above. Note that we've used a column called "text" - this might have been a dummy column added to the table simply to allow us to create an index on it - or we could created the index on either of the "real" columns - title or body.
It should be noted that MULTI_COLUMN_DATASTORE doesn't automatically handle updates to columns used by it - if you create the index on the column text, but specify that columns title and body are to be indexed, you will need to arrange triggers such that the text column is updated whenever title or body are altered.
That works fine for single tables. But what if we actually want to combine data from multiple tables?
In that case there are two approaches which work well:
Create a real table which contains a summary of the information, and create the index on that using the MULTI_COLUMN_DATASTORE. This is simple, and effective, but it does use a lot of disk space as the information to be indexed has to be duplicated.
Create our own "virtual" documents using the USER_DATASTORE. The user datastore allows us to specify a PL/SQL procedure which will be used to fetch the data to be indexed, returned in a CLOB, or occasionally in a BLOB or VARCHAR2. This PL/SQL procedure is called once for each row in the table to be indexed, and is passed the ROWID value of the current row being indexed. The actual contents of the procedure is entirely up to the owner, but it is normal to fetch data from one or more columns from database tables.
In both cases, we still need to take care of updates - making sure that we have all the triggers necessary to update the indexed column (and, in case 1, the summary table) whenever any of the data to be indexed gets changed.
I've written full examples of both these techniques, as SQL scripts to be run in the SQL*Plus tool. You will need to run them as a user who has CTXAPP role and CREATE DIRECTORY privilege.
Part of the data to be indexed is a Microsoft Word file called "1.doc". You should create this file in Word, preferably containing the single line of text: "test document". This file can be saved anywhere, but the SQL scripts need to be changed so that the "create or replace directory" command refers to the right location. In the example, I've used C:\doc.
multi_table_indexing_1.sql : creates a summary table containing all the data, and uses multi_column_datastore Download link / View in browser
multi_table_indexing_2.sql : creates "virtual" documents using a procedure as a user_datastore Download link / View in browser