Oracle Text query parser

Posted by Roger Ford on Oracle Blogs See other posts from Oracle Blogs or by Roger Ford
Published on Wed, 11 Jul 2012 12:55:29 +0000 Indexed on 2012/07/11 15:20 UTC
Read the original article Hit count: 557

Filed under:
Oracle Text provides a rich query syntax which enables powerful text searches.

However, this syntax isn't intended for use by inexperienced end-users.  If you provide a simple search box in your application, you probably want users to be able to type "Google-like" searches into the box, and have your application convert that into something that Oracle Text understands.

For example if your user types "windows nt networking" then you probably want to convert this into something like
"windows ACCUM nt ACCUM networking".  But beware - "NT" is a reserved word, and needs to be escaped.  So let's escape all words:
"{windows} ACCUM {nt} ACCUM {networking}".  That's fine - until you start introducing wild cards. Then you must escape only non-wildcarded searches:
"win% ACCUM {nt} ACCUM {networking}".  There are quite a few other "gotchas" that you might encounter along the way.

Then there's the issue of scoring.  Given a query for "oracle text query syntax", it would be nice if we could score a full phrase match higher than a hit where all four words are present but not in a phrase.  And then perhaps lower than that would be a document where three of the four terms are present.  Progressive relaxation helps you with this, but you need to code the "progression" yourself in most cases.

To help with this, I've developed a query parser which will take queries in Google-like syntax, and convert them into Oracle Text queries. It's designed to be as flexible as possible, and will generate either simple queries or progressive relaxation queries. The input string will typically just be a string of words, such as "oracle text query syntax" but the grammar does allow for more complex expressions:

  word : score will be improved if word exists
  +word : word must exist
  -word : word CANNOT exist
  "phrase words" : words treated as phrase (may be preceded by + or -)
  field:(expression) : find expression (which allows +,- and phrase as above) within "field".

So for example if I searched for
  +"oracle text" query +syntax -ctxcat
Then the results would have to contain the phrase "oracle text" and the word syntax. Any documents mentioning ctxcat would be excluded from the results.

All the instructions are in the top of the file (see "Downloads" at the bottom of this blog entry).  Please download the file, read the instructions, then try it out by running "parser.pls" in either SQL*Plus or SQL Developer.

I am also uploading a test file "test.sql". You can run this and/or modify it to run your own tests or run against your own text index. test.sql is designed to be run from SQL*Plus and may not produce useful output in SQL Developer (or it may, I haven't tried it).

I'm putting the code up here for testing and comments. I don't consider it "production ready" at this point, but would welcome feedback.  I'm particularly interested in comments such as
  • "The instructions are unclear - I couldn't figure out how to do XXX"
  • "It didn't work in my environment" (please provide as many details as possible)
  • "We can't use it in our application" (why not?)
  • "It needs to support XXX feature"
  • "It produced an invalid query output when I fed in XXXX"

Downloads:

parser.pls
test.sql


© Oracle Blogs or respective owner

Related posts about /Oracle