Performance Enhancement in Full-Text Search Query
Posted
by Calvin Sun
on Oracle Blogs
See other posts from Oracle Blogs
or by Calvin Sun
Published on Sat, 29 Sep 2012 16:00:00 +0000
Indexed on
2012/09/30
15:44 UTC
Read the original article
Hit count: 444
/Performance
Ever since its first release, we are continuing consolidating and developing InnoDB Full-Text Search feature. There is one recent improvement that worth blogging about. It is an effort with MySQL Optimizer team that simplifies some common queries’ Query Plans and dramatically shorted the query time. I will describe the issue, our solution and the end result by some performance numbers to demonstrate our efforts in continuing enhancement the Full-Text Search capability.
The Issue:
As we
had discussed in previous Blogs, InnoDB implements Full-Text index as reversed
auxiliary tables. The query once parsed will be reinterpreted into several
queries into related auxiliary tables and then results are merged and consolidated
to come up with the final result. So at the end of the query, we’ll have all matching records on
hand, sorted by their ranking or by their Doc IDs.
Unfortunately,
MySQL’s optimizer and query processing
had been initially designed for MyISAM Full-Text index, and sometimes did not
fully utilize the complete result package from InnoDB.
Here
are a couple examples:
Case 1:
Query result ordered by Rank with only top N results:
mysql> SELECT FTS_DOC_ID, MATCH (title, body) AGAINST ('database')
AS SCORE FROM articles ORDER BY score DESC
LIMIT 1;
In
this query, user tries to retrieve a single record with highest ranking. It
should have a quick answer once we have all the matching documents on hand,
especially if there are ranked. However, before this change, MySQL would almost
retrieve rankings for almost every row in the table, sort them and them come
with the top rank result. This whole retrieve and sort is quite unnecessary
given the InnoDB already have the answer.
In
a real life case, user could have millions of rows, so in the old scheme,
it would retrieve millions of rows'
ranking and sort them, even if our FTS already found there are two 3 matched rows. Apparently, the million
ranking retrieve is done in vain. In above case, it should just ask for 3
matched rows' ranking, all other rows' ranking are 0. If it want the top
ranking, then it can just get the first record from our already sorted result.
Case 2: Select Count(*) on matching records:
mysql> SELECT COUNT(*) FROM articles WHERE MATCH (title,body)
AGAINST
('database' IN NATURAL LANGUAGE MODE);
In this case, InnoDB search can find matching rows quickly and will have all
matching rows. However, before our change, in the old scheme, every row in the
table was requested by MySQL one by one, just to check whether its ranking is
larger than 0, and later comes up a count.
In
fact, there is no need for MySQL to fetch all rows, instead InnoDB already had
all the matching records. The only thing need is to call an InnoDB API to retrieve
the count
The
difference can be huge. Following query output shows how big the difference can
be:
mysql>
select count(*) from searchindex_inno where match(si_title, si_text) against
('people')
+----------+
| count(*) |
+----------+
| 666877 |
+----------+
1 row in set (16 min 17.37 sec)
So the query took almost 16 minutes.
Let’s see how
long the InnoDB can come up the result. In InnoDB, you can obtain extra
diagnostic printout by turning on “innodb_ft_enable_diag_print”, this will
print out extra query info:
Error log:
keynr=2,
'people'
NL search
Total docs: 10954826 Total words: 0
UNION: Searching: 'people'
Processing time: 2 secs: row(s)
666877: error: 10
ft_init()
ft_init_ext()
keynr=2, 'people'
NL search
Total docs: 10954826 Total words: 0
UNION: Searching: 'people'
Processing time: 3 secs: row(s)
666877: error: 10
Output
shows it only took InnoDB only 3 seconds to get the result, while the whole
query took 16 minutes to finish. So large amount of time has been wasted on the
un-needed row fetching.
The
Solution:
The
solution is obvious. MySQL can skip some of its steps, optimize its plan and
obtain useful information directly from InnoDB. Some of savings from doing this
include:
1) Avoid redundant sorting. Since InnoDB already
sorted the result according to ranking. MySQL Query Processing layer does not
need to sort to get top matching results.
2) Avoid row by row fetching to get the matching
count. InnoDB provides all the matching records. All those not in the result
list should all have ranking of 0, and no need to be retrieved. And InnoDB has
a count of total matching records on hand. No need to recount.
3) Covered index scan. InnoDB results always contains
the matching records' Document ID and their ranking. So if only the Document ID
and ranking is needed, there is no need to go to user table to fetch the record
itself.
4) Narrow the search result early, reduce the user
table access. If the user wants to get top N matching records, we do not need to
fetch all matching records from user table. We should be able to first select
TOP N matching DOC IDs, and then only fetch corresponding records with these
Doc IDs.
Performance
Results and comparison with MyISAM
The
result by this change is very obvious. I includes six testing result performed
by Alexander Rubin just to demonstrate how fast the InnoDB query now becomes
when comparing MyISAM Full-Text Search.
These
tests are base on the English Wikipedia data of 5.4 Million rows and
approximately 16G table. The test was performed on a machine with 1 CPU Dual
Core, SSD drive, 8G of RAM and InnoDB_buffer_pool is set to 8 GB.
Table 1: SELECT with LIMIT CLAUSE
mysql> SELECT si_title, match(si_title, si_text) against('family') as rel FROM si WHERE match(si_title, si_text) against('family') ORDER BY rel desc LIMIT 10;
|
InnoDB |
MyISAM |
Times Faster |
Time for the query |
1.63 sec |
3 min 26.31 sec |
127 |
You can
see for this particular query (retrieve top 10 records), InnoDB Full-Text Search
is now approximately 127 times faster than MyISAM.
Table 2: SELECT COUNT QUERY
mysql>select count(*) from si where match(si_title, si_text) against('family‘);
+----------+
| count(*) |
+----------+
| 293955 |
+----------+
|
InnoDB |
MyISAM |
Times Faster |
Time for the query |
1.35 sec |
28 min 59.59 sec |
1289 |
In this
particular case, where there are 293k matching results, InnoDB took only 1.35
second to get all of them, while take MyISAM almost half an hour, that is about
1289 times faster!.
Table 3: SELECT ID with ORDER BY and LIMIT CLAUSE for selected terms
mysql> SELECT <ID>, match(si_title, si_text) against(<TERM>) as rel FROM si_<TB> WHERE match(si_title, si_text) against (<TERM>) ORDER BY rel desc LIMIT 10;
Term |
InnoDB (time to execute) |
MyISAM(time to execute) |
Times Faster |
family |
0.5 sec |
5.05 sec |
10.1 |
family film |
0.95 sec |
25.39 sec |
26.7 |
Pizza restaurant orange county California |
0.93 sec |
32.03 sec |
34.4 |
President united states of America |
2.5 sec |
36.98 sec |
14.8 |
Table 4: SELECT title and text with ORDER BY and LIMIT CLAUSE for selected terms
mysql> SELECT <ID>, si_title, si_text, ... as rel FROM si_<TB> WHERE match(si_title, si_text) against (<TERM>) ORDER BY rel desc LIMIT 10;
Term |
InnoDB (time to execute) |
MyISAM(time to execute) |
Times Faster |
family |
0.61 sec |
41.65 sec |
68.3 |
family film |
1.15 sec |
47.17 sec |
41.0 |
Pizza restaurant orange county california |
1.03 sec |
48.2 sec |
46.8 |
President united states of america |
2.49 sec |
44.61 sec |
17.9 |
Table 5: SELECT ID with ORDER BY and LIMIT CLAUSE for selected terms
mysql> SELECT <ID>, match(si_title, si_text) against(<TERM>) as rel FROM si_<TB> WHERE match(si_title, si_text) against (<TERM>) ORDER BY rel desc LIMIT 10;
Term |
InnoDB (time to execute) |
MyISAM(time to execute) |
Times Faster |
family |
0.5 sec |
5.05 sec |
10.1 |
family film |
0.95 sec |
25.39 sec |
26.7 |
Pizza restaurant orange county califormia |
0.93 sec |
32.03 sec |
34.4 |
President united states of america |
2.5 sec |
36.98 sec |
14.8 |
Table 6: SELECT COUNT(*)
mysql> SELECT count(*) FROM si_<TB> WHERE match(si_title, si_text) against (<TERM>) LIMIT 10;
Term |
InnoDB (time to execute) |
MyISAM(time to execute) |
Times Faster |
family |
0.47 sec |
82 sec |
174.5 |
family film |
0.83 sec |
131 sec |
157.8 |
Pizza restaurant orange county califormia |
0.74 sec |
106 sec |
143.2 |
President united states of america |
1.96 sec |
220 sec |
112.2 |
Again,
table 3 to table 6 all showing InnoDB consistently outperform MyISAM in these
queries by a large margin. It becomes obvious the InnoDB has great advantage
over MyISAM in handling large data search.
Summary:
These results demonstrate the great performance we could achieve by making MySQL optimizer and InnoDB Full-Text Search more tightly coupled. I think there are still many cases that InnoDB’s result info have not been fully taken advantage of, which means we still have great room to improve. And we will continuously explore the area, and get more dramatic results for InnoDB full-text searches.
Jimmy Yang, September 29, 2012
© Oracle Blogs or respective owner