Performance Enhancement in Full-Text Search Query
- by Calvin Sun
Ever
since its first release, we are continuing consolidating and developing InnoDB Full-Text
Search feature. There is one recent improvement that worth blogging about. It
is an effort with MySQL Optimizer team that simplifies some common queries’ Query Plans and dramatically
shorted the query time. I will describe the issue, our solution and the end
result by some performance numbers to demonstrate our efforts in continuing
enhancement the Full-Text Search capability.
The Issue:
As we
had discussed in previous Blogs, InnoDB implements Full-Text index as reversed
auxiliary tables. The query once parsed will be reinterpreted into several
queries into related auxiliary tables and then results are merged and consolidated
to come up with the final result. So at the end of the query, we’ll have all matching records on
hand, sorted by their ranking or by their Doc IDs.
Unfortunately,
MySQL’s optimizer and query processing
had been initially designed for MyISAM Full-Text index, and sometimes did not
fully utilize the complete result package from InnoDB.
Here
are a couple examples:
Case 1:
Query result ordered by Rank with only top N results:
mysql>
SELECT FTS_DOC_ID, MATCH (title, body) AGAINST ('database')
AS SCORE FROM articles ORDER BY score DESC
LIMIT 1;
In
this query, user tries to retrieve a single record with highest ranking. It
should have a quick answer once we have all the matching documents on hand,
especially if there are ranked. However, before this change, MySQL would almost
retrieve rankings for almost every row in the table, sort them and them come
with the top rank result. This whole retrieve and sort is quite unnecessary
given the InnoDB already have the answer.
In
a real life case, user could have millions of rows, so in the old scheme,
it would retrieve millions of rows'
ranking and sort them, even if our FTS already found there are two 3 matched rows. Apparently, the million
ranking retrieve is done in vain. In above case, it should just ask for 3
matched rows' ranking, all other rows' ranking are 0. If it want the top
ranking, then it can just get the first record from our already sorted result.
Case
2: Select Count(*) on matching records:
mysql>
SELECT COUNT(*) FROM articles WHERE MATCH (title,body)
AGAINST
('database' IN NATURAL LANGUAGE MODE);
In this case, InnoDB search can find matching rows quickly and will have all
matching rows. However, before our change, in the old scheme, every row in the
table was requested by MySQL one by one, just to check whether its ranking is
larger than 0, and later comes up a count.
In
fact, there is no need for MySQL to fetch all rows, instead InnoDB already had
all the matching records. The only thing need is to call an InnoDB API to retrieve
the count
The
difference can be huge. Following query output shows how big the difference can
be:
mysql>
select count(*) from searchindex_inno where match(si_title, si_text) against
('people') +----------+
| count(*) | +----------+ | 666877 | +----------+ 1 row in set (16 min 17.37 sec)
So the query took almost 16 minutes.
Let’s see how
long the InnoDB can come up the result. In InnoDB, you can obtain extra
diagnostic printout by turning on “innodb_ft_enable_diag_print”, this will
print out extra query info:
Error log: keynr=2,
'people' NL search Total docs: 10954826 Total words: 0 UNION: Searching: 'people' Processing time: 2 secs: row(s)
666877: error: 10 ft_init() ft_init_ext() keynr=2, 'people' NL search Total docs: 10954826 Total words: 0 UNION: Searching: 'people' Processing time: 3 secs: row(s)
666877: error: 10
Output
shows it only took InnoDB only 3 seconds to get the result, while the whole
query took 16 minutes to finish. So large amount of time has been wasted on the
un-needed row fetching.
The
Solution:
The
solution is obvious. MySQL can skip some of its steps, optimize its plan and
obtain useful information directly from InnoDB. Some of savings from doing this
include:
1) Avoid redundant sorting. Since InnoDB already
sorted the result according to ranking. MySQL Query Processing layer does not
need to sort to get top matching results.
2) Avoid row by row fetching to get the matching
count. InnoDB provides all the matching records. All those not in the result
list should all have ranking of 0, and no need to be retrieved. And InnoDB has
a count of total matching records on hand. No need to recount.
3) Covered index scan. InnoDB results always contains
the matching records' Document ID and their ranking. So if only the Document ID
and ranking is needed, there is no need to go to user table to fetch the record
itself.
4) Narrow the search result early, reduce the user
table access. If the user wants to get top N matching records, we do not need to
fetch all matching records from user table. We should be able to first select
TOP N matching DOC IDs, and then only fetch corresponding records with these
Doc IDs.
Performance
Results and comparison with MyISAM
The
result by this change is very obvious. I includes six testing result performed
by Alexander Rubin just to demonstrate how fast the InnoDB query now becomes
when comparing MyISAM Full-Text Search.
These
tests are base on the English Wikipedia data of 5.4 Million rows and
approximately 16G table. The test was performed on a machine with 1 CPU Dual
Core, SSD drive, 8G of RAM and InnoDB_buffer_pool is set to 8 GB.
Table 1: SELECT with LIMIT CLAUSE
mysql>
SELECT si_title, match(si_title, si_text) against('family') as rel FROM si WHERE
match(si_title, si_text) against('family') ORDER BY rel desc LIMIT 10;
InnoDB
MyISAM
Times Faster
Time for the query
1.63 sec
3 min 26.31 sec
127
You can
see for this particular query (retrieve top 10 records), InnoDB Full-Text Search
is now approximately 127 times faster than MyISAM.
Table 2: SELECT COUNT QUERY
mysql>select count(*) from si
where match(si_title, si_text) against('family‘);
+----------+
| count(*) |
+----------+
| 293955 |
+----------+
InnoDB
MyISAM
Times Faster
Time
for the query
1.35
sec
28
min 59.59 sec
1289
In this
particular case, where there are 293k matching results, InnoDB took only 1.35
second to get all of them, while take MyISAM almost half an hour, that is about
1289 times faster!.
Table 3: SELECT ID with ORDER BY
and LIMIT CLAUSE for selected terms
mysql>
SELECT <ID>, match(si_title, si_text) against(<TERM>)
as rel FROM si_<TB> WHERE match(si_title, si_text) against (<TERM>)
ORDER BY rel desc LIMIT 10;
Term
InnoDB (time to execute)
MyISAM(time to execute)
Times Faster
family
0.5 sec
5.05 sec
10.1
family film
0.95 sec
25.39 sec
26.7
Pizza restaurant orange county
California
0.93 sec
32.03 sec
34.4
President united states of
America
2.5 sec
36.98 sec
14.8
Table 4: SELECT title and text
with ORDER BY and LIMIT CLAUSE for selected terms
mysql>
SELECT <ID>, si_title, si_text, ... as rel FROM si_<TB> WHERE
match(si_title, si_text) against (<TERM>) ORDER BY rel desc LIMIT 10;
Term
InnoDB (time to execute)
MyISAM(time to execute)
Times Faster
family
0.61 sec
41.65 sec
68.3
family film
1.15 sec
47.17 sec
41.0
Pizza restaurant orange county
california
1.03 sec
48.2 sec
46.8
President united states of
america
2.49 sec
44.61 sec
17.9
Table 5: SELECT ID with ORDER BY
and LIMIT CLAUSE for selected terms
mysql>
SELECT <ID>, match(si_title, si_text) against(<TERM>)
as rel FROM si_<TB> WHERE match(si_title, si_text) against
(<TERM>) ORDER BY rel desc LIMIT 10;
Term
InnoDB (time to execute)
MyISAM(time to execute)
Times Faster
family
0.5 sec
5.05 sec
10.1
family film
0.95 sec
25.39 sec
26.7
Pizza restaurant orange county
califormia
0.93 sec
32.03 sec
34.4
President united states of
america
2.5 sec
36.98 sec
14.8
Table 6: SELECT COUNT(*)
mysql>
SELECT
count(*) FROM si_<TB> WHERE match(si_title, si_text) against (<TERM>)
LIMIT 10;
Term
InnoDB (time to execute)
MyISAM(time to execute)
Times Faster
family
0.47 sec
82 sec
174.5
family film
0.83 sec
131 sec
157.8
Pizza restaurant orange county
califormia
0.74 sec
106 sec
143.2
President united states of
america
1.96 sec
220 sec
112.2
Again,
table 3 to table 6 all showing InnoDB consistently outperform MyISAM in these
queries by a large margin. It becomes obvious the InnoDB has great advantage
over MyISAM in handling large data search.
Summary:
These
results demonstrate the great performance we could achieve by making MySQL
optimizer and InnoDB Full-Text Search more tightly coupled. I think there are
still many cases that InnoDB’s result info have not been fully taken advantage
of, which means we still have great room to improve. And we will continuously
explore the area, and get more dramatic results for InnoDB full-text
searches.
Jimmy Yang, September 29, 2012