MySQL – optimize your query to be more scalable (Part 2/2)
The next experiment is to discover whether using the combination index can speed up the query. The following indices are created:
CREATE INDEX Id on T1 (Id);
CREATE INDEX Time_Id on T2 (Time, Id);
Note: The order of multiple columns index can affect how the query works. See MySQL documentation for more details.
We name this index scheme as E3 and use FORCE INDEX on Time_Id. Same again, the query is run 3 times with FLUSH TABLES.
|Test 1||29 secs|
|Test 2||31 secs|
|Test 3||32 secs|
In fact, this is fractionally slower. The reason may be that the index for T2 table becomes slightly larger and it takes slighltly longer for MySQL load up the index.
Another interesting question how will E3-Full (E3 index scheme with range query cover the full T2 table) compare to E2-Full?
|Test 1||2 mins 3 secs||1 min 24 secs|
|Test 2||1 min 59 secs||1 min 44 secs|
|Test 3||1 min 59 secs||1 min 36 secs|
|Average||2 mins .33 secs||1 min 34 secs|
Interestingly, using multi-columns index does further improve the performance on a larger scale of query.
The last experiment uses full covering indices both tables and observes any significant improvement.
CREATE INDEX Id_A_B on T1 (Id, A, B);
CREATE INDEX Time_Id_C on T2 (Time, Id, C);
We name this index scheme as E4 and use FORCE INDEX on Time_Id_C. Same again, the query is run 3 times with FLUSH TABLES.
|Test 1||27 secs|
|Test 2||26 secs|
|Test 3||26 secs|
Having both sides of covering index seems to have a little impact on the performance on sub-range query. On the other hand, there is a noticeable improvement with full range query, E4-Full. This may indicate that the sub-range used (covers 3.9 million rows, 59% of T2 table) for the experiments hasn’t been substantial enough to see the benefit of multiple columns indexing. Nonetheless, there seems a potential for a much larger database table. Perhaps, this would be our next investigation in the future blog, ie varying the ratio of sub-range with covering indexes on a much larger table.
|Test 1||2 mins 3 secs||1 min 24 secs||43 secs|
|Test 2||1 min 59 secs||1 min 44 secs||44 secs|
|Test 3||1 min 59 secs||1 min 36 secs||44 secs|
|Average||2 mins .33 secs||1 min 34 secs||43.66 secs|
Why range data is important for a large table?
Having some sort of range data is particular useful for queries with large table. It is because it:
- breaks the query and resolves it in smaller chunks. There are articles on the internet showing that it gives a better overall performance when a portion (let say 10% of the data) of a large table query is processed step by step, instead of processing the whole table in one go.
- gives better response between multiple client connections accessing the same table. For example, one client requests for read lock while another client operates a query on the whole giant table.
- avoids the dead lock of “Copying to tmp table” bug. The bug was first reported on Oct 2005 and it still hasn’t been fixed! See bug 14070. We came across this issue on a large table query with GROUP BY. However, when the same query is broken down into portions with addition of range data condition, deadlock is averted.
- provides a better integration with progress bar. It can actually inform the users how far the long query been processed.
- uses much less memory and less overall CPU usage.
- takes the full advantage of multi-core processor. You can multi-thread your application and run each thread with a range query.
What if the table doesn’t have the range data
Easy, make one. It is pretty common to have an Id column with AUTO_INCREMENT in MySQL tables. The range can then be acquired by MIN(Id) and MAX(Id).
In this article, we showed that we can’t always assume MySQL resolves the query with the optimal approach, especially for large tables. Even using a query with range condition, it is important to check with EXPLAIN first and experiment the benchmark. A significant performance improvement may be accomplished by slightly changing the query.
Added two more bullet points (5 and 6) on ‘Why range data is …’