Skip to content

January 23, 2009

4

MySQL – optimize your query to be more scalable (Part 2/2)

by Joe Kuan

databaseExperiment 3

The next experiment is to discover whether using the combination index can speed up the query. The following indices are created:

CREATE INDEX Id on T1 (Id);
CREATE INDEX Time_Id on T2 (Time, Id);

Note: The order of multiple columns index can affect how the query works. See MySQL documentation for more details.

We name this index scheme as E3 and use FORCE INDEX on Time_Id. Same again, the query is run 3 times with FLUSH TABLES.

  E3
Test 1 29 secs
Test 2 31 secs
Test 3 32 secs
Average 30.66 secs

In fact, this is fractionally slower. The reason may be that the index for T2 table becomes slightly larger and it takes slighltly longer for MySQL load up the index.

Another interesting question how will E3-Full (E3 index scheme with range query cover the full T2 table) compare to E2-Full?

  E2-Full E3-Full
Test 1 2 mins 3 secs 1 min 24 secs
Test 2 1 min 59 secs 1 min 44 secs
Test 3 1 min 59 secs 1 min 36 secs
Average 2 mins .33 secs 1 min 34 secs

Interestingly, using multi-columns index does further improve the performance on a larger scale of query.

Experiment 4

The last experiment uses full covering indices both tables and observes any significant improvement.

CREATE INDEX Id_A_B on T1 (Id, A, B);
CREATE INDEX Time_Id_C on T2 (Time, Id, C);

We name this index scheme as E4 and use FORCE INDEX on Time_Id_C. Same again, the query is run 3 times with FLUSH TABLES.

  E4
Test 1 27 secs
Test 2 26 secs
Test 3 26 secs
Average 26.33 secs

Having both sides of covering index seems to have a little impact on the performance on sub-range query. On the other hand, there is a noticeable improvement with full range query, E4-Full. This may indicate that the sub-range used (covers 3.9 million rows, 59% of T2 table) for the experiments hasn’t been substantial enough to see the benefit of multiple columns indexing. Nonetheless, there seems a potential for a much larger database table. Perhaps, this would be our next investigation in the future blog, ie varying the ratio of sub-range with covering indexes on a much larger table.

  E2-Full E3-Full E4-Full
Test 1 2 mins 3 secs 1 min 24 secs 43 secs
Test 2 1 min 59 secs 1 min 44 secs 44 secs
Test 3 1 min 59 secs 1 min 36 secs 44 secs
Average 2 mins .33 secs 1 min 34 secs 43.66 secs

Why range data is important for a large table?

Having some sort of range data is particular useful for queries with large table. It is because it:

  1. breaks the query and resolves it in smaller chunks. There are articles on the internet showing that it gives a better overall performance when a portion (let say 10% of the data) of a large table query is processed step by step, instead of processing the whole table in one go.
  2. gives better response between multiple client connections accessing the same table. For example, one client requests for read lock while another client operates a query on the whole giant table.
  3. avoids the dead lock of “Copying to tmp table” bug. The bug was first reported on Oct 2005 and it still hasn’t been fixed! See bug 14070. We came across this issue on a large table query with GROUP BY. However, when the same query is broken down into portions with addition of range data condition, deadlock is averted.
  4. provides a better integration with progress bar. It can actually inform the users how far the long query been processed.
  5. uses much less memory and less overall CPU usage.
  6. takes the full advantage of multi-core processor. You can multi-thread your application and run each thread with a range query.  

What if the table doesn’t have the range data

Easy, make one. It is pretty common to have an Id column with AUTO_INCREMENT in MySQL tables. The range can then be acquired by MIN(Id) and MAX(Id).

Conclusion

In this article, we showed that we can’t always assume MySQL resolves the query with the optimal approach, especially for large tables. Even using a query with range condition, it is important to check with EXPLAIN first and experiment the benchmark. A significant performance improvement may be accomplished by slightly changing the query.

I work for iTrinegy and here are my other technical blogs

UPDATED 14/Mar/2009:

    Added two more bullet points (5 and 6) on ‘Why range data is …’

Advertisements
4 Comments Post a comment
  1. Sunil
    Mar 4 2009

    Thanks for this very important comment. Question is, though, what if one of the table has BLOB column storing 20 to 25 K data per row?

    Reply
  2. Joe Kuan
    Mar 7 2009

    I haven’t really used BLOB object before. So I can’t comment the performance issue of it. I guess it depends on what you are doing with the BLOB objects. Certainly, you don’t want to compare them. If you need to compare BLOB in the query, then you should hash the BLOB and index it – makes the index file a lot smaller to load. If not, then it shouldn’t affect the performance because they are stored in the data pages instead. Only load up when the query is resolved. So if the result set is big, then maybe a faster disk access would help a little bit, not a lot.
    Hope this helps

    Reply
  3. lokesh
    Mar 12 2009

    It helped me a lot… Thanks

    Reply
  4. Nithin
    Nov 12 2009

    Real nice article and very well explained

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Note: HTML is allowed. Your email address will never be published.

Subscribe to comments

%d bloggers like this: