Skip to content

January 23, 2009

7

MySQL – optimize your query to be more scalable (Part 1/2)

by Joe Kuan

databaseIntroduction

When we design and develop a database application, sometimes scalability is not considered or not a priority. When the size of data grows significantly, we often increase the hardware as the defacto solution. By understanding your MySQL queries, there may be easy steps to free up some processing power. In this article, I am going to discuss step by step, how to improve the performance of a simple MySQL query with a semi-large data set. The example shows a speedup from over 8 mins to just less than 30 secs.

Query & data

The query involves two relational tables. One table (T1) contains around 4.4 million rows and the other table (T2) has 6.6 million entries. Both tables are MyISAM and the file sizes are 886MB and 705MB respectively. Basically, T1 holds the key reference by T2 and T2 contains multiple entries for T1 with the same key representing the time base entities. For some reason, the following query takes a long time to perform:

SELECT T1.A, T1.B, SUM(T2.C) FROM T1, T2 WHERE (T1.Id = T2.Id) AND (T2.Time >= X AND T2.Time < Y) GROUP BY A, B 
As you can see the above query involves joining the two tables within a certain time range and finally groups the results together.In each test, the query is run for 3 times and recorded each time taken. The timing starts from the second run of the query, not the first run. The reason is that there are other factors can affect the timing (ie takes slightly longer) such as kernel disk read cache, hardware cache, etc. which are beyond our control. In other words, we run the query first time to get the MySQL index & data into the system cache, then start the experiment. 

Experiment 1
Our first choices of indices for T1 and T2 are as follows:

CREATE INDEX Id ON T1 (Id)
CREATE INDEX Id ON T2 (Id)
CREATE INDEX Time ON T2 (Time)

We name this scheme of indices as E1 for performance comparison. The query is running on an idle machine 3 times with FLUSH TABLES.

  E1
Test 1 7 mins 35 secs
Test 2 6 mins 14 secs
Test 3 10 mins 56 secs
Average 8 mins 15 secs

To understand what takes the query so long, EXPLAIN gives the following explanation:

test1

The query was actually performed without using any indices. The (ALL in type column) whole T1 table was scanned and then joined with table T2.Id (ref column) and finally only filtered with the time range condition. This seems a bit strange as I would expect MySQL to resolve the query by filtering T2.Time first as that would remove many entries before the join is started.

After doing some studies on MySQL, I realised that MySQL is preferrable to scan the whole table because in general this gives a better performance. However, in this case the query is binding 4 million rows to another 6 million rows. This can take a lot of processing.

Experiment 2

The next experiment is to make sure MySQL resolves the query using T2.Time first and with the same index scheme as in experiment 1, E1. There are three ways to enforce that:

  1. Covering index – index all the T2 fields appearing in the SELECT statement
  2. LIMIT – query ends with LIMIT
  3. FORCE or USE – force or hint the MySQL to use a particular index

For method 1, in order to use the T2.Time first, we have to create an index (Time, Id, C). There are a few drawbacks:

  • Adding overhead to the INSERT statement
  • Impractical when columns are changed and maintaining the index
  • Impractical if the table has many fields

For method 2, it is out of the question as we need to retrieve all the results

In our case, method 3 compromises our product the least. It only requires the index name within the FROM clause, as long as the conditions inside the WHERE clause are not likely to be changed.

Using EXPLAIN on

SELECT T1.A, T1.B, SUM(T2.C) FROM T1, T2 FORCE INDEX (Time) WHERE (T1.Id = T2.Id) AND (T2.Time >= X AND T2.Time < Y) GROUP BY A, B

it shows

test2

As you can see the FORCE INDEX changes the order of how MySQL resolves the query. This should actually speed up the query significantly in a number of ways

  • Time is indexed in ascending order using B-Tree, so MySQL should quickly locate the entries within the given range
  • MySQL only needs to perform binding with T1 and T2 on a smaller set of entries
  • this also results in MySQL spending less time looking up the data pages (where non-indexed data is stored) once the WHERE is evaluated

We named this scheme as E2. The query runs 3 times starting with FLUSH TABLES each time.

  E1 E2
Test 1 7 mins 35 secs 28 secs
Test 2 6 mins 14 secs 27 secs
Test 3 10 mins 56 secs 27 secs
Average 8 mins 15 secs 27.33 secs

As you can see, this is a huge improvement on performance. However the speedup is entirely reliant on the number of entries covered by the given range. So, the next interesting question will be if I enlarge the range query to cover the entire table in T2 (we name this result as E2-Full), would it take as long as E1?

  E1 E2-Full
Test 1 7 mins 35 secs 2 mins 3 secs
Test 2 6 mins 14 secs 1 min 59 secs
Test 3 10 mins 56 secs 1 min 59 secs
Average 8 mins 15 secs 2 mins .33 secs

The answer is no. Although on the surface both queries seem to process all the entries, the order in which MySQL processes the data makes a significant difference.

In E2-Full, since it processes all the queries first on the Time index (ascending order), the time range condition is resolved instantly. Then all the rows in T2 are joined with T1.Id which is indexed. In contrast, with E1 the whole T1 table is loaded up first, then joins both tables based on Id value. The extra overhead comes from processing the range comparison for each of the joined entry.

In the next part, we will investigate whether we can speedup the query furthermore.

I work for iTrinegy and here are my other technical blogs

Advertisements
7 Comments Post a comment
  1. cevarief
    Jan 23 2009

    Never knew if there’s FORCE INDEX before. Thanks. This article is nice.
    Looking forward the next one :D

    Reply
  2. Inpuplegeapew
    Feb 10 2009

    Hi, cool site, good writing ;)

    Reply
  3. Sunil
    Mar 4 2009

    This is an excellent comment. Extreamly useful for huge databases. Thanks a lot

    Reply
  4. Very good article, thank you.

    Reply
  5. Eva
    Jan 6 2010

    “Covering index – index all the T2 fields appearing in the SELECT statement”

    why is that? you just need to index all the T2 fields in WHERE clause.

    Reply
    • Joe Kuan
      Jan 11 2010

      This has be explored in the Part 2/2 – Experiments 3 and 4. Having covering index of all fields in SELECT instead of just WHERE clause gives marginal better performance. This is because the fields are stored in the index file, not data record file which means no data page look up is necessary.

      Of course, this will make the INSERT statement slightly longer.

      Reply

Trackbacks & Pingbacks

  1. MySQL – optimize your query to be more scalable | silviasoft

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Note: HTML is allowed. Your email address will never be published.

Subscribe to comments

%d bloggers like this: