MySQL – optimize your query to be more scalable (Part 1/2)
When we design and develop a database application, sometimes scalability is not considered or not a priority. When the size of data grows significantly, we often increase the hardware as the defacto solution. By understanding your MySQL queries, there may be easy steps to free up some processing power. In this article, I am going to discuss step by step, how to improve the performance of a simple MySQL query with a semi-large data set. The example shows a speedup from over 8 mins to just less than 30 secs.
Query & data
The query involves two relational tables. One table (T1) contains around 4.4 million rows and the other table (T2) has 6.6 million entries. Both tables are MyISAM and the file sizes are 886MB and 705MB respectively. Basically, T1 holds the key reference by T2 and T2 contains multiple entries for T1 with the same key representing the time base entities. For some reason, the following query takes a long time to perform:
SELECT T1.A, T1.B, SUM(T2.C) FROM T1, T2 WHERE (T1.Id = T2.Id) AND (T2.Time >= X AND T2.Time < Y) GROUP BY A, B
As you can see the above query involves joining the two tables within a certain time range and finally groups the results together.In each test, the query is run for 3 times and recorded each time taken. The timing starts from the second run of the query, not the first run. The reason is that there are other factors can affect the timing (ie takes slightly longer) such as kernel disk read cache, hardware cache, etc. which are beyond our control. In other words, we run the query first time to get the MySQL index & data into the system cache, then start the experiment.
Our first choices of indices for T1 and T2 are as follows:
CREATE INDEX Id ON T1 (Id)
CREATE INDEX Id ON T2 (Id)
CREATE INDEX Time ON T2 (Time)
We name this scheme of indices as E1 for performance comparison. The query is running on an idle machine 3 times with FLUSH TABLES.
|Test 1||7 mins 35 secs|
|Test 2||6 mins 14 secs|
|Test 3||10 mins 56 secs|
|Average||8 mins 15 secs|
To understand what takes the query so long, EXPLAIN gives the following explanation:
The query was actually performed without using any indices. The (ALL in type column) whole T1 table was scanned and then joined with table T2.Id (ref column) and finally only filtered with the time range condition. This seems a bit strange as I would expect MySQL to resolve the query by filtering T2.Time first as that would remove many entries before the join is started.
After doing some studies on MySQL, I realised that MySQL is preferrable to scan the whole table because in general this gives a better performance. However, in this case the query is binding 4 million rows to another 6 million rows. This can take a lot of processing.
The next experiment is to make sure MySQL resolves the query using T2.Time first and with the same index scheme as in experiment 1, E1. There are three ways to enforce that:
- Covering index – index all the T2 fields appearing in the SELECT statement
- LIMIT – query ends with LIMIT
- FORCE or USE – force or hint the MySQL to use a particular index
For method 1, in order to use the T2.Time first, we have to create an index (Time, Id, C). There are a few drawbacks:
- Adding overhead to the INSERT statement
- Impractical when columns are changed and maintaining the index
- Impractical if the table has many fields
For method 2, it is out of the question as we need to retrieve all the results
In our case, method 3 compromises our product the least. It only requires the index name within the FROM clause, as long as the conditions inside the WHERE clause are not likely to be changed.
Using EXPLAIN on
SELECT T1.A, T1.B, SUM(T2.C) FROM T1, T2 FORCE INDEX (Time) WHERE (T1.Id = T2.Id) AND (T2.Time >= X AND T2.Time < Y) GROUP BY A, B
As you can see the FORCE INDEX changes the order of how MySQL resolves the query. This should actually speed up the query significantly in a number of ways
- Time is indexed in ascending order using B-Tree, so MySQL should quickly locate the entries within the given range
- MySQL only needs to perform binding with T1 and T2 on a smaller set of entries
- this also results in MySQL spending less time looking up the data pages (where non-indexed data is stored) once the WHERE is evaluated
We named this scheme as E2. The query runs 3 times starting with FLUSH TABLES each time.
|Test 1||7 mins 35 secs||28 secs|
|Test 2||6 mins 14 secs||27 secs|
|Test 3||10 mins 56 secs||27 secs|
|Average||8 mins 15 secs||27.33 secs|
As you can see, this is a huge improvement on performance. However the speedup is entirely reliant on the number of entries covered by the given range. So, the next interesting question will be if I enlarge the range query to cover the entire table in T2 (we name this result as E2-Full), would it take as long as E1?
|Test 1||7 mins 35 secs||2 mins 3 secs|
|Test 2||6 mins 14 secs||1 min 59 secs|
|Test 3||10 mins 56 secs||1 min 59 secs|
|Average||8 mins 15 secs||2 mins .33 secs|
The answer is no. Although on the surface both queries seem to process all the entries, the order in which MySQL processes the data makes a significant difference.
In E2-Full, since it processes all the queries first on the Time index (ascending order), the time range condition is resolved instantly. Then all the rows in T2 are joined with T1.Id which is indexed. In contrast, with E1 the whole T1 table is loaded up first, then joins both tables based on Id value. The extra overhead comes from processing the range comparison for each of the joined entry.
In the next part, we will investigate whether we can speedup the query furthermore.