MySQL – Steps to break down large time series data & queries (Part 2/2)
1) Appends date time to the table name
For example, a table name is Data_Connections_History. Then the application appends the date at the end of table name, such as Data_Connections_History_2009_3_14. The application should be able to create these tables automatically and insert the new entries.
The idea is to allow database tables with content unchanged as time passes by and to pre-filter time data on a daily basis. In order to reuse query cache, two things must be unchanged; the query strings and the content of the database tables.
If your data can grow significantly on a daily basis, perhaps break down the table names even more, ie hourly with _2009_3_14_[1-24] or quarterly (every 6 hours) with _2009_3_4_[1-4].
2) Break the long query into a sequence of queries with a smaller time range
If the data are arranged into daily tables, then the time range condition can be refined into an hourly basis. For example, you have a long range query with a condition such as:
time > “2009-03-10 01:23:45” and time < “2009-03-14 12:34:56”
Your application should break the query into small queries with conditions such as
time > “2009-03-10 01:23:45” and time < “2009-03-10 02:00:00”
time > “2009-03-10 02:00:00” and time < “2009-03-10 03:00:00”
time > “2009-03-14 12:00:00” and time < “2009-03-14 12:34:56”
and launch these queries with appropriate tables within the time range. As you can see the time range conditions in the query become fixated (except the boundary queries) which helps to resolve the query via query cache.
If the table is organised in some hourly fashion rather than the daily approach, then the original long range query may cover the whole time span of most tables. Hence, the refined queries do not need to use the time range conditions at all (except boundary case).
3) Disable query cache at the boundary queries (Optional)
Chances are the boundary time queries maybe always different because the end users always query different time ranges. There is little point in searching through query cache and saving the result set into the query cache.
You can simply do SQL_NO_CACHE to stop the result populating into the cache and switch off query_cache_type to stop the query searching through cache.
SET SESSION query_cache_type = OFF;
SELECT SQL_NO_CACHE <boundary query>
SET SESSION query_cache_type = ON;
This may give negligible improvement on the performance but this depends on the user queries. Is it lots of long range or short range queries?
The downside of this approach is that it will result in a large number of tables. However, a lot of CPU processing time and memory usage are saved because
- data are already subdivided into tables with a smaller time frame that requires less processing time
- smaller volume of data inside the table requires less memory to load up the index to resolve the query
- if the data is stored in an hourly table, a long range query will entirely cover most of the tables, i.e. no need to specify time range conditions in the queries.
Quite possibly this scheme requires slightly more disk usage as more table files are created but the increase is very minimal and the tradeoff is well worth it. Another major advantage is the possibility of using multi-thread on multiple range queries, taking the benefit of multi-cores processors.
Note that if your application requires regular modification of past tables, this scheme may not work efficiently but it is still better than one big table.