之前的篇章我们讨论的都是基于单列的分区表,那有无必要建立基于多列的分区表?这种分区表数据分布是否均匀?有无特殊的应用场景?有无特殊的优化策略?本篇基于这些问题来进行重点解读。
MySQL 不仅支持基于单列分区,也支持基于多列分区。比如基于字段(f1,f2,f3)来建立分区表,使用方法和使用场景都有些类似于联合索引。比如下面查询语句,同时对列(f1,f2,f3) 进行过滤。
select * from p1 where f1 = 2 and f2 = 2 and f3 = 2;
多列分区表的前提是参与分区的列检索频率均等,如果不均等,就没有必要使用多列分区。
我们还是以具体实例来验证下多列分区的优缺点以及适用场景,这样理解起来更加透彻。
建立一张表p1,字段r1,r2,r3分别取值为1-8,1-5,1-5.
create table p1(r1 int,r2 int,r3 int,log_date datetime);
按照字段(r1,r2,r3) 的分布范围,我来写个存储过程处理下表p1,变为分区表。存储过程代码如下:
DELIMITER $$ USE `ytt_new`$$ DROP PROCEDURE IF EXISTS `sp_add_partition_ytt_new_p1`$$ CREATE DEFINER=`root`@`%` PROCEDURE `sp_add_partition_ytt_new_p1`() BEGIN DECLARE i,j,k INT UNSIGNED DEFAULT 1; SET @stmt = ''; SET @stmt_begin = 'ALTER TABLE p1 PARTITION BY RANGE COLUMNS (r1,r2,r3)('; WHILE i <= 8 DO set j = 1; while j <= 5 do set k = 1; while k <= 5 do SET @stmt = CONCAT(@stmt,' PARTITION p',i,j,k,' VALUES LESS THAN (',i,',',j,',',k,'),'); set k = k + 1; end while; set j = j + 1; end while; SET i = i + 1; END WHILE; SET @stmt_end = 'PARTITION p_max VALUES LESS THAN (maxvalue,maxvalue,maxvalue))'; SET @stmt = CONCAT(@stmt_begin,@stmt,@stmt_end); PREPARE s1 FROM @stmt; EXECUTE s1; DROP PREPARE s1; SET @stmt = NULL; SET @stmt_begin = NULL; SET @stmt_end = NULL; END$$ DELIMITER ;
调用存储过程,变更表p1为多列分区表,此时表p1有201个分区,记录数为500W条。
mysql> call sp_add_partition_ytt_new_p1; Query OK, 0 rows affected (14.89 sec) mysql> select count(partition_name) as partition_count from information_schema.partitions where table_schema = 'ytt_new' and table_name ='p1'; +-----------------+ | partition_count | +-----------------+ | 201 | +-----------------+ 1 row in set (0.00 sec) mysql> select count(*) from p1; +----------+ | count(*) | +----------+ | 5000000 | +----------+ 1 row in set (12.01 sec)
用同样的方法建立一张分区表p2,来对单列分区表与多列分区表在一些场景下的性能做下对比:
分区表p2按照字段r1分区,仅仅分了9个。
mysql> CREATE TABLE `p2` ( `r1` int DEFAULT NULL, `r2` int DEFAULT NULL, `r3` int DEFAULT NULL, `log_date` datetime DEFAULT NULL ) ENGINE=InnoDB PARTITION BY RANGE COLUMNS(r1) (PARTITION p1 VALUES LESS THAN (1) , PARTITION p2 VALUES LESS THAN (2) , PARTITION p3 VALUES LESS THAN (3) , PARTITION p4 VALUES LESS THAN (4) , PARTITION p5 VALUES LESS THAN (5) , PARTITION p6 VALUES LESS THAN (6) , PARTITION p7 VALUES LESS THAN (7) , PARTITION p8 VALUES LESS THAN (8) , PARTITION p_max VALUES LESS THAN (MAXVALUE) ) 1 row in set (0.00 sec) mysql> insert into p2 select * from p1; Query OK, 5000000 rows affected (1 min 37.92 sec) Records: 5000000 Duplicates: 0 Warnings: 0
多个字段等值过滤的性能对比:同样的查询条件,表p1(执行时间0.02秒)比p2(执行时间0.49秒)要快几十倍。
mysql> select count(*) from p1 where r1 = 2 and r2 = 2 and r3 = 2; +----------+ | count(*) | +----------+ | 24992 | +----------+ 1 row in set (0.02 sec) mysql> select count(*) from p2 where r1 = 2 and r2 = 2 and r3 = 2; +----------+ | count(*) | +----------+ | 24992 | +----------+ 1 row in set (0.49 sec)
查看两者执行计划对比: 同样的查询,表p1扫描行数只有2W多,而表p2扫描行数有62W行,相差巨大。
mysql> explain select count(*) from p1 where r1 = 2 and r2 = 2 and r3 = 2\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: p1 partitions: p223 type: ALL ... rows: 24711 filtered: 0.10 Extra: Using where 1 row in set, 1 warning (0.00 sec) mysql> explain select count(*) from p2 where r1 = 2 and r2 = 2 and r3 = 2\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: p2 partitions: p3 type: ALL ... rows: 623239 filtered: 0.10 Extra: Using where 1 row in set, 1 warning (0.00 sec)
如果过滤字段不完整呢?比如不检索最后一列,再次做下对比:同样表p1(0.1秒)比表p2(0.52秒)执行时间要少几倍。
mysql> select count(*) from p1 where r1 = 2 and r2 = 2; +----------+ | count(*) | +----------+ | 124649 | +----------+ 1 row in set (0.10 sec) mysql> select count(*) from p2 where r1 = 2 and r2 = 2; +----------+ | count(*) | +----------+ | 124649 | +----------+ 1 row in set (0.52 sec)
那只检索第一列呢:这次表p1和p2执行时间上差不多,p2稍微占优势。
mysql> select count(*) from p1 where r1 = 2 ; +----------+ | count(*) | +----------+ | 624599 | +----------+ 1 row in set (0.56 sec) mysql> select count(*) from p2 where r1 = 2 ; +----------+ | count(*) | +----------+ | 624599 | +----------+ 1 row in set (0.45 sec)
看下执行计划对比:表p1扫描的分区数为26个,表p2仅扫描1个分区,分区数量上表p2相对少很多。
mysql> explain select count(*) from p1 where r1 = 2 \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: p1 partitions: p211,p212,p213,p214,p215,p221,p222,p223,p224,p225,p231,p232,p233,p234,p235,p241,p242,p243,p244,p245,p251,p252,p253,p254,p255,p311 type: ALL ... rows: 648074 filtered: 10.00 Extra: Using where 1 row in set, 1 warning (0.00 sec) mysql> explain select count(*) from p2 where r1 = 2 \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: p2 partitions: p3 type: ALL ... rows: 623239 filtered: 10.00 Extra: Using where 1 row in set, 1 warning (0.00 sec)
如果把字段r1拿掉呢?执行时间也相差无几,表p1和表p2都会扫描所有分区。
mysql> select count(*) from p1 where r2 = 2; +----------+ | count(*) | +----------+ | 998700 | +----------+ 1 row in set (3.87 sec) mysql> select count(*) from p2 where r2 = 2; +----------+ | count(*) | +----------+ | 998700 | +----------+ 1 row in set (3.75 sec)
那鉴于此,再来探讨一个问题:对于多列分区,字段的排列顺序是否重要?
关于这个顺序要和我们查询语句对应的过滤条件来一一说明。 类似下面两类 SQL :
SQL 1: select * from p1 where r1 = 2 and r2 = 2 and r3 = 2;
对于SQL 1,顺序无关紧要,因为三个列在查询时都已包含;
SQL 2: select * from p1 where r1 = 2 and r2 = 2;
对于SQL 2 , (r1,r2,r3) 和 (r2,r1,r3) 都可以满足。
SQL 3: select * from p1 where r2 = 2 and r3 = 2;
对于SQL 3, (r2,r3,r1) 和 (r3,r2,r1) 也都可以满足。
用同样的方法来建立分区表p3,分区字段顺序为(r2,r3,r1):
mysql> show create table p3\G *************************** 1. row *************************** Table: p3 Create Table: CREATE TABLE `p3` ( `r1` int DEFAULT NULL, `r2` int DEFAULT NULL, `r3` int DEFAULT NULL, `log_date` datetime DEFAULT NULL ) ENGINE=InnoDB /*!50500 PARTITION BY RANGE COLUMNS(r2,r3,r1) (PARTITION p111 VALUES LESS THAN (1,1,1) ENGINE = InnoDB, ...
对于表p3来讲:下面这条 SQL 执行时间比表p1要快几十倍,由于分区字段顺序不同,表p1要扫描所有分区才能出结果。
mysql> select count(*) from p3 where r2 = 1 and r3 = 4 ; +----------+ | count(*) | +----------+ | 199648 | +----------+ 1 row in set (0.22 sec) mysql> select count(*) from p1 where r2 = 1 and r3 = 4 ; +----------+ | count(*) | +----------+ | 199648 | +----------+ 1 row in set (5.05 sec)
所以对于多列分区表,正如开头讲的一样,它和联合索引的使用方法、注意事项、使用场景也都很类似。对于某些特定的场景,使用多列分区能显著加快查询性能。