第42期：MySQL 是否有必要多列分区

之前的篇章我们讨论的都是基于单列的分区表，那有无必要建立基于多列的分区表？这种分区表数据分布是否均匀？有无特殊的应用场景？有无特殊的优化策略？本篇基于这些问题来进行重点解读。

MySQL 不仅支持基于单列分区，也支持基于多列分区。比如基于字段（f1,f2,f3)来建立分区表，使用方法和使用场景都有些类似于联合索引。比如下面查询语句，同时对列(f1,f2,f3) 进行过滤。

select * from p1 where f1 = 2 and f2 = 2 and f3 = 2;

多列分区表的前提是参与分区的列检索频率均等，如果不均等，就没有必要使用多列分区。

我们还是以具体实例来验证下多列分区的优缺点以及适用场景，这样理解起来更加透彻。

建立一张表p1，字段r1,r2,r3分别取值为1-8，1-5，1-5.

create table p1(r1 int,r2 int,r3 int,log_date datetime);

按照字段(r1,r2,r3) 的分布范围，我来写个存储过程处理下表p1，变为分区表。存储过程代码如下：

DELIMITER $$

USE `ytt_new`$$

DROP PROCEDURE IF EXISTS `sp_add_partition_ytt_new_p1`$$

CREATE DEFINER=`root`@`%` PROCEDURE `sp_add_partition_ytt_new_p1`()
BEGIN
	DECLARE i,j,k INT UNSIGNED DEFAULT 1;
	SET @stmt = '';
	SET @stmt_begin = 'ALTER TABLE p1 PARTITION BY RANGE COLUMNS (r1,r2,r3)(';
        WHILE i <= 8 DO
	   set j = 1;
	   while j <= 5 do
	     set k = 1;
	     while k <= 5 do
               SET @stmt = CONCAT(@stmt,' PARTITION p',i,j,k,' VALUES LESS THAN (',i,',',j,',',k,'),');
               set k = k + 1;
	     end while;
	     set j = j + 1;
	   end while;
	   SET i = i + 1;        
        END WHILE;	
	SET @stmt_end = 'PARTITION p_max VALUES LESS THAN (maxvalue,maxvalue,maxvalue))';
        SET @stmt = CONCAT(@stmt_begin,@stmt,@stmt_end);
        PREPARE s1 FROM @stmt;
        EXECUTE s1;
        DROP PREPARE s1;
        SET @stmt = NULL;
        SET @stmt_begin = NULL;
        SET @stmt_end = NULL;	
	END$$

DELIMITER ;

调用存储过程，变更表p1为多列分区表，此时表p1有201个分区，记录数为500W条。

mysql> call sp_add_partition_ytt_new_p1;
Query OK, 0 rows affected (14.89 sec)

mysql> select count(partition_name) as partition_count  from information_schema.partitions where table_schema = 'ytt_new' and table_name ='p1';
+-----------------+
| partition_count |
+-----------------+
|             201 |
+-----------------+
1 row in set (0.00 sec)

mysql> select count(*) from p1;
+----------+
| count(*) |
+----------+
|  5000000 |
+----------+
1 row in set (12.01 sec)

用同样的方法建立一张分区表p2，来对单列分区表与多列分区表在一些场景下的性能做下对比：

分区表p2按照字段r1分区，仅仅分了9个。

mysql> CREATE TABLE `p2` (
  `r1` int DEFAULT NULL,
  `r2` int DEFAULT NULL,
  `r3` int DEFAULT NULL,
  `log_date` datetime DEFAULT NULL
) ENGINE=InnoDB
PARTITION BY RANGE  COLUMNS(r1)
(PARTITION p1 VALUES LESS THAN (1) ,
 PARTITION p2 VALUES LESS THAN (2) ,
 PARTITION p3 VALUES LESS THAN (3) ,
 PARTITION p4 VALUES LESS THAN (4) ,
 PARTITION p5 VALUES LESS THAN (5) ,
 PARTITION p6 VALUES LESS THAN (6) ,
 PARTITION p7 VALUES LESS THAN (7) ,
 PARTITION p8 VALUES LESS THAN (8) ,
 PARTITION p_max VALUES LESS THAN (MAXVALUE) 
)
1 row in set (0.00 sec)

mysql> insert into p2 select * from p1;
Query OK, 5000000 rows affected (1 min 37.92 sec)
Records: 5000000  Duplicates: 0  Warnings: 0

多个字段等值过滤的性能对比：同样的查询条件，表p1（执行时间0.02秒）比p2（执行时间0.49秒）要快几十倍。

mysql> select count(*) from p1 where r1 = 2 and r2 = 2 and r3 = 2;
+----------+
| count(*) |
+----------+
|    24992 |
+----------+
1 row in set (0.02 sec)

mysql> select count(*) from p2 where r1 = 2 and r2 = 2 and r3 = 2;
+----------+
| count(*) |
+----------+
|    24992 |
+----------+
1 row in set (0.49 sec)

查看两者执行计划对比：同样的查询，表p1扫描行数只有2W多，而表p2扫描行数有62W行，相差巨大。

mysql> explain select count(*) from p1 where r1 = 2 and r2 = 2 and r3 = 2\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: p1
   partitions: p223
         type: ALL
...
         rows: 24711
     filtered: 0.10
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

mysql> explain select count(*) from p2 where r1 = 2 and r2 = 2 and r3 = 2\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: p2
   partitions: p3
         type: ALL
...
         rows: 623239
     filtered: 0.10
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

如果过滤字段不完整呢？比如不检索最后一列，再次做下对比：同样表p1（0.1秒）比表p2（0.52秒）执行时间要少几倍。

mysql> select count(*) from p1 where r1 = 2 and r2 = 2;
+----------+
| count(*) |
+----------+
|   124649 |
+----------+
1 row in set (0.10 sec)

mysql> select count(*) from p2 where r1 = 2 and r2 = 2;
+----------+
| count(*) |
+----------+
|   124649 |
+----------+
1 row in set (0.52 sec)

那只检索第一列呢：这次表p1和p2执行时间上差不多，p2稍微占优势。

mysql> select count(*) from p1 where r1 = 2 ;
+----------+
| count(*) |
+----------+
|   624599 |
+----------+
1 row in set (0.56 sec)

mysql> select count(*) from p2 where r1 = 2 ;
+----------+
| count(*) |
+----------+
|   624599 |
+----------+
1 row in set (0.45 sec)

看下执行计划对比：表p1扫描的分区数为26个，表p2仅扫描1个分区，分区数量上表p2相对少很多。

mysql> explain select count(*) from p1 where r1 = 2 \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: p1
   partitions: p211,p212,p213,p214,p215,p221,p222,p223,p224,p225,p231,p232,p233,p234,p235,p241,p242,p243,p244,p245,p251,p252,p253,p254,p255,p311
         type: ALL
...
         rows: 648074
     filtered: 10.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

mysql> explain select count(*) from p2 where r1 = 2 \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: p2
   partitions: p3
         type: ALL
...
         rows: 623239
     filtered: 10.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

如果把字段r1拿掉呢？执行时间也相差无几，表p1和表p2都会扫描所有分区。

mysql> select count(*) from p1 where  r2 = 2;
+----------+
| count(*) |
+----------+
|   998700 |
+----------+
1 row in set (3.87 sec)

mysql> select count(*) from p2 where  r2 = 2;
+----------+
| count(*) |
+----------+
|   998700 |
+----------+
1 row in set (3.75 sec)

那鉴于此，再来探讨一个问题：对于多列分区，字段的排列顺序是否重要？

关于这个顺序要和我们查询语句对应的过滤条件来一一说明。类似下面两类 SQL ：

SQL 1： select * from p1 where r1 = 2 and r2 = 2 and r3 = 2;

对于SQL 1，顺序无关紧要，因为三个列在查询时都已包含；

SQL 2: select * from p1 where r1 = 2 and r2 = 2;

对于SQL 2 ， (r1,r2,r3) 和 (r2,r1,r3) 都可以满足。

SQL 3： select * from p1 where r2 = 2 and r3 = 2;

对于SQL 3， (r2,r3,r1) 和 (r3,r2,r1) 也都可以满足。

用同样的方法来建立分区表p3，分区字段顺序为(r2,r3,r1):

mysql> show create table p3\G
*************************** 1. row ***************************
       Table: p3
Create Table: CREATE TABLE `p3` (
  `r1` int DEFAULT NULL,
  `r2` int DEFAULT NULL,
  `r3` int DEFAULT NULL,
  `log_date` datetime DEFAULT NULL
) ENGINE=InnoDB 
/*!50500 PARTITION BY RANGE  COLUMNS(r2,r3,r1)
(PARTITION p111 VALUES LESS THAN (1,1,1) ENGINE = InnoDB,
...

对于表p3来讲：下面这条 SQL 执行时间比表p1要快几十倍，由于分区字段顺序不同，表p1要扫描所有分区才能出结果。

mysql> select count(*) from p3 where r2 = 1 and r3 = 4 ;
+----------+
| count(*) |
+----------+
|   199648 |
+----------+
1 row in set (0.22 sec)

mysql> select count(*) from p1 where r2 = 1 and r3 = 4 ;
+----------+
| count(*) |
+----------+
|   199648 |
+----------+
1 row in set (5.05 sec)

所以对于多列分区表，正如开头讲的一样，它和联合索引的使用方法、注意事项、使用场景也都很类似。对于某些特定的场景，使用多列分区能显著加快查询性能。

第42期：MySQL 是否有必要多列分区

于6月 29, 20226月 29, 2022由杨涛涛发布

MySQL 调优

第47期：EXPLAIN TYPE 列的 JOIN 常见场景详解（上）

MySQL 调优

第46期：EXPLAIN：解说一条简单 SQL 语句的执行计划

MySQL 调优

第45期：一条 SQL 语句优化的基本思路

第42期：MySQL 是否有必要多列分区

于6月 29, 20226月 29, 2022由杨涛涛发布

相关文章

MySQL 调优

第47期：EXPLAIN TYPE 列的 JOIN 常见场景详解（上）

MySQL 调优

第46期：EXPLAIN：解说一条简单 SQL 语句的执行计划

MySQL 调优

第45期：一条 SQL 语句优化的基本思路