티스토리 뷰

MySQL 8(8.0.14 이후)에서 인라인 뷰로 JOIN을 할때 LATERAL 연산자를 활용하면 JOIN을 최적화 할 수 있습니다.
사용법은 JOIN 키워드 옆에 LATERAL을 적어주면 됩니다.
MySQL :: MySQL 8.0 Reference Manual :: 15.2.15.9 Lateral Derived Tables

 

MySQL :: MySQL 8.0 Reference Manual :: 15.2.15.9 Lateral Derived Tables

15.2.15.9 Lateral Derived Tables A derived table cannot normally refer to (depend on) columns of preceding tables in the same FROM clause. As of MySQL 8.0.14, a derived table may be defined as a lateral derived table to specify that such references are pe

dev.mysql.com

 

테스트 환경 : AWS Aurora MySQL 3.04.2 (MySQL 8.0.28 호환)
서버 스펙 : t4g.medium (2 vCPU, 4GiB RAM)

 

테스트 스키마 및 데이터는 MySQL 샘플을 사용하였습니다.
MySQL :: Employees Sample Database :: 3 Installation

 

MySQL :: Employees Sample Database :: 3 Installation

The Employees database is available from Employees DB on GitHub. You can download a prepackaged archive of the data, or access the information through Git. To use the Zip archive package, download the archive and unpack it using WinZip or another tool that

dev.mysql.com

 

테스트를 위해 employees 테이블에 있는 hire_date 컬럼에 인덱스를 추가 합니다.
CREATE INDEX IX_hire_date ON employees (hire_date);

 

employees : 사원 테이블
salaries : 급여 테이블

 

테스트 시나리오는 아래와 같습니다.
1990년 1월에 채용된 사원들의 급여 변경 횟수를 emp_no 순으로 정렬하여 추출하는 쿼리를 작성합니다.

기존 방식으로 쿼리 실행을 해보겠습니다.

-- 기존 방식
    SELECT  T1.emp_no
        ,   T2.CNT_SALARY
    FROM    employees   AS T1
    INNER JOIN
    (
        SELECT  emp_no
            ,   COUNT(*) AS CNT_SALARY
        FROM    salaries
        GROUP BY 
                emp_no
    )    AS  T2
        ON  T2.emp_no = T1.emp_no
    WHERE   T1.hire_date BETWEEN '1990-01-01' AND '1990-01-31'
    ORDER BY
            T1.emp_no;

실행 시간은 약 1.4초가 소요됩니다.

 

EXPLAIN으로 실행계획을 한번 확인해 보겠습니다.

+----+-------------+------------+------------+-------+----------------------+--------------+---------+---------------------+---------+----------+------------------------------------------+
| id | select_type | table      | partitions | type  | possible_keys        | key          | key_len | ref                 | rows    | filtered | Extra                                    |
+----+-------------+------------+------------+-------+----------------------+--------------+---------+---------------------+---------+----------+------------------------------------------+
|  1 | PRIMARY     | T1         | NULL       | range | PRIMARY,IX_hire_date | IX_hire_date | 3       | NULL                |    2244 |      100 | Using where; Using index; Using filesort |
|  1 | PRIMARY     | <derived2> | NULL       | ref   | <auto_key0>          | <auto_key0>  | 4       | employees.T1.emp_no |      10 |      100 | NULL                                     |
|  2 | DERIVED     | salaries   | NULL       | index | PRIMARY              | PRIMARY      | 7       | NULL                | 2838426 |      100 | Using index                              |
+----+-------------+------------+------------+-------+----------------------+--------------+---------+---------------------+---------+----------+------------------------------------------+

salaries 테이블을 풀 스캔해서 전제 row를 읽습니다.

EXPLAIN ANALYZE 결과 입니다.

| -> Nested loop inner join  (cost=636948856.13 rows=6369427944) (actual time=1407.409..1413.005 rows=2244 loops=1)
    -> Sort: t1.emp_no  (cost=451.72 rows=2244) (actual time=2.428..2.668 rows=2244 loops=1)
        -> Filter: (t1.hire_date between '1990-01-01' and '1990-01-31')  (cost=451.72 rows=2244) (actual time=0.019..1.909 rows=2244 loops=1)
            -> Index range scan on T1 using IX_hire_date over ('1990-01-01' <= hire_date <= '1990-01-31')  (cost=451.72 rows=2244) (actual time=0.017..0.812 rows=2244 loops=1)
    -> Index lookup on T2 using <auto_key0> (emp_no=t1.emp_no)  (actual time=0.002..0.002 rows=1 loops=2244)
        -> Materialize  (cost=853057.80..853057.80 rows=2838426) (actual time=1409.351..1409.869 rows=300024 loops=1)
            -> Group aggregate: count(0)  (cost=569215.20 rows=2838426) (actual time=0.029..956.258 rows=300024 loops=1)
                -> Covering index scan on salaries using PRIMARY  (cost=285372.60 rows=2838426) (actual time=0.022..724.305 rows=2844047 loops=1)
 |

 

employees(T1) 테이블의 emp_no가 인라인 뷰 안에 전달되면 좋을 듯 하여 쿼리를 변경해 봅니다.

SELECT  T1.emp_no
        ,   T2.CNT_SALARY
    FROM    employees   AS T1
    INNER JOIN
    (
        SELECT  emp_no
            ,   COUNT(*) AS CNT_SALARY
        FROM    salaries
        WHERE   emp_no = T1.emp_no
        GROUP BY 
                emp_no
    )    AS  T2        
    WHERE   T1.hire_date BETWEEN '1990-01-01' AND '1990-01-31'
    ORDER BY
            T1.emp_no;

 위와 같이 실행을 하면 오류가 발생합니다.

ERROR: 1054 (42S22): Unknown column 'T1.emp_no' in 'where clause'

 

이번에는 LATERAL로 작성하여 쿼리를 실행해 보겠습니다.

-- LATERAL 방식
    SELECT  T1.emp_no
        ,   T2.CNT_SALARY
    FROM    employees   AS T1
    INNER JOIN LATERAL
    (
        SELECT  emp_no
            ,   COUNT(*) AS CNT_SALARY
        FROM    salaries
        WHERE   emp_no = T1.emp_no
        GROUP BY 
                emp_no
    )   AS  T2    
    WHERE   T1.hire_date BETWEEN '1990-01-01' AND '1990-01-31'
    ORDER BY
            T1.emp_no;

실행시간은 약 0.02초가 소요됩니다.

 

EXPLAIN으로 실행계획을 한번 확인해 보겠습니다.

+----+-------------------+------------+------------+-------+---------------+--------------+---------+---------------------+------+----------+----------------------------------------------------------------------+
| id | select_type       | table      | partitions | type  | possible_keys | key          | key_len | ref                 | rows | filtered | Extra                                                                |
+----+-------------------+------------+------------+-------+---------------+--------------+---------+---------------------+------+----------+----------------------------------------------------------------------+
|  1 | PRIMARY           | T1         | NULL       | range | IX_hire_date  | IX_hire_date | 3       | NULL                | 2244 |      100 | Using where; Using index; Using filesort; Rematerialize (<derived2>) |
|  1 | PRIMARY           | <derived2> | NULL       | ALL   | NULL          | NULL         | NULL    | NULL                |    9 |      100 | NULL                                                                 |
|  2 | DEPENDENT DERIVED | salaries   | NULL       | ref   | PRIMARY       | PRIMARY      | 4       | employees.T1.emp_no |    9 |      100 | Using index                                                          |
+----+-------------------+------------+------------+-------+---------------+--------------+---------+---------------------+------+----------+----------------------------------------------------------------------+

salaries 테이블에 rows가 확연히 줄었습니다.

 

EXPLAIN ANALYZE 결과 입니다.

| -> Nested loop inner join  (cost=8479.23 rows=21629) (actual time=2.502..22.899 rows=2244 loops=1)
    -> Invalidate materialized tables (row from T1)  (cost=451.72 rows=2244) (actual time=2.464..2.848 rows=2244 loops=1)
        -> Sort: t1.emp_no  (cost=451.72 rows=2244) (actual time=2.463..2.690 rows=2244 loops=1)
            -> Filter: (t1.hire_date between '1990-01-01' and '1990-01-31')  (cost=451.72 rows=2244) (actual time=0.019..1.945 rows=2244 loops=1)
                -> Index range scan on T1 using IX_hire_date over ('1990-01-01' <= hire_date <= '1990-01-31')  (cost=451.72 rows=2244) (actual time=0.017..0.841 rows=2244 loops=1)
    -> Table scan on T2  (cost=0.27..2.61 rows=10) (actual time=0.000..0.000 rows=1 loops=2244)
        -> Materialize (invalidate on row from T1)  (cost=3.42..5.76 rows=10) (actual time=0.009..0.009 rows=1 loops=2244)
            -> Group aggregate: count(0)  (cost=2.18 rows=10) (actual time=0.008..0.008 rows=1 loops=2244)
                -> Covering index lookup on salaries using PRIMARY (emp_no=t1.emp_no)  (cost=1.22 rows=10) (actual time=0.005..0.007 rows=9 loops=2244)
 |

employees(T1) 테이블의 emp_no가 인라인 뷰 안에 전달되어 필요한 만큼만 읽습니다.

 

 비교하면 아래와 같습니다.

방식 실행시간
일반 1.4초
LATERAL 0.02초

 

이번에는 1990년 1월에 채용된 사원들의 급여 중 가장 높은 금액 2건을 emp_no 순으로 정렬하여 추출하는 쿼리를 작성합니다.

 

8 버전에서는 Windows Function을 지원하기 때문에 ROW_NUMBER() 함수를 활용하여 작성이 가능합니다.

-- 기존 방식
    SELECT  T1.emp_no
        ,   T2.salary
    FROM    employees   AS T1
    INNER JOIN
    (
        SELECT  emp_no
            ,   salary
            ,   ROW_NUMBER() OVER (PARTITION BY emp_no ORDER BY salary DESC) AS RN
        FROM    salaries    
    )    AS  T2
        ON  T2.emp_no = T1.emp_no
    WHERE   T1.hire_date BETWEEN '1990-01-01' AND '1990-01-31'
    AND     T2.RN <= 2
    ORDER BY
            T1.emp_no
    ;

EXPLAIN 실행 계획 입니다.

+----+-------------+------------+------------+-------+----------------------+--------------+---------+---------------------+---------+----------+------------------------------------------+
| id | select_type | table      | partitions | type  | possible_keys        | key          | key_len | ref                 | rows    | filtered | Extra                                    |
+----+-------------+------------+------------+-------+----------------------+--------------+---------+---------------------+---------+----------+------------------------------------------+
|  1 | PRIMARY     | T1         | NULL       | range | PRIMARY,IX_hire_date | IX_hire_date | 3       | NULL                |    2244 |      100 | Using where; Using index; Using filesort |
|  1 | PRIMARY     | <derived2> | NULL       | ref   | <auto_key1>          | <auto_key1>  | 4       | employees.T1.emp_no |      16 |    33.33 | Using where                              |
|  2 | DERIVED     | salaries   | NULL       | ALL   | NULL                 | NULL         | NULL    | NULL                | 2838426 |      100 | Using filesort                           |
+----+-------------+------------+------------+-------+----------------------+--------------+---------+---------------------+---------+----------+------------------------------------------+

salaries 테이블을 풀 스캔 합니다.

 

 EXPLAIN ANALYZE 결과 입니다.

| -> Nested loop inner join  (cost=13541.72 rows=12465) (actual time=7330.786..7345.554 rows=4427 loops=1)
    -> Sort: t1.emp_no  (cost=451.70 rows=2244) (actual time=2.514..2.777 rows=2244 loops=1)
        -> Filter: (t1.hire_date between '1990-01-01' and '1990-01-31')  (cost=451.70 rows=2244) (actual time=0.020..1.989 rows=2244 loops=1)
            -> Index range scan on T1 using IX_hire_date over ('1990-01-01' <= hire_date <= '1990-01-31')  (cost=451.70 rows=2244) (actual time=0.017..0.860 rows=2244 loops=1)
    -> Filter: (t2.RN <= 2)  (cost=0.75..4.17 rows=6) (actual time=3.269..3.272 rows=2 loops=2244)
        -> Index lookup on T2 using <auto_key1> (emp_no=t1.emp_no)  (actual time=0.003..0.005 rows=9 loops=2244)
            -> Materialize  (cost=0.00..0.00 rows=0) (actual time=7334.979..7340.475 rows=2844047 loops=1)
                -> Window aggregate: row_number() OVER (PARTITION BY salaries.emp_no ORDER BY salaries.salary desc )   (actual time=1946.235..2695.254 rows=2844047 loops=1)
                    -> Sort: salaries.emp_no, salaries.salary DESC  (cost=285372.60 rows=2838426) (actual time=1946.229..2173.861 rows=2844047 loops=1)
                        -> Table scan on salaries  (cost=285372.60 rows=2838426) (actual time=0.030..871.627 rows=2844047 loops=1)
 |

 

LATERAL로 쿼리 실행

-- LATERAL 방식
    SELECT  T1.emp_no
        ,   T2.salary
    FROM    employees   AS T1
    INNER JOIN LATERAL
    (
        SELECT  emp_no
            , salary
        FROM    salaries
        WHERE   emp_no = T1.emp_no
        ORDER BY
                salary DESC          
        LIMIT 2
    )    AS T2    
    WHERE   T1.hire_date BETWEEN '1990-01-01' AND '1990-01-31'
    ORDER BY
            T1.emp_no
    ;

EXPLAIN 실행 계획 입니다.

+----+-------------------+------------+------------+-------+---------------+--------------+---------+---------------------+------+----------+----------------------------------------------------------------------+
| id | select_type       | table      | partitions | type  | possible_keys | key          | key_len | ref                 | rows | filtered | Extra                                                                |
+----+-------------------+------------+------------+-------+---------------+--------------+---------+---------------------+------+----------+----------------------------------------------------------------------+
|  1 | PRIMARY           | T1         | NULL       | range | IX_hire_date  | IX_hire_date | 3       | NULL                | 2244 |      100 | Using where; Using index; Using filesort; Rematerialize (<derived2>) |
|  1 | PRIMARY           | <derived2> | NULL       | ALL   | NULL          | NULL         | NULL    | NULL                |    2 |      100 | NULL                                                                 |
|  2 | DEPENDENT DERIVED | salaries   | NULL       | ref   | PRIMARY       | PRIMARY      | 4       | employees.T1.emp_no |    9 |      100 | Using filesort                                                       |
+----+-------------------+------------+------------+-------+---------------+--------------+---------+---------------------+------+----------+----------------------------------------------------------------------+

 EXPLAIN ANALYZE 결과 입니다.

| -> Nested loop inner join  (cost=6568.01 rows=4488) (actual time=2.508..29.529 rows=4427 loops=1)
    -> Invalidate materialized tables (row from T1)  (cost=451.70 rows=2244) (actual time=2.466..2.840 rows=2244 loops=1)
        -> Sort: t1.emp_no  (cost=451.70 rows=2244) (actual time=2.465..2.688 rows=2244 loops=1)
            -> Filter: (t1.hire_date between '1990-01-01' and '1990-01-31')  (cost=451.70 rows=2244) (actual time=0.021..1.935 rows=2244 loops=1)
                -> Index range scan on T1 using IX_hire_date over ('1990-01-01' <= hire_date <= '1990-01-31')  (cost=451.70 rows=2244) (actual time=0.018..0.894 rows=2244 loops=1)
    -> Table scan on T2  (cost=1.26..2.52 rows=2) (actual time=0.000..0.000 rows=2 loops=2244)
        -> Materialize (invalidate on row from T1)  (cost=2.68..3.94 rows=2) (actual time=0.011..0.012 rows=2 loops=2244)
            -> Limit: 2 row(s)  (cost=1.22 rows=2) (actual time=0.010..0.011 rows=2 loops=2244)
                -> Sort: salaries.salary DESC, limit input to 2 row(s) per chunk  (cost=1.22 rows=10) (actual time=0.010..0.010 rows=2 loops=2244)
                    -> Index lookup on salaries using PRIMARY (emp_no=t1.emp_no)  (actual time=0.005..0.008 rows=9 loops=2244)
 |

employees(T1) 테이블의 emp_no가 인라인뷰 안에 전달되어 필요한 만큼만 읽습니다.

방식 실행시간
일반 7.3초
LATERAL 0.03초

 

optimizer가 판단하에 자동으로 조건을 넣어주면 좋을 것 같은데 아직은 그렇게 작동하지는 않는것 같습니다.

인라인 뷰로 JOIN을 하면서 집계 처리나 건수 제한을 하는 경우에 LATERAL을 활용하면 최적화가 가능할 수 있습니다.

 

참고 : SQL Server에서는 APPLY 라는 비슷한 기능을 하는 연산자가 존재합니다.