Personal Programming Notes

To err is human; to debug, divine.

Analytic Functions in MySQL

MySQL has traditionally lagged behind in support for the SQL standard. Unfortunately, from my experience, MySQL is often used as the sandbox for SQL code challenges and interviews. If you are used to work with Vertica SQL, writing SQL statements in MySQL can be challenging exercises, NOT necessarily in a good way, because many useful features are not supported.

WITH clause

As discussed in this blog post, WITH clause syntax, also known as Common Table Expressions (CTE), is thankfully supported in Vertica. In summary, WITH clause allows us to arrange sub-queries, usually intermediate steps, in a complex SQL query in sequential, logical order. This will make the complex queries easier to compose and read: we can write steps by steps of the query from top to bottom like a story (i.e., literate programming). Unfortunately, WITH clause is not supported by MySQL although this feature has been requested since 2006. There are work-arounds for MySQL’s lack of CTE, but the easiest way is probably to revert back to using nested subqueries.

Personally, lack of WITH clause support in MySQL is my greatest hindrance as I often ended up writing queries using WITH clauses as first draft before rewriting those queries using nested subqueries. This might appear clumsy in SQL interviews even though writing SQL codes with CTE instead of subqueries is the recommended practice for maintainable code.

Analytic functions

Another regrettable hindrance when working in MySQL is its lack of analytic functions such as ROW_NUMBER, RANK and DENSE_RANK. Those analytic functions are supported in Vertica. The difference between these three functions can be a bit subtle, and would be best described in the following example:

Example of ROW_NUMBER, RANK, and DENSE_RANK functions
1
2
3
4
5
6
7
SELECT customer_name, SUM(annual_income),
ROW_NUMBER () OVER (ORDER BY TO_CHAR(SUM(annual_income),'100000') DESC) row_number,
RANK () OVER (ORDER BY TO_CHAR(SUM(annual_income),'100000') DESC) rank,
DENSE_RANK () OVER (ORDER BY TO_CHAR(SUM(annual_income),'100000') DESC) dense_rank
FROM customer_dimension
GROUP BY customer_name
LIMIT 15;

The outputs of these functions are only different if there are duplicates in SUM(annual_income) value, as seen in rows 75-81 in the example output below:

customer_nameSUMrow_numberrankdense_rank
Theodore R. King97444717171
Laura Y. Pavlov97417727272
Carla . Garcia97371737373
Jack Z. Miller97356747474
Steve W. Williams97343757575
Lauren Y. Rodriguez97343767575
Lucas . Webber97318777776
Sarah N. Moore97243787877
Lucas O. Li97184797978
Doug K. Reyes97166808079
Michael . Weaver97162818180


Sadly, these useful analytic functions are not supported in MySQL. Fortunately, MySQL supports user variables in SQL queries and we can reproduce those functionalities in MySQL using variables and subqueries as follows:

ROW_NUMBER, RANK, and DENSE_RANK functions in MySQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- In Vertica
SELECT
ROW_NUMBER () OVER (PARTITION BY col_1, col_2 ORDER BY col_3 DESC) AS row_number,
RANK () OVER (PARTITION BY col_1, col_2 ORDER BY col_3 DESC) AS rank,
DENSE_RANK () OVER (PARTITION BY col_1, col_2 ORDER BY col_3 DESC) AS dense_rank,
t.*
FROM table_1 t

-- In MySQL
SELECT
@row_num:=IF(@prev_col_1=t.col_1 AND @prev_col_2=t.col_2, @row_num+1, 1) AS row_number,
@rank:=IF(@prev_col_1=t.col_1 AND @prev_col_2=t.col_2 AND @prev_col_3=col_3, @rank, @row_num) AS rank,
@dense:=IF(@prev_col_1=t.col_1 AND @prev_col_2=t.col_2, IF(@prev_col_3=col_3, @dense, @dense+1), 1) AS dense_rank,
@prev_col_1 = t.col_1,
@prev_col_2 = t.col_2,
@prev_col_3 = t.col_3,
t.*
FROM (SELECT * FROM table_1 ORDER BY col_1, col_2, col_3 DESC) t,
     (SELECT @row_num:=1, @dense:=1, @rank:=1, @prev_col_1:=NULL, @prev_col_2:=NULL, @prev_col_3:=NULL) var

The MySQL work-around is intentionally generic so that I can adapt it to any use case. In addition, it intentionally has a single pass (no SET statements, temporary table) since most SQL code challenges expect a single query. Finally, note that the above MySQL solution is intentionally incomplete to make it less convoluted. You need to put that solution in a subquery and SELECT only relevant columns from it.

As an example, the above code template is used to solve this Rank Scores problem. In summary, the question asks for DENSE_RANK functionality to be applied on Score column.

Input table
1
2
3
4
5
6
7
8
9
10
+----+-------+
| Id | Score |
+----+-------+
| 1  | 3.50  |
| 2  | 3.65  |
| 3  | 4.00  |
| 4  | 3.85  |
| 5  | 4.00  |
| 6  | 3.65  |
+----+-------+
Expected output
1
2
3
4
5
6
7
8
9
10
+-------+------+
| Score | Rank |
+-------+------+
| 4.00  | 1    |
| 4.00  | 1    |
| 3.85  | 2    |
| 3.65  | 3    |
| 3.65  | 3    |
| 3.50  | 4    |
+-------+------+

The solution in Vertica SQL would be straight-forward as follows:

Solution in Vertica SQL
1
2
3
select Score,
DENSE_RANK() OVER (ORDER BY Score DESC) AS Rank
FROM Scores;

In MySQL, apply the above code template and note that there is no partition clause to arrive at the following solution:

Solution in MySQL
1
2
3
4
5
6
SELECT Score, Rank FROM
( SELECT t.Score,
@dense:=IF(@prev_col2=t.Score, @dense, @dense+1) AS Rank,
@prev_col2:=t.Score
FROM (SELECT Score FROM Scores ORDER BY Score DESC) t,
(SELECT @dense:=0, @prev_col2:=NULL) var ) x

Note that the outer SELECT is used to only expose only columns of interest while the main SQL code is enclosed in a subquery.

Reference