NOT IN better than
When we want to insert a row into a dimension table AND check for duplicates at the same time, we usually do this in DML scripts:
However, for all such inserts, we were recently informed that it is better in Vertica to do
NOT IN instead of
So, for example above:
LEFT JOIN to check existence
Let’s say we have an ETL that regularly inserts new data into an existing dimension table.
1 2 3 4 5 6 7 8 9 10 11 12
We are sometimes doing
LEFT JOIN like this only to determine whether or not an entry already exists in the table.
It would be faster to use a
WHERE clause instead to perform that existence check.
Although it might sound counter-intuitive, but reducing
JOIN operations like this has been regularly recommended.
1 2 3 4 5 6 7 8 9 10 11
Avoid function calls in
For this performance tip, we make a slight change to the example ETL in the last section above where
country_id column is removed. In this case, we can use a normalized
country_name as the ID to check for existing entries in the table:
1 2 3 4 5 6 7 8 9
In this example, we normalize
country_name to lower case. Note that
WHERE clause should be used instead of
LEFT JOIN as discussed above.
1 2 3 4 5 6 7 8
However, such change still has bad performance because, in general, function calls in
JOIN clauses should be avoided in Vertica.
In both examples above, calling functions like
JOIN clauses will affect the performance of the ETLs.
The solution for this scenario is that, since we control what goes into dimension tables, we can ensure that columns like
country_name are always stored in lower-case.
Then, we can do the same when creating the temporary table such as
staging_table that we are comparing to for checking existence.
Make sure to run
ANALYZE_STATISTICS after all data loads.
Using this function, tables are analyzed for best performance in subsequent queries ran against it.
Without information from
ANALYZE_STATISTICS, the query optimizer assumes uniform distribution of data values and equal storage usage for all projections.
ANALYZE_STATISTICS is only supported on local temporary tables, but not on global temporary tables.
In addition, when we add
ANALYZE_STATISTICS function calls into our ETL scripts, errors might be thrown when a second
ANALYZE_STATISTICS call is made while the first is still running.
Those errors can be ignored but they must be caught accordingly to separate with other Vertica error messages.
Avoid creating temporary tables using
Instead of creating temporary tables using
SELECT, it is recommended to:
- Create the temporary table first without a projection.
- Create a super projection with the correct column encodings and
- Populate it using
INSERT /*+ direct */ INTO. Note the
/*+ direct */hint to write data directly to disk, bypassing memory.
ANALYZE_STATISTICS. See the last section.
For example, in a Vertica ETL script that runs daily, we usually create a temporary table to retrieve the latest records from a source table like this:
1 2 3 4 5 6 7 8 9
In this example,
last_modify_date is the CDC column and
customer_id is the primary key column.
Although this SQL statement is simple and easy to understand, it is really slow for a large and growing
stg_customer table that contains updates to all customers on multiple dates, with millions of new customer entries each day.
Instead, the recommended coding pattern is to create a temporary table first without a projection:
1 2 3 4 5 6 7
It is also recommended that the column names are explicitly specified, so that only required columns are created in the temporary table.
LOCAL temporary table is created, instead of
GLOBAL, so that we can use
ANALYZE_STATISTICS functions as discussed above.
Next, create a super projection with the correct column encodings and
ORDER BY clause:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Finally, insert “directly” into the temporary table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
WITH clause is just a more readable way to write the sub-query in the original SQL statement (see WITH clause).
In addition, the wildcard
* in the original SQL query is also avoided, in case the table
stg_customer is a very wide table.