Vertica: Performance Optimization Notes

In this post, most of optimization notes for Vertica performance are from our team’s interaction with Nexius consultants. Also see Vertica Best Practices.

NOT IN better than NOT EXISTS

When we want to insert a row into a dimension table AND check for duplicates at the same time, we usually do this in DML scripts:

However, for all such inserts, we were recently informed that it is better in Vertica to do NOT IN instead of NOT EXISTS. So, for example above:

Avoid using LEFT JOIN to check existence

Let’s say we have an ETL that regularly inserts new data into an existing dimension table.

We are sometimes doing LEFT JOIN like this only to determine whether or not an entry already exists in the table. It would be faster to use a WHERE clause instead to perform that existence check. Although it might sound counter-intuitive, but reducing JOIN operations like this has been regularly recommended.

Avoid function calls in WHERE and JOIN clauses

For this performance tip, we make a slight change to the example ETL in the last section above where country_id column is removed. In this case, we can use a normalized country_name as the ID to check for existing entries in the table:

In this example, we normalize country_name to lower case. Note that WHERE clause should be used instead of LEFT JOIN as discussed above.

However, such change still has bad performance because, in general, function calls in WHERE and JOIN clauses should be avoided in Vertica. In both examples above, calling functions like LOWER in WHERE and JOIN clauses will affect the performance of the ETLs.

The solution for this scenario is that, since we control what goes into dimension tables, we can ensure that columns like country_name are always stored in lower-case. Then, we can do the same when creating the temporary table such as staging_table that we are comparing to for checking existence.

Use ANALYZE_STATISTICS

Make sure to run ANALYZE_STATISTICS after all data loads. Using this function, tables are analyzed for best performance in subsequent queries ran against it. Without information from ANALYZE_STATISTICS, the query optimizer assumes uniform distribution of data values and equal storage usage for all projections.

Note that ANALYZE_STATISTICS is only supported on local temporary tables, but not on global temporary tables. In addition, when we add ANALYZE_STATISTICS function calls into our ETL scripts, errors might be thrown when a second ANALYZE_STATISTICS call is made while the first is still running. Those errors can be ignored but they must be caught accordingly to separate with other Vertica error messages.

Avoid creating temporary tables using SELECT

Instead of creating temporary tables using SELECT, it is recommended to:

1. Create the temporary table first without a projection.
2. Create a super projection with the correct column encodings and ORDER BY clause
3. Populate it using INSERT /*+ direct */ INTO. Note the /*+ direct */ hint to write data directly to disk, bypassing memory.
4. Run ANALYZE_STATISTICS. See the last section.

For example, in a Vertica ETL script that runs daily, we usually create a temporary table to retrieve the latest records from a source table like this:

In this example, last_modify_date is the CDC column and customer_id is the primary key column. Although this SQL statement is simple and easy to understand, it is really slow for a large and growing stg_customer table that contains updates to all customers on multiple dates, with millions of new customer entries each day. Instead, the recommended coding pattern is to create a temporary table first without a projection:

It is also recommended that the column names are explicitly specified, so that only required columns are created in the temporary table. A LOCAL temporary table is created, instead of GLOBAL, so that we can use ANALYZE_STATISTICS functions as discussed above. Next, create a super projection with the correct column encodings and ORDER BY clause:

Finally, insert “directly” into the temporary table:

The WITH clause is just a more readable way to write the sub-query in the original SQL statement (see WITH clause). In addition, the wildcard * in the original SQL query is also avoided, in case the table stg_customer is a very wide table.