This post lists some tips and tricks that I learnt when working with Vertica database.
General Tips and Tricks
If you want to write data directly to disk and bypass memory, then you should include
/*+ direct */as a “hint” in your
INSERTstatement. This is especially helpful when you are loading data from big files into Vertica. If you don’t use
/*+ direct */, then
INSERTstatement first uses memory, which may be more useful when you want to optimally do inserts and run queries.
COMMITin your SQL statements when you are creating or updating Vertica schemas, because there is NO auto commit in Vertica.
If you are copying a table, DO NOT use
CREATE TABLE copy AS SELECT * FROM source. This will give you a copy table with default projections and storage policy. Instead, you should use
CREATE TABLEstatement with the
LIKE existing_tableclause and use
INSERT /*+ direct */statement. Creating a table with
LIKEoption replicates the table definition and storage policy associated with the source table, which can make a significant difference in data loading performance. Note that the
LIKEclause does not work if the existing source table is a temporary table.
- Before making a copy of a table, be sure to consider alternatives in order to execute optimal queries: create views, rewrite queries, use sub-queries, limit queries to only a subset of data for analysis.
Avoid joining large tables (e.g., > 50M records). Run a
count(*)on tables before joining and use
MERGE JOINto optimally join tables. When you use smaller subsets of data, the Vertica Optimizer will pick the
MERGE JOINalgorithm instead of the
HASH JOINone, which is less optimal.
When an approximate value will be enough, Vertica offers an alternative to
APPROXIMATE_COUNT_DISTINCT. This function is recommended when you have a large data set and you do not require an exact count of distinct values: e.g., sanity checks that verify the tables are populated. According to this documentation, you can get much better performance than
COUNT(DISTINCT). Here is an example of the
APPROXIMATE_COUNT_DISTINCTsyntax that you should use.
UPDATE & DELETE
Deletes and updates take exclusive locks on the table. Hence, only one
UPDATEtransaction on that table can be in progress at a time and only when no
INSERTsare in progress. Deletes and updates on different tables can be run concurrently.
Try to avoid
UPDATEas much as you can, especially on shared Vertica databases. Instead, it may work better to move the data you want to update to a new temporary table, work on that copy, drop the original table, and rename the temporary table with the original table name. For example:
1 2 3 4
- Delete from tables marks rows with delete vectors and stores them so data can be rolled back to a previous epoch. The data must be eventually purged before the database can reclaim disk space.
A query plan is a sequence of step-like paths that the HP Vertica cost-based query optimizer selects to access or alter information in your HP Vertica database. You can get information about query plans by prefixing the SQL query with the
1 2 3
The output from a query plan is presented in a tree-like structure, where each step path represents a single operation in the database that the optimizer uses for its execution strategy. The following example output is based on the previous query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
If you want to understand the details of the query plan, observe the real-time flow of data through the plan to identify possible query bottlenecks, you can:
- query the V_MONITOR.QUERY_PLAN_PROFILES system table.
- review Profiling Query Plans.
- use PROFILE statement to view further detailed analysis of your query.