Table statistics in PostgreSQL. Learn to review them

Hello again. Today we wanted to tell you how to check the statistics of a table in PostgreSQL .

To generate a good plan for the queries, the query planner relies on the available statistical information about the contents of the tables. Therefore, it is essential to ensure that the statistics are accurate and up-to-date. Out-of-date statistics will generate a poor query plan, further degrading database performance. To see how the statistics of each table are stored, we are going to create an example table and fill it with a million rows and two columns.

Setting the stage to analyze statistics

In the first column we have a sequential number from 1 to 1,000,000 and in the second a fixed value, ‘hello’ in this case:

 test = # CREATE TABLE t_test AS SELECT *, 'hello' :: text AS name
 FROM generate_series (1, 1000000) AS id;
 SELECT 1000000 

To have greater reliability, 100% of the records in the table can be analyzed, in this case we have only put 10%, which should be enough:

 test = # ALTER TABLE t_test
 test = # ANALYZE;

If we show the execution plan for the values that are less than 150000 it will show us the following:

 test = # explain SELECT * FROM t_test WHERE id <150000; 
 Seq Scan on t_test (cost = 0.00..17906.00 rows = 145969 width = 9)
 Filter: (id <150000)
 (2 rows) 

What we see here is that PostgreSQL expects sequential scan would return 145,000 rows. This information is very important because if the system knows what to expect, you can adjust your strategy accordingly (index, no index, etc.).

But how can we see this information?

Understanding pg_stats

The answer is sight pg_stat . The content of the view is:

 test = # \ d pg_stats
 View "pg_catalog.pg_stats"
 Column | Type | Collation | Nullable | Default
 schemaname | name | | |
 tablename | name | | |
 attname | name | | |
 inherited | boolean | | |
 null_frac | real | | |
 avg_width | integer | | |
 n_distinct | real | | |
 most_common_vals | anyarray | | |
 most_common_freqs | real [] | | |
 histogram_bounds | anyarray | | |
 correlation | real | | |
 most_common_elems | anyarray | | |
 most_common_elem_freqs | real [] | | |
 elem_count_histogram | real [] | | | 

Let’s go through the output and discuss what kind of data the scheduler can use:

  • schemaname + tablename + attname : For each column of each table in each schema, PostgreSQL will store one line of data.
  • inherited : Are we seeing a legacy / partitioned table or not?
  • null_frac : What percentage of the column contains NULL values? This is important if we use queries like “WHERE col IS NULL” or “WHERE col IS NOT NULL”
  • avg_width : What is the expected average width of the column?
  • n_distinct : expected number of different entries in the column
  • most_common_vals : We have more precise information for the values that occur more frequently. This is especially important if the table entries are not evenly distributed.
  • most_common_freqs : What is the frequency of these most common values? PostgreSQL stores a percentage value here. For example: “Male” is a frequent entry and 54.32% of the entries are male.
  • histogram_bounds : PostgreSQL uses a histogram to store the data distribution. In case the statistics target is 100, the database will store 101 entries to indicate the limits within the data.
  • correlation : The optimizer also wants to know something about the physical order of the data on the disk. It makes a difference whether the data is stored in order (1, 2, 3, 4, 5, 6,…) or randomly (6, 1, 2, 3, 5, 4,…). In case we are looking for ranges, fewer blocks are needed to read ordered data.

Statistics analysis of our table with pg_stats

If we look for our table in the view, what we get is:

PostgreSQL statistics analysis with pg_stats

In this listing, you can see what PostgreSQL knows about our table.

In the column “id” , the part of the histogram is the most important: “ {47,102906,205351,301006,402747,503156,603102,700866,802387,901069,999982} “. PostgreSQL thinks the smallest value is 47. 10% is less than 102906, 20% is expected to be less than 205351, and so on. What is also interesting here is n_distinct: -1 basically means that all values are different . This is important if we are using GROUP BY. In the case of GROUP BY, the optimizer wants to know how many groups to expect. n_distinct is used in many cases to provide us with that estimate.

In the case of the column “name” we can see that “hello” is the most frequent value (100%). So we don’t get a histogram.

Of course, there is a lot more to say about how the PostgreSQL optimizer works. However, to begin with, it is very helpful to have a basic understanding of how PostgreSQL uses statistics.

We hope you like this post and in subsequent posts we will delve into the way this wonderful manager works. Don’t miss the next one, subscribing to our newsletter . With a single email a month you will be informed of all our entries.

Still do not know Query Performance ? Find out how it can help you in your Oracle environment. More information on their page LinkedIn .

Follow GPS on LinkedIn

Regards and until next time.

Leave a Reply

Your email address will not be published. Required fields are marked *