Redshift - Why and When to Vacuum
Updated: Sep 16, 2022
This blog will show you how Redshift writes data in a columnar format to disk, and you will soon see when and why there is a need to Vacuum.
The real surprise is coming in the next couple of pages, but I want you to understand a couple of concepts. First, remember that Redshift places each column in a separate data block. A new block appends for additional data when the data block reaches 1 MB on a slice. So, if you have a sortkey in the table CREATE statement and load all the data in one load, it is sorted ideally. In our example below, we have sorted the data by EmpNo.
Quiz – Place the New Batch of Rows Inside the Table
Your mission is to place the new rows into the diagram where Redshift will place them. You can move the rows around any way you think it works, but ensure you only have two entries in each column before appending a new data block. Remember that you will get closer to 250,000 entries in each block in reality, but this is merely to explain a concept.
Answer – Place the New Batch of Rows Inside the Table
Amazon Redshift appends data to blocks. The data orders perfectly when you initially load data and have a sortkey in the table definition. However, the overall sorting is no longer perfect when you load more data the following day. So the DBA or yourself will do a Vacuum to get back to perfect ordering. We will see the results of the Vaccum in a little while.
The Same is True If There Was Only One Data Block
Amazon Redshift appends data to blocks. When you initially load data and have a sortkey in the table definition, the data orders perfectly. However, when you load more data the following day, the overall sorting is no longer perfect.
Quiz - What Happens When Redshift Updates a Row?
Your mission is to decide how the blocks above will change when the update statement happens in the example above. You better call Sal before he becomes Hal.
Answer - What Happens When Redshift Updates a Row?
Amazon Redshift updates rows by marking them as deleted, although they still physically reside in their block. Redshift inserts a new row appended at the end of the last block, or in this case, the current block, because there is only one block per column.
Quiz - What Happens When Redshift Deletes a Row?
What happens above to the row when the DELETE statement performs?
Answer - What Happens When Redshift Deletes a Row?
Amazon Redshift deletes rows by marking them as deleted. Redshift assigns a DELETE XID for the row that we will cover later, but for now, know that all rows deleted are marked as deleted, but they still physically reside in their block. Deletes take up space in the block and can hurt performance. A Vacuum will fix the issue.
Quiz - What Happens When a Table Gets a Vacuum?
How will the data look after the Emp_Tbl above goes through a Vacuum?
Answer - What Happens When a Table Gets a Vacuum?
After a vacuum on a table, all rows marked deleted are physically deleted. The table is sorted perfectly from the first row to the last (if the table definition has a sortkey).
What is a Vacuum?
Amazon Redshift doesn't automatically reclaim and reuse space when you delete or update rows. These rows are logically deleted but not physically deleted (until you run a vacuum). To perform an update, Amazon Redshift deletes the original row and appends the updated row, so every update is effectively a delete followed by an insert. When you delete, the rows are marked for deletion but not removed. A vacuum physically deletes deleted rows.
When is a Good Time to Vacuum?
A vacuum can be time-consuming and intensive. That is why the above advice is needed. Vacuum wisely. You can run the vacuum command to eliminate the logically deleted rows and resort to the table 100% perfectly. When about 10% of the table has changed over time, running both the Vacuum and Analyze commands is a good practice. Like Groucho Marx has stated, "If data processing slows down and users get Groucho, hit your marks and make it fly after a vacuum."
The VACUUM Command Grooms a Table
When tables with a sortkey are loaded initially, the rows are in perfect sorted order. However, additional rows in subsequent loads append, so the sort is no longer perfect. Finally, a vacuum grooms a table, which means it sorts it perfectly again. Unfortunately, a vacuum will permanently delete rows with a logical delete marking. Therefore, maintaining proper vacuum schedules is essential, and you dictate these schedules by how often a table is loaded or rows are updated or deleted.
If you want Tom Coffing, better known as Tera-Tom because of the thousand classes he has taught around the world, to teach your team an Amazon Redshift Architecture and SQL class, please contact Tom at Tom.Coffing@CoffingDW.com. Check out the website to see big data classes and their agenda (customizable by the customer). https://www.nexusdataserver.com/services-9.
If you need to migrate data to Amazon Redshift, the Nexus Server is the perfect tool to automate the process. You can automatically migrate from any database platform to Amazon Redshift, and Nexus does it for you. Check out this YouTube video to see the Nexus Server migrate data. https://www.youtube.com/watch?v=_9ABivdwJPk.
Look no further if you are looking for the best query tool for Amazon Redshift. The Nexus Query Chameleon is the most amazing tool on the market. And the best part is that Nexus works on all databases and even joins data across different systems with its Federated Super Join Builder, which writes the SQL for you. Check out this video of Nexus and prepare yourself to be amazed. https://www.youtube.com/watch?v=o7YGXfxrJOM&t=24s.