The RedTastic Outer Joins of Amazon Redshift
Below is a video of Nexus in action that proves it is best friends with Amazon Redshift. Watch how Nexus queries, migrates, and federates with Redshift and all other systems in your enterprise.
This blog will teach you everything you need to know about outer joins. We will start with the fundamental concepts of a two-table inner join and then move to the two-table left, right, and full outer joins. We will then take it to four table joins, where you will learn exactly how outer joins process, so there will be no doubt in your mind of how outer joins work. You will also learn the outer join terminology to make you the ultimate expert on outer joins.
We will begin with a quiz. See the example below, where you see two queries against the EMPLOYEE_TABLE and the DEPARTMENT_TABLE and their respective answer sets. If you decide to accept it, your mission is to figure out which three rows will not return on an inner join between the two tables based on the DEPT_NO column.
The three rows that will not return on an inner join will be Richard Smythe and Squiggy Jones from the EMPLOYEE_TABLE and Human Resources from the DEPARTMENT_TABLE. These rows are eliminated from an inner join because they do not have matching department numbers from both tables. For example, Richard Smythe is in department 10, but there is no department 10 in the DEPARTMENT_TABLE. Squiggy Jones has a null value for this department, so he is eliminated immediately. And finally, although we have a department 500 with the department name of Human Resources, we don't have anyone in the EMPLOYEE_TABLE working in that department.
I will now begin teaching you the fundamentals of outer joins, including the key terms to aid your understanding and expertise. In our examples below, we have an inner join query on the left query window, and on the right query window, we have a left outer join.
Notice that the inner join brings back seven rows, each having a matching department number. They match because the answer set shows the e_dept and d_dept columns side by side.
In the query on the right, we are doing an outer join, which can be left, right, or full outer. In our example, we are doing a left outer join. One of the most important things to get into your brain is the first table after the FROM clause is always the left table.
Therefore, since we are doing a left outer join and the first table is the EMPLOYEE_TABLE, we will return all rows from the EMPLOYEE_TABLE, including those that do not have a match. Please think of the EMPLOYEE_TABLE as the table we want to focus on, also referred to as the outer table. In a left outer join, the left table is the outer table. Therefore, the left table will return all rows in a left outer join.
Here is a better way to think of the processing of the join. The inner join happens to return seven rows, but then the system discovers there were two rows from the EMPLOYEE_TABLE that did not have a matching or valid department number. So the system places the columns in the SELECT list from the EMPLOYEE_TABLE into the answer set. But since there are no matching rows in the DEPARTMENT_TABLE, the only solution is to place null values in all columns in the SELECT list coming from the DEPARTMENT_TABLE.
Hey, Squiggy Jones and Richard Smythe appear in the answer set along with their salary and department numbers, but we have to put nulls in their d_dept and department_name columns because there is no match in those columns coming from the DEPARTMENT_TABLE. Notice the arrows I have placed to show the null values for columns coming from the DEPARTMENT_TABLE.
Our next example below shows a right outer join. Remember, the first table mentioned (EMPLOYEE_TABLE) is always the left table. The DEPARTMENT_TABLE is the right table, and since we are doing a right outer join, all rows will return from the DEPARTMENT_TABLE.
Please think of the DEPARTMENT_TABLE as the table we want to focus on, also referred to as the outer table. In a right outer join, the right table is the outer table. Therefore, the right table will return all rows in the right outer join.
Another way to think of this is to look at the keywords RIGHT OUTER JOIN. In the front or to the left of those words are the EMPLOYEE_TABLE. Following those keywords or to the right of them is the DEPARTMENT_TABLE.
Richard Smythe and Squiggy Jones don't return in the answer set, but we get the seven rows from the inner join and department 500, which has the name Human Resources. Notice now that we put null values in all SELECT list columns coming from the EMPLOYEE_TABLE.
Our next example presents the full outer join. A full outer join returns all rows from both tables, which include the inner join matches and any rows from either table without a matching department number. Both tables are considered outer tables.
Notice we have the same seven rows from our inner join example, plus we see Richard Smythe and Squiggy Jones, and department 500 and Human Resources.
You can see from the arrows in the example rows where null values were placed to show they have no matching department numbers.
You don't have to study the SQL in our next example yet, but visualize outer joins on four tables. Please notice that we are doing left outer joins only. Then move on to the next example and take the quiz.
Now that you have the fundamentals for left, right, and full outer joins, we will take you to the top of the outer join mountain. First, please tell me which tables are left and right, and then list the outer table(s). The rest of this lesson will be a pivotal moment in your career.
Please listen to me carefully and make a mental note. The first table after the FROM clause is the left table, and all other tables are right. There is only one left table in an outer join, and it is always the first table.
Secondly, since we start with a left outer join, and subscribers is the left table, that makes subscribers the outer table.
Most of the time, people feel comfortable will all left tables, but they don't understand exactly what is going on with multiple outer joins, but I will break it down for you.
Remember, the system will only join two entities at a time when you have a four-table join. So, we start by joining subscribers with claims and deliver the inner join plus any non-matching subscribers table rows into a temporary answer set. The temporary answer set will be the left table for the next join. So, we take the temporary answer set from the joining of subscribers and claims and do another left outer join to addresses.
In other words, we still want every subscriber to return even if the temporary answer set doesn't match with a row in the addresses table. Therefore, we have joined the temporary answer set with the addresses table to produce another temporary answer set. The new temporary answer set is the left table and joins to the table named providers.
Our query joins four tables, but we have guaranteed that all rows from subscribers return.
The superstars in this industry will understand the next example: your chance to become a superstar is coming soon. Notice we start with a right outer join, which I have highlighted and placed an arrow. Please tell me which tables are left or right and which table(s) is the outer table.
Yes, the subscribers table is the first table after the FROM clause, so it is the left table, but since we are doing a right outer join, the claims table is our outer table. So we want to see every row from the claims table.
The beginning joins don't always have to be left outer joins. However, once you see your first outer join, you will almost always see nothing but left outer joins to preserve the outer table.
However, sometimes you will see inner joins, right joins, or full outer joins in the beginning. In the case below, we begin joining with a right outer join, thus determining that our outer table named claims is the one we want to ensure delivers all rows.
The subscribers and claims tables join to produce a temporary answer set. The temporary answer set becomes the left table. We then join the temporary answer set to the addresses table, but we preserve all rows in the temporary answer set. We continue joining the temporary answer set with each subsequent table. In the end, we have guaranteed that all rows from claims are in the final answer set.
Have you ever heard of a federated query? A federated query is a single query involving tables that reside on different systems. Our example below joins Redshift, Vertica, Yellowbrick, DB2, and Azure Synapse tables in a single query.
Notice on the upper right side of the example; we have the word Hub System and Vertica, which is the chosen database. Nexus will convert and move the Yellowbrick, DB2, Redshift, and Synapse tables to the Vertica system, where the join will process.
If the user changes the hub to Redshift, for example, then Nexus will convert and move the Vertica, Yellowbrick, DB2, and Synapse tables to Redshift, where the join will process. Nexus even allows you to choose your PC as the hub, where it will simultaneously query all four tables, bring them back to your PC, and join them inside your memory.
Below you can see the federated query running on the Vertica system and the answer set. The future of data is federated queries joining tables from multiple systems. We have worked on Nexus for 18 years and know how to make them work.
Tom Coffing has also led a team of developers for almost 20 years to create the Nexus, which queries, migrates, and joins data across all systems. Amazingly, Nexus migrates between all the databases below with the click of the mouse.
• Microsoft Access
• SQL Server
• SAP HANA
The picture below shows Nexus migrating from Teradata to Redshift. Some customers migrate thousands of tables to the Amazon cloud and Redshift in a single job.
At what age should you teach your kids about data? Check out this family's answer.