PROVE IT !!If you know it then PROVE IT !! Skill Proficiency Test

“Join” ? – What happen’s inside Teradata explored !!

Need data from 1st table –          Good
Need data from 2nd table –          Good
Need data from both the table – JOIN

Below  is a simple diagram to illustrate how “JOIN” works.

There are multiple types of join,
· Left join
· Right join
· Cross join

Firstly we create a Cartesian product, then select based on conditions. In left join, we need to finally add back those records in the left table, whose ID is not found in the final table, and assign null values in fields from right table. Here is a figure to illustrate that. Right join is conducted in same we as left join, hence, I didn’t show it.

2

Ideally the Joining process is very clear but the way Teradata (MPP) system handles the join is normally not very clear for us. To get some basic idea about that will help the business user a lot in day to day activities.

A very important concept for Teradata is call “Access Module Processors (amps)”. Teradata distributes large size of data into hundreds or thousands of amps. Data are processed in individual amp, then merged back as a table. We can imagine one amp as an individual less powerful computer with its own CPU and disk. Each amp has its limitation of computing power and disk size. Hence if we dump a huge size of data into only one amp, it will take long time to execute and maybe run out of space for that amp. Since Teradata utilize this distributed structure to process data, even out of space in one amp will cause the entire process to be failed.
Teradata assign data to each amp based on primary index. Hence a well setup index is very important. Suppose you use gender as the primary index for a large table with equal number of male and female, when Teradata process this table, it will allocate all males to one amp and all females to the other amp. In this case, we only leverage 2 amps, and remain thousands of other amps free. The query may easily fail because those 2 amps cannot process such big volume of data.

In the “join” process, the amps allocation and primary index is also very important. Let’s consider three cases.

Case 1: Both table A and table B have same primary index (acct_id)

1
2
3
4
5
6
7
8
9
10
11
12
13
create multiset volatile table a as
(select acct_id, stmt_dt, cl_amt from source_table)
with data primary index (acct_id) on commit preserve rows;
 
create multiset volatile table b as
(select acct_id, stmt_dt, acct_age from source_table)
with data primary index (acct_id) on commit preserve rows;
 
create multiset volatile table c as
(select a.acct_id, a.stmt_dt, cl_amt, acct_age
from a join b
on a.acct_id=b.acct_id)
with data primary index (acct_id) on commit preserve rows;

Here is the actual process in Teradata for this join:

join1

From this figure, we can see in Teradata, if same primary index is specified, the join process is conducted in amp level. No additional space is needed.

Case 2: Table A has primary index(acct_id, and stmt_dt), while Table B has primary index(acct_id). We still join on a.acct_id=b.acct_id

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
create multiset volatile table a as
(select acct_id, stmt_dt, cl_amt from source_table)
with data primary index (acct_id, stmt_dt)
on commit preserve rows;
 
create multiset volatile table b as
(select acct_id, stmt_dt, acct_age from source_table)
with data primary index (acct_id)
on commit preserve rows;
 
create multiset volatile table c as
(select a.acct_id, a.stmt_dt, cl_amt, acct_age
from a join b
on a.acct_id=b.acct_id)
with data primary index (acct_id) on commit preserve rows;

Here is the actual process in Teradata:

join2

In the above figure, we can see the two table cannot join at amp level initially. Hence additional spool space and CPU is need to re-distributed table A.

Case 3: Table A has primary index(acct_id, and stmt_dt), while Table B has primary index(acct_id, stmt_dt). We still join on a.acct_id=b.acct_id

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
create multiset volatile table a as
(select acct_id, stmt_dt, cl_amt from source_table)
with data primary index (acct_id, stmt_dt)
on commit preserve rows;
 
create multiset volatile table b as
(select acct_id, stmt_dt, acct_age from source_table)
with data primary index (acct_id,stmt_dt)
on commit preserve rows;
 
create multiset volatile table c as
(select a.acct_id, a.stmt_dt, cl_amt, acct_age
from a join b
on a.acct_id=b.acct_id)
with data primary index (acct_id) on commit preserve rows;

Here is the actual process in Teradata:

join3

Here we can find that both Table A and Table B need re-distribution so that thy can join at amp level.

Further more, we can use Teradata diagnostics query to check the performance of those three cases:

table

We can see case 1 use least amps while case 3 use most amps. That is because by using acct_id and stmt_dt as primary index, the duplicated records are fewer. Hence more amps can be leverage. We can find out that case 1 use least CPU, time and Spool space because no re-distribution efforts are needed.

Add a Comment

Your email address will not be published. Required fields are marked *