Skewness Unleashed !!
|SKEWNESS – IN A NUTSHELL
Teradata is a massive parallel system, where uniform units (AMPs) do the same tasks on that data parcel they are responsible for. In an ideal world all AMPs share the work equally, no one must work more than the average. The reality is far more cold, it is a rare situation when this equality (called "even distribution") exists.
It is obvious that uneven distribution will cause wrong efficiency of using the parallel infrastructure.
But how bad is the situation? Exactly that is what Skewness characterizes.
Definitions
Let "RESOURCE" mean the amount of resource (CPU, I/O, PERM space) consumed by an AMP.
Let AMPno is the number of AMPs in the Teradata system.
Skew factor := 100 – ( AVG ( "RESOURCE" ) / NULLIFZERO ( MAX ("RESOURCE") ) * 100 )
Total[Resource] := SUM("RESOURCE")
Impact[Resource] := MAX("RESOURCE") * AMPno
Parallel Efficiency := Total[Resource] / Impact[Resource] * 100 or with some transformation:
Parallel Efficiency := 100 – Skew factor
ANALYSIS
Codomain
0 <= "Skew factor" < 100
"Total[Resource]” <= "Impact[Resource]"
0<"Parallel Efficiency"<=100
Meaning
Skew factor : This percent of the consumed real resources are wasted
Eg. an 1Gbytes table with skew factor of 75 will allocate 4Gbytes*
Total[Resource] :Virtual resource consumption, single sum of individual resource consumptions , measured on AMPs as independent systems
Impact[Resource] :Real resource consumption impacted on the parallel infrastructure
Parallel Efficiency : As it says. Eg. Skew=80: 20%
* Theoretically if there is/are complementary characteristics resource allocation (consumes that less resources on that AMP where my load has excess) that can compensate the parallel inefficiency from system point of view, but the probability of it tends to zero.
The "Average" level indicates the mathematical sum of AMP level resource consumptions (Total[Resource]), while “Peak” is the real consumption from “parallel system view” (Impact[Resource])
REASONS OF SKEWNESS
- Skewed tables: Bad choice of PI, Skewed data
- Bad execution plans (typically skewed redistributions)
- Bad data model (normalization,data types,PI, etc.)
- Missing or stale statistics
- Too many joins (break up the query!)
- Hash collision (load time problem)
TERADATA SYSTEM LEVEL SKEWNESS
We've found those bad queries, nice. But what can we say about the whole system? What is the total parallel efficiency? Can we report how much resources were wasted due to bad parallel efficiency?
The answer is: yes, we can estimate quite closely. The exact value we cannot calculate because DBQL does not log AMP information for the query execution, but the most important metrics.
We can not calculate that situation when more skewed queries run the same time, but have peaks on different AMPs. This reduces the system level resource wasting, but is hard to calculate with, however its probability and effect is negligible now.
select
sum(AMPCPUTime) AMPCPUTimeSum
, sum(MaxAMPCPUTime * (hashamp () + 1)) CPUImpactSum
, sum(TotalIOCount) TotalIOCountSum
, sum(MaxAMPIO * (hashamp () + 1)) IOImpactSum
, cast(100 – (AMPCPUTimeSum / CPUImpactSum) * 100 as integer) "CPUSkew%"
, cast(100 – (TotalIOCountSum / IOImpactSum) * 100 as integer) "IOSkew%"
from
/* For archived DBQL
dbql_arch.dbqlogtbl_hst where logdate = '2013-12-18' (date)
and (ampcputime>0 or TotalIOCount > 0)
*/
/* For online DBQL*/
dbc.dbqlogtbl where
cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
and (ampcputime>0 or TotalIOCount > 0)
Look at the last two columns. That percent of your CPU and I/O goes to the sink…
Top bad guys
OK, let's check how many queries accumulate 5%,10%,25%,50%,75%,90% of this loss?
Here you are (CPU version, transform for I/O implicitly):
select 'How many queries?' as "_",min(limit5) "TOP5%Loss",min(limit10) "TOP10%Loss",min(limit25) "TOP25%Loss",min(limit50) "TOP50%Loss",min(limit75) "TOP75%Loss",min(limit90) "TOP90%Loss", max(rnk) TotalQueries, sum(ResourceTotal) "TotalResource", sum(ResourceImpact) "ImpactResource"
from
(
select
case when ResRatio < 5.00 then null else rnk end limit5
,case when ResRatio < 10.00 then null else rnk end limit10
,case when ResRatio < 25.00 then null else rnk end limit25
,case when ResRatio < 50.00 then null else rnk end limit50
,case when ResRatio < 75.00 then null else rnk end limit75
,case when ResRatio < 90.00 then null else rnk end limit90
,rnk
, ResourceTotal
, ResourceImpact
from
(
select
sum(ResourceLoss) over (order by ResourceLoss desc ) totalRes
, sum(ResourceLoss) over (order by ResourceLoss desc rows unbounded preceding) subtotalRes
, subtotalRes *100.00 / totalRes Resratio
, sum(1) over (order by ResourceLoss desc rows unbounded preceding) rnk
, ResourceTotal
, ResourceImpact
from
(
select
AMPCPUTime ResourceTotal
, (MaxAMPCPUTime * (hashamp () + 1)) ResourceImpact
, ResourceImpact – ResourceTotal ResourceLoss
/* For archived DBQL
from dbql_arch.dbqlogtbl_hst where logdate=1131207
and ampcputime>0
*/
/* For online DBQL*/
from dbc.dbqlogtbl where
cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
and ampcputime>0
) x
) y
) z
group by 1