PROVE IT !!If you know it then PROVE IT !! Skill Proficiency Test

Hive Query Optimization

DISTRIBUTE BY…SORT BY v. ORDER BY

In Hive, ORDER BY is not a very fast operation because it forces all the data to go into the same reducer node. By doing this, Hive ensures that the entire dataset is totally ordered.

However, sometimes we do not require total ordering. For example, suppose you have a table called user_action_table where each row has user_idaction, and time. Your goal is to order them by time per user_id.

If you are doing this with ORDER BY, you would run

SELECT time, user_id, action FROM user_action_table
ORDER BY user_id, time

However, you can achieve the same result with

SELECT time, user_id, action FROM user_action_table
DISTRIBUTE BY user_id SORT BY user_id, time

This is because all the rows belonging to the same user_id go to the same reducer (“DISTRIBUTE BY user_id”) and in each reducer, rows are sorted by time (“SORT BY time”). This is faster than the other query because it uses multiple reducers as opposed to a single reducer.

Avoid “SELECT count(DISTINCT field) FROM tbl”

This query looks familier to SQL users, but this query is very slow because only one reducer is used to process the request.

SELECT count(DISTINCT field) FROM tbl

So please rewrite the query like below to leverage multiple reducers.

SELECT
  count(1)
FROM (
  SELECT DISTINCT field FROM tbl
) t

Considering the Cardinality within GROUP BY

There’s a probability where GROUP BY becomes a little bit faster, by carefully ordering a list of fields within GROUP BY in an order of high cardinality.

good: SELECT GROUP BY uid, gender
bad: SELECT GROUP BY gender, uid

Efficient Top-k Query Processing using each_top_k

Efficient processing of Top-k queries is a crucial requirement in many interactive environments that involve massive amounts of data. Our Hive extension each_top_k helps running Top-k processing efficiently.

  • Suppose the following table as the input
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
  • Then, list top-2 students for each class
student class score rank
3 a 90 1
2 a 80 2
1 b 70 1
6 b 60 2

The standard way using SQL window function would be as follows:

 SELECT student, class, score, rank FROM ( SELECT student, class, score, rank() over (PARTITION BY class ORDER BY score DESC) as rank FROM table ) t WHRE rank <= 2

An alternative and efficient way to compute top-k items using each_top_k is as follows:

 SELECT each_top_k( 2, class, score, class, student — output columns other in addition to rank and score ) as (rank, score, class, student) FROM ( SELECT * FROM table CLUSTER BY class — Mandatory for `each_top_k` ) t
`CLUSTER BY x` is a synonym of `DISTRIBUTE BY x CLASS SORT BY x` and required when using `each_top_k`.
`each_top_k` is benefical where the number of grouping keys are large. If the number of grouping keys are not so large (e.g., less than 100), consider using `rank() over` instead.

The function signature of each_top_k is follows:

 each_top_k(int k, ANY group, double value, arg1, arg2, …, argN) returns a relation (int rank, double value, arg1, arg2, .., argN).

Any number types or timestamp are accepted for the t

Add a Comment

Your email address will not be published. Required fields are marked *