PROVE IT !!If you know it then PROVE IT !! Skill Proficiency Test

Hive Performance Tips

The biggest gift for working with Hadoop system in the recent times which can make every RDBMS user smile is HIVE. Although working with Hive can be a frustrating experience if your used to traditional RDBM systems since the traditional statistics required for query optimization are not available .

The most important thing before you work with Hadoop System is to work with your data, If you know your data and what you want to access you can leverage whole set of Hadoop performance optimization parameters which will help in getting better results.

Important links for reference

Configuration Parameters:

Various Hive Parameters:

Hive Indexes:

Hive Windowing and Analytics Functions:

Analytical Queries in Hive:

Hive Server2 Clients:

Important Tips:

Command
Description
set -v Show all settings.
Skew Merge Bucket Joins

  • set hive.optimize.skewjoin = true;
  • set hive.skewjoin.key = skew_key_threshold

Join bottlenecked on the reducer who gets the skewed key

Sort Merge Bucket Map Join
  • set hive.optimize.bucketmapjoin = true;
  • set hive.optimize.bucketmapjoin.sortedmerge = true;
  • set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

Why: No limit on file/partition/table size.

  • 1. Work together with bucket map join
  • 2. Bucket columns == Join columns == sort columns
Bucket Map Join

set hive.optimize.bucketmapjoin = true;

Why: Total table/partition size is big, not good for mapjoin.

  • 1. Work together with map join
  • 2. All join tables are bucketized, and each small table’s bucket number can be divided by big table’s bucket number.
  • 3. Bucket columns == Join columns
Prevent MapJoins for Large Tables
  • set hive.auto.convert.join=false;
  • Star Join Optimization A simple schema for decision support systems or data warehouses is the star schema, where events are collected in large fact tables, while smaller supporting tables (dimensions) are used to describe the data.  
Controlling the CombinedHiveInputFormat Size

  • set mapred.max.split.size=268435456;
  • set mapred.min.split.size=
  • set mapreduce.input.fileinputformat.split.maxsize=
  • set mapreduce.input.fileinputformat.split.minsize=
  • set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

Dynamic Partition Creation

  • set hive.exec.dynamic.partition.mode=nonstrict;
  • set hive.exec.max.dynamic.partitions=10000;
  • set hive.exec.max.dynamic.partitions.pernode=500;
  • set hive.auto.convert.join.noconditionaltask = true;
  • set hive.auto.convert.join.noconditionaltask.size = 10000;

Control the Output Compression

  • set hive.exec.compress.output=true
  • set hive.exec.compress.intermediate=true
  • set io.sort.mb=400

The total amount of buffer memory to use while sorting files, in megabytes.

By default, gives each merge stream 1MB, which should minimize seeks.

  • set hive.limit.pushdown.memory.usage=0.1f

That is used in ORDER BY LIMIT queries for pushing down the LIMIT clause.

select * from table order by key1 limit 10; would use 10% of memory to store a Top-K, which would mean that the impl will keep an in-memory ordered list of 10 rows & replace/discard rows which fall outside the top-10. This means Top-K is worst case of O(log(K)*n), while the unoptimized sorter+LIMIT is O(log(n)*n), which is a few magnitudes of performance gains when K is something like 100 and n is ~1+ million.

  • set hive.optimize.correlation=true;

Hive Stats

The items below will require the stats for each table. Set the stats and ‘analyze’ the table/partitions as directed.

  • set hive.stats.autogather=true;
  • set hive.stats.dbclass=fs;

Analyze table store_sales partition(ss_sold_date) compute statistics partialscan;

  1. While data is inserted   :   set hive.stats.autographer = [true, **false**]
  2. This optimizes “select count(1) from foo;” to run in ~1 second :  set hive.compute.query.using.stats=true;
  3. This optimizes “select x from foo limit 10;” to run <1 second. : set hive.fetch.task.conversion=more;
  4. This optimizes "select x from foo where y = 10;" on ORC tables. :  set hive.optimize.index.filter=true;
Hive CBO

  • hive.compute.query.using.stats = [true, **false**];
  • hive.stats.fetch.column.stats = [true, **false**];
  • hive.stats.fetch.partition.stats = [true, **false**];
  • hive.cbo.enable = [true, **false**];

Hive Tuning
hive.optimize.sort.dynamic.partition = [ **true**, false ]
Hive Server 2
  • hive.execution.engine=tez/spark/mr

This setting determines whether Hive queries will be executed using Tez / Spark or MapReduce. hive.tez.container.size The memory (in MB) to be used for Tez/Spark tasks.

If this is not specified (-1), the memory settings from the MapReduce configurations (mapreduce.map.memory.mb)will be used by default for map tasks hive.tez.java.opts Java command line options for Tez. If this is not specified, the MapReduce java opts settings (mapreduce.map.java.opts) will be used by default for map tasks.

  • hive.server2.tez.default.queues : A comma-separated list of queues configured for the cluster.
  • hive.server2.tez.sessions.per.default.queue The number of sessions for each queue named in the hive.server2.tez.default.queues.
  • hive.server2.tez.initialize.default.sessions Enables a user to use HiveServer2 without enabling Tez for HiveServer2. Users may potentially may want to run queries with Tez without a pool of sessions.
ORC File Tuning

  • hive.exec.orc.memory.pool

Maximum fraction of heap that can be used by ORC file writers. Can effect how “stripes” are written and effect the stripe size.

set hive.exec.orc.write.format=”0.11″

Set the specific ORC file version to write.

Tags:

Add a Comment

Your email address will not be published. Required fields are marked *