PROVE IT !!If you know it then PROVE IT !! Skill Proficiency Test

Automated Provisioning of CDH in the Cloud with Cloudera Director and Ansible

This is a guest blog post from Jasper Pult, Technology Consultant at Lufthansa Industry Solutionsan international IT consultancy covering all aspects of Big Data, IoT and Cloud.  The below work was implemented using Director’s API v9 and certain API details might change in future versions.

Cloud computing is quickly replacing traditional on premises solutions in all kinds of industries. With Apache Hadoop workloads often varying in resource requirements over time, it’s no surprise that big data engineers have started leveraging the benefits of dynamic scalability in the cloud as well.

Ensuring that the perfectly sized cluster is always available when needed, while not inducing any costs when not needed, usually requires either regularly scaling out and scaling down or even provisioning and tearing down entire clusters. At Lufthansa Industry Solutions, we have automated these operations by making use of Cloudera Director and its REST API, with Ansible being our Infrastructure as Code tool of choice. We are using the same approach for Lufthansa Technik’s AVIATAR, the new innovative platform for the aviation industry.

Cloudera Director

Cloudera Director is a lifecycle management tool for Cloudera clusters in the cloud. It is cloud service agnostic, so clusters can be spun up on Amazon Web Services, Microsoft Azure, or Google Cloud Platform all from the same Cloudera Director instance. The browser UI allows us to provision, monitor, alter, and delete clusters based on custom templates.

While Cloudera Altus is a platform-as-a-service offering streamlined for certain use cases, Cloudera Director gives us complete control over our cluster configuration. This does necessitate that Cloudera Director requires more manual input to manage and operate unlike Cloudera Altus. You can learn more about Cloudera Altus on the Cloudera web page. For the remainder of this blog post, we will focus on provisioning Cloudera clusters using Cloudera Director.

Cloudera Director UI

The Director’s UI is useful for exploration and development, but users typically don’t want to navigate the Director UI every time a cluster needs to be provisioned or scaled down in production, and therefore prefer the REST API.

The REST endpoint provides a Swagger documentation and can be found at http://director-server-hostname:7189/api-console. It allows operations like creating, altering, and deleting clusters and storing instance templates, adding database connections, etc.

Cloudera Director API


Instance and cluster templates are submitted in JSON format. A few useful reference templates are provided on GitHub. One of these templates can be a good starting point and can be customized as required.

For more advanced configurations, a good trick is to make the desired adjustments manually on a running cluster. Afterwards, the cluster template can be exported through the API to see what certain parameters translate to in the JSON templates.


Ansible is an Infrastructure as Code tool, similar to the likes of Chef, Puppet or Terraform. One of the benefits of Ansible is its agentless architecture. It does not require its own daemons to be running on the machines it connects to and only needs SSH and Python to be installed. This imposes minimal dependencies on the environment and reduces the overhead of network traffic by pushing configuration information to the nodes rather than have the nodes constantly poll for updated information.

Ansible can be executed from the command line on any client machine with sufficient connectivity and permissions to reach the hosts that need to be configured by the playbook. If you want to further automate your setup and make it easier to trigger by end users, you might consider Red Hat’s Ansible Tower or open source alternatives like Semaphore, Tensor or others. Most of them offer features like a REST API, browser GUI or scheduling of playbooks.

The automation we are discussing in this post could be also achieved with other automation tools, depending on what you feel most comfortable with or what fits your use case best.

Letting Ansible Play With Cloudera Director

Ansible and Cloudera Director can work together as illustrated in the figure below. Ansible playbooks assemble JSON templates and submit them to Cloudera Director via its REST API. Cloudera Director then provisions clusters in the cloud.


For the purpose of this example we will assume that there is a Cloudera Director instance running at Based on this we will have the following endpoints:

We’ll also assume we already have an environment named lhind-environment set up on our Director instance.

1. Creating a Deployment (aka Cloudera Manager Instance)

Before we can provision a cluster, we need a Cloudera Manager instance to deploy it into, just like in an on premises environment. In Cloudera Director, Cloudera Manager instances are called deployments. To create a deployment, we issue a POST request to the endpoint, passing it a JSON model like the following:


A few things to look out for:

  • managerVirtualInstance: This block describes the virtual machine as requested from the respective cloud provider. This example is valid for an Azure VM – if we were using AWS or GCP, it would look slightly different.
  • UUID: The virtual instance’s ID needs to be random and unique for every deployment instance that’s provisioned through the Director. We’ll therefore automate a random generation and insert it into the template with Ansible before submitting it.
  • Scripts: Both the bootstrap script and the post create scripts are useful if there’s any custom configuration you’d like to perform on your Cloudera Manager VM before or after deploying the Cloudera Manager software itself. They’re redacted here for better readability, but default bootstrap scripts can be found on GitHub.
[Playbook] Iteration 1: Submitting a JSON Template

The template above is all we need to submit to Cloudera Director to set up a deployment instance. To automate this with Ansible, we’ll make use of the uri module. The first iteration of our playbook looks like the following:


We could now run the playbook with ansible-playbookcreate_deployment.yml, which would trigger the provisioning of our deployment instance.

[Playbook] Iteration 2: Extracting Variables

Next, to make things more generic, we’ll extract all parameters to variables and store them in a separate vars file.


The playbook now looks like this:


[Playbook] Iteration 3: Encrypting Sensitive Information With ansible-vault

Of course, we don’t want to store things like the Cloudera Director password or other sensitive information in plain text in a vars file. This is where the Ansible Vault comes into play. We’ll move directorPassword to a second vars file:


To encrypt this file we’ll use the ansible-vault command: ansible-vault encrypt vars/vault.yml, which will prompt us for a password that can be used to decrypt it in the future. The vault file has to be included in our playbook:

Now when running the playbook we’ll need the --ask-vault-pass flag, which will prompt us for the password we set above:

[Playbook] Iteration 4: Assembling the Template at Run Time

I mentioned above that every deployment instance needs a UUID. If we want to be able to automatically generate one and insert it into the template every time we run the playbook, we’ll have to find a way to dynamically assemble the JSON template at playbook run time. There are multiple ways of achieving this, but one easy way is to insert keyword strings into the template file, which can be replaced with a find/replace in the playbook. Doing this will also allow us to extract other variables from the template and make it more generic. To give you an example, we’ll extract a few sensitive parameters like the Kerberos admin password as well as the UUID from the template:


Now all we have to do in our playbook is replace the keywords ***VIRTUAL-INSTANCE-ID***, ***KERBEROS-ADMIN*** and ***KERBEROS-PASSWORD*** with their respective values.

We’ll encrypt the Kerberos password in our Ansible Vault and use uuidgen to generate a random ID. We’ll also set no_log to true, so that the sensitive parameters don’t get logged when they’re inserted.



The same pattern can, of course, be applied to many more parameters in the template. This was just an example to give you an idea of how it can work.

[Playbook] Further Steps:

Some further optimizations you might want to take into account:

  • Make use of Ansible Roles to create reusable roles and give your playbook some structure, especially as it grows larger.
  • Make the template and playbook more generic. For example, allow for different templates to be run from the same playbook or make the basic JSON structure configurable entirely through variables.

2. Provisioning a Cluster Within the Deployment

Now that we have a deployment, we’re ready to spin up a cluster on top of it. The JSON template for this is slightly more complex than for a deployment, but we can make use of the same mechanisms to assemble and submit it with Ansible as we did before.

Skipping a few of the iterations we did above, the template might look like the following:


The main difference when compared to the deployment template, are the different virtual instance groups for cluster node types, with respective service roles assigned to each type.

What’s missing here are the virtualInstances lists describing the VM specifications for each group. The API expects one virtualInstances object for each node in a group. So if, for example, we want three data nodes, we’ll have to submit a list of three identical instance objects. Each instance will again need a UUID. So to avoid code duplication and generate a UUID for each instance, we’ll assemble the instance list using a simple shell script, taking a single instance template and the number of replications as inputs:

A single instance template is stored in a separate file (this is an example for a data node. Slightly different versions will be needed for edge and master node templates):


Finally we can put everything together in a playbook:


Again these examples are fairly hard coded to get the overall idea across. There’s endless possibilities to extract variables and make them more generic. For more advanced setups like a high availability configuration, see GitHub.

Next Steps

Once we have a running cluster, we’ll soon want to scale it down or out depending on the resources required by our dynamic workload. This is also supported by Cloudera Director and can therefore be automated. Hopefully, this post gives you a good idea on how to get started with automating your CDH cloud setup and we look forward to discussing optimizations or hearing about your own approaches. Feel free to get in touch in the comments below.

Useful Resources

Note: Learn more about Ansible here. Cloudera does not distribute or support Ansible.


Let’s block ads! (Why?)