Introduction
Although the support for Hadoop & Mahout’s MapReduce has been phased out since 2014, sometimes it is needed for big data work, research, or study. Ansible is a scripting tool that can be used to automate the setup of a reproducible environment for such a workload reliably and efficiently. This post serves two objectives: to give you a basic walkthrough of using Ansible and to provide you with my downloadable tool for setting up the software environment mentioned earlier for big data analysis.
Background
Ansible is a scripting tool to automate IT tasks and is popular in reproducing a particular environment for programs to run due to its idempotency property (i.e., no matter how many times you execute an Ansible script, you achieve the same result), which means that it is robust against any failures or errors that may occur at any point in the script. Mahout is a big data analysis tool usually executed on top of Hadoop, a distributed file system. In this post, I use an example to illustrate using Ansible to create an environment to run Hadoop & Mahout MapReduce workload on a pseudo-distributed cluster. The key reason for this choice is that since the MapReduce paradigm for Mahout has been phased out since 2014 (after being superseded by other approaches), the information regarding how to set up such an environment is sparse, scattered across the Internet, and difficult to consolidate into workable steps.
The “pseudo-distributed cluster” aspect is a feature of Hadoop for testing a simulated “cluster” on a single computer. Since Hadoop & Mahout runs on disk rather than in memory, to save time, I recommend running this on a PC or Mac with a solid-state drive (SSD) rather than a hard disk drive (HDD).
The code for this article is publicly available on this repository under my GitHub account, licensed under BSD-3.
Setting up Ansible
Hadoop & Mahout works under Linux. Hence, I will focus solely on setting them up in Linux. If you use Windows, you can install Linux using Windows Subsystem for Linux (WSL) by following the instructions on Microsoft’s website. After having Linux ready, the next step would be to set up Ansible. However, there are other dependencies before and after this step. Hence, the easiest way is to run the bash shell script “setup_ansible.sh” after downloading it from the GitHub repository mentioned earlier. The script is shown in full below.
#!/bin/bash
sudo apt update
sudo apt upgrade
sudo apt-get install -y python3-pip python3-dev
sudo pip3 install lxml
sudo pip3 install ansible
ansible-galaxy collection install community.general
sudo apt-get install -y whois
In order to run the script, enter the following in your Linux terminal:
$ chmod +x setup_ansible.sh
$ bash setup_ansible.sh
Script walkthrough
Ansible scripts are in YAML format, which is reader-friendly. Let us start with some basic house keeping items:
- hosts: 127.0.0.1
connection: local
become: true
As you may have guessed, the “connection: local” part indicates that we are running the script locally. Ansible can be used to run over a network, but I will not cover that in this post.
Next, we indicate to Ansible that some custom variables are stored in the file “vars.yml”.
vars_files:
- vars.yml
The contents of the “vars.yml” are listed below. We are creating two login IDs because the highest version of Mathout supporting the MapReduce paradigm (and all of its algorithms) is 0.90*, which requires Hadoop 1.3.1. The extra ID is entirely optional and only for installing Hadoop 3.3.1 in addition to Hadoop 1.3.1 to allow you to explore a newer version of Hadoop without interfering with the older version 1.3.1.
- Version 0.10.0 shifted away from MapReduce to the Samsara domain-specific language (DSL).
# Usernames and passwords
# Key is username and item is password hash from mkpasswd --method=sha-512
# First one must be Hadoop 3.3.1, while second one must be for Mahout 0.90, which requires Hadoop 1.3.1
big_data_users:
hadoop331:
user_name: hadoop331
password_hash: $6$nGk0qT2ONarx.il$fM57NUaAQSojPYm43aap1EfXd/uxUp5dJGvnJYoLkByPoxOx93pcAs0V2ZnobQtpUZD8RAyphKQIR44SegB7p0
mahout09:
user_name: mahout09
password_hash: $6$nGk0qT2ONarx.il$fM57NUaAQSojPYm43aap1EfXd/uxUp5dJGvnJYoLkByPoxOx93pcAs0V2ZnobQtpUZD8RAyphKQIR44SegB7p0
# The temporary directory into which Hadoop and Mahout will be downloaded for setup.
download_dir: /tmp
Modify the “password_hash” values based on your chosen passwords. To get the hash values for your chosen passwords, enter the following in your Linux terminal. You will be asked to key in your chosen password.
$ mkpasswd --method=sha-512
The subsequent lines are to ensure everything is up to date:
pre_tasks:
- name: Update apt cache if needed
apt:
update_cache: true
cache_valid_time: 3600
- name: Make sure sudo group exists
group:
name: sudo
state: present
Next, the script installs the necessary software packages and sets up the two login IDs described earlier. Note that the package “OpenSSH Server” is uninstalled and reinstalled. This step is required due to a bug in Windows WSL version 2; otherwise, this step would have been unnecessary.
tasks:
- name: Remove the package "openssh-server" because reinstall is needed for Windows WSL2
apt:
name: openssh-server
state: absent
purge: yes
- name: (Re-)Install the package "openssh-server"
apt:
name: openssh-server
state: present
- name: Install Java OpenJDK 11
apt:
name: openjdk-11-jdk
state: present
- name: Install the package unzip
apt:
name: unzip
state: present
- name: Create a login ID for each of our big data users
user:
name: "{{ item.value.user_name }}"
password: "{{ item.value.password_hash }}" # Hash of your password from mkpasswd --method=sha-512
state: present
groups: sudo
shell: /bin/bash
system: no
createhome: yes
home: "/home/{{ item.value.user_name }}"
with_dict: "{{ big_data_users }}"
- name: Ensure .ssh folder exists for each of our big data users
file:
path: "/home/{{ user_name }}/.ssh"
owner: "{{ user_name }}"
state: directory
vars:
user_name: "{{ item.value.user_name }}"
with_dict: "{{ big_data_users }}"
- name: Generate an OpenSSH keypair with ecdsa 521-bit for each of our big data users
openssh_keypair:
path: "/home/{{ user_name }}/.ssh/id_ecdsa"
owner: "{{ user_name }}"
size: 521
type: ecdsa
vars:
user_name: "{{ item.value.user_name }}"
with_dict: "{{ big_data_users }}"
- name: Set authorized key taken from file for each of our big data users
authorized_key:
user: "{{ user_name }}"
state: present
key: "{{ lookup('file', file_name) }}"
vars:
user_name: "{{ item.value.user_name }}"
file_name: "/home/{{ user_name }}/.ssh/id_ecdsa.pub"
with_dict: "{{ big_data_users }}"
# Download Hadoop 1.2.1, Mahout 0.9 and Hadoop 3.3.1
- name: Download Hadoop 1.2.1
get_url:
url: https://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/hadoop-1.2.1-bin.tar.gz
dest: "{{ download_dir }}/hadoop-1.2.1-bin.tar.gz"
checksum: sha512:5793DBB7410E479253AD412F855F531AD7E029937A764B41DFE1E339D6EA014F75AD08B8497FDA30D6AB623C83DBE87826750BE18BB2B96216A83B36F5095F1E
- name: Expand Hadoop 1.2.1 for mahout09 user
unarchive:
src: "{{ download_dir }}/hadoop-1.2.1-bin.tar.gz"
dest: "/home/{{ user_name }}"
remote_src: true
owner: "{{ user_name }}"
mode: u+rwx,o-rwx
vars:
user_name: "{{ big_data_users['mahout09'].user_name }}"
- name: Remove Hadoop 1.2.1 temporary files.
file:
path: "{{ download_dir }}/hadoop-1.2.1-bin.tar.gz"
state: absent
- name: Download Mahout 0.9
get_url:
url: http://archive.apache.org/dist/mahout/0.9/mahout-distribution-0.9.tar.gz
dest: "{{ download_dir }}/mahout-distribution-0.9.tar.gz"
checksum: sha1:b0d192a33dcc3f00439bf2ffbc313c6ef47510c3
- name: Expand Mahout 0.9 for mahout09 user.
unarchive:
src: "{{ download_dir }}/mahout-distribution-0.9.tar.gz"
dest: "/home/{{ user_name }}"
remote_src: true
owner: "{{ user_name }}"
mode: u+rwx,o-rwx
vars:
user_name: "{{ big_data_users['mahout09'].user_name }}"
- name: Remove Mahout 0.9 temporary files.
file:
path: "{{ download_dir }}/mahout-distribution-0.9.tar.gz"
state: absent
- name: Download Hadoop 3.3.1
get_url:
url: https://archive.apache.org/dist/hadoop/core/hadoop-3.3.1/hadoop-3.3.1.tar.gz
dest: "{{ download_dir }}/hadoop-3.3.1.tar.gz"
checksum: sha512:2fd0bf74852c797dc864f373ec82ffaa1e98706b309b30d1effa91ac399b477e1accc1ee74d4ccbb1db7da1c5c541b72e4a834f131a99f2814b030fbd043df66
- name: Expand Hadoop 3.3.1 for hadoop331 user
unarchive:
src: "{{ download_dir }}/hadoop-3.3.1.tar.gz"
dest: "/home/{{ user_name }}"
remote_src: true
owner: "{{ user_name }}"
mode: u+rwx,o-rwx
vars:
user_name: "{{ big_data_users['hadoop331'].user_name }}"
- name: Remove Hadoop 3.3.1 temporary files.
file:
path: "{{ download_dir }}/hadoop-3.3.1.tar.gz"
state: absent
The final part is the most challenging: configuring everything correctly so that they will work together:
- creating the proper folder names
- setting up the right environment variables in the bash files
- inserting the correct details in the XML configuration files
All of the above was the result of much time-consuming scouring the Internet. The details are peripheral to the discussion here. Still, if you are interested, you can find the related information by looking up the various phrases in the portion of the code below on the Internet.
# Update bash files
- name: Insert lines into mahout09 user .bashrc
blockinfile:
path: "/home/{{ big_data_users['mahout09'].user_name }}/.bashrc"
insertafter: "EOF"
block: |
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
export HADOOP_PREFIX=~/hadoop-1.2.1
export HADOOP_INSTALL=$HADOOP_PREFIX
export HADOOP_MAPRED_HOME=$HADOOP_PREFIX
export HADOOP_COMMON_HOME=$HADOOP_PREFIX
export HADOOP_HDFS_HOME=$HADOOP_PREFIX
export HADOOP_YARN_HOME=$HADOOP_PREFIX
export YARN_HOME=$HADOOP_PREFIX
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_PREFIX/lib/native
export PATH=$PATH:$HADOOP_PREFIX/sbin:$HADOOP_PREFIX/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib/native"
export HADOOP_CONF_DIR=$HADOOP_INSTALL/conf
export MAHOUT_HOME=~/mahout-distribution-0.9
export PATH=$PATH:$MAHOUT_HOME/bin
- name: Update line in Hadoop 3.3.1 hadoop-env.sh
lineinfile:
path: "/home/{{ big_data_users['mahout09'].user_name }}/hadoop-1.2.1/conf/hadoop-env.sh"
search_string: "export JAVA_HOME=/usr/lib"
state: present
line: export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
- name: Insert lines into hadoop331 user .bashrc
blockinfile:
path: "/home/{{ big_data_users['hadoop331'].user_name }}/.bashrc"
insertafter: "EOF"
block: |
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
export HADOOP_HOME=~/hadoop-3.3.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
- name: Update line in Hadoop 3.3.1 hadoop-env.sh
lineinfile:
path: "/home/{{ big_data_users['hadoop331'].user_name }}/hadoop-3.3.1/etc/hadoop/hadoop-env.sh"
search_string: "export JAVA_HOME=/usr/lib"
state: present
line: export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
# Create Hadoop folders
- name: Create folders for Hadoop 1.2.1 namenode, datanode and temp
file:
path: "/home/{{ user_name }}/{{ item }}"
state: directory
owner: "{{ user_name }}"
mode: u+rw,o-rwx
vars:
user_name: "{{ big_data_users['mahout09'].user_name }}"
with_items:
- hadoopdata/hdfs/namenode
- hadoopdata/hdfs/datanode
- hadoopdata/hdfs/tmpdata
- name: Create folders for Hadoop 3.3.1 namenode, datanode and temp
file:
path: "/home/{{ user_name }}/{{ item }}"
state: directory
owner: "{{ user_name }}"
mode: u+rw,o-rwx
vars:
user_name: "{{ big_data_users['hadoop331'].user_name }}"
with_items:
- hadoopdata/hdfs/namenode
- hadoopdata/hdfs/datanode
- hadoopdata/hdfs/tmpdata
# Hadoop 1.2.1 XML configurations
- name: Insert lines into mahout09 user Hadoop 1.2.1 configuration core-site.xml
xml:
path: "/home/{{ user_name }}/hadoop-1.2.1/conf/core-site.xml"
xpath: "/configuration/property[name[text()='hadoop.tmp.dir']]/value"
value: "/home/{{ user_name }}/hadoopdata/hdfs/tmpdata"
vars:
user_name: "{{ big_data_users['mahout09'].user_name }}"
- name: Insert lines into mahout09 user Hadoop 1.2.1 configuration core-site.xml
xml:
path: "/home/{{ user_name }}/hadoop-1.2.1/conf/core-site.xml"
xpath: "/configuration/property[name[text()='fs.default.name']]/value"
value: "hdfs://127.0.0.1:9000"
vars:
user_name: "{{ big_data_users['mahout09'].user_name }}"
- name: Insert lines into mahout09 user Hadoop 1.2.1 configuration hdfs-site.xml
xml:
path: "/home/{{ user_name }}/hadoop-1.2.1/conf/hdfs-site.xml"
xpath: "/configuration/property[name[text()='dfs.namenode.name.dir']]/value"
value: "/home/{{ user_name }}/hadoopdata/hdfs/namenode"
vars:
user_name: "{{ big_data_users['mahout09'].user_name }}"
- name: Insert lines into mahout09 user Hadoop 1.2.1 configuration hdfs-site.xml
xml:
path: "/home/{{ user_name }}/hadoop-1.2.1/conf/hdfs-site.xml"
xpath: "/configuration/property[name[text()='dfs.datanode.data.dir']]/value"
value: "/home/{{ user_name }}/hadoopdata/hdfs/datanode"
vars:
user_name: "{{ big_data_users['mahout09'].user_name }}"
- name: Insert lines into mahout09 user Hadoop 1.2.1 configuration hdfs-site.xml
xml:
path: "/home/{{ user_name }}/hadoop-1.2.1/conf/hdfs-site.xml"
xpath: "/configuration/property[name[text()='dfs.replication']]/value"
value: "1"
vars:
user_name: "{{ big_data_users['mahout09'].user_name }}"
- name: Insert lines into mahout09 user Hadoop 1.2.1 configuration mapred-site.xml
xml:
path: "/home/{{ user_name }}/hadoop-1.2.1/conf/mapred-site.xml"
xpath: "/configuration/property[name[text()='mapred.job.tracker']]/value"
value: "localhost:9001"
vars:
user_name: "{{ big_data_users['mahout09'].user_name }}"
# Hadoop 3.3.1 XML configurations
- name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration core-site.xml
xml:
path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/core-site.xml"
xpath: "/configuration/property[name[text()='hadoop.tmp.dir']]/value"
value: "/home/{{ user_name }}/hadoopdata/hdfs/tmpdata"
vars:
user_name: "{{ big_data_users['hadoop331'].user_name }}"
- name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration core-site.xml
xml:
path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/core-site.xml"
xpath: "/configuration/property[name[text()='fs.default.name']]/value"
value: "hdfs://127.0.0.1:9000"
vars:
user_name: "{{ big_data_users['hadoop331'].user_name }}"
- name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration hdfs-site.xml
xml:
path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/hdfs-site.xml"
xpath: "/configuration/property[name[text()='dfs.name.dir']]/value"
value: "/home/{{ user_name }}/hadoopdata/hdfs/namenode"
vars:
user_name: "{{ big_data_users['hadoop331'].user_name }}"
- name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration hdfs-site.xml
xml:
path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/hdfs-site.xml"
xpath: "/configuration/property[name[text()='dfs.data.dir']]/value"
value: "/home/{{ user_name }}/hadoopdata/hdfs/datanode"
vars:
user_name: "{{ big_data_users['hadoop331'].user_name }}"
- name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration hdfs-site.xml
xml:
path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/hdfs-site.xml"
xpath: "/configuration/property[name[text()='dfs.replication']]/value"
value: "1"
vars:
user_name: "{{ big_data_users['hadoop331'].user_name }}"
- name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration mapred-site.xml
xml:
path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/mapred-site.xml"
xpath: "/configuration/property[name[text()='mapreduce.framework.name']]/value"
value: "yarn"
vars:
user_name: "{{ big_data_users['hadoop331'].user_name }}"
- name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration yarn-site.xml
xml:
path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/yarn-site.xml"
xpath: "/configuration/property[name[text()='yarn.nodemanager.aux-services']]/value"
value: "mapreduce_shuffle"
vars:
user_name: "{{ big_data_users['hadoop331'].user_name }}"
- name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration yarn-site.xml
xml:
path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/yarn-site.xml"
xpath: "/configuration/property[name[text()='yarn.nodemanager.aux-services.mapreduce.shuffle.class']]/value"
value: "org.apache.hadoop.mapred.ShuffleHandler"
vars:
user_name: "{{ big_data_users['hadoop331'].user_name }}"
- name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration yarn-site.xml
xml:
path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/yarn-site.xml"
xpath: "/configuration/property[name[text()='yarn.resourcemanager.hostname']]/value"
value: "127.0.0.1"
vars:
user_name: "{{ big_data_users['hadoop331'].user_name }}"
- name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration yarn-site.xml
xml:
path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/yarn-site.xml"
xpath: "/configuration/property[name[text()='yarn.acl.enable']]/value"
value: "0"
vars:
user_name: "{{ big_data_users['hadoop331'].user_name }}"
- name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration yarn-site.xml
xml:
path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/yarn-site.xml"
xpath: "/configuration/property[name[text()='yarn.nodemanager.env-whitelist']]/value"
value: "JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME"
vars:
user_name: "{{ big_data_users['hadoop331'].user_name }}"
Finally, to run the script, save the above script as “big_data.yml” in the same folder as the file “vars.yml”, and then enter the following in your Linux terminal:
$ sudo ansible-playbook big_data.yml
Manual steps after the script
After running the script, you need to format your distributed file system and then create your new folders on your distributed file system. These steps are written as manual steps here rather than in an automated script so that you get to experience and learn Hadoop.
First, switch to user mahout09, and start the SSH service:
$ su - mahout09
$ sudo service ssh start
Then, run SSH; answer “yes” when asked, “Are you sure you want to continue?”:
$ ssh localhost
Run the following to start the Hadoop service:
$ start-all.sh
Run the following to prepare the Hadoop cluster:
$ hadoop namenode -format
$ hadoop fs -mkdir /user
$ hadoop fs -mkdir /user/mahout09
Finally, run the following to check that you have successfully formatted and created your folders on your Hadoop distributed file system (HDFS) for the current user:
$ hdfs dfs -ls /
Run the following to stop the Hadoop service before you log off the current user:
$ stop-all.sh
Repeat all the steps above for your Hadopp331 user, except for the folder name step:
$ hdfs dfs -mkdir /user/hadoop331
instead of:
$ hdfs dfs -mkdir /user/mahout09
Using Hadoop and Mahout MapReduce
Note that when running Mahout 0.9 with Hadoop 1.2.1, the following warning message will keep appearing:
localhost: WARNING: An illegal reflective access operation has occurred
According to this discussion on StackOverflow, this is due to Hadoop 1.2.1 and can be safely ignored.
Upon logging on, always run:
$ start-all.sh
And before logging off, always run:
$ stop-all.sh
Notice that for Hadoop 1.2.1, the file system commands are:
$ hdfs dfs -linux_command
$ hadoop fs -linux_command
whereas for Hadoop 3.3.1, the commands are:
$ hdfs dfs -linux_command
MapReduce example in Python
Since there are various resources on the Internet on how to run Hadoop & Mahout commands on the terminal, I will not cover that here. However, running many commands manually and repeatedly can be tedious and unproductive. Fortunately, this can be automated using Python. Refer to the file “cluster_analysis.py” (on the GitHub repository mentioned earlier) for an example of performing Cluster Analysis (used in recommendation systems, such as movie recommendations). The file uses the subprocess module, which runs any shell command and also picks up the outputs from that run and returns it in Python; see below for an example from the file.
# initialize with Canopy
canopy_str = (
'mahout canopy -i docs-vectors/tfidf-vectors'
+ ' -ow -o docs-canopy-centroids'
+ ' -dm org.apache.mahout.common.distance.'
+ f'{distance_measure}-t1 0.5 -t2 0.3')
canopy_run_output_lines = (
subprocess
.run(canopy_str.split(' '), stdout=subprocess.PIPE)
.stdout.decode('utf-8').splitlines())