Using Ansible to setup Mahout MapReduce environment

Introduction

Although the support for Hadoop & Mahout’s MapReduce has been phased out since 2014, sometimes it is needed for big data work, research, or study. Ansible is a scripting tool that can be used to automate the setup of a reproducible environment for such a workload reliably and efficiently. This post serves two objectives: to give you a basic walkthrough of using Ansible and to provide you with my downloadable tool for setting up the software environment mentioned earlier for big data analysis.

Background

Ansible is a scripting tool to automate IT tasks and is popular in reproducing a particular environment for programs to run due to its idempotency property (i.e., no matter how many times you execute an Ansible script, you achieve the same result), which means that it is robust against any failures or errors that may occur at any point in the script. Mahout is a big data analysis tool usually executed on top of Hadoop, a distributed file system. In this post, I use an example to illustrate using Ansible to create an environment to run Hadoop & Mahout MapReduce workload on a pseudo-distributed cluster. The key reason for this choice is that since the MapReduce paradigm for Mahout has been phased out since 2014 (after being superseded by other approaches), the information regarding how to set up such an environment is sparse, scattered across the Internet, and difficult to consolidate into workable steps.

The “pseudo-distributed cluster” aspect is a feature of Hadoop for testing a simulated “cluster” on a single computer. Since Hadoop & Mahout runs on disk rather than in memory, to save time, I recommend running this on a PC or Mac with a solid-state drive (SSD) rather than a hard disk drive (HDD).

The code for this article is publicly available on this repository under my GitHub account, licensed under BSD-3.

Setting up Ansible

Hadoop & Mahout works under Linux. Hence, I will focus solely on setting them up in Linux. If you use Windows, you can install Linux using Windows Subsystem for Linux (WSL) by following the instructions on Microsoft’s website. After having Linux ready, the next step would be to set up Ansible. However, there are other dependencies before and after this step. Hence, the easiest way is to run the bash shell script “setup_ansible.sh” after downloading it from the GitHub repository mentioned earlier. The script is shown in full below.

#!/bin/bash

sudo apt update
sudo apt upgrade
sudo apt-get install -y python3-pip python3-dev
sudo pip3 install lxml
sudo pip3 install ansible
ansible-galaxy collection install community.general
sudo apt-get install -y whois

In order to run the script, enter the following in your Linux terminal:

$ chmod +x setup_ansible.sh
$ bash setup_ansible.sh

Script walkthrough

Ansible scripts are in YAML format, which is reader-friendly. Let us start with some basic house keeping items:

- hosts: 127.0.0.1
  connection: local
  become: true

As you may have guessed, the “connection: local” part indicates that we are running the script locally. Ansible can be used to run over a network, but I will not cover that in this post.

Next, we indicate to Ansible that some custom variables are stored in the file “vars.yml”.

  vars_files:
    - vars.yml

The contents of the “vars.yml” are listed below. We are creating two login IDs because the highest version of Mathout supporting the MapReduce paradigm (and all of its algorithms) is 0.90*, which requires Hadoop 1.3.1. The extra ID is entirely optional and only for installing Hadoop 3.3.1 in addition to Hadoop 1.3.1 to allow you to explore a newer version of Hadoop without interfering with the older version 1.3.1.

Version 0.10.0 shifted away from MapReduce to the Samsara domain-specific language (DSL).

# Usernames and passwords
# Key is username and item is password hash from mkpasswd --method=sha-512
# First one must be Hadoop 3.3.1, while second one must be for Mahout 0.90, which requires Hadoop 1.3.1
big_data_users:
  hadoop331:
    user_name: hadoop331
    password_hash: $6$nGk0qT2ONarx.il$fM57NUaAQSojPYm43aap1EfXd/uxUp5dJGvnJYoLkByPoxOx93pcAs0V2ZnobQtpUZD8RAyphKQIR44SegB7p0
  mahout09:
    user_name: mahout09
    password_hash: $6$nGk0qT2ONarx.il$fM57NUaAQSojPYm43aap1EfXd/uxUp5dJGvnJYoLkByPoxOx93pcAs0V2ZnobQtpUZD8RAyphKQIR44SegB7p0

# The temporary directory into which Hadoop and Mahout will be downloaded for setup.
download_dir: /tmp

Modify the “password_hash” values based on your chosen passwords. To get the hash values for your chosen passwords, enter the following in your Linux terminal. You will be asked to key in your chosen password.

$ mkpasswd --method=sha-512

The subsequent lines are to ensure everything is up to date:

  pre_tasks:
    - name: Update apt cache if needed
      apt:
        update_cache: true
        cache_valid_time: 3600
    - name: Make sure sudo group exists
      group:
        name: sudo
        state: present

Next, the script installs the necessary software packages and sets up the two login IDs described earlier. Note that the package “OpenSSH Server” is uninstalled and reinstalled. This step is required due to a bug in Windows WSL version 2; otherwise, this step would have been unnecessary.

  tasks:
    - name: Remove the package "openssh-server" because reinstall is needed for Windows WSL2
      apt:
        name: openssh-server
        state: absent
        purge: yes

    - name: (Re-)Install the package "openssh-server"
      apt:
        name: openssh-server
        state: present

    - name: Install Java OpenJDK 11
      apt:
        name: openjdk-11-jdk
        state: present

    - name: Install the package unzip
      apt:
        name: unzip
        state: present

    - name: Create a login ID for each of our big data users
      user:
        name: "{{ item.value.user_name }}"
        password: "{{ item.value.password_hash }}"  # Hash of your password from mkpasswd --method=sha-512
        state: present
        groups: sudo
        shell: /bin/bash
        system: no
        createhome: yes
        home: "/home/{{ item.value.user_name }}"
      with_dict: "{{ big_data_users }}"

    - name: Ensure .ssh folder exists for each of our big data users
      file:
        path: "/home/{{ user_name }}/.ssh"
        owner: "{{ user_name }}"
        state: directory
      vars:
        user_name: "{{ item.value.user_name }}"
      with_dict: "{{ big_data_users }}"

    - name: Generate an OpenSSH keypair with ecdsa 521-bit for each of our big data users
      openssh_keypair:
        path: "/home/{{ user_name }}/.ssh/id_ecdsa"
        owner: "{{ user_name }}"
        size: 521
        type: ecdsa
      vars:
        user_name: "{{ item.value.user_name }}"
      with_dict: "{{ big_data_users }}"

    - name: Set authorized key taken from file for each of our big data users
      authorized_key:
        user: "{{ user_name }}"
        state: present
        key: "{{ lookup('file', file_name) }}"
      vars:
        user_name: "{{ item.value.user_name }}"
        file_name: "/home/{{ user_name }}/.ssh/id_ecdsa.pub"
      with_dict: "{{ big_data_users }}"


# Download Hadoop 1.2.1, Mahout 0.9 and Hadoop 3.3.1

    - name: Download Hadoop 1.2.1
      get_url:
        url: https://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/hadoop-1.2.1-bin.tar.gz
        dest: "{{ download_dir }}/hadoop-1.2.1-bin.tar.gz"
        checksum: sha512:5793DBB7410E479253AD412F855F531AD7E029937A764B41DFE1E339D6EA014F75AD08B8497FDA30D6AB623C83DBE87826750BE18BB2B96216A83B36F5095F1E

    - name: Expand Hadoop 1.2.1 for mahout09 user
      unarchive:
        src: "{{ download_dir }}/hadoop-1.2.1-bin.tar.gz"
        dest: "/home/{{ user_name }}"
        remote_src: true
        owner: "{{ user_name }}"
        mode: u+rwx,o-rwx
      vars:
        user_name: "{{ big_data_users['mahout09'].user_name }}"

    - name: Remove Hadoop 1.2.1 temporary files.
      file:
        path: "{{ download_dir }}/hadoop-1.2.1-bin.tar.gz"
        state: absent


    - name: Download Mahout 0.9
      get_url:
        url: http://archive.apache.org/dist/mahout/0.9/mahout-distribution-0.9.tar.gz
        dest: "{{ download_dir }}/mahout-distribution-0.9.tar.gz"
        checksum: sha1:b0d192a33dcc3f00439bf2ffbc313c6ef47510c3

    - name: Expand Mahout 0.9 for mahout09 user.
      unarchive:
        src: "{{ download_dir }}/mahout-distribution-0.9.tar.gz"
        dest: "/home/{{ user_name }}"
        remote_src: true
        owner: "{{ user_name }}"
        mode: u+rwx,o-rwx
      vars:
        user_name: "{{ big_data_users['mahout09'].user_name }}"

    - name: Remove Mahout 0.9 temporary files.
      file:
        path: "{{ download_dir }}/mahout-distribution-0.9.tar.gz"
        state: absent


    - name: Download Hadoop 3.3.1
      get_url:
        url: https://archive.apache.org/dist/hadoop/core/hadoop-3.3.1/hadoop-3.3.1.tar.gz
        dest: "{{ download_dir }}/hadoop-3.3.1.tar.gz"
        checksum: sha512:2fd0bf74852c797dc864f373ec82ffaa1e98706b309b30d1effa91ac399b477e1accc1ee74d4ccbb1db7da1c5c541b72e4a834f131a99f2814b030fbd043df66

    - name: Expand Hadoop 3.3.1 for hadoop331 user
      unarchive:
        src: "{{ download_dir }}/hadoop-3.3.1.tar.gz"
        dest: "/home/{{ user_name }}"
        remote_src: true
        owner: "{{ user_name }}"
        mode: u+rwx,o-rwx
      vars:
        user_name: "{{ big_data_users['hadoop331'].user_name }}"

    - name: Remove Hadoop 3.3.1  temporary files.
      file:
        path: "{{ download_dir }}/hadoop-3.3.1.tar.gz"
        state: absent

The final part is the most challenging: configuring everything correctly so that they will work together:

creating the proper folder names
setting up the right environment variables in the bash files
inserting the correct details in the XML configuration files

All of the above was the result of much time-consuming scouring the Internet. The details are peripheral to the discussion here. Still, if you are interested, you can find the related information by looking up the various phrases in the portion of the code below on the Internet.

# Update bash files

    - name: Insert lines into mahout09 user .bashrc
      blockinfile:
        path: "/home/{{ big_data_users['mahout09'].user_name }}/.bashrc"
        insertafter: "EOF"
        block: |
          export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
          export HADOOP_PREFIX=~/hadoop-1.2.1
          export HADOOP_INSTALL=$HADOOP_PREFIX
          export HADOOP_MAPRED_HOME=$HADOOP_PREFIX
          export HADOOP_COMMON_HOME=$HADOOP_PREFIX
          export HADOOP_HDFS_HOME=$HADOOP_PREFIX
          export HADOOP_YARN_HOME=$HADOOP_PREFIX
          export YARN_HOME=$HADOOP_PREFIX
          export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_PREFIX/lib/native
          export PATH=$PATH:$HADOOP_PREFIX/sbin:$HADOOP_PREFIX/bin
          export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib/native"
          export HADOOP_CONF_DIR=$HADOOP_INSTALL/conf
          export MAHOUT_HOME=~/mahout-distribution-0.9
          export PATH=$PATH:$MAHOUT_HOME/bin
    - name: Update line in Hadoop 3.3.1 hadoop-env.sh
      lineinfile:
        path: "/home/{{ big_data_users['mahout09'].user_name }}/hadoop-1.2.1/conf/hadoop-env.sh"
        search_string: "export JAVA_HOME=/usr/lib"
        state: present
        line: export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/


    - name: Insert lines into hadoop331 user .bashrc
      blockinfile:
        path: "/home/{{ big_data_users['hadoop331'].user_name }}/.bashrc"
        insertafter: "EOF"
        block: |
          export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
          export HADOOP_HOME=~/hadoop-3.3.1
          export HADOOP_INSTALL=$HADOOP_HOME
          export HADOOP_MAPRED_HOME=$HADOOP_HOME
          export HADOOP_COMMON_HOME=$HADOOP_HOME
          export HADOOP_HDFS_HOME=$HADOOP_HOME
          export HADOOP_YARN_HOME=$HADOOP_HOME
          export YARN_HOME=$HADOOP_HOME
          export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
          export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
          export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
    - name: Update line in Hadoop 3.3.1 hadoop-env.sh
      lineinfile:
        path: "/home/{{ big_data_users['hadoop331'].user_name }}/hadoop-3.3.1/etc/hadoop/hadoop-env.sh"
        search_string: "export JAVA_HOME=/usr/lib"
        state: present
        line: export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/


# Create Hadoop folders

    - name: Create folders for Hadoop 1.2.1 namenode, datanode and temp
      file:
        path: "/home/{{ user_name }}/{{ item }}"
        state: directory
        owner: "{{ user_name }}"
        mode: u+rw,o-rwx
      vars:
        user_name: "{{ big_data_users['mahout09'].user_name }}"
      with_items:
        - hadoopdata/hdfs/namenode
        - hadoopdata/hdfs/datanode
        - hadoopdata/hdfs/tmpdata


    - name: Create folders for Hadoop 3.3.1 namenode, datanode and temp
      file:
        path: "/home/{{ user_name }}/{{ item }}"
        state: directory
        owner: "{{ user_name }}"
        mode: u+rw,o-rwx
      vars:
        user_name: "{{ big_data_users['hadoop331'].user_name }}"
      with_items:
        - hadoopdata/hdfs/namenode
        - hadoopdata/hdfs/datanode
        - hadoopdata/hdfs/tmpdata


# Hadoop 1.2.1 XML configurations

    - name: Insert lines into mahout09 user Hadoop 1.2.1 configuration core-site.xml
      xml:
        path: "/home/{{ user_name }}/hadoop-1.2.1/conf/core-site.xml"
        xpath: "/configuration/property[name[text()='hadoop.tmp.dir']]/value"
        value: "/home/{{ user_name }}/hadoopdata/hdfs/tmpdata"
      vars:
        user_name: "{{ big_data_users['mahout09'].user_name }}"

    - name: Insert lines into mahout09 user Hadoop 1.2.1 configuration core-site.xml
      xml:
        path: "/home/{{ user_name }}/hadoop-1.2.1/conf/core-site.xml"
        xpath: "/configuration/property[name[text()='fs.default.name']]/value"
        value: "hdfs://127.0.0.1:9000"
      vars:
        user_name: "{{ big_data_users['mahout09'].user_name }}"


    - name: Insert lines into mahout09 user Hadoop 1.2.1 configuration hdfs-site.xml
      xml:
        path: "/home/{{ user_name }}/hadoop-1.2.1/conf/hdfs-site.xml"
        xpath: "/configuration/property[name[text()='dfs.namenode.name.dir']]/value"
        value: "/home/{{ user_name }}/hadoopdata/hdfs/namenode"
      vars:
        user_name: "{{ big_data_users['mahout09'].user_name }}"

    - name: Insert lines into mahout09 user Hadoop 1.2.1 configuration hdfs-site.xml
      xml:
        path: "/home/{{ user_name }}/hadoop-1.2.1/conf/hdfs-site.xml"
        xpath: "/configuration/property[name[text()='dfs.datanode.data.dir']]/value"
        value: "/home/{{ user_name }}/hadoopdata/hdfs/datanode"
      vars:
        user_name: "{{ big_data_users['mahout09'].user_name }}"

    - name: Insert lines into mahout09 user Hadoop 1.2.1 configuration hdfs-site.xml
      xml:
        path: "/home/{{ user_name }}/hadoop-1.2.1/conf/hdfs-site.xml"
        xpath: "/configuration/property[name[text()='dfs.replication']]/value"
        value: "1"
      vars:
        user_name: "{{ big_data_users['mahout09'].user_name }}"


    - name: Insert lines into mahout09 user Hadoop 1.2.1 configuration mapred-site.xml
      xml:
        path: "/home/{{ user_name }}/hadoop-1.2.1/conf/mapred-site.xml"
        xpath: "/configuration/property[name[text()='mapred.job.tracker']]/value"
        value: "localhost:9001"
      vars:
        user_name: "{{ big_data_users['mahout09'].user_name }}"


# Hadoop 3.3.1 XML configurations

    - name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration core-site.xml
      xml:
        path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/core-site.xml"
        xpath: "/configuration/property[name[text()='hadoop.tmp.dir']]/value"
        value: "/home/{{ user_name }}/hadoopdata/hdfs/tmpdata"
      vars:
        user_name: "{{ big_data_users['hadoop331'].user_name }}"

    - name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration core-site.xml
      xml:
        path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/core-site.xml"
        xpath: "/configuration/property[name[text()='fs.default.name']]/value"
        value: "hdfs://127.0.0.1:9000"
      vars:
        user_name: "{{ big_data_users['hadoop331'].user_name }}"


    - name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration hdfs-site.xml
      xml:
        path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/hdfs-site.xml"
        xpath: "/configuration/property[name[text()='dfs.name.dir']]/value"
        value: "/home/{{ user_name }}/hadoopdata/hdfs/namenode"
      vars:
        user_name: "{{ big_data_users['hadoop331'].user_name }}"

    - name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration hdfs-site.xml
      xml:
        path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/hdfs-site.xml"
        xpath: "/configuration/property[name[text()='dfs.data.dir']]/value"
        value: "/home/{{ user_name }}/hadoopdata/hdfs/datanode"
      vars:
        user_name: "{{ big_data_users['hadoop331'].user_name }}"

    - name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration hdfs-site.xml
      xml:
        path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/hdfs-site.xml"
        xpath: "/configuration/property[name[text()='dfs.replication']]/value"
        value: "1"
      vars:
        user_name: "{{ big_data_users['hadoop331'].user_name }}"


    - name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration mapred-site.xml
      xml:
        path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/mapred-site.xml"
        xpath: "/configuration/property[name[text()='mapreduce.framework.name']]/value"
        value: "yarn"
      vars:
        user_name: "{{ big_data_users['hadoop331'].user_name }}"


    - name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration yarn-site.xml
      xml:
        path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/yarn-site.xml"
        xpath: "/configuration/property[name[text()='yarn.nodemanager.aux-services']]/value"
        value: "mapreduce_shuffle"
      vars:
        user_name: "{{ big_data_users['hadoop331'].user_name }}"

    - name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration yarn-site.xml
      xml:
        path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/yarn-site.xml"
        xpath: "/configuration/property[name[text()='yarn.nodemanager.aux-services.mapreduce.shuffle.class']]/value"
        value: "org.apache.hadoop.mapred.ShuffleHandler"
      vars:
        user_name: "{{ big_data_users['hadoop331'].user_name }}"

    - name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration yarn-site.xml
      xml:
        path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/yarn-site.xml"
        xpath: "/configuration/property[name[text()='yarn.resourcemanager.hostname']]/value"
        value: "127.0.0.1"
      vars:
        user_name: "{{ big_data_users['hadoop331'].user_name }}"

    - name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration yarn-site.xml
      xml:
        path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/yarn-site.xml"
        xpath: "/configuration/property[name[text()='yarn.acl.enable']]/value"
        value: "0"
      vars:
        user_name: "{{ big_data_users['hadoop331'].user_name }}"

    - name: Insert lines into hadoop331 user Hadoop 3.3.1 configuration yarn-site.xml
      xml:
        path: "/home/{{ user_name }}/hadoop-3.3.1/etc/hadoop/yarn-site.xml"
        xpath: "/configuration/property[name[text()='yarn.nodemanager.env-whitelist']]/value"
        value: "JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME"
      vars:
        user_name: "{{ big_data_users['hadoop331'].user_name }}"

Finally, to run the script, save the above script as “big_data.yml” in the same folder as the file “vars.yml”, and then enter the following in your Linux terminal:

$ sudo ansible-playbook big_data.yml

Manual steps after the script

After running the script, you need to format your distributed file system and then create your new folders on your distributed file system. These steps are written as manual steps here rather than in an automated script so that you get to experience and learn Hadoop.

First, switch to user mahout09, and start the SSH service:

$ su - mahout09
$ sudo service ssh start

Then, run SSH; answer “yes” when asked, “Are you sure you want to continue?”:

$ ssh localhost

Run the following to start the Hadoop service:

$ start-all.sh

Run the following to prepare the Hadoop cluster:

$ hadoop namenode -format
$ hadoop fs -mkdir /user
$ hadoop fs -mkdir /user/mahout09

Finally, run the following to check that you have successfully formatted and created your folders on your Hadoop distributed file system (HDFS) for the current user:

$ hdfs dfs -ls /

Run the following to stop the Hadoop service before you log off the current user:

$ stop-all.sh

Repeat all the steps above for your Hadopp331 user, except for the folder name step:

$ hdfs dfs -mkdir /user/hadoop331

instead of:

$ hdfs dfs -mkdir /user/mahout09

Using Hadoop and Mahout MapReduce

Note that when running Mahout 0.9 with Hadoop 1.2.1, the following warning message will keep appearing:

localhost: WARNING: An illegal reflective access operation has occurred

According to this discussion on StackOverflow, this is due to Hadoop 1.2.1 and can be safely ignored.

Upon logging on, always run:

$ start-all.sh

And before logging off, always run:

$ stop-all.sh

Notice that for Hadoop 1.2.1, the file system commands are:

$ hdfs dfs -linux_command
$ hadoop fs -linux_command

whereas for Hadoop 3.3.1, the commands are:

$ hdfs dfs -linux_command

MapReduce example in Python

Since there are various resources on the Internet on how to run Hadoop & Mahout commands on the terminal, I will not cover that here. However, running many commands manually and repeatedly can be tedious and unproductive. Fortunately, this can be automated using Python. Refer to the file “cluster_analysis.py” (on the GitHub repository mentioned earlier) for an example of performing Cluster Analysis (used in recommendation systems, such as movie recommendations). The file uses the subprocess module, which runs any shell command and also picks up the outputs from that run and returns it in Python; see below for an example from the file.

    # initialize with Canopy
    canopy_str = (
        'mahout canopy -i docs-vectors/tfidf-vectors'
        + ' -ow -o docs-canopy-centroids'
        + ' -dm org.apache.mahout.common.distance.'
        + f'{distance_measure}-t1 0.5 -t2 0.3')
    canopy_run_output_lines = (
        subprocess
        .run(canopy_str.split(' '), stdout=subprocess.PIPE)
        .stdout.decode('utf-8').splitlines())