ElastiCluster-ClusterJob Computing Model

Introduction

This is a supplemetary page to the paper “Ambitious Data Science Can Be Painless” by Monajemi et al. Here you will learn how to conduct massive computational experiments in the cloud using an approach developed by Hatef Monajemi and Riccardo Murri during the first iteration of Stats285 course at Stanford University in the Fall of 2017; You will use elasticluster to build your own personal cluster in the cloud and then use clusterjob (CJ) to run jobs on this cluster. Both of these steps are very straight-forward and painless.
This documents contains the detail of setting up your (GPU-accelerated) cluster and testing that it works properly with CJ. Once these steps are completed, you can then enjoy conducting your own massive data science experiments in the cloud painlessly.

Questions and Comments

Please send your questions to CJ’s Google group. For other inquires, please send an email to hatefmonajemi@gmail.com or riccardo.murri@gmail.com

Building your cluster in the cloud

To create your own cluster in the cloud, you should take the following 4 steps:

  1. Install ClusterJob
  2. Setup your cloud account
  3. Create your cluster using 0-install ElastiCluster script
  4. Test your cluster works with CJ

Part-1: Install ClusterJob

Part-2: Setup Your Cloud Account

Part-3: Create Your Cluster Using Elasticluster

  1. Get elasticluster 0-install script from GitHub or download it from this website
    curl -O https://raw.githubusercontent.com/gc3-uzh-ch/elasticluster/master/elasticluster.sh
    chmod +x elasticluster.sh
    
  2. Provide your desired configuration
     elasticluster.sh list-templates
     vim ~/.elasticluster/config
    

    You may use the default Elasticluster config template or elasticluster_config that is provided with this companion page. You can get elasticluster_config via the following commands:

     curl -O  https://monajemi.github.io/datascience/assets/files/elasticluster_config
     cp elasticluster_config ~/.elasticluster/config
    

    You must change the contents of the elasticluster config file ~/.elasticluster/config to reflect your own credentials and choice of resources. As an example, on Google Cloud, you should retrive your project_id, client_id, and client_secret by visiting the Credential Page and update the contents of ~/.elasticluster/config by providing these credentials.

    See config example on GitHub

    gcloud provides useful commands to see the available options, for example:
    gcloud compute machine-types list --zones us-west1-a
    lists all the machine types that are availbale in zone us-west1-a
    This infomation can be found online on Google
    Also, gcloud compute images list list all the available images.

Advanced Tips

  1. Spin up your cluster
    ./elasticluster.sh -vvvv start gce
    

    Note that cluster named gce is fully defined in your ~/.elasticluster/config. For convenient use, you may add an alias to your ~/.bashrc or ~/.bash_profile:
    alias elasticluster='/PATH/TO/SCRIPT/./elasticluster.sh'

    if you run into error, and asked to run the setup again, do so using

     elasticluster -vvvv setup gce
    
  • You can also monitor the progress at your cloud console ( e.g., Google Cloud Consol , EC2 Consol

    if everything goes well, you will see your cluster is ready!. This is the moment you should shout Yay! and congratulate yourself. You now have your own cluster!

  • Get the IP address of the frontend node using:
      elasticluster list-nodes gce
    

    example: 35.199.171.137

  • Login to your cluster to test it
      ssh <USERNAME>@<FRONTEND_IP>
    

    example: ssh hatefmonajemi@35.199.171.137 (for GCP, your gmail ID is your username)

  • To destroy your cluster:
      elasticluster -vvvv stop gce
    

    Note that this command will destroy your cluster and you lose all the data on it. Make sure you get your data to a safe storage place before you destroy your cluster.

Part-4: Test your cluster works with CJ

After you have launched your cluster successfully, it is time to test it by running a sample job on it using ClusterJob. Follow the instructions on Test Your Cluster With CJ.