Machine Learning Workflow

Goal: For a given text message, identify if it is spam or not.

  1. Extract data
  2. Transform & tokenize messages
  3. Build Spark’s Tf-IDF model and expand messages to feature vectors
  4. Create and evaluate H2O’s Deep Learning model
  5. Use the models to detect spam messages

Prepare environment

  1. Run Sparkling shell with an embedded Spark cluster:

Note: To avoid flooding output with Spark INFO messages, I recommend editing your $SPARK_HOME/conf/ and configuring the log level to WARN.

  1. Open Spark UI: Go to http://localhost:4040/ to see the Spark status.

  2. Prepare the environment:

  1. Define the representation of the training message:

  1. Define the data loader and parser:

  1. Define the input messages tokenizer:

  1. Configure Spark’s Tf-IDF model builder:

  1. Initialize H2OContext and start H2O services on top of Spark:

  1. Open H2O UI and verify that H2O is running:

At this point, you can use the H2O UI and see the status of the H2O cloud by typing getCloud.

  1. Build the final workflow using all building pieces:

  1. Evaluate the model’s quality:

You can also open the H2O UI and type getPredictions to visualize the model’s performance or type getModels to see model output.

  1. Create a spam detector:

  1. Try to detect spam:

  1. At this point, you have finished your 1st Sparkling Water Machine Learning application.