Steps To Train A Machine Learning Model With Amazon Sagemaker — First Look

SageMaker is a machine learning service managed by Amazon. It’s basically a service that combines EC2, ECR and S3 all together, allowing you to train complex machine learning models quickly and easily, and then deploy the model into a production-ready hosted environment. It provides many best-in-class built-in algorithms, such as Factorization Machines, XGBoost etc. It also allows you to train models using various machine learning frameworks, such as Apache MXNet, TensorFlow, and Scikit-learn.

A straightforward way to interact with SageMaker is using the notebook Instance. This process is described in detail by Amazon (link). We use SageMaker in a slightly different way. we only want to use SageMaker for the model training part, so that we can train a complex model on a large dataset without worrying about the messy infrastructural details. But the rest of the process (e.g., data preparation, making predictions) run locally. So, in our use case, we want to:

1. Interact with SageMaker jobs from local machine, without using SageMaker notebook Instance. Why? Well, there are a few advantages:

  • it takes a few minutes to start a notebook Instance, which is slow
  • unless you manually stop the instance, you will always be charged for the running instance, no matter if you are actively using it or not. On the other hand, if you submit the training job from local machine, you will only be charged for the model training part
  • if the code sits locally, you can use your IDE to debug, and use github for version control

2. Access the trained model locally, so that we can

  • look at the details of the model, instead of using the model as a black box
  • make predictions locally, and use the model in our own way

The rest of this post will cover how we did that in 5 steps:

  1. Set up your local machine, so that you can interact with SageMaker jobs locally.
  2. Prepare your data
  3. Submit the training job
  4. Download the trained model
  5. Make predictions locally

At the end, we’ll also briefly show you how to use SageMaker’s hyperparameter tuner which helps you tune the machine learning model.

SET UP YOUR LOCAL MACHINE

To interact with SageMaker jobs programmatically and locally, you need to install the sagemaker Python API, and AWS SDK for python. You can install them by running pip install sagemaker boto3

The easiest way to test if your local environment is ready, is by running through a sample notebook, for example, An Introduction to Factorization Machines with MNIST. Run this sample notebook, and check if you need to install additional packages, or if any AWS credential information is missing.

Now you are sure that your local machine is properly set up to interact with SageMaker, then you can bring your own data, train a Factorization Machine classification model using SageMaker, download the model and make predictions. To start, let’s look at how to prepare your data for training. 

PREPARE YOUR DATA

Before you can train a model, data need to be uploaded to S3. The format of the input data depends on the algorithm you choose, for SageMaker’s Factorization Machine algorithm, protobuf is typically used.

To begin, you need to preprocess your data (clean, one hot encoding etc.), split both feature (X) and label (y) into train and test sets. Sometimes, you may also want to leave a validation set aside.

After you have obtained feature (X) and label (y), use the following python code to transform them into protobuf and upload to S3 bucket. Run this for both train, and test sets.


import sagemaker.amazon.common as smac

import boto3

import os


# after lots of data cleaning, preprocessing, feature engineering, split into train, test etc.

feature = your_features

label = your_labels

# define the S3 path to store data, the data would be uploaded to s3://{bucket}/{prefix}/{key} where key is the file name

bucket = your_S3_bucket_name

prefix = your_prefix_name

key = 'train.protobuf' # or 'test.protobuf'

# transform the feature and label into protobuf

buf = io.BytesIO()

# if the feature is a numpy array use smac.write_numpy_to_dense_tensor(buf, feature, label),

# if the feature is sparse matrix, use smac.write_spmatrix_to_sparse_tensor(buf, feature, label)

smac.write_numpy_to_dense_tensor(buf, feature, label)

buf.seek(0)

# upload the protobuf to S3

boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, key)).upload_fileobj(buf)

path_to_train_data = f's3://{bucket}/{prefix}/{key}'

print(f'uploaded training data location: s3://{bucket}/{prefix}/{key}')

At this point, you have uploaded your train, and test data to S3. You can go to AWS console, select S3, and check the protobuf file you just uploaded.

SUBMIT THE TRAINING JOB

Once you have the data ready, you can then define your estimator and submit a training job. The code below defines a factorization machine estimator, and fits data to it:


import sagemaker

from sagemaker.amazon.amazon_estimator import get_image_uri


output_prefix = your_prefix_name_for_output_model

role = your_full_IAM_role_arn_string  # the "Set up your local machine" session describes how to get this string

path_to_train_data = your_path_to_train_data  # from the "Prepare your data" step above

path_to_test_data = your_path_to_train_data

job_name = None  # you can name your job. Otherwise, sagemaker with auto assign a job name

output_prefix = 's3://{}/{}/factorization_machine_output'.format(bucket, output_prefix)

container = get_image_uri(boto3.Session(region_name='us-east-1').region_name, 'factorization-machines')

eatimator = sagemaker.estimator.Estimator(container, role, train_instance_count=1, train_instance_type='ml.c4.xlarge', output_path=output_prefix, sagemaker_session=sagemaker.Session())

eatimator.set_hyperparamters(feature_dim=feature.shape[1], predictor_type='binary_classifier', num_factors=100)

# run training job

eatimator.fit({'train': path_to_train_data, 'test': path_to_test_data}, wait=False, job_name=job_name)

training_job_name = estimator.latest_training_job.job_name

Model parameters can be changed by calling the set_hyperparamters method, if you are not sure what’s the optimal value, you can try the Hyperparameter Tuner described later in this post.

In the estimator’s fit method, there is a parameter wait, which is set to True by default. That means, before this fitting process (i.e., model training) is finished, any code below this line will not run. we find this very inconvenient especially if you want to submit multiple training jobs at the same time. Therefore we set wait = False, and you can check the job status by either looking at the AWS console (select SageMaker -> Training -> Training jobs), or by running the following code:


def get_training_job_status(training_job_name: str):

    job_info = boto3.client('sagemaker').describe_training_job(TrainingJobName=training_job_name)

    job_status = job_info['TrainingJobStatus']

    if job_status == 'Failed':

        message = job_info['FailureReason']

        print(f'Training failed with the following error: {message}')

    return job_status, job_info


job_status, job_info = get_training_job_status(training_job_name)

if job_status != 'Completed':

    print(f'Reminder: Training job {training_job_name} has not be completed. Cannot get model, or evaluate it.')

else:

    s3_model_artifact_path = job_info['ModelArtifacts']['S3ModelArtifacts']

    print('path to the output model artifacts:', s3_model_artifact_path)

DOWNLOAD THE TRAINED MODEL

After Sagemaker trains the model, a model artifact is stored to S3. You can download it, and access the model coefficients locally.

The way to access the model differs from algorithm to algorithm, here we only show you how to access the model coefficients for Sagemaker’s factorization machine model. Please note that this may NOT apply to other algorithms.

First, download the model artifact output from the training job


local_name = 'model_fm.tar.gz'

bucket = s3_model_artifact_path.split('s3://')[1].split('/')[0]

key = s3_model_artifact_path.split(bucket + '/')[1]

s3 = boto3.resource('s3')

s3.Bucket(bucket).download_file(key, local_name)

Next, extract the information out


!tar -zxvf model_fm.tar.gz

!unzip -o model_algo-1

!mv params model_fm-0000.params

!mv symbol.json model_fm-symbol.json

Finally, load the model into a mx.module object, so that you can read the information stored inside the model.


import mxnet as mx

mx_model = mx.module.Module.load("./model_fm", 0, False, label_names=['out_label'])

For a Factorization Machine model, the mx_model._arg_params has three keys. These include,  w0_weight (the bias), w1_weight (weights for the linear terms), and v (weights for reduced dimension factorization space). You can look at their values to understand more about your model.

MAKE PREDICTIONS LOCALLY

After you have loaded the model locally, you can apply the model to your test data, and make predictions, without paying to AWS.

If you have a small amount of data, you can make predictions by running make_prediction_dense function below.


def make_prediction_dense(model: mx.module, x_array: np.ndarray, batch_size: int=100):

    data_iter = mx.io.NDArrayIter(data=x_array, batch_size=batch_size)

    model.bind(data_shapes=data_iter.provide_data)

    prediction = model.predict(data_iter).asnumpy().flatten()

    return model, prediction

If you have a large amount of data, make_prediction_dense would take a long time to finish. In this case, we’d suggest you transform your input data x_array to scipy sparse matrix before running the prediction.

Now you have obtained a factorization model using SageMaker, and are able to make predictions with it! The model you trained is based on a specific set of hyper parameter values. You may ask, how do I know what are the optimal values for the hyper parameters? SageMaker’s Hyperparameter Tuner will help you find the answer.

HYPERPARAMETER TUNER:

In many cases, you do not know what is the optimal value for model hyperparameters. Therefore, you would like to tune the model. Sagemaker’s hyperparameter tuner uses Bayesian Optimization to find the optimal model hyperparameters, as described here

Unless you use CategoricalParameter to define the hyperparameter range, the Hyperparameter Tuner can not explore all the possible values within the defined range, but focuses its training efforts on the best places. At each iteration, the value to test is based on everything the tuner knows about this problem so far. This process is stochastic, it is very helpful for tuning complex models, where it is impossible to explore all the possible combinations. On the other hand, because it’s stochastic, it’s possible that the hyperparameter tuning model will fail to converge on the best answer, even if the ranges specified are correct.

We suggest you take some time to explore the hyperparameter ranges, and gradually shrink the ranges to explore so that the hyperparameter tuner is more likely to converge around the best answer faster. In addition, there is a small trade off between max_parallel_jobs and the quality of the final model. Larger max_parallel_jobs decreases  overall tuning, but smaller max_parallel_jobs will probably generate a slightly better result.

If you’d like to dig further, you can use this sample notebook to visualize how the objective metric, and hyperparameter values change with time. It helps you understand if the hyperparameter tuner converged or not. With this information, you can adjust your hyperparameter ranges, and the max_jobs accordingly.

CONCLUSION

Overall, SageMaker is a very powerful machine learning service. It allows you to train a complex model on a large dataset, and deploy the model without worrying about the messy infrastructural details. SageMaker provides lots of best-in-class built in algorithms, and allows to bring your own model. Besides, you can use machine learning frameworks such as Scikit-learn, and TensorFlow with SageMaker. There are many sample notebooks, so you can learn by doing.

That being said, we think there is still room to improve:

1 ) Difficult to troubleshoot. Because SageMaker is relatively new, you can hardly find solutions to your questions on places like Stack OverFlow. From my experience, these are the best resources for troubleshooting:

  • the python sdk repo, look at the source code to find information that are not described in the sagemaker documentation
  • the sagemaker forum, you might find answers to your questions there, you can also post your own questions, and AWS people will typically respond within a day or two
  • the AWS support center, you can create a ticket there, and the support team will answer your question

2) Incomplete documentation. For example, when we were using SageMaker, the documentation does not cover how to extract the model coefficient, or how to set up the hyperparameter values for tuning. We found the answers by looking into their sample notebooks, AWS blog, and the SageMaker forum.

3) Not flexible enough. For example, when using SageMaker’s factorization machines with hyperparameter tuning, there are very limited objective metrics we can choose from. It is still unclear how to run cross validation with SageMaker’s built-in algorithm.

SageMaker has many functionalities, and this post is based on initial experimentation only. We plan to continue exploring other areas in SageMaker, such as how to bring my own model, and how to use Scikit-learn and Spark in SageMaker.

Related Posts