Amazon SageMaker Processing Resources and Notes

A list of useful links and notes about Amazon SageMaker Processing as I learn about them. Contents are subject to change.

Resources

References

Amazon SageMaker Processing SDK
Amazon SageMaker Examples GitHub repo
- Amazon SageMaker Examples
- R in SageMaker Processing (See also: my modified version)

AWS Blog Posts

Using the Amazon SageMaker Studio Image Build CLI to build container images from your Studio notebooks
Bringing your own R environment to Amazon SageMaker Studio
Performing simulations at scale with Amazon SageMaker Processing and R on RStudio
- This post is especially interesting because it includes a section on adapting an R script for SageMaker Processing.

SageMaker Developer Guide

AWS Samples GitHub repo

YouTube

AWS re:Invent 2020: Productionizing R workloads using Amazon SageMaker, featuring Siemens

Amazon SageMaker Processing Notes

Parallel Processing in SageMaker

To process data in parallel using [a container] on Amazon SageMaker Processing, you can shard input objects by S3 key by setting s3_data_distribution_type='ShardedByS3Key' inside a ProcessingInput so that each instance receives about the same number of input objects.

From the S3DataSouce API reference, S3DataDistributionType parameter:

If you want Amazon SageMaker to replicate the entire dataset on each ML compute instance that is launched for model training, specify FullyReplicated.

If you want Amazon SageMaker to replicate a subset of data on each ML compute instance that is launched for model training, specify ShardedByS3Key. If there are n ML compute instances launched for a training job, each instance gets approximately 1/n of the number of S3 objects. In this case, model training on each machine uses only the subset of training data.

Don't choose more ML compute instances for training than available S3 objects. If you do, some nodes won't get any data and you will pay for nodes that aren't getting any training data. This applies in both File and Pipe modes. Keep this in mind when developing algorithms.