Amazon SageMaker Processing Resources and Notes

A list of useful links and notes about Amazon SageMaker Processing as I learn about them. Contents are subject to change.

Resources

References

AWS Blog Posts

SageMaker Developer Guide

AWS Samples GitHub repo

YouTube

Amazon SageMaker Processing Notes

Parallel Processing in SageMaker

To process data in parallel using [a container] on Amazon SageMaker Processing, you can shard input objects by S3 key by setting s3_data_distribution_type='ShardedByS3Key' inside a ProcessingInput so that each instance receives about the same number of input objects.

From the S3DataSouce API reference, S3DataDistributionType parameter:

If you want Amazon SageMaker to replicate the entire dataset on each ML compute instance that is launched for model training, specify FullyReplicated.

If you want Amazon SageMaker to replicate a subset of data on each ML compute instance that is launched for model training, specify ShardedByS3Key. If there are n ML compute instances launched for a training job, each instance gets approximately 1/n of the number of S3 objects. In this case, model training on each machine uses only the subset of training data.

Don't choose more ML compute instances for training than available S3 objects. If you do, some nodes won't get any data and you will pay for nodes that aren't getting any training data. This applies in both File and Pipe modes. Keep this in mind when developing algorithms.