Amazon SageMaker Processing Resources and Notes
Resources
References
AWS Blog Posts
- Using the Amazon SageMaker Studio Image Build CLI to build container images from your Studio notebooks
- Bringing your own R environment to Amazon SageMaker Studio
- Performing simulations at scale with Amazon SageMaker Processing and R on RStudio
- This post is especially interesting because it includes a section on adapting an R script for SageMaker Processing.
SageMaker Developer Guide
AWS Samples GitHub repo
YouTube
Amazon SageMaker Processing Notes
Parallel Processing in SageMaker
To process data in parallel using [a container] on Amazon SageMaker Processing, you can shard input objects by S3 key by setting s3_data_distribution_type='ShardedByS3Key'
inside a ProcessingInput
so that each instance receives about the same number of input objects.
From the S3DataSouce API reference, S3DataDistributionType
parameter:
If you want Amazon SageMaker to replicate the entire dataset on each ML compute instance that is launched for model training, specify
FullyReplicated
.
If you want Amazon SageMaker to replicate a subset of data on each ML compute instance that is launched for model training, specify
ShardedByS3Key
. If there are n ML compute instances launched for a training job, each instance gets approximately 1/n of the number of S3 objects. In this case, model training on each machine uses only the subset of training data.
Don't choose more ML compute instances for training than available S3 objects. If you do, some nodes won't get any data and you will pay for nodes that aren't getting any training data. This applies in both File and Pipe modes. Keep this in mind when developing algorithms.