Using boto3 and Keras to checkpoint deep learning models on AWS S3


For starters, let me explain why I’m writing this post, although the boto3 library is extremely powerfull. It’s also one of those packages I can’t wrap my head around, I keep googling for solutions how to do simple stuff ALL THE TIME.

I think one of the reasons for that is in boto3 there are multiple ways to perform basically the same task. This makes it relatively hard to use.

But enough lingering, Let’s write a simple wrapper around boto3 to make common S3 operations easier and learn to use it more efficiently.

To actually apply it in a real-world scenario
we will use the wrapper to create a custom keras.callback whose task is to upload model checkpoints to s3, every time the model improves.

Common S3 operations

The code above should be pretty self-explanatory, and hopefully useful. Let’s get to a more interesting topic – checkpointing deep learning models on S3.

There are 2 main reasons why checkpointing is extremely important:

  • You want to evaluate/use your model at different learning stages
  • You don’t want to loose a model that was trained for hour-days-weeks

One might say, why S3? When working on a remote machine, let’s say a GPU spot instance, we can attach an Amazon EBS and persist everything there. And that’s right (and we should)
But there are scenarios where it’s handy to have it on S3 for example if we need to download the model to our local machine on a regular basis.

We can implement the checkpointing to S3 in two ways, one would be to edit/override the default ModelCheckpoint and add the S3 functionality there. Or we can use a more hacky and fun way – implement a new callback and chain it together with the original ModelCheckpoint.

Let’s have a look how this can be achieved:

As you can see we only need to implement one trigger (on_epoch_end) we check if the file timestamp has changed (to avoid uploading of the same version when our model didn’t improve)
Since we inherited our S3Wrap class we can use a handy upload function to send the model to S3.

Let’s put thins together into one working example, where we train a model and every time the ‘acc’ metric improves we store it on the host AND upload the model to given s3 bucket.

Of course, this is just a crude implementation and there are scenarios where using a solution like that would be a really bad idea. And that’s when:
– You train big models (gigabyte range) on an NOT-AWS machine. This could take to much time to upload.
– You train models that train very quickly and uploading every epoch would lead to a significant slowdown.

To combat above issues you might consider running the upload as a separate thread.

Posted by jakub.cieslik

Leave a Reply