Indexing images using h5py for machine learning purposes

Dealing with image datasets can get a little bit tricky, considering their size a dataset too big to fit into memory is a common view. One way to make dealing with them more pleasant is to index them in an HDF5 file wich gives us a number of advantages compared to dealing with each file one-by-one. To name a few:

  • Reading from HDF5 is extremely fast
  • We can treat them similar as we would treat a numpy nd-array.
  • They are stored entirely on a hard drive wich means you are not restricted by system memory.
  • Sharing, uploading, moving is easier since you can have a full dataset in just one file

the h5py python package provides a nice API to work with HDF5. We will use it to create a simple image indexer class that will optimize some things not implemented in h5py.

One important concept while working with HDF5 and indexing images is buffering. Writing to a HDF5 is much faster using big batches of images compared to writing them one-by-one.

h5py quick-starter

The above code will print:

The most basic structure in an h5py file is a dataset, we need to specify the desired shapes during creation. This step should be pretty self-explanatory. There is a small catch though, if you don’t specify the max shape argument you will not be able to resize the dataset in the future. Due to some performance optimizations, this is ok if you are 100% sure you will not need to extend it.

In our case, we set the max shape for axis=0 and axis=3 to None, this means that we will be able to resize this dataset without any limitations along that axis. So basically adding more images and/or adding more channels to each image.

Lets create a simple class that will implement the indexer with buffering:

Usage

Since we implemented enter and exit methods we can use this class as a context manager, this will come handy to close the file at the end and write the last buffers at context-exit.

Example usage:

In the upcoming post we will see how to use hdf5 data sets with Keras efficiently.

Posted by jakub.cieslik

Leave a Reply