Jean de Dieu Nyandwi

Jean de Dieu Nyandwi

A Handy Notes on the Art of Data Augmentation!

A Handy Notes on the Art of Data Augmentation!

Subscribe to my newsletter and never miss my upcoming articles

Let's say that you wanted to build a real-world image classifier, but you found out that you can only get 20 images. You thought you can collect more images and get 100 in total. You know that if you spend more time you can get 150, 200, 250 but it's slow and daunting...What can you do to get enough training images for your classifier?

As ML succeeds in solving complex problems, there is a need for enough data and it is rare to get it in the first place. The good news is that with the advent of ML techniques and best practices, we can synthesize artificial data that can potentially boost performance accuracy or other desired metrics. The technique of creating new data from the existing data is called data augmentation.

In this post, I will talk about data augmentation, specifically the following points:

  • When to use data augmentation?
  • Augmentating images
  • Augmentating texts
  • Augmentating sounds and videos
  • Augmentating structured data
  • Conclusion

When to Use Data Augmentation?

Even if data augmentation can increase and improve the quality of the data, it is important to underscore that it is not something that will always be guaranteed to work.

Here are different scenarios in which you can hope it will help:

  • When you performed error analysis and found out that the model is not doing well because of insufficient data, and there is clear evidence that adding more data will help.
  • When the model is not doing well on one specific class because there is data imbalance and by adding more data to that particular class, the model can do well for both classes.

As you can see, more data will help if they are needed. Take an example for the second scenario. If you are building a car models classifier, and your classifier is doing well for car model A because it dominates the dataset, and too bad on a model B because it has few images in the dataset, adding more data to car model A will not improve the model. You will instead create more images of car model B.

Also, it is important to create realistic data, and realistic enough that you too can recognize it without difficulty. Take an example in cat/dog classifier, if you create one image like this, it still looks like a cat, but it is likely that your classifier will not be tested on the cats in such positions, so by adding it in the training set, it may confuse the model and thus missing out on the real cat scenarios. This may not necessarily be true, but the model can suffer in classifying cats in these positions especially if they were not represented enough in the training images.

Cat.png A real cat in uncommon position

Augmentating Images

Computer vision is probably one of the most areas in which using data augmentation can work well. This because of two reasons: for most real-world projects, there are not enough images and collecting them can be expensive. The second reason is that it has become a norm that artificially creating new images from existing training images can drastically boost performance accuracy.

There are sure techniques that can be used to create images. These include flipping, cropping, rotating the images, changing contrast and colour, and adding noise to the images.

The below photo summarizes the augmentation in images.

MLE.png Image augmentation, credit: MLE & Alfonso Escalante

Augmenting Texts

When trying to augment texts, there are several options. The first is to replace some random words with their synonyms. Take an example: The sentence "A dog is sitting in the yard." can be similar in meaning to:

  • A dog is sitting in the garden.
  • A dog is sitting in the yard.
  • A canine is sitting in the backyard.

Another option is to use words that have a broad meaning (a.k.a hypernyms). For example, a dog can mean an animal or a pet. Back to our example, the following sentences have the semantic meaning close to the original sentence:

  • An animal is sitting in the garden.
  • A pet is sitting in the yard.

The last option is to use back translation. When translating from one language to another, the context of the sentence is given high priority than the meaning of individual words. This means if you translate words from one language into another, there is a chance that the words will be different. If they are, you have got new but similar words. Translate them back into the first language and assign them with a label similar to the original sentence.

The ideas discussed in this section are borrowed from the book Machine Learning Engineering by Andriy Burkov.

Augmenting Sounds and Videos

When working with sounds, you can take the existing sounds and add noise, slow them, make them reasonably fast that the message can still be heard, shift the time of the sound, narrow or widen the voice, etc...

Sound.png Adding noise to the sound to produce a new sound

Videos can also be augmented as how you would augment sounds. You can slow or speed them, shift the time, split the video and apply some transformations to individual clips. For example, some clip might have a reduced brightness, different colour, etc.

There are no limits to what types of transformations you can try but it must be realistic as we mentioned at the beginning of this article.

Augmenting Structured Data

Structured data are these ordinary data that are in a tabular fashion such as data containing customer information (names, age, location, etc). Different to other types of data, structured data are hard to augment. Take the example in customer data, you would not want to create fake customers. You can, but it will not be a good move!

Instead of creating new data points in this case, we can maximize the conventional feature engineering. Feature engineering is a creative task and requires some forms of domain knowledge, but if there is an obvious way we can create new features, that can be great.

Let's take a simple example, a date feature in (day-month-year) format can give three individual features: day, month and year. You can also perform some operations around features such as multiplying or dividing 2 features or so.

When creating more features, the goal is to create features that have high predictive power. Often you will not know before training & evaluating the model, but sometimes, it can be obvious. Take an example: if you are predicting if someone will like to upgrade their old smartphone to the newest model before showing them ads, it is obvious that having information about their heights will not count in that particular decision.


In many cases, data augmentation will improve the data quality and model performance when done well.

Further Learning and References

I actively share content around ML ideas, tips and best practices on Twitter. If you would like to connect, you're welcome. Every day, I share one or two things that you will find insightful.

Share this