An article published in the MIT (Massachusetts Institute of Technology) News on the 15th of March 2022, explains how MIT researchers proved that synthetic data can be more effective than real-world data in machine-learning training for image classification. In this blog post we won’t explain what machine learning is, but you can read all about it in the previous article Introduction to Automated Machine Learning – Sparkd.AI.
A cheaper and more effective solution to train self-driving cars.
While this discovery may seem to be very much related to the academic world, it could indeed have a dramatic impact on every day road travel as we know it. A particular machine learning technique, called deep learning, is in fact at the core of the engineering behind the self-driving cars which are currently being developed by many big companies in the automotive and technology industry – Google, Volvo, Tesla and Audi, just to mention a few. Self-driving vehicles are already being tested but they are not yet deemed suitable to be sold to the public due to safety concerns. The main challenge is that autonomous cars have to be trained to react to numerous circumstances which are difficult to foresee in the real world. In addition, even when collecting this data is possible, it is usually very expensive. This is where the use of synthetic data could make a difference in training self-driving cars.
What is the difference between synthetic data and real-world data?
Synthetic data is manufactured digitally by applying rules, statistical models, simulations or other techniques. The advantages in the use of this kind of data are many, particularly when it comes to computer vision. Synthetic data can be more effective and efficient than real-world data (images and videos), because the latter might include personal information, which cannot be used (read more about How to Keep Your Machine Learning Data Confidential – Sparkd.AI), and is often difficult and expensive to collect. Machine-learning models which generate synthetic data, can predict situations and generate images which are rare, if not impossible to find and foresee in the real world. These models are called generative models.
Why do synthetic data work better than “the real thing”?
Research scientist Ali Jahanian, lead author of the paper referred to in the MIT News article, confirms that researchers “were especially pleased when showed that this method sometimes does even better than the real thing” but at the same time warns about the current risks linked to the use of synthetic data, including privacy, biased data and disclosure of source data. According to Jahanian, the breakthrough of these generative models is that once trained, they are able to predict circumstances they had not been exposed to during training. This particular ability would make those machines able to react in the correct way, also in situations which had not been forecasted and collected in the real world by the engineers who trained them.
Ali Jahanian and colleagues will present their paper at the 10th International Conference on Learning Representations, taking place online on the 24th – 29th of April 2022.
Author Manuela Armini