It will force tech companies to depend more heavily on data generated by AI, known as “synthetic data”. And this, in turn, could lead to the AI systems currently used by hundreds of millions of people being less accurate and reliable – and therefore, useful.
But this isn’t an inevitable outcome. In fact, if used and managed carefully, synthetic data could improve AI models.
The problems with real data
Tech companies depend on data – real or synthetic – to build, train and refine generative AI models such as ChatGPT. The quality of this data is crucial. Poor data leads to poor outputs, in the same way using low-quality ingredients in cooking can produce low-quality meals.
Real data refers to text, video and images created by humans. Companies collect it through methods such as surveys, experiments, observations or mining of websites and social media.
Real data is generally considered valuable because it includes true events and captures a wide range of scenarios and contexts. However, it isn’t perfect.
For example, it can contain spelling errors and inconsistent or irrelevant content. It can also be heavily biased, which can, for example, lead to generative AI models creating images that show only men or white people in certain jobs.
This kind of data also requires a lot of time and effort to prepare. First, people collect datasets, before labelling them to make them meaningful for an AI model. They will then review and clean this data to resolve any inconsistencies, before computers filter, organise and validate it.
This process can take up to 80% of the total time investment in the development of an AI system.
But as stated above, real data is also in increasingly short supply because humans can’t produce it quickly enough to feed burgeoning AI demand.
The rise of synthetic data
Synthetic data is artificially created or generated by algorithms, such as text generated by ChatGPT or an image generated by DALL-E.
In theory, synthetic data offers a cost-effective and faster solution for training AI models.
It also addresses privacy concerns and ethical issues, particularly with sensitive personal information like health data.
Importantly, unlike real data it isn’t in short supply. In fact, it’s unlimited.
The challenges of synthetic data
For these reasons, tech companies are increasingly turning to synthetic data to train their AI systems. Research firm Gartner estimates that by 2030, synthetic data will become the main form of data used in AI.