Tuesday, December 05, 2023

Generating synthetic data

 Faker is an excellent tool for generating mock data for your application. But any complex application would have tens of tables with complex relationships between them. How can we use Faker to populate all of these tables? 

We can follow two approaches here:

Option 1: Create the primary table first and then the dependent tables. Then when populating the dependent tables, you can refer to a random primary key from the first table. A good article summarizing this is here -- https://khofifah.medium.com/how-to-generate-fake-relational-data-in-python-and-getting-insight-using-sql-in-bigquery-985c5adc87d3

Code snippet: 
 #generate relational user id in account table and transaction table
trans['user_id']=random.choices(account["id"], k=len(trans))

Option 2: Use a ORM framework to insert data into the database. An ORM framework would make it easy to establish relationships between different tables. A good article on this approach is here - https://medium.com/@pushpam.ankit/generating-mock-data-for-complex-relational-tables-with-python-and-sqlite-2950ab7700f2

Another interesting opensource tool is "Synthetic Data Vault" https://sdv.dev/
In these tools, we first train the tool on real data and then use the AI model for generation of new synthetic data. Many vendors differentiate between "mock" and "synthetic" data on this aspect.