Generating Test Data With ChatGPT
I attended a presentation by Paul Turley on Microsoft Fabric this week and was fascinated when he told me he’d created a ton of test data through ChatGPT. This is another great use of AI to allow us technologists to do more with less. Gone are the olden days of sample databases with sample datasets or us having to create our own, we can have ChatGPT just generate it for us. To see how easy it is, I thought I’d show you an example and tell you how I did it, as it was done in reverse of what it’s presented here for the execution of the build in the technical environment.
I’m going to break it up into two prompts, as I found the results are more accurate, receiving lesser hallucination rate. Rarely are you going to get a good result if you ask it to do both Python create objects and then data creation. This was done with GPT-4.
ChatGPT Prompt:
First we’ll ask ChatGPT for the SQL Script to create the tables:Create Oracle SQL script to create five tables, ‘HR_Employees’, with columns (last_name,first_name, email) ‘Managers’, with columns (employee_id,manager_name) ‘Titles’, with columns (employee_id, title),’Organizations, (employee_id, organization) and’Salary_Level’ with columns (employee_id,salary)
Output
Here are the SQL scripts you requested. Please check the columns’ data types and adjust as necessary:
CREATE TABLE HR_Employees ( employee_id NUMBER GENERATED ALWAYS AS IDENTITY, last_name VARCHAR2(50) NOT NULL, first_name VARCHAR2(50) NOT NULL, email VARCHAR2(100) UNIQUE NOT NULL ); CREATE TABLE Managers ( manager_id NUMBER GENERATED ALWAYS AS IDENTITY, employee_id NUMBER NOT NULL, manager_name VARCHAR2(100) NOT NULL, FOREIGN KEY (employee_id) REFERENCES HR_Employees(employee_id) ); CREATE TABLE Titles ( title_id NUMBER GENERATED ALWAYS AS IDENTITY, employee_id NUMBER NOT NULL, title VARCHAR2(100) NOT NULL, FOREIGN KEY (employee_id) REFERENCES HR_Employees(employee_id) ); CREATE TABLE Organizations ( organization_id NUMBER GENERATED ALWAYS AS IDENTITY, employee_id NUMBER NOT NULL, organization VARCHAR2(100) NOT NULL, FOREIGN KEY (employee_id) REFERENCES HR_Employees(employee_id) ); CREATE TABLE Salary_Level ( salary_id NUMBER GENERATED ALWAYS AS IDENTITY, employee_id NUMBER NOT NULL, salary NUMBER NOT NULL, FOREIGN KEY (employee_id) REFERENCES HR_Employees(employee_id) );
This SQL script will create five tables. Each table has an `employee_id` column which is a foreign key that references the `employee_id` column in the `HR_Employees` table. This ensures data integrity as you can’t enter a record in ‘Managers’, ‘Titles’, ‘Organizations’, or ‘Salary_Level’ with an `employee_id` that does not exist in the `HR_Employees` table.
Please note, I’ve included `IDENTITY` columns for each table to serve as primary keys. The `NOT NULL` constraint ensures that a column cannot have a NULL value. The `UNIQUE` constraint ensures that all values in a column are different.
Remember that you may need to modify the `VARCHAR2` and `NUMBER` data types to match the kind of data you will store in each column, the example provided are common use cases but may not fit your specific needs.
ChatGPT Prompt:
Create a python script to generate test data for an Oracle database with the base tables of HR employee list, managers, titles, organization and salary level for a company of 10000
Output
Creating test data for an Oracle database involves generating fake data and then inserting that data into the database. It’s crucial to note that interacting with Oracle databases using Python requires `cx_Oracle`, a Python library that allows the use of SQL queries to interact with Oracle databases.
Here’s a simple script that generates and inserts data into a `HR_Employees`, `Managers`, `Titles`, `Organizations`, and `Salary_Level` tables using the `faker` library to generate realistic data. However, the assumption here is that you have already created these tables in your database:
import cx_Oracle import faker import random # Initialize faker fake = faker.Faker() # Define Oracle Database connection dsn_tns = cx_Oracle.makedsn('host', 'port', service_name='service_name') # replace with your actual details conn = cx_Oracle.connect(user='username', password='password', dsn=dsn_tns) # Get a cursor object cursor = conn.cursor() # Generate and insert data for _ in range(10000): # Generate fake data name = fake.name() email = fake.email() job_title = fake.job() org = fake.company() salary_level = random.randint(1, 10) # Insert into HR_Employees table cursor.execute(f"INSERT INTO HR_Employees(name, email, job_title) VALUES('{name}', '{email}', '{job_title}')") # Retrieve employee_id of inserted employee cursor.execute("SELECT id FROM HR_Employees WHERE email = :email", {'email': email}) employee_id = cursor.fetchone()[0] # Insert into other tables cursor.execute(f"INSERT INTO Managers(employee_id) VALUES('{employee_id}')") cursor.execute(f"INSERT INTO Titles(employee_id, title) VALUES('{employee_id}', '{job_title}')") cursor.execute(f"INSERT INTO Organizations(employee_id, organization) VALUES('{employee_id}', '{org}')") cursor.execute(f"INSERT INTO Salary_Level(employee_id, level) VALUES('{employee_id}', '{salary_level}')") # Commit the changes and close the connection conn.commit() cursor.close() conn.close()
The above script will insert 10,000 records of employees, each with associated records in the `Managers`, `Titles`, `Organizations`, and `Salary_Level` tables. Note that this script is very basic and assumes simplistic table structures. Your actual database schema might require adjustments to this script.
Please replace ‘host’, ‘port’, ‘service_name’, ‘username’, and ‘password’ with your Oracle DB credentials. If your tables have different structure or naming, you need to adjust the INSERT INTO commands accordingly.
This code also does not handle exceptions that might occur during the data insertion, for simplicity. In a real-world scenario, you should include error handling to account for any issues that might occur during this process.
What this does
If you have a simple Oracle database, the first script will create the tables and then ChatGPT does 10K inserts of data into them. I did this in reverse now- I let ChatGPT build out the data inserts from the faker library, then used that to write my prompt for the SQL statement you see above it, but you get the idea— Let ChatGPT build out the best data set for your purpose, then tell it what you’ll need to hold the data, but it does work if you need quick test data.
You could easily replicate this for any database platform, including datalakes or NoSQL databases, even.
Pingback: Dew Drop – June 26, 2023 (#3971) – Morning Dew by Alvin Ashcraft