Safeguarding the sensitive information of your customers is crucial. Personally Identifiable Information (PII), such as names, addresses, and phone numbers, is heavily regulated under data privacy laws like GDPR, CCPA, and HIPAA. However, developers, data analysts, and QA teams still need to work with realistic data for testing, development, or demonstrations.
For example, I’ve been developing an OpenINSIGHTS demo account to output campaigns with our retail AI customer predictions. I want to provide evidence that the data is actionable, meaning I can display actual customer records with the output. However, I don’t want to display accounts, actual names, addresses, email addresses, phone numbers, or any other PII in customer data.
Why Build Fake PII Data?
- Compliance with Privacy Laws: Using real PII in testing or development environments can breach regulations, resulting in hefty fines.
- Risk Mitigation: Real PII in lower environments increases the risk of accidental leaks.
- Accurate Testing: Fake but realistic data allows you to effectively test data pipelines, reports, and UIs.
- Demo Environments: Fake data provides a professional appearance during demos without risking user privacy.
Should Fake Data Look Realistic?
Not necessarily. While fake data must work within your systems, it should still look intentionally fake to anyone observing a demo or testing environment. This helps maintain transparency and avoid confusion.
- System Integrity: Internal processes, such as filtering by city, state, or zip code, often rely on valid formats and patterns. Fake data should meet these requirements to ensure the system functions correctly without errors. For example, ZIP codes should align with expected formats, while other fields like names or street addresses can be replaced with clearly fake placeholders.
- Functional Testing: Maintaining integrity in key fields like city, state, and zip ensures that internal filters, workflows, and validations continue to operate as expected. Meanwhile, the fake data for other fields (like names and street addresses) can help test edge cases or performance without impacting functionality.
- Presentation Clarity: During demos, fake data that is clearly fake (e.g., Zxy Test St.) avoids confusion while still showcasing system features. This strikes a balance between professional presentation and maintaining transparency about data usage.
You can ensure secure, functional, and clear testing or demonstrations by intentionally designing fake data to look artificial while preserving system integrity in essential fields. Here’s a snapshot of what I was able to build:
SQL to Generate Fake PII Data
Here is the SQL query to create fake yet realistic PII data in your database for first_name, last_name, customer_address_line1, customer_address_line2, email_address, and phone_number, as well as creating a full name field. I’ve added that logic only to add a second address line to 20% of households.
UPDATE `project.demo.pii`
SET
first_name = (
CONCAT(
UPPER(SUBSTR('bcdfghjklmnpqrstvwxyz', CAST(FLOOR(RAND() * 21) + 1 AS INT64), 1)),
SUBSTR('aeiou', CAST(FLOOR(RAND() * 5) + 1 AS INT64), 1),
SUBSTR('bcdfghjklmnpqrstvwxyz', CAST(FLOOR(RAND() * 21) + 1 AS INT64), 1),
IF(RAND() > 0.5, SUBSTR('aeiou', CAST(FLOOR(RAND() * 5) + 1 AS INT64), 1), '')
)
),
last_name = (
CONCAT(
UPPER(SUBSTR('bcdfghjklmnpqrstvwxyz', CAST(FLOOR(RAND() * 21) + 1 AS INT64), 1)),
SUBSTR('aeiou', CAST(FLOOR(RAND() * 5) + 1 AS INT64), 1),
SUBSTR('bcdfghjklmnpqrstvwxyz', CAST(FLOOR(RAND() * 21) + 1 AS INT64), 1),
IF(RAND() > 0.5, SUBSTR('aeiou', CAST(FLOOR(RAND() * 5) + 1 AS INT64), 1), '')
)
),
customer_address_line1 = (
CONCAT(
CAST(CAST(FLOOR(RAND() * 99999 + 1) AS INT64) AS STRING), " ",
UPPER(SUBSTR('bcdfghjklmnpqrstvwxyz', CAST(FLOOR(RAND() * 21) + 1 AS INT64), 1)),
SUBSTR('aeiou', CAST(FLOOR(RAND() * 5) + 1 AS INT64), 1),
SUBSTR('bcdfghjklmnpqrstvwxyz', CAST(FLOOR(RAND() * 21) + 1 AS INT64), 1),
" ",
CASE CAST(FLOOR(RAND() * 19) AS INT64)
WHEN 0 THEN 'St'
WHEN 1 THEN 'Ave'
WHEN 2 THEN 'Blvd'
WHEN 3 THEN 'Dr'
WHEN 4 THEN 'Ln'
WHEN 5 THEN 'Rd'
WHEN 6 THEN 'Ci'
WHEN 7 THEN 'Ct'
WHEN 8 THEN 'Pl'
WHEN 9 THEN 'Pkwy'
WHEN 10 THEN 'Ter'
WHEN 11 THEN 'Way'
WHEN 12 THEN 'Sq'
WHEN 13 THEN 'Loop'
WHEN 14 THEN 'Trail'
WHEN 15 THEN 'Hwy'
WHEN 16 THEN 'Row'
WHEN 17 THEN 'Path'
WHEN 18 THEN 'Alley'
ELSE 'Pass'
END
)
),
customer_address_line2 = CASE
WHEN RAND() <= 0.2 THEN CONCAT(
CASE CAST(FLOOR(RAND() * 3) AS INT64)
WHEN 0 THEN 'Apt '
WHEN 1 THEN 'Suite '
ELSE 'Unit '
END,
CASE CAST(FLOOR(RAND() * 2) AS INT64)
WHEN 0 THEN CONCAT(UPPER(SUBSTR('ABCDEF', CAST(FLOOR(RAND() * 6) + 1 AS INT64), 1)), CAST(FLOOR(RAND() * 999 + 1) AS STRING))
ELSE CAST(FLOOR(RAND() * 1000 + 1) AS STRING)
END
)
ELSE customer_address_line2
END,
email_address = CONCAT(
LOWER(first_name), ".", LOWER(last_name),
CASE CAST(FLOOR(RAND() * 3) AS INT64)
WHEN 0 THEN '@example.com'
WHEN 1 THEN '@testmail.com'
ELSE '@fakemail.org'
END
),
phone_number = CONCAT(
'(', CAST(FLOOR(RAND() * 800 + 200) AS STRING), ') ',
CAST(FLOOR(RAND() * 900 + 100) AS STRING), '-',
CAST(FLOOR(RAND() * 9000 + 1000) AS STRING)
),
customer_name = CONCAT(
UPPER(SUBSTR(first_name, 1, 1)), SUBSTR(first_name, 2), ' ',
UPPER(SUBSTR(last_name, 1, 1)), SUBSTR(last_name, 2)
)
WHERE TRUE;
Breaking Down the Code
1. Generating first_name
and last_name
- The query randomly generates fake first and last names using a combination of consonants and vowels.
- Logic:
- Picks a random consonant →
SUBSTR('bcdfghjklmnpqrstvwxyz', ...)
- Adds a vowel →
SUBSTR('aeiou', ...)
- Combines them to form a short, readable name with optional additional vowels.
- Picks a random consonant →
- Names look somewhat real but are guaranteed to be fake.
2. Creating customer_address_line1
- Combines a random house number with a randomly generated street name and type (e.g., St, Ave, Blvd).
- Logic:
- Randomly selects a number between 1–99,999.
- Constructs a street name using consonants and vowels.
- Appends a random street type from a list (e.g., “Ln”, “Way”, “Trail”).
3. Handling customer_address_line2
- Adds apartment, suite, or unit details with a 20% probability.
- Logic:
- Randomly picks “Apt”, “Suite”, or “Unit”.
- Adds a number or alphanumeric identifier.
4. Creating email_address
email_address
:- Combines
first_name
andlast_name
in lowercase. - Appends one of the fake domains (
example.com
,testmail.com
, orfakemail.org
). - Ensures the format looks like an email but is clearly fake.
Example:
[email protected]
- Combines
4. Creating a phone_number
phone_number
:- Generates a 10-digit number formatted as
(XXX) XXX-XXXX
. - Area code (
XXX
) is between 200–999 (valid area codes start with 2–9). - Ensures realistic phone formatting but with fake values.
Example:
(425) 678-1234
- Generates a 10-digit number formatted as
4. Combining customer_name
- Formats the fake first and last names to title case (e.g., “John Smith”).
Final Notes
This query allows you to:
- Generate secure, fake PII for testing.
- Avoid compliance risks with real data.
- Maintain data realism, ensuring effective system testing and demos.