Every application needs data to develop against. The temptation is to copy a production database dump into staging — real names, real emails, real purchase histories. That shortcut creates legal liability, security risk, and brittle tests. Professional teams use fake data: synthetically generated records that look realistic but contain no real personal information.
Fake data is not about cutting corners. It is about building software safely, reproducibly, and in compliance with privacy regulations that apply in the EU, UK, US, and most other jurisdictions. This article explains why fake data matters, what good fake data looks like, and how to generate it for your project.
The problem with using real data in development
Production databases contain personally identifiable information (PII): names, email addresses, phone numbers, payment details, health records, and location data. Copying that data to development or staging environments violates GDPR Article 5 (data minimisation) and Article 32 (security of processing) in most cases. Even with employee access controls, staging environments typically have weaker security — fewer audits, more people with access, and sometimes publicly reachable URLs.
A data breach in staging is still a data breach. Under GDPR, fines can reach €20 million or 4% of global annual revenue. Beyond regulation, using real data in development erodes user trust if discovered, creates onboarding friction (new developers see sensitive customer data on day one), and makes demos risky — you never want a screen-share to reveal a real customer's medical history or financial details.
Three main use cases
Unit and integration testing: Tests need predictable, repeatable inputs. Fake data lets you assert that a function handles edge cases — empty strings, Unicode characters, 500-character names, null values — without polluting a shared database. Each test run starts from a known state.
Database seeding: When a new developer clones the repo or a CI pipeline spins up a fresh environment, seed scripts populate tables with realistic volumes. A user table with 10,000 fake rows reveals pagination bugs that 10 rows would hide. An orders table with varied statuses tests dashboard filters and reporting queries.
UI mockups and prototypes: Designers and frontend developers need realistic content to evaluate layouts. Lorem ipsum reveals nothing about truncation, line wrapping, or avatar alignment. Fake names, avatars, and addresses make prototypes credible for stakeholder reviews without exposing real users.
What makes good fake data
Realistic formats: Email addresses should look like emails ([email protected]), phone numbers should match locale formats, dates should be valid, and postal codes should follow regional patterns. Garbage strings like 'asdf1234' do not catch validation bugs.
Consistent relationships: A fake user named 'Sarah Chen' should have an email like [email protected], not [email protected]. Order records should reference valid user IDs. Foreign key integrity matters even in test data — broken references cause misleading test failures.
Edge cases: Good generators produce long names, single-character names, names with apostrophes and hyphens, emoji in display names, zero-amount transactions, and future dates. These expose UI overflow, encoding bugs, and validation gaps that average data hides.
How to seed a database with fake data
SQL INSERT example: INSERT INTO users (id, name, email, created_at) VALUES (1, 'Alice Johnson', '[email protected]', '2025-01-15'), (2, 'Bob O''Connor', '[email protected]', '2025-02-20'), (3, 'María García', '[email protected]', '2025-03-10');
Node.js seeding script:
import { faker } from '@faker-js/faker';
for (let i = 0; i < 1000; i++) {
await db.users.create({
name: faker.person.fullName(),
email: faker.internet.email(),
createdAt: faker.date.past(),
});
}Run seed scripts as part of your development setup (npm run seed, make seed, rails db:seed). Keep seeds idempotent where possible — truncate tables before inserting, or use upsert logic — so re-running does not duplicate records.
When NOT to use fake data
Never use fake data in production — obviously. But also avoid it in user acceptance testing (UAT) when stakeholders need to verify real business workflows with actual account structures, pricing tiers, and integration credentials. UAT should use anonymised production snapshots or dedicated test accounts created through the normal signup flow, not randomly generated names.
Performance testing is another exception: synthetic data may not match production distributions (index selectivity, table sizes, query plans). For load testing, use anonymised production-scale datasets or statistically matched synthetic generators.
FAQ
Is using fake data GDPR compliant? Yes. Fake data contains no real personal information, so GDPR does not apply to it. The obligation is to avoid processing real PII in non-production environments without a lawful basis and appropriate safeguards.
What is data anonymisation vs pseudonymisation? Anonymisation irreversibly removes the ability to identify individuals — the data is no longer personal data under GDPR. Pseudonymisation replaces identifiers with tokens that can be reversed with a separate key — it is still personal data and still regulated. Fake data generation is neither; it creates entirely new, fictional records.
Can I use real data if I anonymise it? Sometimes, but anonymisation is harder than it sounds. Removing names while keeping birth date + postcode + gender can still allow re-identification. Consult your DPO before using anonymised production data in staging. Fake data is simpler and safer.
How many rows should I generate for testing? Enough to expose pagination, search, and performance issues — typically 1,000 to 100,000 rows depending on table complexity. UI testing may need only 20–50 varied records. Load testing needs production-scale volumes. Start small and increase when tests pass too easily.
