Data Anonymisation

Joe Leach
Data architect
Enterprise architecture team

Is using production data safe for testing?

… a case dealt with by the Information Commissioner in which a pupil was away from home at boarding school. Her parents received a letter from the local hospital informing them that their daughter had been involved in a road accident. In fact, there had been no accident, but the hospital had been using live patient data to test a system for sending out letters to patients.

(see “System Testing with Live Data May Breach Data Protection Act — Pinsentmasons.com”)

How is production data currently managed during testing?

The production database is cloned to a test environment for a period of two weeks
The proposed upgrade is User Acceptance Tested (UAT) against the cloned data in the test environment
Test outcomes are evaluated:
- pass
  1. The test database is removed
  2. the software update goes live
- fail
  1. The test database is removed
  2. The upgrade is postponed

Benefits of testing with production data

Improved accuracy: Production data provides a more accurate representation of real-world scenarios than synthetic data. This allows testers to identify bugs and edge cases that might not be apparent in simulations.

In tests, known use-cases are an examples of scenarios that give accuracy during tests.

Benefits of testing with production data

Realistic testing environment: By using production data, testers can create a testing environment that closely resembles the actual production environment. This helps ensure that the system behaves as expected under realistic conditions.

In tests, the upgraded software is tested against production data, giving surety that the upgrade will succeed in production.

Benefits of testing with production data

Cost-effective: Using production data for testing can be more cost-effective than creating artificial data. It eliminates the need to generate large amounts of data manually, which can be time-consuming and expensive.

In tests, using production data is straightforward and inexpensive.

Benefits of testing with production data

Faster testing: Production data can help speed up the testing process, and reduce data friction, by providing a pre-existing dataset that testers can use immediately. This can reduce the time and effort required to set up a testing environment.

In tests, testing is accelerated by using production data.

Benefits of testing with production data

Valuable insights: Production data can provide valuable insights into how users interact with the system in the real world. This information can be used to improve the user experience and identify areas for optimisation.

In tests, using real data gives the closest user experience to the anticipated outcome.

Benefits of testing with production data

Overall: Using production data for testing provides an accurate, realistic, and cost-effective way to test software systems.; In the next two sections, we’ll explore some of the risks associated with using production data and how to mitigate them.

Risks of testing with production data

Data privacy: Using production data for testing can expose sensitive user information, such as PII.

This is the case in system testing, with risk of exposure mitigated by conducting tests within a prescribed time window, after which the database clone is removed.

Risks of testing with production data

Security breaches: Production data is more valuable than simulated data, making it a potential cybercrime target. Using production data for testing can increase the risk of a security breach, data loss, and theft.

This risk is mitigated by the production data only being in test systems for a prescribed time-window.

Risks of testing with production data

Data quality: Production data may contain inaccuracies, errors, or inconsistencies that can affect the testing results.; This can lead to false positives or false negatives.

This is mitigated by error correction transformations used (e.g. checking emails) in everyday production.

Risks of testing with production data

Regulatory compliance: Using production data for testing may be challenged.

You may use production data to test your products if a Data Protection Impact Assessment finds usage to be compliant with data protection law. (see “Using Live Data for Testing Purposes - Security Guidance — Security-Guidance.service.justice.gov.uk”)

Risks of testing with production data

Overall: The risk based approach shows that using production data outside of production does pose risks; In the next section, we’ll explore some of the mitigations that can be used to make production data safe to use outside its home environment.

Risk mitigation: Anonymisation

The most commonly used way of working with live data is through is Anonymisation.

Anonymisation: a process by which PII is irreversibly altered in such a way that a PII principal can no longer be identified directly or indirectly, either by the PII controller alone or in collaboration with any other party.

(see “Iso.org”)

Risk mitigation: Anonymisation

De-identification: Fields containing PII within records are “de-identified” using an encryption key and cipher to pseudonymise PII.; The encrypted values of records important to certain tests can be noted separately for use during testing.

Risk mitigation: Anonymisation

Peturbation: Adding random noise to the data, such as adding or subtracting a small amount from each data point.; This can preserve the statistical properties of the data, but may affect its usefulness for analysis.

Risk mitigation: Anonymisation

Masking: Replacing sensitive data with a generic value, such as “****” or “XXX-XX-XXXX” for example.; This technique can preserve the structure of the data, but may limit its usefulness, especially when it comes to distinguishing one value from the other.

Masking code from the enterprise data repository

a stored procedure for anonymising fields extracted from a reporting table into enterprise data

CREATE VIEW [rpt].[dimPerson] AS SELECT 
    [Title],
    CASE WHEN [Restricted] 'Y' THEN '********' 
      ELSE [Forename] 
      END AS [Forename],
    CASE WHEN [Restricted] 'Y' THEN '********' 
      ELSE[Surname] 
      END AS Surname, 
    CASE WHEN [Restricted] 'Y' THEN '********' 
      [DateOfBirth],
    CASE WHEN [Restricted] - 'V' THEN '********'
      ELSE [EmailAddress] 
      END AS EmailAddress
    FROM [prs].[dimPerson]; 
GO

Risk mitigation: Anonymisation

Generalisation: Removing specific details from the data to make it less specific, such as removing the exact date of birth and only keeping the year.; This technique can preserve the overall trends in the data but may affect its usefulness for detailed analysis.

Risk mitigation: Anonymisation

Aggregation: Combining data from multiple individuals into a single record to make it impossible to identify any individual.; This can preserve the overall statistics of the data but may not be useful for detailed analysis of individual records.

Risk mitigation: Anonymisation

Statistical disclosure control with geospatial data

The ONS use a combination of the above methods to prevent PII disclosure with their published data (see “Protecting Personal Data in Census 2021 Results - Office for National Statistics — Ons.gov.uk”)

Swapping records between areas
Applying a cell key method to each table
Applying disclosure rules in preventing sparse (disclosive) tables from being published.

Risk mitigation: Anonymisation

Statistical disclosure control of aggregated datasets with the 5/10 rule: This approach suppresses any counts below 10 and rounds those that are 10 and above to the nearest 5; Prevents sparse, disclosive data from being shared (see Lloyd et al. 2023)

Risk mitigation: other methods

Subsets: Using a subset of production data, rather than the entire dataset, can help reduce the risk of exposing sensitive information.; Organisations should carefully consider which data is necessary for testing purposes and use only that data.

Risk mitigation: other methods

Virtualisation: Tools can create a virtual layer between the application and the data source, allowing testers to create virtual clones of their production data in real-time.

Risk mitigation: other methods

Strict access controls: Limiting access to production data to only those who need it can help prevent unauthorized access or data breaches.; Organisations should implement strict access controls, such as role-based access or multi-factor authentication, to ensure that only authorized users can access the data.

Risk mitigation: other methods

Monitoring data usage: organisations should monitor how production data is being used for testing purposes to ensure that it is being used appropriately and responsibly.; Regular audits can help identify any potential risks or compliance issues.

Risk mitigation: other methods

Obtaining user consent: In some cases, organisations may need to obtain user consent before using their production data for testing purposes.; This is particularly important when dealing with sensitive data or data subject to regulatory requirements. This can be assessed with a DPIA.

Role of data architect

“Support product teams managing the full data management lifecycle” (see GDS 2023)

Research / Enterprise data differences

Research data: cannot contain PII > de-identifies individuals, but still enables their records to be linked across datasets
Enterprise data: can contain PII > if terms of service allow PII e.g. for case management, but de-identified data is easier to re-use across the enterprise.

Some options for data systems

Note: Adopting anonymisation will require the supplier to transform the data during pre-test cloning
Note: New procedures to test known records would require development.

References

GDS. 2023. “Government Digital and Data Profession Capability Framework — ddat-Capability-Framework.service.gov.uk.” https://ddat-capability-framework.service.gov.uk/data-architect.html.

“Iso.org.” https://www.iso.org/obp/ui/en/#iso:std:iso-iec:29100:ed-1:v1:en.

Lloyd, Christopher D, Gemma Catney, Richard Wright, Mark Ellis, Nissa Finney, Stephen Jivraj, David Manley, and Sarah Wood. 2023. “An Ethnic Group Specific Deprivation Index for Measuring Neighbourhood Inequalities in England and Wales.” The Geographical Journal.

“Protecting Personal Data in Census 2021 Results - Office for National Statistics — Ons.gov.uk.” https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/methodologies/protectingpersonaldataincensus2021results#main-points.

“System Testing with Live Data May Breach Data Protection Act — Pinsentmasons.com.” https://www.pinsentmasons.com/out-law/news/system-testing-with-live-data-may-breach-data-protection-act.

“Using Live Data for Testing Purposes - Security Guidance — Security-Guidance.service.justice.gov.uk.” https://security-guidance.service.justice.gov.uk/using-live-data-for-testing-purposes/#who-is-this-for.