Platform.sh User Documentation

Sanitizing databases: PostgreSQL and Django

Sign up for Upsun

Get your free trial by clicking the link below.

Get your Upsun free trial

Databases of live websites often contain personally identifiable information (PII) such as full names, mailing addresses, and phone numbers. To ensure people reviewing code changes can’t access information they shouldn’t, sanitize your databases of any PII that they may contain.

This example goes through the process for a PostgreSQL database using Django.

Before you begin Anchor to this heading

You need:

  • A project with a PostgreSQL database.
  • A command interface installed:
    • If doing it manually, the Platform CLI.
    • Otherwise, make sure pqsl is installed in your environment.

This guide is about sanitizing PostgreSQL databases.

This guide doesn’t address:

  • Sanitizing NoSQL Databases (such as MongoDB)
  • Input validation and input sanitization, which both help prevent security vulnerabilities

Sanitize the database Anchor to this heading

Make sure that you only sanitize preview environments and never the production environment. Otherwise you may lose most or even all of the relevant data stored in your database.

First, take a database dump of your preview environment. This is just a safety precaution. Production data isn’t altered. To get a database dump, run the following command: platform db:dump -e DEVELOPMENT_ENVIRONMENT_NAME .

Assumptions:

  • users is the table where all of your PII is stored in the staging development database.
  • staging is an exact copy of your production database.
  1. Connect to the staging database by running platform sql -e staging.

  2. Display all fields from your users table, to select which ones need to be redacted. Run the following query:

    main=> SELECT * FROM users;

    You see output like the following:

    id   |                user_email               |     display_name
    -----+-----------------------------------------+-----------------------
    3501 | daniel02@yourcompany.com                | Jason Brown
    3502 | ismith@kim.com                          | Sandra Griffin
    3503 | olee@coleman-rodriguez.com              | Miss Christine Morgan
  3. Change the fields where PII is contained with the UPDATE statement. For example, to change the display name of users with an email address not in your company’s domain to a random value, run the following query:

    UPDATE users
    SET display_name==substring(md5(display_name||'$PLATFORM_PROJECT_ENTROPY') for 8);
    WHERE email NOT LIKE '%@yourcompany%'

    Adapt and run that query for all fields that you need to sanitize. If you modify fields that you shouldn’t alter, you can restore them from the dump you took in step 1.

    You can create a script to automate the sanitization process to be run automatically on each new deployment. Once you have a working script, add your script to sanitize the database to a deploy hook:

    .platform.app.yaml
    hooks:
        deploy: |
    
            # ...
    
            cd /app/public
            if [ "$PLATFORM_ENVIRONMENT_TYPE" = production ]; then
                # Do whatever you want on the production site.
            else
                # The sanitization of the database should happen here (since it's non-production)
                sanitize_the_database.sh
            fi

Assumptions:

  • users is the table where all of your PII is stored in the staging development database.
  • database is the relationship name for the PostgreSQL service.

Set up a script by following these steps:

  1. Retrieve service credentials from the PLATFORM_RELATIONSHIPS environment variable to use the psql command interface. Export these values to a .environment file or include them directly in the sanitization script.

    .environment
    # Pull credentials from the PLATFORM_RELATIONSHIPS environment variable.
    DB_USER=$(echo $PLATFORM_RELATIONSHIPS | base64 --decode | jq -r '.database[0].username')
    DB_HOST=$(echo $PLATFORM_RELATIONSHIPS | base64 --decode | jq -r '.database[0].host')
    DB_PORT=$(echo $PLATFORM_RELATIONSHIPS | base64 --decode | jq -r '.database[0].port')
    DB_PASS=$(echo $PLATFORM_RELATIONSHIPS | base64 --decode | jq -r '.database[0].password')
  2. Create an executable sanitizing script by running the following command:

    touch sanitize.sh && chmod +x sanitize.sh
  3. Make the script sanitize environments with an environment type other than production.

    The following example runs only in preview environments and sanitizes the display_name and email columns of the users table. Adjust the details to fit your data.

    sanitize.sh
    #!/usr/bin/env bash
    
    if [ "$PLATFORM_ENVIRONMENT_TYPE" != production ]; then
        # Sanitize data
        PGPASSWORD=$DB_PASS psql -c "UPDATE users SET display_name=substring(md5(display_name||'$PLATFORM_PROJECT_ENTROPY') for 8);" -U $DB_USER -h $DB_HOST -p $DB_PORT
        PGPASSWORD=$DB_PASS psql -c "UPDATE users SET email=substring(md5(email||'$PLATFORM_PROJECT_ENTROPY') for 8);" -U $DB_USER -h $DB_HOST -p $DB_PORT
    fi

    To sanitize only on the initial deploy and not all future deploys, on sanitization create a file on a mount. Then add a check for the file as in the following example:

    sanitize.sh
    #!/usr/bin/env bash
    
    if [ "$PLATFORM_ENVIRONMENT_TYPE" != production ] && [ ! -f MOUNT_PATH/is_sanitized ]; then
        # Sanitize data
        touch MOUNT_PATH/is_sanitized
    fi
  4. Update the deploy hook to run your script on each deploy.

    .platform.app.yaml
    hooks:
        build: ...
        deploy: |
            python manage.py migrate
            bash sanitize.sh        
  5. Commit your changes by running the following command:

    git add .environment sanitize.sh .platform.app.yaml&& git commit -m "Add sanitization."

    Push the changes to staging and verify that environment’s database was sanitized. Once merged to production, all data from future preview environments are sanitized on environment creation.

What’s next Anchor to this heading

You learned how to remove sensitive data from a database.

To replace sensitive data that with other meaningful data, you can add a faker to the process. A faker is a program that generates fake data that looks real. Having meaningful PII-free data allows you to keep your current Q&A, external reviews, and other processes. To add a faker, adapt your sanitizing queries to replace each value that contains PII with a new value generated by the faker.

You might also want to make sure that you implement input validation.

If your database contains a lot of data, consider using the REINDEX statement to help improve performance.

Is this page helpful?