Survey Cleaner

Welcome to survey_cleaner

Table of Contents

Testing CI CD codecov Docs
Package Test PyPI Latest Release GitHub Release Python Version Repo Status
Meta Code of Conduct License

survey_cleaner is a project that aims to streamline the process of cleaning survey data by automating common cleaning tasks. Designed to generalize to survey data on different topics, survey_cleaner provides functions to remove duplicate responses, remove unnecessary whitespaces, normalize responses to binary format, and convert ordinal-type responses to numeric data. The package sets up a standardized cleaning framework that can be carried across multiple projects and helps users to reduce manual preprocessing time and minimize errors.

Functions

  • remove_duplicates: keeps only the latest survey response from each individual.
  • handle_emptyStrings: handle None, raise TypeError for non-string inputs, collapse all whitespace into single spaces and strip leading/trailing whitespace, and write the corresponding docstring.
  • normalize_binary: converts binary responses such as True and False, T and F, or Yes and No to a binary format (0 and 1).
  • word_to_ordinal: gives ranking words such as Best, Better, Good, Bad, Worst a numerical rating so that responses can be organized by their numerical values. Likert scale are set up as default rankings but users can also provide their own rankings.

Python Ecosystem

While there are a number of text cleaning packages available on PyPi such as clean-text which preprocesses raw text data on the web, there is no package that is specifically dedicated to cleaning survey response data which is something the survey_cleaner package addresses.

Installation

Clone the repository to your local:

$ git clone https://github.com/UBC-MDS/DSCI_524_group35_survey_cleaner.git

$ cd DSCI_524_group35_survey_cleaner

It is recommended but not required to use the environment file to create a conda environment:

$ conda env create -f environment.yml

$ conda activate survey_cleaner

You can install this package into your preferred Python environment using pip:

$ pip install survey_cleaner

Usage Examples:

Clean Whitespace

import pandas as pd
from survey_cleaner import handle_emptyStrings

# Removes leading/trailing whitespace and collapses multiple spaces
df['comments'] = handle_emptyStrings(df['comments'])

Normalize Binary Responses

from survey_cleaner import normalize_binary
import pandas as pd

# Converts Yes/No, True/False, T/F to 1/0
df = pd.DataFrame({'response': ['Yes', 'No', 'Yes']})
df['response'] = df['response'].apply(normalize_binary)

Convert Ordinal Responses to Numeric

from survey_cleaner import word_to_ordinal

# Customized mapping
word_to_ordinal(feedback, mapping={"Good": 1, "Bad": 0})
# Using default Likert scale
word_to_ordinal(feedback, likert="agreement")

Remove Duplicate Responses

from survey_cleaner import remove_duplicates

responses = pd.DataFrame({
     'respondent_id': [1, 2, 1, 3],
     'completed_at': ['2024-01-01 10:00', '2024-01-01 11:00', 
                      '2024-01-01 12:00', '2024-01-01 13:00'],
     'answer': ['Yes', 'No', 'Maybe', 'Yes']
 })
 
clean_responses = remove_duplicates(responses, 'respondent_id', 'completed_at')

Developer Setup

Clone the repository to your local:

$ git clone https://github.com/UBC-MDS/DSCI_524_group35_survey_cleaner.git

$ cd DSCI_524_group35_survey_cleaner

It is recommended but not required to use the environment file to create a conda environment:

$ conda env create -f environment.yml

$ conda activate survey_cleaner

You can install this package in development mode

$ pip install -e ".[docs]"

Run the test suite:

$ pytest tests/

Build Documentation

Building Documentation Locally

Install documentation dependencies

$ pip install -e ".[docs]"

Generate API documentation using quartodoc

quartodoc build

Preview the documentation

quarto preview

Render the final HTML output

quarto render

Deploy Documentation

Workflow Overview

Workflow Trigger Purpose
build.yml Push/PR to main Runs tests and builds package
deploy.yml Push to main (after tests pass) Deploys package to TestPyPI
quartodoc.yml Push/PR to main Builds API documentation with quartodoc
quartodoc-publish.yml Push to main Publishes documentation to GitHub Pages

Documentation Build Workflow

The quartodoc.yml workflow automatically:

  1. Checks out the repository

  2. Sets up Python environement

  3. Installs package with Document dependencies

  4. Runs quartodoc build to generate API docs

  5. Validates the documentation build

Documentation Publish Workflow

The quartodoc-publish.yml workflow automatically

  1. Builds the documentation using Quarto

  2. Deploys to GitHub Pages when changes are pushed to main

  3. Makes documentation available to GitHub Pages URL

Viewing Published Documentation

Once deployed, documentation is available at: - GitHub Pages: https://ubc-mds.github.io/DSCI_524_group35_survey_cleaner/

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

Contributors

Natalie Truesdell, Amanpreet Binepal, Jay Li, Junli