Data Cleaning & Preparation
Learn how to use aidnn to clean messy data, handle missing values, and prepare datasets for analysis.Common Data Issues
aidnn helps you fix:- Missing or null values
- Duplicate records
- Inconsistent formatting
- Outliers and anomalies
- Wrong data types
- Messy column names
- Extra whitespace
Quick Start
Step 1: Assess Data Quality
Check for issues:“Check this data for quality issues”aidnn will identify:
- Missing values in columns
- Duplicate rows
- Outliers
- Formatting inconsistencies
- Data type problems
Step 2: Fix Common Issues
Remove duplicates:“Remove duplicate rows from this data”Handle missing values:
“Remove rows with any missing values”
“Fill missing values in the price column with the average”Fix formatting:
“Trim whitespace from all text columns”
“Convert the date column to proper date format”Standardize:
“Convert all state names to two-letter abbreviations”
“Make all email addresses lowercase”
Real-World Example
Scenario: Cleaning Customer Data
You have: A customer database exported from various sources with quality issues. Your Goal: Clean it for analysis and import into a CRM. Conversation: 1. Upload Data“Upload customer_export.csv”2. Initial Assessment
“Show me a summary and check for data quality issues”aidnn Reports:
“Remove the duplicate customer records, keeping the most recent entry for each”aidnn:
“For rows missing email addresses, can we fill them in from other data?”aidnn: “I can’t infer missing emails reliably. Would you like to:
- Remove rows without emails
- Keep them but flag as incomplete
- Export list for manual follow-up”
“Keep them but add a column marking incomplete records”aidnn adds flag column:
“Standardize all phone numbers to format (###) ###-####”aidnn:
“Convert all state names to two-letter abbreviations”Examples:
“Convert zip codes to text and add leading zeros where needed”aidnn restores:
“Trim extra spaces from first and last name columns”Removes:
“Show me a summary of the cleaned data”aidnn confirms:
“Export the cleaned data to CSV”Ready for CRM import!
Common Cleaning Tasks
Handling Missing Data
Remove rows with missing values:“Remove rows where revenue is missing”Fill with default:
“Fill missing status values with ‘Unknown’”Fill with calculation:
“Fill missing prices with the category average”Flag for review:
“Add a column indicating which rows have missing data”
Removing Duplicates
Remove all duplicates:“Remove duplicate rows”Keep specific version:
“Remove duplicates by customer ID, keeping the most recent”Check for duplicates:
“Show me potential duplicate customers based on name and address”
Fixing Data Types
Convert types:“Convert the amount column to number type”
“Convert order_date to date format”Handle errors:
“Convert revenue to numbers, replacing invalid values with 0”
Standardizing Text
Case normalization:“Convert all product names to title case”
“Make all email addresses lowercase”Trim whitespace:
“Remove leading and trailing spaces from all text columns”Find and replace:
“Replace all instances of ‘N/A’ with null”
Date Handling
Parse dates:“Convert the text date column to proper date format”Extract components:
“Create separate columns for year, month, and day from the date”Handle invalid dates:
“Identify and flag rows with invalid dates”
Outlier Management
Identify outliers:“Find outliers in the price column”Remove outliers:
“Remove rows where revenue is more than 3 standard deviations from mean”Cap outliers:
“Cap prices at the 95th percentile value”
Data Validation
Checking Ranges
Validate values:“Check if all ages are between 0 and 120”
“Find rows where quantity is negative”
“Show me any prices below 10,000”
Format Validation
Email addresses:“Check if all email addresses are valid format”Phone numbers:
“Identify invalid phone numbers”Zip codes:
“Find zip codes that don’t match US format”
Referential Integrity
Check relationships:“Find orders with customer IDs that don’t exist in the customer table”
“Identify products referenced in sales but not in the product catalog”
Best Practices
Before Cleaning
1. Make a Copy Always work with a copy of your data, not the original. 2. Understand the Data Review structure and contents before making changes. 3. Document Issues Keep track of what problems you find.During Cleaning
1. Go Step by Step Don’t try to fix everything at once. 2. Verify Each Change Check results after each cleaning operation. 3. Keep Track Note what transformations you apply.After Cleaning
1. Validate Results Check that cleaned data meets requirements. 2. Compare Before/After Verify you didn’t lose important information. 3. Document Process Record what cleaning steps you took.Advanced Cleaning
Complex Replacements
Pattern-based:“Replace all product codes matching pattern ‘OLD-XXX’ with ‘NEW-XXX’”Conditional:
“For rows where region is ‘West’, update the territory code to ‘W‘“
Data Enrichment
Adding calculations:“Add a column calculating profit margin from revenue and cost”Categorization:
“Create a customer segment column based on purchase history”
Multi-File Cleaning
Apply same cleaning:“Apply the same cleaning steps to all files in my folder”Standardize across sources:
“Make the column names consistent across my sales and orders files”
Troubleshooting
Too Much Data Removed
If you accidentally removed too much:- Start over with original file
- Be more specific in filters
- Review step by step
Transformations Not Working
Check:- Data types are correct
- No hidden characters
- Format is what you expect
Can’t Fix an Issue
Try:- Breaking it into smaller steps
- Showing aidnn examples
- Explaining what the clean version should look like
Quick Reference
| Task | Example Command |
|---|---|
| Find missing values | ”Check for missing values” |
| Remove duplicates | ”Remove duplicate rows” |
| Fill nulls | ”Fill missing prices with average” |
| Trim spaces | ”Remove extra spaces” |
| Fix data types | ”Convert amount to number” |
| Standardize format | ”Format phone numbers as (###) ###-####“ |
| Remove outliers | ”Remove price outliers” |
| Validate format | ”Check if emails are valid” |
Next Steps
- Financial Reports - Analyze financial data
- Sales Analysis - Analyze sales performance
- Best Practices - Data quality tips