Skip to main content

Data Cleaning & Preparation

Learn how to use aidnn to clean messy data, handle missing values, and prepare datasets for analysis.

Common Data Issues

aidnn helps you fix:
  • Missing or null values
  • Duplicate records
  • Inconsistent formatting
  • Outliers and anomalies
  • Wrong data types
  • Messy column names
  • Extra whitespace

Quick Start

Step 1: Assess Data Quality

Check for issues:
“Check this data for quality issues”
aidnn will identify:
  • Missing values in columns
  • Duplicate rows
  • Outliers
  • Formatting inconsistencies
  • Data type problems

Step 2: Fix Common Issues

Remove duplicates:
“Remove duplicate rows from this data”
Handle missing values:
“Remove rows with any missing values”
“Fill missing values in the price column with the average”
Fix formatting:
“Trim whitespace from all text columns”
“Convert the date column to proper date format”
Standardize:
“Convert all state names to two-letter abbreviations”
“Make all email addresses lowercase”

Real-World Example

Scenario: Cleaning Customer Data

You have: A customer database exported from various sources with quality issues. Your Goal: Clean it for analysis and import into a CRM. Conversation: 1. Upload Data
“Upload customer_export.csv”
2. Initial Assessment
“Show me a summary and check for data quality issues”
aidnn Reports:
Data Quality Issues Found:

Missing Values:
- Email: 45 rows (4.5%)
- Phone: 128 rows (12.8%)
- State: 12 rows (1.2%)

Duplicates:
- 23 duplicate customer records found

Formatting Issues:
- Phone numbers in multiple formats
- State names inconsistent (full names vs abbreviations)
- Extra spaces in name fields

Data Type Issues:
- ZipCode stored as number (leading zeros lost)
3. Remove Duplicates
“Remove the duplicate customer records, keeping the most recent entry for each”
aidnn:
Plan: Remove Duplicates
1. Identify duplicate records by email address
2. For each duplicate set, keep row with most recent date
3. Remove older duplicate entries
4. Verify 23 rows removed
Approve → 23 duplicates removed 4. Fix Missing Values
“For rows missing email addresses, can we fill them in from other data?”
aidnn: “I can’t infer missing emails reliably. Would you like to:
  • Remove rows without emails
  • Keep them but flag as incomplete
  • Export list for manual follow-up”
“Keep them but add a column marking incomplete records”
aidnn adds flag column:
New column: data_complete
- TRUE for complete records
- FALSE for those missing critical fields
5. Standardize Phone Numbers
“Standardize all phone numbers to format (###) ###-####”
aidnn:
Before:
555-123-4567
5551234567
(555) 123 4567

After:
(555) 123-4567
(555) 123-4567
(555) 123-4567
6. Fix State Names
“Convert all state names to two-letter abbreviations”
Examples:
California → CA
New York → NY
Texas → TX
7. Fix Zip Codes
“Convert zip codes to text and add leading zeros where needed”
aidnn restores:
1234 → 01234
345 → 00345
12345 → 12345
8. Clean Names
“Trim extra spaces from first and last name columns”
Removes:
" John " → "John"
"Smith  " → "Smith"
"  Mary" → "Mary"
9. Validate Results
“Show me a summary of the cleaned data”
aidnn confirms:
Cleaning Complete:
✅ 977 total records (from 1,000)
✅ 23 duplicates removed
✅ Phone numbers standardized
✅ State codes standardized
✅ Zip codes fixed
✅ Names trimmed
✅ Incomplete records flagged
10. Export Clean Data
“Export the cleaned data to CSV”
Ready for CRM import!

Common Cleaning Tasks

Handling Missing Data

Remove rows with missing values:
“Remove rows where revenue is missing”
Fill with default:
“Fill missing status values with ‘Unknown’”
Fill with calculation:
“Fill missing prices with the category average”
Flag for review:
“Add a column indicating which rows have missing data”

Removing Duplicates

Remove all duplicates:
“Remove duplicate rows”
Keep specific version:
“Remove duplicates by customer ID, keeping the most recent”
Check for duplicates:
“Show me potential duplicate customers based on name and address”

Fixing Data Types

Convert types:
“Convert the amount column to number type”
“Convert order_date to date format”
Handle errors:
“Convert revenue to numbers, replacing invalid values with 0”

Standardizing Text

Case normalization:
“Convert all product names to title case”
“Make all email addresses lowercase”
Trim whitespace:
“Remove leading and trailing spaces from all text columns”
Find and replace:
“Replace all instances of ‘N/A’ with null”

Date Handling

Parse dates:
“Convert the text date column to proper date format”
Extract components:
“Create separate columns for year, month, and day from the date”
Handle invalid dates:
“Identify and flag rows with invalid dates”

Outlier Management

Identify outliers:
“Find outliers in the price column”
Remove outliers:
“Remove rows where revenue is more than 3 standard deviations from mean”
Cap outliers:
“Cap prices at the 95th percentile value”

Data Validation

Checking Ranges

Validate values:
“Check if all ages are between 0 and 120”
“Find rows where quantity is negative”
“Show me any prices below 0orabove0 or above 10,000”

Format Validation

Email addresses:
“Check if all email addresses are valid format”
Phone numbers:
“Identify invalid phone numbers”
Zip codes:
“Find zip codes that don’t match US format”

Referential Integrity

Check relationships:
“Find orders with customer IDs that don’t exist in the customer table”
“Identify products referenced in sales but not in the product catalog”

Best Practices

Before Cleaning

1. Make a Copy Always work with a copy of your data, not the original. 2. Understand the Data Review structure and contents before making changes. 3. Document Issues Keep track of what problems you find.

During Cleaning

1. Go Step by Step Don’t try to fix everything at once. 2. Verify Each Change Check results after each cleaning operation. 3. Keep Track Note what transformations you apply.

After Cleaning

1. Validate Results Check that cleaned data meets requirements. 2. Compare Before/After Verify you didn’t lose important information. 3. Document Process Record what cleaning steps you took.

Advanced Cleaning

Complex Replacements

Pattern-based:
“Replace all product codes matching pattern ‘OLD-XXX’ with ‘NEW-XXX’”
Conditional:
“For rows where region is ‘West’, update the territory code to ‘W‘“

Data Enrichment

Adding calculations:
“Add a column calculating profit margin from revenue and cost”
Categorization:
“Create a customer segment column based on purchase history”

Multi-File Cleaning

Apply same cleaning:
“Apply the same cleaning steps to all files in my folder”
Standardize across sources:
“Make the column names consistent across my sales and orders files”

Troubleshooting

Too Much Data Removed

If you accidentally removed too much:
  • Start over with original file
  • Be more specific in filters
  • Review step by step

Transformations Not Working

Check:
  • Data types are correct
  • No hidden characters
  • Format is what you expect

Can’t Fix an Issue

Try:
  • Breaking it into smaller steps
  • Showing aidnn examples
  • Explaining what the clean version should look like

Quick Reference

TaskExample Command
Find missing values”Check for missing values”
Remove duplicates”Remove duplicate rows”
Fill nulls”Fill missing prices with average”
Trim spaces”Remove extra spaces”
Fix data types”Convert amount to number”
Standardize format”Format phone numbers as (###) ###-####“
Remove outliers”Remove price outliers”
Validate format”Check if emails are valid”

Next Steps

Questions? Check our FAQ or contact support@isotopes.ai.