# Data Cleaning Tips and Best Practices ## Overview Data cleaning is crucial for accurate analysis, but it doesn't need to be overwhelming. Querri's AI automatically detects and fixes many common issues, but understanding these patterns helps you get better results faster. ## Understanding "Clean" Data **Important principle**: Data is "clean" when it's ready for your specific analysis task, not when it meets some abstract standard of perfection. ### What Makes Data Analysis-Ready - **Correct structure**: Headers in the right place, data in consistent format - **Proper data types**: Numbers stored as numbers, dates as dates - **Consistent values**: Standardized categories and formats - **Relevant completeness**: Missing values handled appropriately for your analysis ## Most Common Data Issues (And How Querri Fixes Them) ### 1. Buried Headers (Very Common!) **The Problem:** ``` Row 1: "Monthly Sales Report - Q4 2024" Row 2: "Generated on: 2024-12-15" Row 3: [blank] Row 4: [blank] Row 5: [blank] Row 6: "Order_ID", "Product", "Amount", "Date" ← Real headers here! Row 7: "10001", "Widget A", "$1,500", "2024-01-15" ``` **How Querri Helps:** - Automatically identifies the real header row (Row 6 in this example) - Removes title and metadata rows above - Sets proper column headers - Converts data types appropriately **User Tip:** If headers look wrong, ask: "The headers are in the wrong row, can you fix this?" ### 2. Data Types Stored as Text **The Problem:** - Numbers with dollar signs: "$1,500" instead of 1500 - Dates as text: "Jan 15, 2024" instead of date format - Mixed formats: "1,500" and "1500" in same column **How Querri Helps:** - Detects numeric data stored as text - Removes formatting characters ($, commas) - Converts to proper numeric types - Standardizes date formats **User Tip:** Ask: "Convert the Amount column to numbers" or "Fix the date formats" ### 3. Inconsistent Categories **The Problem:** ``` Product_Category column contains: - "Electronics" - "electronics" - "ELECTRONICS" - "Electronic" - "Elec" ``` **How Querri Helps:** - Uses AI to recognize similar categories - Suggests standardized naming - Can create consistent categories across your dataset - Fixes common misspellings **User Tip:** Ask: "Standardize the product categories" or "Clean up the region names" ## Data Quality Assessment Checklist ### Quick Visual Inspection When you first upload data, look for: **✅ Good Signs:** - Column headers are descriptive and in the first row - Data types look consistent within columns - No obvious formatting issues - Reasonable number of missing values **⚠️ Warning Signs:** - Generic headers (Column1, Column2, etc.) - Numbers mixed with text in the same column - Lots of blank rows or cells - Strange characters or encoding issues ### Questions to Ask Querri 1. **"What data quality issues do you see?"** 2. **"Are there any missing values I should know about?"** 3. **"Do the data types look correct?"** 4. **"Are there any duplicates in this data?"** ## Handling Missing Values Strategically ### Context-Dependent Approach **For Revenue Analysis:** - Missing sales amounts → Remove or estimate - Missing customer names → Usually OK, focus on amounts **For Customer Analysis:** - Missing contact info → May need to clean or remove - Missing purchase dates → Could impair trend analysis **For Operational Data:** - Missing timestamps → Critical for process analysis - Missing categories → May need standardization ### Smart Missing Value Strategies **Ask Querri to:** - "Handle missing values appropriately for sales analysis" - "Fill missing customer regions with 'Unknown' for now" - "Remove rows with missing order amounts" - "Estimate missing dates based on patterns" ## Advanced Cleaning Techniques ### 1. Using the Researcher Tool For complex text cleaning and categorization: **Example:** "Use the researcher tool to loop through each product description and create a standardized category. Categories should be: Electronics, Clothing, Home & Garden, Books, Sports. Base the category on the product name and description." ### 2. Multi-Dataset Cleaning When working with multiple files: - Clean the main dataset first - Ensure joining columns (like customer_id) are consistent across files - Address inconsistencies in shared categories ### 3. Date and Time Cleaning Common date issues Querri can fix: - Mixed formats: "1/15/24" and "January 15, 2024" - Text dates: "Q1 2024" → Convert to specific dates - Time zones: Standardize to consistent timezone ## Industry-Specific Cleaning Tips ### Sales Data **Common Issues:** - Product codes with inconsistent formatting - Customer names with variations - Currency symbols in numeric fields - Date ranges vs. specific dates **Querri Solutions:** ``` "Standardize all product codes to uppercase" "Clean customer names and remove duplicates" "Convert all currency fields to numbers" "Standardize date formats for time series analysis" ``` ### HR/Employee Data **Common Issues:** - Job titles with variations - Department names inconsistent - Salary ranges vs. specific amounts - Employment dates in various formats **Querri Solutions:** ``` "Standardize job titles across the organization" "Create consistent department categories" "Convert salary ranges to midpoint values" "Ensure all dates use the same format" ``` ### Survey Data **Common Issues:** - Scale responses (1-5, 1-10, etc.) stored as text - Multiple choice with "Other" text responses - Likert scales with missing responses - Free text that needs categorization **Querri Solutions:** ``` "Convert rating scales to numeric values" "Categorize open-text responses using the researcher tool" "Handle missing survey responses appropriately" "Standardize multiple choice options" ``` ## When NOT to Clean Data ### Preserve Important Context - **Missing values that are meaningful**: A missing ship date for unshipped orders - **Outliers that are real**: Legitimate high-value transactions - **Variations that matter**: Regional spelling differences that affect analysis ### Ask Before Major Changes - "Should I remove outliers or keep them for analysis?" - "Are these missing values meaningful or should I fill them?" - "Do these categories need to be combined or kept separate?" ## Quality Control After Cleaning ### Verify Your Results 1. **Check row counts**: "How many rows before and after cleaning?" 2. **Verify key totals**: "Do the revenue totals still match?" 3. **Sample check**: "Show me a few examples of the cleaned data" 4. **Data type verification**: "Are all the data types correct now?" ### Common Post-Cleaning Questions - "What percentage of the data was affected by cleaning?" - "Are there any remaining quality issues?" - "Can you summarize what was cleaned?" - "Is this data ready for analysis now?" ## Best Practices Summary ### Let Querri Guide You 1. **Start with automatic detection**: Let Querri identify issues first 2. **Ask specific questions**: "Fix the header issues" rather than generic "clean this" 3. **Verify changes**: Always check results after cleaning 4. **Document changes**: Ask Querri to summarize what was cleaned ### Work Collaboratively 1. **Provide context**: "This is sales data from our CRM system" 2. **Set priorities**: "Revenue accuracy is most important" 3. **Confirm changes**: "Yes, combine those similar categories" 4. **Ask for alternatives**: "What other cleaning options do I have?" ### Think About Your End Goal - **For visualization**: Focus on consistent categories and proper data types - **For statistical analysis**: Address missing values and outliers carefully - **For reporting**: Ensure totals and counts are accurate - **For machine learning**: May need more extensive cleaning and feature engineering ## Troubleshooting Cleaning Issues ### When Cleaning Goes Wrong **Problem**: "The cleaning removed too much data" **Solution**: Ask Querri to undo and try a different approach **Problem**: "The data types still don't look right" **Solution**: Be more specific: "Convert the Amount column to currency format" **Problem**: "Categories weren't standardized properly" **Solution**: Provide examples: "Electronics and Electronic should be the same category" ### Getting Better Results - **Be specific about your needs**: "I need this for monthly trend analysis" - **Provide business context**: "These are product categories from our inventory system" - **Ask for explanations**: "Why did you choose to handle missing values this way?" - **Request alternatives**: "What other ways could we clean this data?" Remember: Good data cleaning is about making your data work for your specific analysis needs. Querri's AI makes this process much easier, but your domain knowledge and feedback make it perfect for your use case. ## Next Steps - **[Data Transformation Guide](../advanced-features/Data-Transformation.md)**: Learn about reshaping and enhancing data - **[Analysis Strategies](./Analysis-Strategies.md)**: Best practices for different types of analysis - **[Troubleshooting Data Issues](../troubleshooting/Data-Type-Errors.md)**: Solve specific data problems