Exploratory data analysis (EDA) explores and analyzes a data set using descriptive statistics and visualizations to learn about the dataset. The main motivations for EDA are:
- detecting data errors
- testing assumptions about the data
- selecting appropriate models
- determining relationships among the explanatory variables
- accessing the direction and magnitude relationships between explanatory and outcome variables.
However, mistakes and assumptions are often difficult to detect as many statistical and numerical analysis (e.g., means, standard deviations, etc.) will provide an answer regardless of whether or not the answer makes sense within the overall dataset.
What to do?
If the dataset deals with people, one approach is to humanize the data by generating personas – EDA personas.
Generating an EDA persona gives the context for detecting mistakes and incorrect assumptions in the dataset.
For example, for a workshop we teach, we use a Kaggle marketing dataset that can use EDA to get insights into various variables. The numerical results from EDA seem reasonable.
However, we then use personas an EDA approach highlights several anomalies, such as highly skilled individuals with no professional education, twenty-year-olds with years of professional experience, and product satisfaction ratings by individuals who have never purchased the product.
Using EDA personas provides the context for evaluating people datasets to detect mistakes and anomalies that might not be detected using numbers alone.