Exploratory Data Analysis (EDA) is an approach to analyzing data that emphasizes exploring datasets for patterns and insights without any predetermined hypotheses.
The goal is to let the data “speak for themselves” and guide analysis, rather than imposing rigid structures or theories.
Goals
Exploratory data analysis (EDA) mainly analyzes and investigates datasets to discover patterns, spot anomalies, test hypotheses, and check assumptions.
EDA has several key goals:
- Quickly summarize and describe the characteristics of a dataset. By visualizing data distributions and calculating descriptive statistics, we can get an overview of the salient properties of variables in the dataset.
- Check the quality of the data and identify any issues. Data visualization and summary statistics readily reveal missing values, errors, and outliers that may need mitigation before proceeding with analysis.
- Formulate hypotheses and derive insights by exploring interesting aspects of the data. Patterns may suggest causal hypotheses to test via statistical modeling. Outliers often contain useful domain insights.
- Understand relations between variables. Visualizations can uncover the nature of bivariate relationships – shape, direction, form, outliers, etc. This guides correlation and regression modeling choices.
- Test assumptions for statistical models you intend to apply later. Histogram shapes indicate parameter distribution assumptions. Scatterplots check linearity assumptions. Identifying where assumptions break provides guidance for requisite data transformations or alternate models.
In essence, EDA entails active investigation of what our data contains even before formal modeling to guide choices, reveal issues needing resolution, and ensure we squeeze all potential value from our data resources.
The flexibility and lack of stringent assumptions make EDA invaluable for open-ended understanding.
Techniques
Exploratory Data Analysis (EDA) emphasizes flexibility and exploring different approaches to let key aspects of datasets emerge, rather than rigidly testing hypotheses from the start.
It is an iterative cycle where we analyze, visualize, and transform data to extract meaning.
EDA principles underlie “data science” and complement traditional statistical inference. Smart use of EDA provides a rich understanding of phenomena that can guide the construction of causal theories and models.
Trying various statistical and machine learning techniques to understand different facets of datasets.
Rather than sticking to predetermined analysis plans, the focus is using diverse tools suited for particular datasets.
Techniques could include clustering algorithms, decision trees, linear regression, ANOVA, etc., based on the data characteristics and research goals.
Graphical techniques
EDA emphasizes using graphical techniques to reveal patterns, relationships, and anomalies within data.
Graphical techniques in EDA are not merely supplemental tools but are crucial for gaining a deeper understanding of the data, which quantitative summaries alone cannot achieve.
Graphical methods provide unparalleled power to explore data due to their ability to engage the analyst’s natural pattern recognition abilities
Pictures allow our powerful visual perception to notice things numerical summaries may miss.
Creating graphs, charts, and plots to visually inspect data distributions, relationships between variables, outliers, etc.
Examples of graphical techniques include:
- Histograms: To visualize the distribution of a single variable, identify outliers, and compare variables across different groups.
- Scatterplots: To visualize the relationship between two variables and assess their correlation.
- Box-and-Whisker Plots: To visualize the distribution of a single variable, identify outliers, and compare variables across different groups.
- Probability Plots: To assess whether a data set follows a particular theoretical distribution, such as the normal distribution.
- Residual Plots: To assess the validity of a fitted model by analyzing the patterns in the residuals.
Summarizing
Describing key statistics of datasets to understand central tendency (mean, median), spread (variance, percentiles), shape (skewness), outliers, and so on.
These numerical summaries complement visual inspection.
Central Tendency: This refers to the “typical” or “middle” value of a dataset. Commonly used measures of central tendency include:
- Mean: The sum of all values divided by the number of values. It’s the most commonly reported measure of location but is sensitive to extreme values (outliers).
- Median: The middle value when data is arranged in order. It’s considered more robust than the mean, as it’s less affected by extreme values.
- Mode: The most frequent value in a dataset. It’s less commonly used than the mean and median, and in multimodal distributions (those with multiple peaks), it might not be a good representative of the central tendency.
Spread: Also known as variability or dispersion, it describes how spread out the data values are. Key measures of spread include:
- Variance: Measures the average squared deviation from the mean. A larger variance indicates greater spread.
- Standard Deviation: The square root of the variance. It’s expressed in the same units as the original data, making it more interpretable than variance. For approximately normally distributed data, about 95% of the values fall within 2 standard deviations of the mean.
- Interquartile Range (IQR): The difference between the 75th percentile (Q3) and the 25th percentile (Q1). The IQR is a robust measure of spread, less affected by outliers than the variance or standard deviation.
- Range: The difference between the maximum and minimum values. It’s sensitive to outliers and not as robust as other measures of spread.
Shape: Describes the symmetry and peakedness of a distribution. Two important measures of shape are:
- Skewness: Measures the asymmetry of a distribution. A positive skew indicates a longer tail on the right side, while a negative skew indicates a longer tail on the left.
- Kurtosis: Measures the “peakedness” of a distribution compared to a normal distribution. Positive kurtosis (leptokurtic) suggests a sharper peak and heavier tails, while negative kurtosis (platykurtic) suggests a flatter peak and thinner tails
Tabulation is a fundamental Exploratory Data Analysis (EDA) technique used to summarize data, particularly categorical data, by presenting the frequency of each category in a table format.
Tabulation helps to simplify complex datasets and identify dominant patterns. The insights gained from tabulation can then inform the selection of appropriate statistical methods for further analysis
Here’s how tabulation summarizes data:
- Frequency Counts: Tabulation involves creating a table that lists each distinct category of a categorical variable and the number of times that category appears in the dataset. This provides a clear picture of the distribution of data across different categories. For instance, if you have data on student majors, a tabulation table would show each major (e.g., “Computer Science”, “Biology”) and the number of students enrolled in each.
- Relative Frequencies (Percentages): In addition to raw counts, tabulation often includes the percentage or proportion of each category relative to the total number of observations. This allows for easy comparison of the representation of different categories within the dataset.
- Basis for Cross-tabulation: Tabulation serves as a foundation for a more advanced technique known as cross-tabulation, which is used to analyze the relationship between two or more categorical variables. Cross-tabulation creates a two-way table, where rows and columns represent different categories of two variables, and the cell values represent the count or percentage of observations that fall into the intersection of those categories.
Data transformations
When data deviates from a normal distribution, many statistical techniques, which are grounded in the assumption of normality, may yield misleading or incorrect results.
Transformations provide a means to reshape the data, bringing it closer to a normal distribution and thus enhancing the applicability of these statistical techniques.
Data transformations involve using mathematical functions to modify the structure of datasets to uncover patterns more easily.
For instance, if a variable like income is highly skewed, applying a logarithm transformation can normalize its distribution, addressing potential issues with specific statistical tests.
Beyond improving normality, transformations in EDA can also enhance the clarity of data patterns, linearize trends, and stabilize variance.
They play a vital role in ensuring that the chosen statistical analysis aligns with the characteristics of the data, ultimately leading to more accurate and reliable conclusions.
Here are some benefits of re-expressing data:
- Improved symmetry and normality of data and residuals: A more symmetrical distribution often allows the mean to be a more accurate measure of central tendency. Additionally, many statistical tests, such as t-tests and linear models like regression and ANOVA, assume normality.
- More comparable variances among groups: When comparing multiple groups, ANOVA and ANCOVA models often assume that these groups have similar variances. Re-expressing the data can help achieve this.
- More linear relationships between variables: Simplifying the relationship between variables can make it easier to analyze and interpret the data.
- More stable variation around a regression line: This homoscedasticity is a common assumption in many statistical models.
- Enhanced suitability for additive models: Re-expression can improve the fit of an additive model, which relates a response to two or more factors. This can minimize the need for complex interaction terms.
Lgarithms are recommended as a starting point for amounts, like times and rates, which are nonnegative real values. For counts, which are nonnegative integers, square roots and logarithms are good initial transformations.
When analyzing data with skewness, researchers often aim to normalize it. However, it’s essential to recognize that skewness can sometimes reflect a genuine nonlinear relationship between variables, not just a statistical anomaly.
For example, a quadratic relationship might exist between drug dosage and the number of symptoms, where a higher dosage initially reduces symptoms, but extremely high doses lead to their resurgence.
Outlier detection
Identifying anomalies that distort overall patterns, and either correcting erroneous values or analyzing outliers specifically since they often reveal useful insights about the phenomenon under study.
Outlier identification is not about simply discarding inconvenient data points. It’s about critically examining the data, understanding the stories behind extreme values, and making informed decisions about how to handle them to ensure accurate and meaningful conclusions.
Why is outlier identification important?
- Data Integrity and Validity: Outliers might represent errors in data collection, coding, or experimental procedures. Identifying these outliers allows for investigation and potential correction or removal, ensuring the data accurately reflects the phenomenon under study.
- Model Accuracy and Interpretation: Outliers can disproportionately influence statistical models, leading to a poor fit with the majority of the data. This can result in misleading conclusions about relationships between variables. Removing or adjusting for outliers can improve model accuracy and lead to more reliable interpretations.
- Generalizability: Outliers can limit the generalizability of findings. Identifying outliers helps define the scope of the study and determine the population to which results can be reliably generalized.
Outliers can be identified using:
- Fence methods: Often employed in box plots, define outlier boundaries using the interquartile range (IQR). Points beyond the whiskers, which extend 1.5 times the IQR from the hinges (roughly the 25th and 75th percentiles), are often considered potential outliers.
- Z-Scores: Measure how many standard deviations a data point is from the mean. Points with large absolute z-scores might be outliers.
- Studentized residuals: Used in regression analysis. Unlike raw residuals, which can be misleading due to varying leverage, studentized residuals are standardized and follow a t-distribution. This standardization facilitates the identification of underlying patterns and potential outliers.
- Cook’s distance, also used in regression analysis, quantifies the impact of deleting a particular data point on the regression model’s coefficients. Cases with large Cook’s distances exert a substantial influence on the model and might warrant closer examination as potential outliers
- Statistical Tests: Specialized tests like Grubbs’ test can formally test for outliers
Choosing the Right Tool: Matching Techniques to Data and Objectives
- Objective: Getting an idea of the distribution of a variable
- Technique: Histogram
Histograms are a powerful tool for understanding the distribution of a variable. They provide insights into central tendency, spread, modality, and outliers.
By displaying the frequency or proportion of cases within specified ranges (bins), histograms offer a visual representation of the data distribution.
The shape of the histogram can reveal whether the data is symmetrical, skewed, unimodal, bimodal, or has other distinct patterns.
Trying different bin sizes can be helpful to reveal more or less detail in the shape of the distribution.
Descriptive vs. Explorative Analysis
Descriptive analysis focuses on summarizing what the data shows on the surface. Exploratory analysis digs deeper to uncover subtle patterns and non-obvious trends in the data.
Descriptive analysis might tell a dataset’s average, median, and standard deviation. Exploratory analysis would use visualizations, transformations, and interrogating the data with different techniques to model the relationships between variables beyond just summary statistics.
So descriptive analysis describes what the data shows, while exploratory analysis explores nuances in the data to extract deeper meaning.
But good data analysis uses both techniques – summary statistics to complement the graphs and visuals revealing relationships.
Descriptive Analysis
- Summarizes and presents the data without making inferences or models
- Uses simple graphics like histograms, bar charts, summary statistics
- Goal is to describe patterns in the data
Exploratory Analysis
- Makes inferences about patterns, relationships, effects in the data
- Relies heavily on graphics and visualization
- Transforms/manipulates the data to extract meaning
- Iterative cycle to understand the data
- Goal is to extract deeper insights from the data
References
Behrens, J. T. (1997). Principles and procedures of exploratory data analysis. Psychological Methods, 2(2), 131–160.
Emerson, J. D., & Stoto, M. A. (1983). Transforming data. In D. C. Hoaglin, F. Mosteller, & J. W. Tukey
(Eds.), Understanding robust and exploratory data analysis (pp. 97–128). Wiley.
Hoaglin, D. C., Mosteller, F., & Tukey, J. W. (Eds.). (1991). Fundamentals of exploratory analysis of variance. Wiley.
Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley.
Velleman, P. F. (2008). Truth, damn truth, and statistics. Journal of Statistics Education: An International Journal on the Teaching and Learning of Statistics, 16(2).