Simpson Paradox

A statistical phenomenon when the trend of the data reverses or disappears when data is put into groups.

The figure below represents the trend of the generated data showing the positive relationship of the data when they are in groups. However, the trend reverses when the data is put together. This is an effect of Simpson's paradox, or a situation when the trend of the whole data differs from the trend of the same data put into subgroups.

Image_Caption

Figure 1: Simpson's effect of the generated data.

Source: Towards Data Science

UC Berkeley Dataset

A real-world example of the admission to graduate school at UC Berkeley in 1973 is one of the common examples of the Simpson's paradox effect. This dataset was collected from the applicants who applied to UC Berkeley in the Fall semester of 1973. Table 1 gives an overview of the data's variables.

Dataset Variables Explanation
Year The application year (this data is always 1973).
Major An anonymized major code (either A, B, C, D, E, F, or Other). The specific majors are unknown except that A-F is the six majors with the most applicants in Fall 1973.
Gender Applicant self-reported gender (either M or F).
Admission Admission decision (either Rejected or Accepted).

Table 1: Description of dataset variables.

To demonstrate the effect of Simpson's paradox, we will analyze the data with the bar graph for a clearer picture. First, we will take a look at the overall statistics of this dataset as summarized in table 2. We can see from the table that altogether male applicants were to admitted to UC Berkeley more than female applicants.

Male
Female
# Applicants % Admitted # Applicants % Admitted
Total 2,890 44 % 1,835 34.5%

Table 2: Overall statistics of graduate admission at UC Bekeley in 1973.

On the other hand, if we look at the data individually, women were likely to be admitted more than men in several departments as seen in figure 2. One of the reasons behind this is the difference in the acceptance rate. Elaborately, each department has a different difficulty level of acceptance where females applied and got admitted by the harder departments. For instance, in department A, more than 80 percent of female applicants were to be admitted to this department while only 60 percent of male applicants were to be admitted to the same department.

Alt Text

Figure 2: Numbers of male & female applicants admitted to UC Berkeley.

How can Simpson's Paradox affect Data Interpretation?

Simpson's paradox can lead to a wrong interpretation or conclusion of the data like in our example that the university was sued for gender bias. Interpretation may be right both ways, but it depends on your problem's question to decide whether to use results from the overall or disaggregated data viewpoints. In other words, we need to know what we are looking for from the data, so that we can choose a fair interpretation of the data. Moreover, it is worthwhile to double-check if the hidden variable exists that might have potential effects on the data such as

  1. confounding variables,
  2. lurking variables, or
  3. spurious correlation in the dataset.


References

  1. Simpson's Paradox: More information is available here