Enhancing Correlation Matrix Heatmap Plots with P-values in Python

Tosin Harold Akingbemisilu
11 min readFeb 10, 2024

--

Correlation matrix plots are a valuable tool in data analysis, helping us visualize the relationships between variables in a dataset. They allow us to identify patterns and dependencies among different features, which can be crucial for making data-driven decisions. However, a correlation value alone doesn’t always tell the whole story. That’s where p-values come in. After trying to look for solutions since the annotation on the plot only includes the coefficients and I could not get anything tangible online on the inclusion of the p-value, I put together the code and decided to put this post together and the code to do this after getting some headway, hoping it saves someone the trouble.

Level up and master tech skills, fast-track projects, and innovate smarter with Pluralsight. Start your free trial now and save up to 50%!

Why is the Inclusion of P-value in a Correlation Plot Important?

P-values provide a measure of the statistical significance of the observed correlations, helping us distinguish between meaningful relationships and random fluctuations. Including p-values in a correlation matrix plot adds valuable information to any data analysis.

  1. Statistical Significance: P-values indicate whether the observed correlations are statistically significant. A low p-value (typically < 0.05) suggests a strong and meaningful relationship between variables, while a high p-value implies a weak or random correlation.
  2. Confidence in Interpretation: By knowing the statistical significance of correlations, you can have more confidence in your interpretations. You can distinguish between strong, meaningful correlations and those that might be spurious.
  3. Hypothesis Testing: P-values are essential for hypothesis testing. They help you decide whether to accept or reject the null hypothesis that there is no correlation between variables.
  4. Identifying Meaningful Patterns: In complex datasets, including p-values can help you focus on the most relevant relationships, filtering out noise and random associations.

Understanding The Code Snippet

Let’s break down the provided Python code and understand each step.

Import the relevant libraries:

Achieving our objective relies on several Python libraries, each serving a specific purpose in the process of creating a correlation matrix plot with p-values.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import textwrap
from scipy.stats import pearsonr
from scipy.stats import spearmanr
from scipy.stats import kendalltau

Here’s an explanation of the relevant libraries and how they are applied to achieving our aim:

  1. NumPy: NumPy is used for numerical computations and array operations, helping with tasks like creating masks and working with arrays.
  2. Pandas: Pandas is essential for data manipulation, providing DataFrames and Series to handle structured data, select columns, and store correlations and p-values.
  3. Seaborn: Seaborn is used for data visualization and is responsible for creating the main correlation matrix heatmap with annotations and customized styling.
  4. Matplotlib: Matplotlib is used to create the figure and axis for the heatmap, even though Seaborn is the primary library for visualization.
  5. textwrap: The textwrap module is used to wrap long axis labels for improved readability in the heatmap.
  6. SciPy Stats Functions: SciPy provides statistical functions like pearsonr, spearmanr, and kendalltau that are crucial for calculating correlation coefficients and their associated p-values in the code. This is important depending on the method you use in your pandas `corr()` function. I will explain how this is applied later.

Data Preparation and Correlation Calculation

For the sake of this post, I will be using a dataframe from kaggle and you can find it here. We will be selecting a subset of columns.

df = pd.read_csv('/content/SolarPrediction.csv', usecols = ['Radiation', 'Temperature', 'Humidity', 'WindDirection(Degrees)', 'Speed'])

Next, we calculate the correlation matrix using the pandas corr() function as it computes pairwise correlation of columns, excluding NA/null values. Here we will use Kendall’s rank correlation coefficient method. I am doing this so I can explain further why this method is relevant when you are calculating the p-value later. There are 3 methods when calling the corr() function, and they are; ‘pearson’, ‘spearman’ and ‘kendall’. We will not go into details on the differences between these three, but you can get more information online. However, if you do not include method when calling the corr() function, it uses ‘pearson’ as default. You can get more information on the corr() function here.

corr = df_agroecological.corr(method='kendall')

Creating a Mask for a Diagonal Matrix

Personally, I prefer a diagonal matrix since the upper triangle contains duplicate information, so I often remove it for clarity, which I believe is the best practice anyway, as I have always wondered why anyone would want to leave it. A mask is therefore created to hide the upper triangle of the correlation matrix, as shown below, and we will use this later when creating the heatmap.

mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
np.fill_diagonal(mask, False)

Creating the Correlation Heatmap

Now, we generate a heatmap using the Seaborn library, with the custom mask applied. This heatmap visually represents the correlations between variables.

fig, ax = plt.subplots(figsize=(10,5))
plt.title(title, fontsize=14)

# Generate the heatmap including the mask
heatmap = sns.heatmap(corr,
annot=True,
annot_kws={"fontsize": 10},
fmt='.2f',
linewidths=0.5,
cmap='RdBu',
mask=mask, # the mask has been included here
ax=ax)

# Display our plot
plt.show()

In the code above, ‘annot’ is set to True, which is what would include the correlation coefficient values on the heatmap. Below is what our plot looks like without the p-values at the moment.

Image by Author

It kind of looks good, but like I said earlier, for the sake of making inferences, this is not enough, and this is where the p-value will help us understand if these relationships are statistically significant or not. Let’s delve into the next part, where we include the p-values.

Level up and master tech skills, fast-track projects, and innovate smarter with Pluralsight. Start your free trial now and save up to 50%!

Calculating and Formatting P-values

The code below calculates and formats p-values for pairs of variables in our dataset. Here we will use the chosen correlation method, ‘Kendall’s tau’ based on the method we have selected earlier. If you have used a different method when calling the corr() function earlier, you can change kendalltau to pearsonr for pearson correlation and spearmanr for spearman correlation. If you did not include the method when calling the corr() function, you should replace kendalltau to pearsonr in the code below, since corr() uses pearson by default. We also organised the resulting p-values in a DataFrame that corresponds to the original correlation matrix, providing a statistical measure of the significance of the observed correlations between variables.

# Generate the correlation matrix afresh
corr = df.corr(numeric_only=True)

# mask the correlation matrix to diagonal
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
np.fill_diagonal(mask, False)

fix,ax = plt.subplots(figsize=(10,5))
plt.title("Correlation map with P-value", fontsize=14)

# Generate heatmap
heatmap = sns.heatmap(corr,
annot= True,
annot_kws={"fontsize": 10},
fmt='.2f',
linewidths=0.5,
cmap='RdBu',
mask=mask,
ax=ax)

# calculate and format p-values
p_values = np.full((corr.shape[0], corr.shape[1]), np.nan)
for i in range(corr.shape[0]):
for j in range(i+1, corr.shape[1]):
x = df.iloc[:, i]
y = df.iloc[:, j]
mask = ~np.logical_or(np.isnan(x), np.isnan(y))
if np.sum(mask) > 0:
p_values[i, j] = kendalltau(x[mask], y[mask])[1] #change to pearsonr or spearmanr

# Create a dataframe object for p_values
p_values = pd.DataFrame(p_values, columns=corr.columns, index=corr.index)

# Mask the p values
mask_pvalues = np.triu(np.ones_like(p_values), k=1)

# Generate maximum and minimum correlation coefficients for p-value annotation color
max_corr = np.max(corr.max())
min_corr = np.min(corr.min())

# Assign p-value annotations, include asterisks for significance
for i in range (p_values.shape[0]):
for j in range(p_values.shape[1]):
if mask_pvalues[i, j]:
p_value = p_values.iloc[i, j]
if not np.isnan(p_value):
correlation_value = corr.iloc[i, j]
text_color = 'white' if correlation_value >= (max_corr - 0.4) or correlation_value <= (min_corr + 0.4) else 'black'
if p_value <= 0.01:
#include double asterisks for p-value <= 0.01
ax.text(i + 0.5, j + 0.8, f'(p = {p_value:.2f})**',
horizontalalignment='center',
verticalalignment='center',
fontsize=8,
color=text_color)
elif p_value <= 0.05:
#include single asterisk for p-value <= 0.05
ax.text(i + 0.5, j + 0.8, f'(p = {p_value:.2f})*',
horizontalalignment='center',
verticalalignment='center',
fontsize=8,
color=text_color)
else:
ax.text(i + 0.5, j + 0.8, f'(p = {p_value:.2f})',
horizontalalignment='center',
verticalalignment='center',
fontsize=8,
color=text_color)

# Customize x-axis labels
x_labels = [textwrap.fill(label.get_text(), 13) for label in ax.get_xticklabels()]
ax.set_xticklabels(x_labels, rotation=0, ha="center")

# Customize y-axis labels
y_labels = [textwrap.fill(label.get_text(), 13) for label in ax.get_yticklabels()]
ax.set_yticklabels(y_labels, rotation=0, ha="right")

# Display the plot
plt.show()

Okay, I think it is fair if I explain the code above. In the code:

  1. We first initialize an empty array for P-values by creating a NumPy array named p_values with the same shape as the correlation matrix (corr). Initially, all elements in this array are set to NaN (Not-a-Number).
  2. We then went ahead to iterate through the rows and columns of the correlation matrix (corr) using nested loops, excluding the diagonal elements. Also, for each pair of variables (indexed by i and j), it extracts the corresponding columns x and y from the df DataFrame.
  3. We then created a mask to identify valid data points by checking for non-NaN values in both x and y. This helps handle missing data.
  4. If there are valid data points (i.e., the sum of the mask is greater than 0), it calculates the p-value using Kendall’s tau rank correlation coefficient (kendalltau) for the selected variables x and y. The p-value is extracted from the result and stored in the p_values array at the corresponding position (p_values[i, j]).
  5. After completing the nested loops, the code converts the p_values NumPy array into a Pandas DataFrame. The DataFrame is given column and index labels taken from the original correlation matrix (corr), making it easier to work with and visualize.
  6. We then created a mask for displaying only the lower triangle of the p-values DataFrame. This mask is used to annotate the correlation heatmap with p-values.
  7. In the code, we calculated and stored the highest and lowest correlation coefficients in the correlation matrix and then annotated the lower triangle of the heatmap with p-values. The text colour of the p-value annotations is adjusted based on the magnitude of the correlation value, making it easier to identify significant correlations in the plot.
  8. We then iterated through the elements of the p_values DataFrame, which stores the calculated p-values for each pair of variables.
    For each pair of variables (indexed by i and j), it checks whether the corresponding entry in the mask_pvalues array is True. This mask ensures that p-values are only displayed in the lower triangle of the heatmap (to avoid duplication).
  9. If the mask condition is met and the p-value is not a NaN (indicating a valid p-value), it calculates the correlation coefficient value for the same pair of variables from the corr DataFrame and assigns it to correlation_value. It determines the text colour for the annotation based on the magnitude of the correlation value. If the correlation is very high (greater than or equal to max_corr — 0.4) or very low (less than or equal to min_corr + 0.4), the text colour is set to white; otherwise, it’s set to black. We also annotated the heatmap at the position (i + 0.5, j + 0.8) with the p-value in the format (p = {p_value:.2f}) to ensure it is rightly below the coefficient value. The annotation is centered, and the ‘fontsize’ is set to 8. The text colour is determined based on the correlation value. edited: Included asterisks in the code to show significance (double asterisks for p-value ≤0.01 and single for p-value ≤ 0.05)
  10. One last thing, which may not be necessary in your case, is that since we have pretty long names, we tried to customize them by wrapping them. We did this for both the x and y axis. For the x-axis we also made the rotation 0 since we are wrapping anyway. I personally do not like tilting my head to read things 😊. So, let’s do this.

Level up and master tech skills, fast-track projects, and innovate smarter with Pluralsight. Start your free trial now and save up to 50%!

Now. Let’s see what our new correlation plot looks like.

Image by Author

Voila! With a few lines of code (actually not so few), we have been able to achieve our objective, which looks better to me. Now, it is easier to make some inferences and know if the relationship is significant based on a p-value < 0.05 in a pairwise fashion.

Nonetheless, considering this is a long line of code, you can go ahead and wrap it in a function, and I have automated some processes for the method as well. This way, you just run a line of code, and you have your correlation matrix plot generated for you. Below is the function:

def corr_heatmap_with_pval(df, method = 'pearson', figsize=(20, 10), title=None):
"""
df: dataframe to be used. Ensured the dataframe has been sliced to contain only the column you need. It accepts only numerical columns
method: default uses the pearson method. It overall permits 3 methods; 'pearson', 'spearman' and 'kendall'
figsize: default is (20, 10) but you can change it based on your preference
title: Specify the title for your chart, default is None
"""
# Make a copy of the df
data = df.copy()
# Check features correlation
corr = data.corr(method = method)

# Create a mask to hide the upper triangle
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True

# Set the diagonal elements of the mask to False to display self-correlation
np.fill_diagonal(mask, False)

fig, ax = plt.subplots(figsize=figsize)
plt.title(title, fontsize=14)

# Create the heatmap with the custom mask
heatmap = sns.heatmap(corr,
annot=True,
annot_kws={"fontsize": 10}, # Adjust annotation font size
fmt='.2f',
linewidths=0.5,
cmap='RdBu',
mask=mask,
ax=ax)

# Create a function to calculate and format p-values
p_values = np.full((corr.shape[0], corr.shape[1]), np.nan)
for i in range(corr.shape[0]):
for j in range(i+1, corr.shape[1]):
x = data.iloc[:, i]
y = data.iloc[:, j]
mask = ~np.logical_or(np.isnan(x), np.isnan(y))
if np.sum(mask) > 0:
if method == 'pearson':
p_values[i, j] = pearsonr(x[mask], y[mask])[1] #Changes based on the method chosen in the function
elif method == 'kendall':
p_values[i, j] = kendalltau(x[mask], y[mask])[1]
elif method == 'spearman':
p_values[i, j] = spearmanr(x[mask], y[mask])[1]

p_values = pd.DataFrame(p_values, columns=corr.columns, index=corr.index)

# Create a mask for the p-values heatmap
mask_pvalues = np.triu(np.ones_like(p_values), k=1)

# Calculate the highest and lowest correlation coefficients
max_corr = np.max(corr.max())
min_corr = np.min(corr.min())

# Annotate the heatmap with p-values and change text color based on correlation value
for i in range(p_values.shape[0]):
for j in range(p_values.shape[1]):
if mask_pvalues[i, j]:
p_value = p_values.iloc[i, j]
if not np.isnan(p_value):
correlation_value = corr.iloc[i, j]
text_color = 'white' if correlation_value >= (max_corr - 0.4) or correlation_value <= (min_corr + 0.4) else 'black'
if p_value <= 0.01:
#include double asterisks for p-value <= 0.01
ax.text(i + 0.5, j + 0.8, f'(p = {p_value:.2f})**',
horizontalalignment='center',
verticalalignment='center',
fontsize=8,
color=text_color)
elif p_value <= 0.05:
#include single asterisks for p-value <= 0.05
ax.text(i + 0.5, j + 0.8, f'(p = {p_value:.2f})*',
horizontalalignment='center',
verticalalignment='center',
fontsize=8,
color=text_color)
else:
ax.text(i + 0.5, j + 0.8, f'(p = {p_value:.2f})',
horizontalalignment='center',
verticalalignment='center',
fontsize=8,
color=text_color)

# Customize x-axis labels
x_labels = [textwrap.fill(label.get_text(), 12) for label in ax.get_xticklabels()]
ax.set_xticklabels(x_labels, rotation=0, ha="center")

# Customize y-axis labels
y_labels = [textwrap.fill(label.get_text(), 15) for label in ax.get_yticklabels()]
ax.set_yticklabels(y_labels, rotation=0, ha="right")
ax.grid(False)
plt.show()

With that, you can just easily run one single line of code to generate your plot like this:

# Plot correlation matrix with pvalue in one line of code using kendall method
corr_matrix_with_pval(df, method = 'kendall', figsize=(10,5), title="Correlation Matrix with P-Value using Function and Kendall Method")

It would be nice to have your feedback and to know if this helped you in any way. I wish you the best of luck as you keep on with your data exploration.

This is also available on my YouTube channel, so you can watch and code along with me.

You can also find the code repository here on GitHub.

Let’s connect via GitHub | LinkedIn | Twitter | YouTube

MY RECOMMENDED SITES TO LEVEL UP YOUR DATA SCIENCE SKILLS — REGISTER AND START LEARNING TODAY

Pluralsight

DataCamp

--

--

Tosin Harold Akingbemisilu
Tosin Harold Akingbemisilu

Written by Tosin Harold Akingbemisilu

Feasting on data | Super Curious | Writing stories on life, work and the little fun I manage to have | All contents are mine… www.tosinharold.com