Seaborn Basics (Python)

3 minute read

Published:

Generate a volcano plot using `Python!

Introduction

This document demonstrates how to generate a volcano plot using Python by reading a CSV file that contains gene expression data. The dataset must include at least three mandatory columns:

  • log2FC (Log2 Fold Change)
  • p_value (P-value for statistical significance)
  • Gene_symbol (or Gene EntrezID or Gene ENSEMBL ID)

Each step is explained in detail, with code chunks for clarity.

Installing Required Libraries

import subprocess
import sys

required_packages = ['pandas', 'matplotlib', 'seaborn']

# Check for missing packages
for pkg in required_packages:
    try:
        globals()[pkg] = __import__(pkg)
        print(f"{pkg} version: {globals()[pkg].__version__}")
    except ImportError:
        print(f"Error: {pkg} is not installed.")
        print(f"Installing {pkg}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", pkg])
        globals()[pkg] = __import__(pkg)  # Import the package after installation
        print(f"{pkg} has been installed.")
        print(f"{pkg} version: {globals()[pkg].__version__}")

Import libraries:

import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Setting Up Project Directories

project_dir = "/Users/debojyoti/Projects/seaborn_basics"
data_dir = os.path.join(project_dir, "input_data")
result_dir = os.path.join(project_dir, "results")

os.makedirs(result_dir, exist_ok=True)

Reading the input data using pandas

input_file = os.path.join(data_dir, "test_input_file.csv")
data = pd.read_csv(input_file)
print(data.head())

#	Gene_symbol log2FC  	neg_log10pval	log2FC_sq	p_value
# 0	Gene7	2.267283	1.725017	5.140572	0.018836
# 1	Gene9	3.027636	3.804565	9.166577	0.000157
# 2	Gene11	1.957304	0.261662	3.831041	0.547442
# 3	Gene12	3.429968	3.787973	11.764681	0.000163
# 4	Gene13	-2.083291	3.899395	4.340102	0.000126

Transforming Data for Visualization

pval_cutoff = 0.05
log2fc_cutoff = 1

data["logP"] = -np.log10(data["p_value"])
data["negLog2FC"] = -data["log2FC"]

conditions = [
    (data["p_value"] < pval_cutoff) & (data["negLog2FC"] > log2fc_cutoff),
    (data["p_value"] < pval_cutoff) & (data["negLog2FC"] < -log2fc_cutoff),
]
choices = ["Upregulated", "Downregulated"]
data["regulation"] = np.select(conditions, choices, default="Non-significant")

data.head()

#	Gene_symbol log2FC  	neg_log10pval	log2FC_sq	p_value 	logP    	negLog2FC	regulation
# 0	Gene7	   2.267283	1.725017	5.140572	0.018836	1.725017	-2.267283	Downregulated
# 1	Gene9	   3.027636	3.804565	9.166577	0.000157	3.804565	-3.027636	Downregulated
# 2	Gene11	   1.957304	0.261662	3.831041	0.547442	0.261662	-1.957304	Non-significant
# 3	Gene12	   3.429968	3.787973	11.764681	0.000163	3.787973	-3.429968	Downregulated
# 4	Gene13	   -2.083291	3.899395	4.340102	0.000126	3.899395	2.083291	Upregulated

Selecting Top Genes for Labeling

top_n = 5
top_up = data[data["regulation"] == "Upregulated"].nsmallest(top_n, "log2FC")
top_down = data[data["regulation"] == "Downregulated"].nlargest(top_n, "log2FC")
top_genes = pd.concat([top_up, top_down])

# Set the display width to a larger value, e.g., 1000 characters
pd.set_option('display.width', 120)

top_genes.head()
#	Gene_symbol log2FC neg_log10pval   log2FC_sq  p_value     logP   negLog2FC	regulation
# 2440  Gene6948   -6.064915   2.122147  36.783188  0.007548  2.122147   6.064915      Upregulated
# 1724  Gene4951   -4.996103   4.636459  24.961045  0.000023  4.636459   4.996103      Upregulated
#  900  Gene2644   -4.799338   2.081537  23.033642  0.008288  2.081537   4.799338      Upregulated
# 1002  Gene2924   -4.771924   3.505065  22.771257  0.000313  3.505065   4.771924      Upregulated
# 1326  Gene3821   -4.708490   2.552914  22.169875  0.002800  2.552914   4.708490      Upregulated

Creating the Volcano Plot

plt.figure(figsize=(8, 6))
sns.scatterplot(
    data=data, x="negLog2FC", y="logP", hue="regulation",
    palette={"Upregulated": "red", "Downregulated": "blue", "Non-significant": "black"},
    alpha=0.7
)

for _, row in top_genes.iterrows():
    plt.text(row["negLog2FC"], row["logP"], row["Gene_symbol"], fontsize=8, ha='right')

plt.title("Volcano Plot")
plt.xlabel("-Log2 Fold Change")
plt.ylabel("-Log10 P-value")
plt.ylim(0, 6)
plt.legend(title="regulation")
plt.tight_layout()
plt.show()

Inside the plot (default locations): ‘best’: Automatically chooses the best location (default), ‘upper left’: Legend in the upper-left corner, ‘upper right’: Legend in the upper-right corner, ‘lower left’: Legend in the lower-left corner, ‘lower right’: Legend in the lower-right corner, ‘center left’: Legend in the center of the left side, ‘center right’: Legend in the center of the right side, ‘lower center’: Legend in the center of the bottom side. ‘upper center’: Legend in the center of the top side, ‘center’: Legend in the center of the plot.

Saving the Plot

output_file = os.path.join(result_dir, "volcano_plot.png")
plt.savefig(output_file)
print(f"Plot saved to: {output_file}")
# Plot saved to: /Users/debojyoti/Projects/seaborn_basics/results/volcano_plot.png
<Figure size 640x480 with 0 Axes>

Conclusion

This notebook demonstrated how to load, process, and visualize gene expression data using a volcano plot in Python. We classified genes, highlighted top ones, and saved the plot using seaborn and matplotlib.