Analyse Binette results
Let’s visualize the results from Binette and compare them to the initial bin sets used as input.
To explore these results interactively, you can open the Jupyter notebook via Binder by following this link:
Import Necessary Libraries
First, we’ll need to import the necessary libraries for our analysis and plotting:
[1]:
import pandas as pd
from pathlib import Path
import plotly.express as px
# The following two lines are needed to properly display Plotly graphs in the documentation
# However you may need to remove these lines and restart the kernel to visualise the graph in another context
import plotly.io as pio
pio.renderers.default = "sphinx_gallery"
Load Binette Results
Now, let’s load the final Binette quality report into a Pandas DataFrame:
[2]:
binette_result_file = "./binette_output/final_bins_quality_reports.tsv"
df_binette = pd.read_csv(binette_result_file, sep="\t")
df_binette["tool"] = "binette" # Add a column to label the tool
df_binette["index"] = df_binette.index # Add an index column
df_binette
[2]:
| name | origin | is_original | original_name | completeness | contamination | score | checkm2_model | size | N50 | coding_density | contig_count | tool | index | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | binette_bin1 | binette | False | binette_bin1 | 100.00 | 0.10 | 99.80 | Neural Network (Specific Model) | 4658605 | 82084 | 0.8803 | 91 | binette | 0 |
| 1 | binette_bin2 | binette | False | binette_bin2 | 99.94 | 0.23 | 99.48 | Neural Network (Specific Model) | 2796059 | 41151 | 0.8882 | 98 | binette | 1 |
| 2 | binette_bin3 | binette | False | binette_bin3 | 96.10 | 0.27 | 95.56 | Gradient Boost (General Model) | 2559714 | 11656 | 0.8990 | 315 | binette | 2 |
| 3 | binette_bin4 | binette | False | binette_bin4 | 93.43 | 0.12 | 93.19 | Neural Network (Specific Model) | 4229623 | 40395 | 0.9031 | 148 | binette | 3 |
| 4 | binette_bin5 | binette | False | binette_bin5 | 95.15 | 2.36 | 90.43 | Gradient Boost (General Model) | 1843697 | 10106 | 0.8835 | 266 | binette | 4 |
| 5 | binette_bin6 | binette | False | binette_bin6 | 91.50 | 2.21 | 87.08 | Gradient Boost (General Model) | 3543663 | 5964 | 0.8542 | 786 | binette | 5 |
| 6 | binette_bin7 | semibin2_output/output_bins | True | SemiBin_23.fa | 84.06 | 1.66 | 80.74 | Neural Network (Specific Model) | 1689331 | 8389 | 0.8678 | 246 | binette | 6 |
| 7 | binette_bin8 | binette | False | binette_bin8 | 74.32 | 2.17 | 69.98 | Gradient Boost (General Model) | 1257085 | 5017 | 0.8946 | 257 | binette | 7 |
| 8 | binette_bin9 | binette | False | binette_bin9 | 74.08 | 3.82 | 66.44 | Neural Network (Specific Model) | 3492747 | 3005 | 0.9218 | 1308 | binette | 8 |
| 9 | binette_bin10 | binette | False | binette_bin10 | 64.49 | 1.79 | 60.91 | Gradient Boost (General Model) | 1266713 | 3796 | 0.9064 | 415 | binette | 9 |
| 10 | binette_bin11 | binette | False | binette_bin11 | 60.27 | 1.85 | 56.57 | Neural Network (Specific Model) | 2080860 | 4612 | 0.9044 | 519 | binette | 10 |
| 11 | binette_bin12 | binette | False | binette_bin12 | 52.00 | 1.07 | 49.86 | Neural Network (Specific Model) | 2516999 | 5503 | 0.9092 | 482 | binette | 11 |
| 12 | binette_bin13 | binette | False | binette_bin13 | 48.86 | 4.50 | 39.86 | Gradient Boost (General Model) | 1119471 | 1517 | 0.8945 | 729 | binette | 12 |
| 13 | binette_bin14 | binette | False | binette_bin14 | 43.66 | 5.11 | 33.44 | Neural Network (Specific Model) | 2087483 | 4593 | 0.9248 | 476 | binette | 13 |
| 14 | binette_bin15 | binette | False | binette_bin15 | 43.93 | 9.52 | 24.89 | Neural Network (Specific Model) | 2451217 | 1480 | 0.8544 | 1627 | binette | 14 |
Load and Combine Input Bin Quality Reports
Next, we will load the quality reports of the input bin sets, computed by various tools and saved by Binette. We’ll combine these into a single DataFrame and add a column to indicate high-quality bins. We define a high-quality bin as one with contamination ≤ 5% and completeness ≥ 90%.
[3]:
from pathlib import Path
input_bins_quality_reports_dir = Path("binette_output/input_bins_quality_reports/")
# Initialize the list with Binette results
df_input_bin_list = [df_binette]
# Load each input bin quality report
for input_bin_metric_file in input_bins_quality_reports_dir.glob("*tsv"):
tool = input_bin_metric_file.name.split(".")[1].split("_")[
0
] # Extract tool name from file name
df_input = pd.read_csv(input_bin_metric_file, sep="\t")
df_input["index"] = df_input.index
df_input["tool"] = tool
df_input_bin_list.append(df_input)
# Combine all DataFrames into one
df_bins = pd.concat(df_input_bin_list)
# Add a column to indicate high-quality bins
df_bins["High quality bin"] = (df_bins["completeness"] >= 90) & (
df_bins["contamination"] <= 5
)
# Display relevant columns
df_bins[["tool", "completeness", "contamination", "size", "N50", "contig_count"]]
[3]:
| tool | completeness | contamination | size | N50 | contig_count | |
|---|---|---|---|---|---|---|
| 0 | binette | 100.00 | 0.10 | 4658605 | 82084 | 91 |
| 1 | binette | 99.94 | 0.23 | 2796059 | 41151 | 98 |
| 2 | binette | 96.10 | 0.27 | 2559714 | 11656 | 315 |
| 3 | binette | 93.43 | 0.12 | 4229623 | 40395 | 148 |
| 4 | binette | 95.15 | 2.36 | 1843697 | 10106 | 266 |
| ... | ... | ... | ... | ... | ... | ... |
| 9 | metabat2 | 44.85 | 0.79 | 987990 | 4743 | 220 |
| 10 | metabat2 | 44.38 | 0.58 | 1745116 | 4265 | 420 |
| 11 | metabat2 | 25.47 | 0.03 | 1077467 | 91995 | 14 |
| 12 | metabat2 | 94.21 | 37.06 | 8631886 | 4347 | 1994 |
| 13 | metabat2 | 7.06 | 0.03 | 252404 | 64012 | 6 |
139 rows × 6 columns
Plot bin completeness and contamination
With the DataFrame containing both Binette’s final bins and the input bins, we can now create a scatter plot to visualize the results:
[4]:
import plotly.express as px
# Create a scatter plot to visualize completeness and contamination
fig = px.scatter(
df_bins,
x="completeness",
y="contamination",
color="High quality bin",
size="size",
facet_row="tool",
title="Bin Quality Comparison",
)
# Update layout for better visibility
fig.update_layout(
width=600,
height=800,
legend_title="High Quality Bin",
title="Comparison of Bin Quality Metrics",
)
# Show the plot
fig.show()
We can see that binette bins are the one displaying the most high quality bins (completeness ≥ 90% and contamination ≤ 5%).
Comparing Binning Tools Using Bin Score Curves
A common way to compare bin sets is by sorting the bins based on their scores and plotting them against their index.
Here’s how we can create such a plot:
[5]:
# Calculate the score for each bin
df_bins["completeness - 2*contamination"] = (
df_bins["completeness"] - 2 * df_bins["contamination"]
)
# Plot the score against the bin index
fig = px.line(
df_bins, x="index", y="completeness - 2*contamination", color="tool", markers=True
)
fig.update_layout(width=600, height=500)
fig.show()
From the plot, you might notice that Concoct has a lot of bins with lower quality scores. Let’s zoom in to get a better look:
[6]:
# Adjust the plot view to zoom in
fig.update_layout(
xaxis_range=[-1, 20], # Zoom on x-axis
yaxis_range=[0, 100], # Zoom on y-axis
width=600,
height=500,
)
fig.show()
Binette line consistently appears above the other binning tools. This indicates that Binette produce higher-quality bins compared to the initial bin sets.
Plot Number of High-Quality Bins per Bin Set
Let’s plot the number of bins falling into different quality categories. We’ll focus on bins with a maximum of 10% contamination and classify them into three completeness categories:
``> 50% and ≤ 70%``
``> 70% and ≤ 90%``
``> 90%``
First, let’s group and count the bins in each category:
[7]:
# Define the contamination cutoff
contamination_cutoff = 10
# Create filters for completeness categories
low_contamination_filt = df_bins["contamination"] <= contamination_cutoff
high_completeness_filt = df_bins["completeness"] > 90
medium_completeness_filt = df_bins["completeness"] > 70
low_completeness_filt = df_bins["completeness"] > 50
# Define quality categories
quality = f"Contamination ≤ {contamination_cutoff} and<br>Completeness"
df_bins.loc[low_contamination_filt & low_completeness_filt, quality] = "> 50% and ≤ 70%"
df_bins.loc[low_contamination_filt & medium_completeness_filt, quality] = (
"> 70% and ≤ 90%"
)
df_bins.loc[low_contamination_filt & high_completeness_filt, quality] = "> 90%"
# Group and count bins by quality category and tool
df_bins_quality_grouped = (
df_bins.groupby([quality, "tool"]).agg(bin_count=("index", "count")).reset_index()
)
df_bins_quality_grouped
[7]:
| Contamination ≤ 10 and<br>Completeness | tool | bin_count | |
|---|---|---|---|
| 0 | > 50% and ≤ 70% | binette | 3 |
| 1 | > 50% and ≤ 70% | maxbin2 | 2 |
| 2 | > 50% and ≤ 70% | metabat2 | 1 |
| 3 | > 50% and ≤ 70% | semibin2 | 2 |
| 4 | > 70% and ≤ 90% | binette | 3 |
| 5 | > 70% and ≤ 90% | concoct | 2 |
| 6 | > 70% and ≤ 90% | metabat2 | 5 |
| 7 | > 70% and ≤ 90% | semibin2 | 4 |
| 8 | > 90% | binette | 6 |
| 9 | > 90% | concoct | 4 |
| 10 | > 90% | maxbin2 | 2 |
| 11 | > 90% | metabat2 | 3 |
| 12 | > 90% | semibin2 | 4 |
Now, let’s create a bar plot to visualize the number of bins in each quality category for each bin sets:
[8]:
# Define colors for each completeness category
color_discrete_map = {
"> 90%": px.colors.qualitative.Prism[4],
"> 70% and ≤ 90%": px.colors.qualitative.Prism[2],
"> 50% and ≤ 70%": px.colors.qualitative.Prism[6],
}
# Create the bar plot
fig = px.bar(
df_bins_quality_grouped,
x="tool",
y="bin_count",
color=quality,
barmode="stack",
color_discrete_map=color_discrete_map,
text="bin_count",
category_orders={"tool": ["binette", "semibin2", "concoct", "metabat2", "maxbin2"]},
opacity=0.9,
)
# Update layout for better appearance
fig.update_layout(width=600, height=500, legend=dict(traceorder="reversed"))
fig.show()
From the plot, you can see that Binette produces more high-quality bins compared to the initial bin sets! 🎉