Analyse Binette results

Let’s visualize the results from Binette and compare them to the initial bin sets used as input.

To explore these results interactively, you can open the Jupyter notebook via Binder by following this link:

Import Necessary Libraries

First, we’ll need to import the necessary libraries for our analysis and plotting:

[1]:

import pandas as pd
from pathlib import Path
import plotly.express as px

# The following two lines are needed to properly display Plotly graphs in the documentation
# However you may need to remove these lines and restart the kernel to visualise the graph in another context
import plotly.io as pio

pio.renderers.default = "sphinx_gallery"

Load Binette Results

Now, let’s load the final Binette quality report into a Pandas DataFrame:

[2]:

binette_result_file = "./binette_output/final_bins_quality_reports.tsv"
df_binette = pd.read_csv(binette_result_file, sep="\t")
df_binette["tool"] = "binette"  # Add a column to label the tool
df_binette["index"] = df_binette.index  # Add an index column
df_binette

[2]:

	name	origin	is_original	original_name	completeness	contamination	score	checkm2_model	size	N50	coding_density	contig_count	tool	index
0	binette_bin1	binette	False	binette_bin1	100.00	0.10	99.80	Neural Network (Specific Model)	4658605	82084	0.8803	91	binette	0
1	binette_bin2	binette	False	binette_bin2	99.94	0.23	99.48	Neural Network (Specific Model)	2796059	41151	0.8882	98	binette	1
2	binette_bin3	binette	False	binette_bin3	96.10	0.27	95.56	Gradient Boost (General Model)	2559714	11656	0.8990	315	binette	2
3	binette_bin4	binette	False	binette_bin4	93.43	0.12	93.19	Neural Network (Specific Model)	4229623	40395	0.9031	148	binette	3
4	binette_bin5	binette	False	binette_bin5	95.15	2.36	90.43	Gradient Boost (General Model)	1843697	10106	0.8835	266	binette	4
5	binette_bin6	binette	False	binette_bin6	91.50	2.21	87.08	Gradient Boost (General Model)	3543663	5964	0.8542	786	binette	5
6	binette_bin7	semibin2_output/output_bins	True	SemiBin_23.fa	84.06	1.66	80.74	Neural Network (Specific Model)	1689331	8389	0.8678	246	binette	6
7	binette_bin8	binette	False	binette_bin8	74.32	2.17	69.98	Gradient Boost (General Model)	1257085	5017	0.8946	257	binette	7
8	binette_bin9	binette	False	binette_bin9	74.08	3.82	66.44	Neural Network (Specific Model)	3492747	3005	0.9218	1308	binette	8
9	binette_bin10	binette	False	binette_bin10	64.49	1.79	60.91	Gradient Boost (General Model)	1266713	3796	0.9064	415	binette	9
10	binette_bin11	binette	False	binette_bin11	60.27	1.85	56.57	Neural Network (Specific Model)	2080860	4612	0.9044	519	binette	10
11	binette_bin12	binette	False	binette_bin12	52.00	1.07	49.86	Neural Network (Specific Model)	2516999	5503	0.9092	482	binette	11
12	binette_bin13	binette	False	binette_bin13	48.86	4.50	39.86	Gradient Boost (General Model)	1119471	1517	0.8945	729	binette	12
13	binette_bin14	binette	False	binette_bin14	43.66	5.11	33.44	Neural Network (Specific Model)	2087483	4593	0.9248	476	binette	13
14	binette_bin15	binette	False	binette_bin15	43.93	9.52	24.89	Neural Network (Specific Model)	2451217	1480	0.8544	1627	binette	14

Load and Combine Input Bin Quality Reports

Next, we will load the quality reports of the input bin sets, computed by various tools and saved by Binette. We’ll combine these into a single DataFrame and add a column to indicate high-quality bins. We define a high-quality bin as one with contamination ≤ 5% and completeness ≥ 90%.

[3]:

from pathlib import Path

input_bins_quality_reports_dir = Path("binette_output/input_bins_quality_reports/")

# Initialize the list with Binette results
df_input_bin_list = [df_binette]

# Load each input bin quality report
for input_bin_metric_file in input_bins_quality_reports_dir.glob("*tsv"):
    tool = input_bin_metric_file.name.split(".")[1].split("_")[
        0
    ]  # Extract tool name from file name
    df_input = pd.read_csv(input_bin_metric_file, sep="\t")
    df_input["index"] = df_input.index
    df_input["tool"] = tool
    df_input_bin_list.append(df_input)

# Combine all DataFrames into one
df_bins = pd.concat(df_input_bin_list)

# Add a column to indicate high-quality bins
df_bins["High quality bin"] = (df_bins["completeness"] >= 90) & (
    df_bins["contamination"] <= 5
)

# Display relevant columns
df_bins[["tool", "completeness", "contamination", "size", "N50", "contig_count"]]

[3]:

	tool	completeness	contamination	size	N50	contig_count
0	binette	100.00	0.10	4658605	82084	91
1	binette	99.94	0.23	2796059	41151	98
2	binette	96.10	0.27	2559714	11656	315
3	binette	93.43	0.12	4229623	40395	148
4	binette	95.15	2.36	1843697	10106	266
...	...	...	...	...	...	...
9	metabat2	44.85	0.79	987990	4743	220
10	metabat2	44.38	0.58	1745116	4265	420
11	metabat2	25.47	0.03	1077467	91995	14
12	metabat2	94.21	37.06	8631886	4347	1994
13	metabat2	7.06	0.03	252404	64012	6

139 rows × 6 columns

Plot bin completeness and contamination

With the DataFrame containing both Binette’s final bins and the input bins, we can now create a scatter plot to visualize the results:

[4]:

import plotly.express as px

# Create a scatter plot to visualize completeness and contamination
fig = px.scatter(
    df_bins,
    x="completeness",
    y="contamination",
    color="High quality bin",
    size="size",
    facet_row="tool",
    title="Bin Quality Comparison",
)

# Update layout for better visibility
fig.update_layout(
    width=600,
    height=800,
    legend_title="High Quality Bin",
    title="Comparison of Bin Quality Metrics",
)

# Show the plot
fig.show()

We can see that binette bins are the one displaying the most high quality bins (completeness ≥ 90% and contamination ≤ 5%).

Comparing Binning Tools Using Bin Score Curves

A common way to compare bin sets is by sorting the bins based on their scores and plotting them against their index.

Here’s how we can create such a plot:

[5]:

# Calculate the score for each bin
df_bins["completeness - 2*contamination"] = (
    df_bins["completeness"] - 2 * df_bins["contamination"]
)

# Plot the score against the bin index
fig = px.line(
    df_bins, x="index", y="completeness - 2*contamination", color="tool", markers=True
)
fig.update_layout(width=600, height=500)
fig.show()

From the plot, you might notice that Concoct has a lot of bins with lower quality scores. Let’s zoom in to get a better look:

[6]:

# Adjust the plot view to zoom in
fig.update_layout(
    xaxis_range=[-1, 20],  # Zoom on x-axis
    yaxis_range=[0, 100],  # Zoom on y-axis
    width=600,
    height=500,
)
fig.show()

Binette line consistently appears above the other binning tools. This indicates that Binette produce higher-quality bins compared to the initial bin sets.

Plot Number of High-Quality Bins per Bin Set

Let’s plot the number of bins falling into different quality categories. We’ll focus on bins with a maximum of 10% contamination and classify them into three completeness categories:

``> 50% and ≤ 70%``
``> 70% and ≤ 90%``
``> 90%``

First, let’s group and count the bins in each category:

[7]:

# Define the contamination cutoff
contamination_cutoff = 10

# Create filters for completeness categories
low_contamination_filt = df_bins["contamination"] <= contamination_cutoff
high_completeness_filt = df_bins["completeness"] > 90
medium_completeness_filt = df_bins["completeness"] > 70
low_completeness_filt = df_bins["completeness"] > 50

# Define quality categories
quality = f"Contamination ≤ {contamination_cutoff} and<br>Completeness"
df_bins.loc[low_contamination_filt & low_completeness_filt, quality] = "> 50% and ≤ 70%"
df_bins.loc[low_contamination_filt & medium_completeness_filt, quality] = (
    "> 70% and ≤ 90%"
)
df_bins.loc[low_contamination_filt & high_completeness_filt, quality] = "> 90%"

# Group and count bins by quality category and tool
df_bins_quality_grouped = (
    df_bins.groupby([quality, "tool"]).agg(bin_count=("index", "count")).reset_index()
)
df_bins_quality_grouped

[7]:

	Contamination ≤ 10 and<br>Completeness	tool	bin_count
0	> 50% and ≤ 70%	binette	3
1	> 50% and ≤ 70%	maxbin2	2
2	> 50% and ≤ 70%	metabat2	1
3	> 50% and ≤ 70%	semibin2	2
4	> 70% and ≤ 90%	binette	3
5	> 70% and ≤ 90%	concoct	2
6	> 70% and ≤ 90%	metabat2	5
7	> 70% and ≤ 90%	semibin2	4
8	> 90%	binette	6
9	> 90%	concoct	4
10	> 90%	maxbin2	2
11	> 90%	metabat2	3
12	> 90%	semibin2	4

Now, let’s create a bar plot to visualize the number of bins in each quality category for each bin sets:

[8]:

# Define colors for each completeness category
color_discrete_map = {
    "> 90%": px.colors.qualitative.Prism[4],
    "> 70% and ≤ 90%": px.colors.qualitative.Prism[2],
    "> 50% and ≤ 70%": px.colors.qualitative.Prism[6],
}

# Create the bar plot
fig = px.bar(
    df_bins_quality_grouped,
    x="tool",
    y="bin_count",
    color=quality,
    barmode="stack",
    color_discrete_map=color_discrete_map,
    text="bin_count",
    category_orders={"tool": ["binette", "semibin2", "concoct", "metabat2", "maxbin2"]},
    opacity=0.9,
)

# Update layout for better appearance
fig.update_layout(width=600, height=500, legend=dict(traceorder="reversed"))

fig.show()

From the plot, you can see that Binette produces more high-quality bins compared to the initial bin sets! 🎉