Analyse Binette results

Let’s visualize the results from Binette and compare them to the initial bin sets used as input.

To explore these results interactively, you can open the Jupyter notebook via Binder by following this link: Binder

Import Necessary Libraries

First, we’ll need to import the necessary libraries for our analysis and plotting:

[1]:
import pandas as pd
from pathlib import Path
import plotly.express as px

# The following two lines are needed to properly display Plotly graphs in the documentation
# However you may need to remove these lines and restart the kernel to visualise the graph in another context
import plotly.io as pio

pio.renderers.default = "sphinx_gallery"

Load Binette Results

Now, let’s load the final Binette quality report into a Pandas DataFrame:

[2]:
binette_result_file = "./binette_output/final_bins_quality_reports.tsv"
df_binette = pd.read_csv(binette_result_file, sep="\t")
df_binette["tool"] = "binette"  # Add a column to label the tool
df_binette["index"] = df_binette.index  # Add an index column
df_binette
[2]:
name origin is_original original_name completeness contamination score checkm2_model size N50 coding_density contig_count tool index
0 binette_bin1 binette False binette_bin1 100.00 0.10 99.80 Neural Network (Specific Model) 4658605 82084 0.8803 91 binette 0
1 binette_bin2 binette False binette_bin2 99.94 0.23 99.48 Neural Network (Specific Model) 2796059 41151 0.8882 98 binette 1
2 binette_bin3 binette False binette_bin3 96.10 0.27 95.56 Gradient Boost (General Model) 2559714 11656 0.8990 315 binette 2
3 binette_bin4 binette False binette_bin4 93.43 0.12 93.19 Neural Network (Specific Model) 4229623 40395 0.9031 148 binette 3
4 binette_bin5 binette False binette_bin5 95.15 2.36 90.43 Gradient Boost (General Model) 1843697 10106 0.8835 266 binette 4
5 binette_bin6 binette False binette_bin6 91.50 2.21 87.08 Gradient Boost (General Model) 3543663 5964 0.8542 786 binette 5
6 binette_bin7 semibin2_output/output_bins True SemiBin_23.fa 84.06 1.66 80.74 Neural Network (Specific Model) 1689331 8389 0.8678 246 binette 6
7 binette_bin8 binette False binette_bin8 74.32 2.17 69.98 Gradient Boost (General Model) 1257085 5017 0.8946 257 binette 7
8 binette_bin9 binette False binette_bin9 74.08 3.82 66.44 Neural Network (Specific Model) 3492747 3005 0.9218 1308 binette 8
9 binette_bin10 binette False binette_bin10 64.49 1.79 60.91 Gradient Boost (General Model) 1266713 3796 0.9064 415 binette 9
10 binette_bin11 binette False binette_bin11 60.27 1.85 56.57 Neural Network (Specific Model) 2080860 4612 0.9044 519 binette 10
11 binette_bin12 binette False binette_bin12 52.00 1.07 49.86 Neural Network (Specific Model) 2516999 5503 0.9092 482 binette 11
12 binette_bin13 binette False binette_bin13 48.86 4.50 39.86 Gradient Boost (General Model) 1119471 1517 0.8945 729 binette 12
13 binette_bin14 binette False binette_bin14 43.66 5.11 33.44 Neural Network (Specific Model) 2087483 4593 0.9248 476 binette 13
14 binette_bin15 binette False binette_bin15 43.93 9.52 24.89 Neural Network (Specific Model) 2451217 1480 0.8544 1627 binette 14

Load and Combine Input Bin Quality Reports

Next, we will load the quality reports of the input bin sets, computed by various tools and saved by Binette. We’ll combine these into a single DataFrame and add a column to indicate high-quality bins. We define a high-quality bin as one with contamination ≤ 5% and completeness ≥ 90%.

[3]:
from pathlib import Path

input_bins_quality_reports_dir = Path("binette_output/input_bins_quality_reports/")

# Initialize the list with Binette results
df_input_bin_list = [df_binette]

# Load each input bin quality report
for input_bin_metric_file in input_bins_quality_reports_dir.glob("*tsv"):
    tool = input_bin_metric_file.name.split(".")[1].split("_")[
        0
    ]  # Extract tool name from file name
    df_input = pd.read_csv(input_bin_metric_file, sep="\t")
    df_input["index"] = df_input.index
    df_input["tool"] = tool
    df_input_bin_list.append(df_input)

# Combine all DataFrames into one
df_bins = pd.concat(df_input_bin_list)

# Add a column to indicate high-quality bins
df_bins["High quality bin"] = (df_bins["completeness"] >= 90) & (
    df_bins["contamination"] <= 5
)

# Display relevant columns
df_bins[["tool", "completeness", "contamination", "size", "N50", "contig_count"]]
[3]:
tool completeness contamination size N50 contig_count
0 binette 100.00 0.10 4658605 82084 91
1 binette 99.94 0.23 2796059 41151 98
2 binette 96.10 0.27 2559714 11656 315
3 binette 93.43 0.12 4229623 40395 148
4 binette 95.15 2.36 1843697 10106 266
... ... ... ... ... ... ...
9 metabat2 44.85 0.79 987990 4743 220
10 metabat2 44.38 0.58 1745116 4265 420
11 metabat2 25.47 0.03 1077467 91995 14
12 metabat2 94.21 37.06 8631886 4347 1994
13 metabat2 7.06 0.03 252404 64012 6

139 rows × 6 columns

Plot bin completeness and contamination

With the DataFrame containing both Binette’s final bins and the input bins, we can now create a scatter plot to visualize the results:

[4]:
import plotly.express as px

# Create a scatter plot to visualize completeness and contamination
fig = px.scatter(
    df_bins,
    x="completeness",
    y="contamination",
    color="High quality bin",
    size="size",
    facet_row="tool",
    title="Bin Quality Comparison",
)

# Update layout for better visibility
fig.update_layout(
    width=600,
    height=800,
    legend_title="High Quality Bin",
    title="Comparison of Bin Quality Metrics",
)

# Show the plot
fig.show()

We can see that binette bins are the one displaying the most high quality bins (completeness ≥ 90% and contamination ≤ 5%).

Comparing Binning Tools Using Bin Score Curves

A common way to compare bin sets is by sorting the bins based on their scores and plotting them against their index.

Here’s how we can create such a plot:

[5]:
# Calculate the score for each bin
df_bins["completeness - 2*contamination"] = (
    df_bins["completeness"] - 2 * df_bins["contamination"]
)

# Plot the score against the bin index
fig = px.line(
    df_bins, x="index", y="completeness - 2*contamination", color="tool", markers=True
)
fig.update_layout(width=600, height=500)
fig.show()

From the plot, you might notice that Concoct has a lot of bins with lower quality scores. Let’s zoom in to get a better look:

[6]:
# Adjust the plot view to zoom in
fig.update_layout(
    xaxis_range=[-1, 20],  # Zoom on x-axis
    yaxis_range=[0, 100],  # Zoom on y-axis
    width=600,
    height=500,
)
fig.show()

Binette line consistently appears above the other binning tools. This indicates that Binette produce higher-quality bins compared to the initial bin sets.

Plot Number of High-Quality Bins per Bin Set

Let’s plot the number of bins falling into different quality categories. We’ll focus on bins with a maximum of 10% contamination and classify them into three completeness categories:

  • ``> 50% and ≤ 70%``

  • ``> 70% and ≤ 90%``

  • ``> 90%``

First, let’s group and count the bins in each category:

[7]:
# Define the contamination cutoff
contamination_cutoff = 10

# Create filters for completeness categories
low_contamination_filt = df_bins["contamination"] <= contamination_cutoff
high_completeness_filt = df_bins["completeness"] > 90
medium_completeness_filt = df_bins["completeness"] > 70
low_completeness_filt = df_bins["completeness"] > 50

# Define quality categories
quality = f"Contamination ≤ {contamination_cutoff} and<br>Completeness"
df_bins.loc[low_contamination_filt & low_completeness_filt, quality] = "> 50% and ≤ 70%"
df_bins.loc[low_contamination_filt & medium_completeness_filt, quality] = (
    "> 70% and ≤ 90%"
)
df_bins.loc[low_contamination_filt & high_completeness_filt, quality] = "> 90%"

# Group and count bins by quality category and tool
df_bins_quality_grouped = (
    df_bins.groupby([quality, "tool"]).agg(bin_count=("index", "count")).reset_index()
)
df_bins_quality_grouped
[7]:
Contamination ≤ 10 and<br>Completeness tool bin_count
0 > 50% and ≤ 70% binette 3
1 > 50% and ≤ 70% maxbin2 2
2 > 50% and ≤ 70% metabat2 1
3 > 50% and ≤ 70% semibin2 2
4 > 70% and ≤ 90% binette 3
5 > 70% and ≤ 90% concoct 2
6 > 70% and ≤ 90% metabat2 5
7 > 70% and ≤ 90% semibin2 4
8 > 90% binette 6
9 > 90% concoct 4
10 > 90% maxbin2 2
11 > 90% metabat2 3
12 > 90% semibin2 4

Now, let’s create a bar plot to visualize the number of bins in each quality category for each bin sets:

[8]:
# Define colors for each completeness category
color_discrete_map = {
    "> 90%": px.colors.qualitative.Prism[4],
    "> 70% and ≤ 90%": px.colors.qualitative.Prism[2],
    "> 50% and ≤ 70%": px.colors.qualitative.Prism[6],
}

# Create the bar plot
fig = px.bar(
    df_bins_quality_grouped,
    x="tool",
    y="bin_count",
    color=quality,
    barmode="stack",
    color_discrete_map=color_discrete_map,
    text="bin_count",
    category_orders={"tool": ["binette", "semibin2", "concoct", "metabat2", "maxbin2"]},
    opacity=0.9,
)

# Update layout for better appearance
fig.update_layout(width=600, height=500, legend=dict(traceorder="reversed"))

fig.show()

From the plot, you can see that Binette produces more high-quality bins compared to the initial bin sets! 🎉