binette package

Submodules

binette.bin_manager module

class binette.bin_manager.Bin(contigs: BitMap, origin: set[str] | None = None, name: str | None = None, is_original: bool = False)

Bases: object

CHECKM2_MODELS = ('Neural Network (Specific Model)', 'Gradient Boost (General Model)')
add_N50(n50: int) None

Add the N50 attribute to the Bin object.

Parameters:

n50 – The N50 value to add.

Returns:

None

add_coding_density(contig_to_coding_length: dict[int, int]) float | None

Calculate the coding density of the bin.

Parameters:

contig_to_coding_length – A dictionary mapping contig IDs to their total coding lengths.

Returns:

The coding density of the bin, or None if the length is not set or is zero.

add_length(length: int) None

Add the length attribute to the Bin object if the provided length is a positive integer.

Parameters:

length – The length value to add.

Returns:

None

add_model(model: str) None

Add a CheckM2 model to the bin.

Parameters:

model – The model name to add.

Raises:

ValueError – If the model name is not recognized.

add_quality(completeness: float, contamination: float, contamination_weight: float) None

Set the quality attributes of the bin.

Parameters:
  • completeness – The completeness value.

  • contamination – The contamination value.

  • contamination_weight – The weight assigned to contamination in the score calculation.

property checkm2_model

Get the CheckM2 model for the bin.

contig_difference(*others: Bin) BitMap

Compute the difference between the bin and other bins.

Parameters:

others – Other bins to compute the difference with.

contig_intersection(*others: Bin) BitMap

Compute the intersection of the bin with other bins.

Parameters:

others – Other bins to compute the intersection with.

contig_union(*others: Bin) BitMap

Compute the union of the bin with other bins.

Parameters:

others – Other bins to compute the union with.

property contigs_key

Serialize the contigs for easier comparison.

is_high_quality(min_completeness: float, max_contamination: float) bool

Determine if a bin is considered high quality based on completeness and contamination thresholds.

Parameters:
  • min_completeness – The minimum completeness required for a bin to be considered high quality.

  • max_contamination – The maximum allowed contamination for a bin to be considered high quality.

Raises:

ValueError – If either completeness or contamination has not been set (is None).

Returns:

True if the bin meets the high quality criteria; False otherwise.

overlaps_with(other: Bin) set[str]

Find the contigs that overlap between this bin and another bin.

Parameters:

other – The other Bin object.

Returns:

A set of contig names that overlap between the bins.

binette.bin_manager.build_contig_index(bins_dict: dict[bytes, binette.bin_manager.Bin]) dict[int, set[bytes]]

Build an inverted index: contig_id -> set of contigs_key of bins containing it. :param bins_dict: Mapping from contigs_key -> Bin. :return: Inverted index (contig_id -> set of contigs_key).

binette.bin_manager.create_intermediate_bins(contig_key_to_initial_bin: dict[bytes, binette.bin_manager.Bin], contig_lengths: ndarray, min_comp: float, max_conta: float, min_len: int, max_len: int, disable_progress_bar: bool = False) dict[bytes, binette.bin_manager.Bin]

Creates intermediate bins from a dictionary of bin sets.

Parameters:

original_bins – Set of input bins.

Returns:

A set of intermediate bins created from intersections, differences, and unions.

binette.bin_manager.from_bins_to_bin_graph(bins: Iterable[Bin]) Graph

Creates a bin graph made of overlapping gram a set of bins.

Parameters:

bins – a set of bins

Returns:

A networkx Graph representing the bin graph of overlapping bins.

binette.bin_manager.get_all_possible_combinations(clique: list) Iterable[tuple]

Generates all possible combinations of elements from a given clique.

Parameters:

clique – An iterable representing a clique.

Returns:

An iterable of tuples representing all possible combinations of elements from the clique.

binette.bin_manager.get_bins_from_contig2bin_table(contig2bin_table: Path, set_name: str) list[dict[str, Any]]

Retrieves a list of Bin objects from a contig-to-bin table.

Parameters:
  • contig2bin_table – The path to the contig-to-bin table.

  • set_name – The name of the set the bins belong to.

Returns:

A list of Bin info in dict created from the contig-to-bin table.

binette.bin_manager.get_bins_from_directory(bin_dir: Path, set_name: str, fasta_extensions: set[str]) list[binette.bin_manager.Bin]

Retrieves a list of Bin objects from a directory containing bin FASTA files.

Parameters:
  • bin_dir – The directory path containing bin FASTA files.

  • set_name – The name of the set the bins belong to.

Fasta_extensions:

Possible fasta extensions to look for in the bin directory.

Returns:

A list of Bin objects created from the bin FASTA files.

binette.bin_manager.get_contigs_in_bin_sets(bin_set_name_to_bins: dict[str, set[binette.bin_manager.Bin]]) list[str]

Processes bin sets to check for duplicated contigs and logs detailed information about each bin set.

Parameters:

bin_set_name_to_bins – A dictionary where keys are bin set names and values are sets of Bin objects.

Returns:

A set of contig names found in bin sets

binette.bin_manager.make_bins_from_bins_info(bin_set_name_to_bins_info: dict[str, list[dict[str, Any]]], contig_to_index: dict[str, int], are_original_bins: bool)

Create Bin objects from the provided bin information.

Parameters:
  • bin_set_name_to_bins_info – A dictionary mapping bin set names to their bin information.

  • contig_to_index – A mapping of contig names to their indices.

  • are_original_bins – A boolean indicating whether the bins are original.

Returns:

A dictionary mapping serialized contig bitmaps to their corresponding Bin objects.

binette.bin_manager.parse_bin_directories(bin_name_to_bin_dir: dict[str, pathlib.Path], fasta_extensions: set[str]) dict[str, list[dict[str, Any]]]

Parses multiple bin directories and returns a dictionary mapping bin names to a list of Bin objects.

Parameters:

bin_name_to_bin_dir – A dictionary mapping bin names to their respective bin directories.

Fasta_extensions:

Possible fasta extensions to look for in the bin directory.

Returns:

A dictionary mapping bin names to a list of dict created from the bin directories.

binette.bin_manager.parse_contig2bin_tables(bin_name_to_bin_tables: dict[str, pathlib.Path]) dict[str, list[dict[str, Any]]]

Parses multiple contig-to-bin tables and returns a dictionary mapping bin names to a set of unique Bin objects.

Logs a warning if duplicate bins are detected within a bin set.

Parameters:

bin_name_to_bin_tables – A dictionary where keys are bin set names and values are file paths or identifiers for contig-to-bin tables. Each table is parsed to extract Bin objects.

Returns:

A dictionary where keys are bin set names and values are sets of Bin objects. Duplicates are removed based on contig composition.

binette.bin_manager.remove_bins_from_index(bin_keys: set[bytes], bins_dict: dict[bytes, binette.bin_manager.Bin], contig_to_bins: dict[int, set[bytes]]) None

Remove a set of bins from the inverted index.

Parameters:
  • bin_keys – The contig_keys of bins to remove.

  • bins_dict – Mapping from contig_key -> Bin.

  • contig_to_bins – Inverted index (contig_id -> set of contig_keys).

binette.bin_manager.select_best_bins(bins_dict: dict[bytes, binette.bin_manager.Bin], min_completeness: float, max_contamination: float, prefix: str = 'binette') list[binette.bin_manager.Bin]

Select the best non-overlapping bins based on score, N50, and ID.

Parameters:
  • bins_dict – Mapping from contig_key -> Bin.

  • min_completeness – Minimum completeness threshold for a bin to be considered.

  • max_contamination – Maximum contamination threshold for a bin to be considered.

  • prefix – Prefix to use for naming selected bins.

binette.bin_manager.sum_contig_lengths(bm_contigs: BitMap, contig_lengths: ndarray, cache: dict[bytes, int] | None = None, key: bytes | None = None)

binette.bin_quality module

binette.cds module

binette.cds.filter_faa_file(contigs_to_keep: set[str], input_faa_file: Path, filtered_faa_file: Path)

Filters a FASTA file containing protein sequences to only include sequences from contigs present in the provided set of contigs (contigs_to_keep).

This function processes the input FASTA file, identifies protein sequences originating from contigs listed in contigs_to_keep, and writes the filtered sequences to a new FASTA file. The output file supports optional .gz compression.

Parameters:
  • contigs_to_keep – A set of contig names to retain in the output FASTA file.

  • input_faa_file – Path to the input FASTA file containing protein sequences.

  • filtered_faa_file – Path to the output FASTA file for filtered sequences. If the filename ends with .gz, the output will be compressed.

binette.cds.get_aa_composition(genes: list[str]) Counter

Calculate the amino acid composition of a list of protein sequences.

Parameters:

genes – A list of protein sequences.

Returns:

A Counter object representing the amino acid composition.

binette.cds.get_contig_cds_metadata(contig_to_genes: dict[int, Any | list[Any]], threads: int) dict[str, dict]

Calculate metadata for contigs in parallel, including CDS count, amino acid composition, and total amino acid length.

Parameters:
  • contig_to_genes – A dictionary mapping contig names to lists of protein sequences.

  • threads – Number of CPU threads to use.

Returns:

A tuple containing dictionaries for CDS count, amino acid composition, and total amino acid length.

binette.cds.get_contig_cds_metadata_flat(contig_to_genes: dict[str, list[str]]) tuple[dict[str, int], dict[str, collections.Counter], dict[str, int]]

Calculate metadata for contigs, including CDS count, amino acid composition, and total amino acid length.

Parameters:

contig_to_genes – A dictionary mapping contig names to lists of protein sequences.

Returns:

A tuple containing dictionaries for CDS count, amino acid composition, and total amino acid length.

binette.cds.get_contig_coding_len(genes: list[pyrodigal.lib.Gene], contig_length: int) int | None

Compute the coding length of a contig. Use a mask to account for overlapping genes.

Parameters:
  • genes – A list of gene annotations for the contig.

  • contig_length – The length of the contig in base pairs.

Returns:

The coding length as a float, or None if contig_length is zero.

binette.cds.get_contig_from_cds_name(cds_name: str) str

Extract the contig name from a CDS name.

Parameters:

cds_name (str) – The name of the CDS.

Returns:

The name of the contig.

Return type:

str

binette.cds.is_nucleic_acid(sequence: str) bool

Determines whether the given sequence is a DNA or RNA sequence.

Parameters:

sequence – The sequence to check.

Returns:

True if the sequence is a DNA or RNA sequence, False otherwise.

binette.cds.parse_faa_file(faa_file: str) dict[str, list[str]]

Parse a FASTA file containing protein sequences and organize them by contig.

Parameters:

faa_file – Path to the input FASTA file.

Returns:

A dictionary mapping contig names to lists of protein sequences.

Raises:

ValueError – If the file contains nucleotidic sequences instead of protein sequences.

binette.cds.predict(contigs_iterator: Iterator, outfaa: str, threads: int = 1) tuple[dict[str, list[str]], dict[str, int | None]]

Predict open reading frames with Pyrodigal.

Parameters:
  • contigs_iterator – An iterator of contig sequences.

  • outfaa – The output file path for predicted protein sequences (in FASTA format).

  • threads – Number of CPU threads to use (default is 1).

Returns:

A dictionary mapping contig names to predicted genes and a dictionary mapping contig names to coding lengths.

binette.cds.predict_genes(find_genes, name, seq) tuple[str, pyrodigal.lib.Genes]
binette.cds.write_faa(outfaa: str, contig_to_genes: list[tuple[str, pyrodigal.lib.Genes]]) None

Write predicted protein sequences to a FASTA file.

Parameters:
  • outfaa – The output file path for predicted protein sequences (in FASTA format).

  • contig_to_genes – A dictionary mapping contig names to predicted genes.

binette.contig_manager module

binette.contig_manager.apply_contig_index(contig_to_index: dict[str, int], contig_to_info: dict[str, Any]) dict[int, Any | collections.abc.Iterable[Any]]

Apply the contig index mapping to the contig info dictionary.

Parameters:
  • contig_to_index – A dictionary mapping contig names to their corresponding index.

  • contig_to_info – A dictionary mapping contig names to their associated information.

Returns:

A dictionary mapping contig indices to their associated information.

binette.contig_manager.make_contig_index(contigs: set[str]) dict[str, int]

Create an index mapping for contigs.

Parameters:

contigs – A list of contig names.

Returns:

A tuple containing the contig index mapping dictionaries (contig_to_index, index_to_contig).

binette.contig_manager.parse_fasta_file(fasta_file: str, index_file: str) Fasta

Parse a FASTA file and return a pyfastx.Fasta object.

Parameters:

fasta_file – The path to the FASTA file.

Returns:

A pyfastx.Fasta object representing the parsed FASTA file.

binette.diamond module

binette.io_manager module

binette.io_manager.check_contig_consistency(contigs_from_assembly: Iterable[str], contigs_from_elsewhere: Iterable[str], assembly_file: str, elsewhere_file: str)

Check the consistency of contig names between different sources.

Parameters:
  • contigs_from_assembly – List of contig names from the assembly file.

  • contigs_from_elsewhere – List of contig names from an external source.

  • assembly_file – Path to the assembly file.

  • elsewhere_file – Path to the file from an external source.

Raises:

AssertionError – If inconsistencies in contig names are found.

binette.io_manager.check_resume_file(faa_file: Path, diamond_result_file: Path) None

Check the existence of files required for resuming the process.

Parameters:
  • faa_file – Path to the protein file.

  • diamond_result_file – Path to the Diamond result file.

Raises:

FileNotFoundError – If the required files don’t exist for resuming.

binette.io_manager.get_paths_common_prefix_suffix(paths: list[pathlib.Path]) tuple[list[str], list[str], list[str]]

Determine the common prefix parts, suffix parts, and common extensions of the last part of a list of pathlib.Path objects.

Parameters:

paths – List of pathlib.Path objects.

Returns:

A tuple containing three lists: - The common prefix parts. - The common suffix parts. - The common extensions of the last part of the paths.

binette.io_manager.infer_bin_set_names_from_input_paths(input_bins: list[pathlib.Path]) dict[str, pathlib.Path]

Infer bin set names from a list of bin input directories or files.

Parameters:

input_bins – List of input bin directories or files.

Returns:

Dictionary mapping inferred bin names to their corresponding directories or files.

binette.io_manager.write_bin_info(bins: Iterable[Bin], output: Path, add_contigs: bool = False)

Write bin information to a TSV file.

Parameters:
  • bins – List of Bin objects.

  • output – Output file path for writing the TSV.

  • add_contigs – Flag indicating whether to include contig information.

binette.io_manager.write_bins_fasta(selected_bins: list[binette.bin_manager.Bin], contigs_fasta: Path, outdir: Path, contigs_names: list[str], max_buffer_size: int = 50000000)

Write selected bins’ contigs to separate FASTA files using pyfastx.Fastx (no index). Buffer entries by total character size, not just number of sequences.

Parameters:
  • selected_bins – List of Bin objects with .id and .contigs.

  • contigs_fasta – Path to the input FASTA file.

  • outdir – Directory to save bin FASTA files.

  • max_buffer_size – Maximum total character size to buffer before flushing.

binette.io_manager.write_contig2bin_table(selected_bins: list[binette.bin_manager.Bin], output_file: Path, contigs_names: list[str])

Write a simple TSV file mapping contig IDs to bin IDs.

Parameters:
  • selected_bins – List of selected Bin objects.

  • output_file – Path to the output TSV file.

  • contigs_names – List of contig names where index corresponds to contig ID.

binette.io_manager.write_original_bin_metrics(original_bins: list[binette.bin_manager.Bin], original_bin_report_dir: Path)

Write metrics of original input bins to a specified directory.

This function writes the metrics for each bin set to a TSV file in the specified directory. Each bin set will have its own TSV file named according to its set name.

Parameters:
  • original_bins – A set containing input bins

  • original_bin_report_dir – The directory path (Path) where the bin metrics will be saved.

binette.main module