binette package
Submodules
binette.bin_manager module
- class binette.bin_manager.Bin(contigs: BitMap, origin: set[str] | None = None, name: str | None = None, is_original: bool = False)
Bases:
object- CHECKM2_MODELS = ('Neural Network (Specific Model)', 'Gradient Boost (General Model)')
- add_N50(n50: int) None
Add the N50 attribute to the Bin object.
- Parameters:
n50 – The N50 value to add.
- Returns:
None
- add_coding_density(contig_to_coding_length: dict[int, int]) float | None
Calculate the coding density of the bin.
- Parameters:
contig_to_coding_length – A dictionary mapping contig IDs to their total coding lengths.
- Returns:
The coding density of the bin, or None if the length is not set or is zero.
- add_length(length: int) None
Add the length attribute to the Bin object if the provided length is a positive integer.
- Parameters:
length – The length value to add.
- Returns:
None
- add_model(model: str) None
Add a CheckM2 model to the bin.
- Parameters:
model – The model name to add.
- Raises:
ValueError – If the model name is not recognized.
- add_quality(completeness: float, contamination: float, contamination_weight: float) None
Set the quality attributes of the bin.
- Parameters:
completeness – The completeness value.
contamination – The contamination value.
contamination_weight – The weight assigned to contamination in the score calculation.
- property checkm2_model
Get the CheckM2 model for the bin.
- contig_difference(*others: Bin) BitMap
Compute the difference between the bin and other bins.
- Parameters:
others – Other bins to compute the difference with.
- contig_intersection(*others: Bin) BitMap
Compute the intersection of the bin with other bins.
- Parameters:
others – Other bins to compute the intersection with.
- contig_union(*others: Bin) BitMap
Compute the union of the bin with other bins.
- Parameters:
others – Other bins to compute the union with.
- property contigs_key
Serialize the contigs for easier comparison.
- is_high_quality(min_completeness: float, max_contamination: float) bool
Determine if a bin is considered high quality based on completeness and contamination thresholds.
- Parameters:
min_completeness – The minimum completeness required for a bin to be considered high quality.
max_contamination – The maximum allowed contamination for a bin to be considered high quality.
- Raises:
ValueError – If either completeness or contamination has not been set (is None).
- Returns:
True if the bin meets the high quality criteria; False otherwise.
- binette.bin_manager.build_contig_index(bins_dict: dict[bytes, binette.bin_manager.Bin]) dict[int, set[bytes]]
Build an inverted index: contig_id -> set of contigs_key of bins containing it. :param bins_dict: Mapping from contigs_key -> Bin. :return: Inverted index (contig_id -> set of contigs_key).
- binette.bin_manager.create_intermediate_bins(contig_key_to_initial_bin: dict[bytes, binette.bin_manager.Bin], contig_lengths: ndarray, min_comp: float, max_conta: float, min_len: int, max_len: int, disable_progress_bar: bool = False) dict[bytes, binette.bin_manager.Bin]
Creates intermediate bins from a dictionary of bin sets.
- Parameters:
original_bins – Set of input bins.
- Returns:
A set of intermediate bins created from intersections, differences, and unions.
- binette.bin_manager.from_bins_to_bin_graph(bins: Iterable[Bin]) Graph
Creates a bin graph made of overlapping gram a set of bins.
- Parameters:
bins – a set of bins
- Returns:
A networkx Graph representing the bin graph of overlapping bins.
- binette.bin_manager.get_all_possible_combinations(clique: list) Iterable[tuple]
Generates all possible combinations of elements from a given clique.
- Parameters:
clique – An iterable representing a clique.
- Returns:
An iterable of tuples representing all possible combinations of elements from the clique.
- binette.bin_manager.get_bins_from_contig2bin_table(contig2bin_table: Path, set_name: str) list[dict[str, Any]]
Retrieves a list of Bin objects from a contig-to-bin table.
- Parameters:
contig2bin_table – The path to the contig-to-bin table.
set_name – The name of the set the bins belong to.
- Returns:
A list of Bin info in dict created from the contig-to-bin table.
- binette.bin_manager.get_bins_from_directory(bin_dir: Path, set_name: str, fasta_extensions: set[str]) list[binette.bin_manager.Bin]
Retrieves a list of Bin objects from a directory containing bin FASTA files.
- Parameters:
bin_dir – The directory path containing bin FASTA files.
set_name – The name of the set the bins belong to.
- Fasta_extensions:
Possible fasta extensions to look for in the bin directory.
- Returns:
A list of Bin objects created from the bin FASTA files.
- binette.bin_manager.get_contigs_in_bin_sets(bin_set_name_to_bins: dict[str, set[binette.bin_manager.Bin]]) list[str]
Processes bin sets to check for duplicated contigs and logs detailed information about each bin set.
- Parameters:
bin_set_name_to_bins – A dictionary where keys are bin set names and values are sets of Bin objects.
- Returns:
A set of contig names found in bin sets
- binette.bin_manager.make_bins_from_bins_info(bin_set_name_to_bins_info: dict[str, list[dict[str, Any]]], contig_to_index: dict[str, int], are_original_bins: bool)
Create Bin objects from the provided bin information.
- Parameters:
bin_set_name_to_bins_info – A dictionary mapping bin set names to their bin information.
contig_to_index – A mapping of contig names to their indices.
are_original_bins – A boolean indicating whether the bins are original.
- Returns:
A dictionary mapping serialized contig bitmaps to their corresponding Bin objects.
- binette.bin_manager.parse_bin_directories(bin_name_to_bin_dir: dict[str, pathlib.Path], fasta_extensions: set[str]) dict[str, list[dict[str, Any]]]
Parses multiple bin directories and returns a dictionary mapping bin names to a list of Bin objects.
- Parameters:
bin_name_to_bin_dir – A dictionary mapping bin names to their respective bin directories.
- Fasta_extensions:
Possible fasta extensions to look for in the bin directory.
- Returns:
A dictionary mapping bin names to a list of dict created from the bin directories.
- binette.bin_manager.parse_contig2bin_tables(bin_name_to_bin_tables: dict[str, pathlib.Path]) dict[str, list[dict[str, Any]]]
Parses multiple contig-to-bin tables and returns a dictionary mapping bin names to a set of unique Bin objects.
Logs a warning if duplicate bins are detected within a bin set.
- Parameters:
bin_name_to_bin_tables – A dictionary where keys are bin set names and values are file paths or identifiers for contig-to-bin tables. Each table is parsed to extract Bin objects.
- Returns:
A dictionary where keys are bin set names and values are sets of Bin objects. Duplicates are removed based on contig composition.
- binette.bin_manager.remove_bins_from_index(bin_keys: set[bytes], bins_dict: dict[bytes, binette.bin_manager.Bin], contig_to_bins: dict[int, set[bytes]]) None
Remove a set of bins from the inverted index.
- Parameters:
bin_keys – The contig_keys of bins to remove.
bins_dict – Mapping from contig_key -> Bin.
contig_to_bins – Inverted index (contig_id -> set of contig_keys).
- binette.bin_manager.select_best_bins(bins_dict: dict[bytes, binette.bin_manager.Bin], min_completeness: float, max_contamination: float, prefix: str = 'binette') list[binette.bin_manager.Bin]
Select the best non-overlapping bins based on score, N50, and ID.
- Parameters:
bins_dict – Mapping from contig_key -> Bin.
min_completeness – Minimum completeness threshold for a bin to be considered.
max_contamination – Maximum contamination threshold for a bin to be considered.
prefix – Prefix to use for naming selected bins.
- binette.bin_manager.sum_contig_lengths(bm_contigs: BitMap, contig_lengths: ndarray, cache: dict[bytes, int] | None = None, key: bytes | None = None)
binette.bin_quality module
binette.cds module
- binette.cds.filter_faa_file(contigs_to_keep: set[str], input_faa_file: Path, filtered_faa_file: Path)
Filters a FASTA file containing protein sequences to only include sequences from contigs present in the provided set of contigs (contigs_to_keep).
This function processes the input FASTA file, identifies protein sequences originating from contigs listed in contigs_to_keep, and writes the filtered sequences to a new FASTA file. The output file supports optional .gz compression.
- Parameters:
contigs_to_keep – A set of contig names to retain in the output FASTA file.
input_faa_file – Path to the input FASTA file containing protein sequences.
filtered_faa_file – Path to the output FASTA file for filtered sequences. If the filename ends with .gz, the output will be compressed.
- binette.cds.get_aa_composition(genes: list[str]) Counter
Calculate the amino acid composition of a list of protein sequences.
- Parameters:
genes – A list of protein sequences.
- Returns:
A Counter object representing the amino acid composition.
- binette.cds.get_contig_cds_metadata(contig_to_genes: dict[int, Any | list[Any]], threads: int) dict[str, dict]
Calculate metadata for contigs in parallel, including CDS count, amino acid composition, and total amino acid length.
- Parameters:
contig_to_genes – A dictionary mapping contig names to lists of protein sequences.
threads – Number of CPU threads to use.
- Returns:
A tuple containing dictionaries for CDS count, amino acid composition, and total amino acid length.
- binette.cds.get_contig_cds_metadata_flat(contig_to_genes: dict[str, list[str]]) tuple[dict[str, int], dict[str, collections.Counter], dict[str, int]]
Calculate metadata for contigs, including CDS count, amino acid composition, and total amino acid length.
- Parameters:
contig_to_genes – A dictionary mapping contig names to lists of protein sequences.
- Returns:
A tuple containing dictionaries for CDS count, amino acid composition, and total amino acid length.
- binette.cds.get_contig_coding_len(genes: list[pyrodigal.lib.Gene], contig_length: int) int | None
Compute the coding length of a contig. Use a mask to account for overlapping genes.
- Parameters:
genes – A list of gene annotations for the contig.
contig_length – The length of the contig in base pairs.
- Returns:
The coding length as a float, or None if contig_length is zero.
- binette.cds.get_contig_from_cds_name(cds_name: str) str
Extract the contig name from a CDS name.
- Parameters:
cds_name (str) – The name of the CDS.
- Returns:
The name of the contig.
- Return type:
str
- binette.cds.is_nucleic_acid(sequence: str) bool
Determines whether the given sequence is a DNA or RNA sequence.
- Parameters:
sequence – The sequence to check.
- Returns:
True if the sequence is a DNA or RNA sequence, False otherwise.
- binette.cds.parse_faa_file(faa_file: str) dict[str, list[str]]
Parse a FASTA file containing protein sequences and organize them by contig.
- Parameters:
faa_file – Path to the input FASTA file.
- Returns:
A dictionary mapping contig names to lists of protein sequences.
- Raises:
ValueError – If the file contains nucleotidic sequences instead of protein sequences.
- binette.cds.predict(contigs_iterator: Iterator, outfaa: str, threads: int = 1) tuple[dict[str, list[str]], dict[str, int | None]]
Predict open reading frames with Pyrodigal.
- Parameters:
contigs_iterator – An iterator of contig sequences.
outfaa – The output file path for predicted protein sequences (in FASTA format).
threads – Number of CPU threads to use (default is 1).
- Returns:
A dictionary mapping contig names to predicted genes and a dictionary mapping contig names to coding lengths.
- binette.cds.predict_genes(find_genes, name, seq) tuple[str, pyrodigal.lib.Genes]
- binette.cds.write_faa(outfaa: str, contig_to_genes: list[tuple[str, pyrodigal.lib.Genes]]) None
Write predicted protein sequences to a FASTA file.
- Parameters:
outfaa – The output file path for predicted protein sequences (in FASTA format).
contig_to_genes – A dictionary mapping contig names to predicted genes.
binette.contig_manager module
- binette.contig_manager.apply_contig_index(contig_to_index: dict[str, int], contig_to_info: dict[str, Any]) dict[int, Any | collections.abc.Iterable[Any]]
Apply the contig index mapping to the contig info dictionary.
- Parameters:
contig_to_index – A dictionary mapping contig names to their corresponding index.
contig_to_info – A dictionary mapping contig names to their associated information.
- Returns:
A dictionary mapping contig indices to their associated information.
- binette.contig_manager.make_contig_index(contigs: set[str]) dict[str, int]
Create an index mapping for contigs.
- Parameters:
contigs – A list of contig names.
- Returns:
A tuple containing the contig index mapping dictionaries (contig_to_index, index_to_contig).
- binette.contig_manager.parse_fasta_file(fasta_file: str, index_file: str) Fasta
Parse a FASTA file and return a pyfastx.Fasta object.
- Parameters:
fasta_file – The path to the FASTA file.
- Returns:
A pyfastx.Fasta object representing the parsed FASTA file.
binette.diamond module
binette.io_manager module
- binette.io_manager.check_contig_consistency(contigs_from_assembly: Iterable[str], contigs_from_elsewhere: Iterable[str], assembly_file: str, elsewhere_file: str)
Check the consistency of contig names between different sources.
- Parameters:
contigs_from_assembly – List of contig names from the assembly file.
contigs_from_elsewhere – List of contig names from an external source.
assembly_file – Path to the assembly file.
elsewhere_file – Path to the file from an external source.
- Raises:
AssertionError – If inconsistencies in contig names are found.
- binette.io_manager.check_resume_file(faa_file: Path, diamond_result_file: Path) None
Check the existence of files required for resuming the process.
- Parameters:
faa_file – Path to the protein file.
diamond_result_file – Path to the Diamond result file.
- Raises:
FileNotFoundError – If the required files don’t exist for resuming.
- binette.io_manager.get_paths_common_prefix_suffix(paths: list[pathlib.Path]) tuple[list[str], list[str], list[str]]
Determine the common prefix parts, suffix parts, and common extensions of the last part of a list of pathlib.Path objects.
- Parameters:
paths – List of pathlib.Path objects.
- Returns:
A tuple containing three lists: - The common prefix parts. - The common suffix parts. - The common extensions of the last part of the paths.
- binette.io_manager.infer_bin_set_names_from_input_paths(input_bins: list[pathlib.Path]) dict[str, pathlib.Path]
Infer bin set names from a list of bin input directories or files.
- Parameters:
input_bins – List of input bin directories or files.
- Returns:
Dictionary mapping inferred bin names to their corresponding directories or files.
- binette.io_manager.write_bin_info(bins: Iterable[Bin], output: Path, add_contigs: bool = False)
Write bin information to a TSV file.
- Parameters:
bins – List of Bin objects.
output – Output file path for writing the TSV.
add_contigs – Flag indicating whether to include contig information.
- binette.io_manager.write_bins_fasta(selected_bins: list[binette.bin_manager.Bin], contigs_fasta: Path, outdir: Path, contigs_names: list[str], max_buffer_size: int = 50000000)
Write selected bins’ contigs to separate FASTA files using pyfastx.Fastx (no index). Buffer entries by total character size, not just number of sequences.
- Parameters:
selected_bins – List of Bin objects with .id and .contigs.
contigs_fasta – Path to the input FASTA file.
outdir – Directory to save bin FASTA files.
max_buffer_size – Maximum total character size to buffer before flushing.
- binette.io_manager.write_contig2bin_table(selected_bins: list[binette.bin_manager.Bin], output_file: Path, contigs_names: list[str])
Write a simple TSV file mapping contig IDs to bin IDs.
- Parameters:
selected_bins – List of selected Bin objects.
output_file – Path to the output TSV file.
contigs_names – List of contig names where index corresponds to contig ID.
- binette.io_manager.write_original_bin_metrics(original_bins: list[binette.bin_manager.Bin], original_bin_report_dir: Path)
Write metrics of original input bins to a specified directory.
This function writes the metrics for each bin set to a TSV file in the specified directory. Each bin set will have its own TSV file named according to its set name.
- Parameters:
original_bins – A set containing input bins
original_bin_report_dir – The directory path (Path) where the bin metrics will be saved.