fiber_views package

Submodules

fiber_views.fiber_views module

Main module.

fiber_views.fiber_views.bed_to_anno_df(bed_df, entry_name_type='gene_id', aligned_pos='start')

Convert a data frame in BED format to another data frame with a different layout.

Parameters:

bed_df (pandas.DataFrame) – Data frame in BED format, with columns ‘chrom’, ‘start’, ‘end’, ‘strand’, ‘name’, and ‘score’.
entry_name_type (str, optional) – Column name for the unique identifier for each feature. The default is “gene_id”.
aligned_pos (str, optional) – The position in the bed entries to use as the ‘pos’ or aligned base. Can be ‘start’, ‘end’, or ‘center’ (default is ‘start’).

Returns:

anno_df – Data frame with columns ‘seqid’, ‘pos’, ‘strand’, [entry_name_type], and ‘score’.

Return type:

pandas.DataFrame

fiber_views.fiber_views.build_multi_fview(bam_file, sites_df, mod_defs, region_defs, window=(-1000, 1000), fully_span=True, region_interval=30, filter_args={'cutoff': 2, 'dist': 3000}, tags=['np', 'ec', 'rq'], max_reads=300)

Build an AnnData object centered at a multiple genomic sites.

Parameters:

bam_file (str) – Path to the BAM file containing Fiber-seq reads.
sites_df (pandas.DataFrame) – genomic positions to center on, pandas.DataFrame with columns ‘seqid’, ‘pos’, and ‘strand’.
mod_defs (list of dict) – List of modification definitions, each dict describing a modification to extract.
region_defs (list of dict) – List of region definitions, each dict describing a region type to extract.
window (tuple of int, optional) – window (upstream, downstream) around the site to extract (default is (-1000, 1000)).
fully_span (bool, optional) – If True, only include reads fully spanning the window (default is True).
region_interval (int, optional) – Interval size used for region feature binning (default is 30).
filter_args (dict, optional) – Arguments for filtering reads by methylation endpoints, should include ‘dist’ and ‘cutoff’ (default is {‘dist’: 3000, ‘cutoff’: 2}).
tags (list of str, optional) – List of BAM tags to extract for annotation (default is [‘np’, ‘ec’, ‘rq’]).
max_reads (int, optional) – Maximum number of reads to extract (default is 300).

Returns:

Annotated data matrix containing read, sequence, modification, and region layers for the site. Returns None if no reads pass filtering.

Return type:

AnnData or None

fiber_views.fiber_views.build_single_fview(bam_file, site_info, mod_defs='PacBio_Fiberseq', region_defs='FIRE', window=(-1000, 1000), fully_span=True, region_interval=30, filter_args={'cutoff': 2, 'dist': 3000}, tags=['np', 'ec', 'rq'], max_reads=300)

Build an AnnData object centered at a single genomic site.

Parameters:

bam_file (str) – Path to the BAM file containing Fiber-seq reads.
site_info (dict or pandas.Series) – genomic position to center on, dict or series with keys ‘seqid’, ‘pos’, and ‘strand’.
mod_defs (str OR list of dicts) – List of modification definitions, each dict describing a modification to extract.
region_defs (str OR list of dict) – List of region definitions, each dict describing a region type to extract.
window (tuple of int, optional) – window (upstream, downstream) around the site to extract (default is (-1000, 1000)).
fully_span (bool, optional) – If True, only include reads fully spanning the window (default is True).
region_interval (int, optional) – Interval size used for region feature binning (default is 30).
filter_args (dict, optional) – Arguments for filtering reads by methylation endpoints, should include ‘dist’ and ‘cutoff’ (default is {‘dist’: 3000, ‘cutoff’: 2}).
tags (list of str, optional) – List of BAM tags to extract for annotation (default is [‘np’, ‘ec’, ‘rq’]).
max_reads (int, optional) – Maximum number of reads to extract (default is 300).

Returns:

Annotated data matrix containing read, sequence, modification, and region layers for the site. Returns None if no reads pass filtering.

Return type:

AnnData or None

fiber_views.fiber_views.read_bed(bed_file)

Read a BED file and return a pandas DataFrame.

Parameters:: bed_file (str) – The file path of the BED file to be read. The bed file should follow bed standard and not include column names.
Returns:: A DataFrame containing the data from the BED file.
Return type:: pandas.DataFrame

fiber_views.plot module

Created on Wed Aug 28 13:33:42 2024

@author: morgan

fiber_views.plot.annotate_boundaries(fview)

Add start and end position columns to the observation metadata of a fiber view.

This function calculates the leftmost and rightmost positions of actual sequence data (excluding gaps) for each fiber and adds them as ‘s_pos’ and ‘e_pos’ columns to the obs dataframe.

Parameters:: fview (anndata.AnnData) – The fiber view object to annotate.
Returns:: The function modifies the fiber view object in place.
Return type:: None

fiber_views.plot.draw_fiber_bars(fview, ax=None, color='#d0d0d0', width=0.8)

Draw fibers as horizontal bars on a matplotlib axis.

This function draws each fiber in the fiber view as a rectangular bar spanning from the start to the end of the sequence data. This visualization is suitable for fewer than ~150 fibers.

Parameters:

fview (anndata.AnnData) – The fiber view object containing sequence data.
ax (matplotlib.axes.Axes, optional) – An existing axis to draw on. If None, a new axis will be created. The default is None.
color (str, optional) – The color of the fiber bars. The default is “#d0d0d0”.
width (float, optional) – The width (height) of each bar in axis units. The default is DEFAULT_WIDTH (0.8).

Returns:

The axis with fibers drawn as horizontal bars.

Return type:

matplotlib.axes.Axes

fiber_views.plot.draw_fiber_lines(fview, ax=None, color='#606060')

Draw fibers as horizontal lines on a matplotlib axis.

This function draws each fiber in the fiber view as a horizontal line spanning from the start to the end of the sequence data. This visualization is suitable for fewer than ~150 fibers.

Parameters:

fview (anndata.AnnData) – The fiber view object containing sequence data.
ax (matplotlib.axes.Axes, optional) – An existing axis to draw on. If None, a new axis will be created. The default is None.
color (str, optional) – The color of the fiber lines. The default is “#606060”.

Returns:

The axis with fibers drawn as horizontal lines.

Return type:

matplotlib.axes.Axes

fiber_views.plot.draw_mods(fview, ax=None, mod='m6a', width=0.8, color='#000000')

Draw base modifications as vertical marks on a matplotlib axis.

This function visualizes base modifications (such as m6A or CpG methylation) as small vertical rectangles at each modified position along the fibers.

Parameters:

fview (anndata.AnnData) – The fiber view object containing modification data.
ax (matplotlib.axes.Axes, optional) – An existing axis to draw on. If None, a new axis will be created. The default is None.
mod (str, optional) – The name of the modification layer to draw (e.g., ‘m6a’, ‘cpg’). The default is ‘m6a’.
width (float, optional) – The width (height) of each modification mark in axis units. The default is DEFAULT_WIDTH (0.8).
color (str, optional) – The color of the modification marks. The default is ‘#000000’ (black).

Returns:

The axis with modifications drawn as vertical marks.

Return type:

matplotlib.axes.Axes

fiber_views.plot.draw_mods_offset(fview, ax=None, mod='m6a', width=0.8, color='#000000')

fiber_views.plot.draw_regions(fview, ax=None, base_name='msp', color='red', width=0.8)

Draw genomic regions as colored rectangles on a matplotlib axis.

This function visualizes regions (such as nucleosomes or MSPs) as colored rectangles overlaid on the fiber view. Each region is drawn at its corresponding position along the fiber.

Parameters:

fview (anndata.AnnData) – The fiber view object containing region data.
ax (matplotlib.axes.Axes, optional) – An existing axis to draw on. If None, a new axis will be created. The default is None.
base_name (str, optional) – The name of the region type to draw (e.g., ‘msp’, ‘nuc’, ‘fire’). The default is ‘msp’.
color (str, optional) – The color of the region rectangles. The default is “red”.
width (float, optional) – The width (height) of each rectangle in axis units. The default is DEFAULT_WIDTH (0.8).

Returns:

The axis with regions drawn as colored rectangles.

Return type:

matplotlib.axes.Axes

fiber_views.plot.draw_split_lines(fview, ax=None, split_var='site_name', color='black')

Draw horizontal lines separating groups in a fiber view.

This function draws horizontal dividing lines between groups of fibers based on a grouping variable in the observation metadata. This is useful for visually separating fibers from different sites or conditions.

Parameters:

fview (anndata.AnnData) – The fiber view object with grouped observations.
ax (matplotlib.axes.Axes, optional) – An existing axis to draw on. If None, a new axis will be created. The default is None.
split_var (str, optional) – The column name in obs to use for determining group boundaries. The default is “site_name”.
color (str, optional) – The color of the dividing lines. The default is “black”.

Returns:

The axis with horizontal dividing lines drawn between groups.

Return type:

matplotlib.axes.Axes

fiber_views.plot.make_plot_ax(fview, ax=None)

Create or prepare a matplotlib axis for plotting a fiber view.

Parameters:

fview (anndata.AnnData) – The fiber view object to plot.
ax (matplotlib.axes.Axes, optional) – An existing axis to use. If None, a new figure and axis will be created. The default is None.

Returns:

The prepared axis with xlim and ylim set appropriately for the fiber view.

Return type:

matplotlib.axes.Axes

fiber_views.tools module

Created on Wed Sep 7 14:37:04 2022

@author: morgan

@description: A set of usefull tools for workign with fiber views

fiber_views.tools.agg_by_obs_and_bin(fview, obs_group_var='site_name', bin_width=10, obs_to_keep=['seqid', 'pos', 'strand', ''], fast=True, region_weights='ones')

Aggregate fiber view data by a group variable in the obs dataframe and bin by bin_widht basepairs.

Parameters:

fview (anndata.AnnData) – The fiber view object containing the data to be aggregated.
obs_group_var (str, optional) – The name of the obs group variable to use for aggregation. The default value is ‘site_name’. If obs_group_var is set to None, the fiber view will not be aggregated by rows and the row ordering will be preserved.
bin_width (int, optional) – The width of each bin, in base pairs. The default value is 10. If `bin_width is 1, the data will not be binned.
obs_to_keep (list of str, optional) – A list of observation metadata columns to keep in the aggregated data. The default value is [‘seqid’, ‘pos’, ‘strand’, ‘’].
fast (bool, optional) – If True, the modification matrices will be converted to dense matrices for faster calculations. The default value is True. This may use more memory for large fiber view objects.
region_weights (str, optional) – how to weight regions when aggregating must be one of ‘ones’, ‘length’ or ‘score’

Returns:

An aggregated version of the input fiber view object, with observations grouped and binned according to the specified parameters.

Return type:

anndata.AnnData

fiber_views.tools.bin_sparse_regions(fview, base_name='nuc', bin_width=10, interval=3)

Bin regions in a fiber view by averaging their length and score over a set of consecutive bins.

Parameters:

fview (anndata.AnnData) – The fiber view object containing the region data.
base_name (str, optional) – The name of the type of regions to bin. This should be one of ‘nuc’ (nucleosomes), or ‘msp’ (methylation sensitive patches). The default value is ‘nuc’.
bin_width (int, optional) – The width of each bin, in base pairs. The default value is 10.
interval (int, optional) – The interval between bins, in base pairs. The default value is 3.

Returns:

A tuple containing the position, length, and score data for the binned regions, stored as COOrdinate format sparse matrices.

Return type:

tuple of scipy.sparse.coo_matrix

fiber_views.tools.calc_kmer_dist(fview, metric='cityblock')

Calculate pairwise k-mer distances between fibers in a fiber view.

This function calculates pairwise distances between fibers in a fiber view based on the k-mer counts stored in the ‘kmers’ element of the obsm attribute. The distance metric can be specified using the metric parameter (default is ‘cityblock’). The resulting distance matrix is stored in the ‘kmer_dist’ element of the obsp attribute of the fiber view.

Parameters:

fview (anndata.AnnData) – Fiber view object containing k-mer count data in the ‘kmers’ element of the obsm attribute.
metric (str, optional) – Distance metric to use for calculating pairwise distances. The default is ‘cityblock’.

Return type:

None

fiber_views.tools.count_kmers(fview, k)

Count k-mers in each fiber in a fiber view.

This function counts the occurrences of k-mers in each fiber in a fiber view, and stores the resulting k-mer counts in the ‘kmers’ element of the obsm attribute of the fiber view. The length of the k-mers (k) and the mapping from k-mer strings to column indices in the k-mer count matrix are stored in the ‘kmer_len’ and ‘kmer_idx’ elements of the uns attribute, respectively.

Parameters:

fview (anndata.AnnData) – Fiber view object containing DNA sequence data in the ‘seq’ element of the layers attribute.
k (int) – Length of the k-mers to count.

Returns:

The function updates the fiber view object in place, adding a new observation matrix ‘kmers’ containing the counts of each k-mer for each fiber, and adds two new entries to the ‘uns’ dictionary: ‘kmer_len’ and ‘kmer_idx’. ‘kmer_len’ is the length of the k-mers that were counted, and ‘kmer_idx’ is a list of the k-mers that were counted, with each k-mer represented as a bytes object.

Return type:

None

fiber_views.tools.filter_regions(fview, base_name='nuc', new_base_name=None, length_limits=(-inf, inf), score_limits=(-inf, inf), inplace=False)

Filter base modifications in a fiber view by length and score limits.

Parameters:

fview (anndata.AnnData) – The fiber view object containing the base modification data.
base_name (str, optional) – The name of the type of regions to bin. This should be one of ‘nuc’ (nucleosomes), or ‘msp’ (methylation sensitive patches). The default value is ‘nuc’.
new_base_name (str, optional) – If not None, the new region base name to save the filtered regions to (region information at base_name will not be modified). If new_base_name is None, the filtered regions will be saved to base_name.
length_limits (tuple of float, optional) – The lower and upper limits for the length of the base modifications. Modifications with lengths outside of these limits will be filtered out. The default value is (-inf, inf), which includes all modifications.
score_limits (tuple of float, optional) – The lower and upper limits for the score of the base modifications. Modifications with scores outside of these limits will be filtered out. The default value is (-inf, inf), which includes all modifications.
inplace (bool, optional) – If True, the function will filter the base modifications in place and return None. If False (default), the function will return a new fiber view.

Returns:

The function updates the fiber view object in place or returns a new fiber view with the selected region type filtered.

Return type:

None or anndata.AnnData

fiber_views.tools.get_seq_records(fview, id_col='read_name')

Convert fiber view sequences to BioPython SeqRecord objects.

Parameters:

fview (anndata.AnnData) – The fiber view object containing sequence data.
id_col (str, optional) – The column name in obs to use as the sequence ID. The default is “read_name”.

Returns:

A list of SeqRecord objects where each record contains the sequence from one row of the fiber view.

Return type:

list of Bio.SeqRecord.SeqRecord

fiber_views.tools.get_sequences(fview)

Returns a list of strings where each string is the sequence of one row of the fview object.

Parameters:: fview (AnnData object) – The fiber view object containing the sequence data.
Returns:: sequences – A list of strings where each string is the sequence of one row of the fview object.
Return type:: list

fiber_views.tools.make_dense_regions(fview, base_name='nuc', report='ones')

Create a dense matrix containing a representation of region infromation in a fiber view.

Parameters:

fview (anndata.AnnData) – The fiber view object containing the base modification data.
base_name (str, optional) – The name of the type of regions to bin. This should be one of ‘nuc’ (nucleosomes), or ‘msp’ (methylation sensitive patches). The default value is ‘nuc’.
report (str, optional) – The data to include in the dense matrix. This should be one of ‘ones’, ‘score’ or ‘length’. The default value is ‘score’.

Returns:

A dense matrix of size (number of fibers, number of bases) containing the specified region data. Each position in the matrix where a region is not present is set to 0, positions where ar region is present may be set to either the length or score value of the region occupying that position.

Return type:

numpy.ndarray

fiber_views.tools.make_region_df(fview, base_name='nuc', zero_pos='left')

Create a dataframe containing the positions and lengths of regions in a fiber view.

Parameters:

fview (anndata.AnnData) – The fiber view object containing the base modification data.
base_name (str, optional) – The name of the type of regions to bin. This should be one of ‘nuc’ (nucleosomes), or ‘msp’ (methylation sensitive patches). The default value is ‘nuc’.
zero_pos (str, optional) – The position to use as the zero point for the start positions of the base modifications. This should be one of ‘left’, ‘center’, or ‘right’. The default value is ‘left’.

Returns:

A dataframe with columns ‘row’ (the fiber index), ‘start’ (the start position of the base modification), ‘length’ (the length of the base modification), and ‘score’ (the score of the base modification).

Return type:

pandas.DataFrame

fiber_views.tools.mark_cpg_sites(fview, sparse=True)

Identify and mark CpG sites in a fiber view.

This function creates a new layer ‘cpg_sites’ in the fiber view with True values at positions that are CpG dinucleotides (C followed by G). The ‘cpg_sites’ layer is also added to the ‘mods’ list in uns.

Parameters:

fview (anndata.AnnData) – The fiber view object to mark CpG sites in.
sparse (bool, optional) – If True, store the CpG sites as a sparse matrix. If False, store as a dense array. The default is True.

Returns:

The function modifies the fiber view object in place, adding a ‘cpg_sites’ layer.

Return type:

None

Notes

Known issue: All Cs at the end of each sequence are marked as not CpGs.

fiber_views.tools.split_fire(fview, input_region='msp', threshold=1, output_regions=['lnk', 'fire'])

Split methylation-sensitive patches (MSPs) into linker and FIRE regions based on score.

This function creates two new region types by filtering the input region type based on a score threshold. Regions with scores below the threshold are classified as one type (default: linker), and regions with scores above the threshold are classified as another type (default: FIRE).

Parameters:

fview (anndata.AnnData) – The fiber view object containing region data.
input_region (str, optional) – The name of the region type to split. The default is ‘msp’.
threshold (float, optional) – The score threshold for splitting regions. The default is 1.
output_regions (list of str, optional) – A list of two names for the output region types [low_score, high_score]. The default is [‘lnk’, ‘fire’].

Returns:

The function modifies the fiber view object in place, adding new region layers.

Return type:

None

fiber_views.utils module

Created on Tue Aug 30 16:05:34 2022

@author: morgan

class fiber_views.utils.ReadList(normal_list=[], strand='+')

Bases: list

A simple list of pysam.libcalignedsegment.PileupRead objects, plus methods useful for constructing anndata elements from the read objects. also tracks strand info of the genomic query position.

Parameters:

normal_list (list, optional) – A list of pysam.libcalignedsegment.PileupRead objects. The default is [].
strand (str, optional) – The strand of the genomic query position. The default is “+”.

strand

The strand of the genomic query position.

Type:: str

build_anno_df(anno_series, tags=['np', 'ec', 'rq'])

Create a data frame with annotation data for the reads in the ReadList object.

Parameters:

anno_series (pandas Series) – A pandas Series containing annotation data for the genomic query position.
tags (list, optional) – A list of tags to include in the data frame. The default is [‘np’, ‘ec’, ‘rq’].

Returns:

df – A data frame with annotation data for the reads in the ReadList object.

Return type:

pandas DataFrame

build_mod_array(window, mod_type=[('A', 0, 'a'), ('T', 1, 'a')], strand=None, sparse=True, score_cutoff=200)

Create a base modification matrix for the reads in the ReadList object.

Parameters:

window (tuple) – A tuple of integers representing the window of +/- window_offset. The tuple should be of the form (window_start, window_end).
mod_type (list, optional) – A list of tuples representing the base modification type to consider. The default is M6A_MODS.
strand (str, optional) – The strand of the genomic query position. If not provided, the strand information of the ReadList object is used. The default is None.
sparse (bool, optional) – If True, the base odification matrix is returned in sparse format. If False, the matrix is returned in dense format. The default is True.
score_cutoff (int, optional) – The minimum score required for a base modification to be considered. The default is 200.

Returns:

mod_mtx – A base modification matrix for the reads in the ReadList object.

Return type:

numpy array or scipy sparse matrix

build_mod_array_from_def(window, mod_def, strand=None, sparse=True)

Create a base modification matrix for reads using a modification definition dictionary.

Parameters:

window (tuple) – A tuple of integers representing the window of +/- window_offset. The tuple should be of the form (window_start, window_end).
mod_def (dict) – A modification definition dictionary containing ‘mod_code’, ‘threshold’, and ‘rev_offset’ keys.
strand (str, optional) – The strand of the genomic query position. If not provided, the strand information of the ReadList object is used. The default is None.
sparse (bool, optional) – If True, the base modification matrix is returned in sparse format. If False, the matrix is returned in dense format. The default is True.

Returns:

A base modification matrix for the reads in the ReadList object, with modifications defined by mod_def.

Return type:

scipy.sparse.coo_matrix or numpy.ndarray

build_seq_array(window, strand=None)

Create a byte array of the sequences for the reads in the ReadList object.

Parameters:

window (tuple) – A tuple of integers representing the window of +/- window_offset. The tuple should be of the form (window_start, window_end).
strand (str, optional) – The strand of the genomic query position. If not provided, the strand information of the ReadList object is used. The default is None.

Returns:

char_array – A byte array of the sequences for the reads in the ReadList object.

Return type:

numpy array

Notes

Make sure to filter the reads in the ReadList object before using this method.

build_sparse_region_array(window, tags=('ns', 'nl'), interval=30, strand=None)

Create a sparse region matrix for the reads in the ReadList object.

Parameters:

window (tuple) – A tuple of integers representing the window of +/- window_offset. The tuple should be of the form (window_start, window_end).
tags (tuple, optional) – A tuple of tags to consider. The default is (‘ns’, ‘nl’).
interval (int, optional) – The interval at which to report region information. This determines the minimum window size that can be subset to that still preserve region info. The default is 30.
strand (str, optional) – The strand of the genomic query position. If not provided, the strand information of the ReadList object is used. The default is None.

Returns:

region_mtx – A sparse region matrix for the reads in the ReadList object.

Return type:

scipy sparse matrix

filter_by_end_meth(dist=3000, cutoff=2, inplace=False)

Remove reads if they have fewer than [cutoff] m6A mods within [dist] base pairs of the read start and end.

Parameters:

dist (int, optional) – The distance in base pairs from the read start and end to consider. The default is 3000.
cutoff (int, optional) – The minimum number of m6A mods within [dist] base pairs of the read start and end required to keep the read. The default is 2.
inplace (bool, optional) – If True, the ReadList object is modified in place. If False, a new ReadList object is returned. The default is False.

Returns:

If inplace is True, returns None. If inplace is False, returns a new ReadList object with the filtered reads and strand information.

Return type:

None or ReadList

filter_by_window(window, inplace=False, strand=None)

Remove reads that do not fully span a given window of +/- window_offset.

Parameters:

window (tuple) – A tuple of integers representing the window of +/- window_offset. The tuple should be of the form (window_start, window_end).
inplace (bool, optional) – If True, the ReadList object is modified in place. If False, a new ReadList object is returned. The default is False.
strand (str, optional) – The strand of the genomic query position. If not provided, the strand information of the ReadList object is used. The default is None.

Returns:

If inplace is True, returns None. If inplace is False, returns a new ReadList object with the filtered reads and strand information.

Return type:

None or ReadList

get_reads(alignment_file, ref_pos, max_reads)

Retrieve reads from a BAM file for a given reference position.

Parameters:

alignment_file (pysam.AlignmentFile) – A BAM file opened with pysam.
ref_pos (tuple) – A tuple containing the reference name, position, and strand of the genomic query position. The tuple should be of the form (reference_name, position, strand).
max_reads (int) – the max number of reads to load from the bam file, usefull to speed up processing when coverage is deep.

Returns:

self – The ReadList object with the reads and strand information.

Return type:

ReadList

Example

reads = ReadList().get_reads(bamfile, (‘chr3’, 200000, ‘+’))

print_aligned_centers(offset=5)

Print the center positions of the reads in the ReadList object with a specified number of bases on either side. This is a test function to make sure reads are aligning correctly

Parameters:: offset (int, optional) – The number of bases on either side of the center position to include in the output. The default is 5.
Return type:: None

fiber_views.utils.get_mod_pos_from_rec(rec, mods=[('A', 0, 'a'), ('T', 1, 'a')], score_cutoff=200)

Retrieve positions of modified bases in a record.

Parameters:

rec (pysam.libcalignedsegment.AlignedSegment) – A record containing modified bases.
mods (list, optional) – A list of modified bases to consider, in the form (base, index, code). The default is M6A_MODS.
score_cutoff (int, optional) – The minimum score required for a modified base to be included. The default is 200.

Returns:

mod_positions – An array of positions of modified bases.

Return type:

numpy.ndarray

Example

mod_positions = get_mod_pos_from_rec(read.alignment)

fiber_views.utils.get_strand_correct_mods(read, mod_type=[('A', 0, 'a'), ('T', 1, 'a')], centered=False, score_cutoff=200)

Retrieve modified bases in a read and correct their positions to match the forward genomic strand.

Parameters:

read (pysam.libcalignedsegment.AlignedSegment) – A read containing modified bases.
mod_type (list, optional) – A list of modified bases to consider, in the form (base, index, code). The default is M6A_MODS.
centered (bool, optional) – Whether to center the positions around the query position of the read. The default is False.
score_cutoff (int, optional) – The minimum score required for a modified base to be included. The default is 200.

Returns:

mods – An array of positions of modified bases, corrected for strand.

Return type:

numpy.ndarray

Example

mods = get_strand_correct_mods(read)

fiber_views.utils.get_strand_correct_mods_from_def(read, mod_def, centered=False)

Retrieve modified bases in a read using a modification definition and correct positions for strand.

This function extracts modification positions from a read using a custom modification definition dictionary and corrects the positions to match the forward genomic strand.

Parameters:

read (pysam.libcalignedsegment.AlignedSegment) – A read containing modified bases.
mod_def (dict) – A modification definition dictionary containing ‘mod_code’ (list of tuples), ‘threshold’ (int), and ‘rev_offset’ (int) keys.
centered (bool, optional) – Whether to center the positions around the query position of the read. The default is False.

Returns:

An array of positions of modified bases, corrected for strand. Returns None if no modifications are found.

Return type:

numpy.ndarray or None

Example

mods = get_strand_correct_mods_from_def(read, mod_def)

fiber_views.utils.get_strand_correct_regions(read, tags=('ns', 'nl'), centered=False)

Retrieve start positions, lengths, and scores of regions in a read and correct them to match the forward genomic strand.

Parameters:

read (pysam.libcalignedsegment.AlignedSegment) – A read containing regions.
tags (tuple, optional) – A tuple of tags containing the start positions, lengths, and scores of the regions. The default is (‘ns’, ‘nl’).
centered (bool, optional) – Whether to center the positions around the query position of the read. The default is False.

Returns:

starts (numpy.ndarray) – An array of start positions of regions, corrected for strand.
lengths (numpy.ndarray) – An array of lengths of regions.
scores (numpy.ndarray) – An array of scores of regions.

Example

starts, lengths, scores = get_strand_correct_regions(read)

fiber_views.utils.make_sparse_regions(region_df, shape, bin_width=1, interval=30)

Make a sparse matrix representing genomic regions.

This function takes a DataFrame containing region information, as well as the shape of the resulting matrix and other parameters, and returns three sparse matrices representing the positions, lengths, and scores of the regions within the matrix.

Parameters:

region_df (pandas.DataFrame) – A DataFrame containing the region information. The DataFrame should have columns for row, start, length, and score, representing the row index of the matrix, the starting position of the region (0-based), the length of the region, and the score associated with the region, respectively.
shape (tuple) – The shape of the resulting matrix before binning. The first element should be the number of rows in the matrix, and the second element should be the number of columns.
bin_width (int, optional) – The width of each bin in the resulting matrix, in base pairs. The default is 1.
interval (int, optional) – The interval at which to report region information. This determines the minimum window size that can be subset to that still preserve region info. The default is 30. interval is in number of bins, not bp

Returns:

A tuple containing three sparse matrices representing the positions, lengths, and scores of the regions within the matrix. The matrices are in the form of COO sparse matrices. position values are still in base pairs after binning. and may be negative for the first reported pos of a region

Return type:

tuple of scipy.sparse.coo_matrix

fiber_views.utils.print_mod_contexts(read, mod_positions, offset=5, use_strand=True)

Print the contexts surrounding modified bases in a read.

Parameters:

read (pysam.libcalignedsegment.AlignedSegment) – A read containing modified bases.
mod_positions (numpy.ndarray) – An array of positions of modified bases in the read.
offset (int, optional) – The number of bases on either side of the modified base to include in the context. The default is 5.
use_strand (bool, optional) – Whether to use the strand information in the read to determine the context. If True, the context will be reversed if the read is on the negative strand. The default is True.

Example

print_mod_contexts(read, mod_positions)

Module contents

Top-level package for fiber_views.