xenomapper2¶
xenomapper2.xenomapper2 module¶
-
class
xenomapper2.xenomapper2.AlignbatchFileReader(*args, **kwargs)¶ Bases:
pylazybam.bam.FileReaderA BAM file reader that iterates batches of reads with the same name
Parameters: ubam (BinaryIO) – An binary (bytes) file or stream containing a valid uncompressed bam file conforming to the specification. Yields: alignbatch (List[bytes]) – A byte string of a bam alignment entry in raw binary format -
refs¶ A dictionary of reference_name keys with reference_length
Type: Dict[str:int]
-
ref_to_index¶ A dictionary mapping reference names to the bam numeric identifier
Type: Dict[str:int]
-
index_to_ref¶ A dictionary mapping bam reference numeric identifiers to names
Type: Dict[int:str]
Notes
It is advisable not to call the private functions or operate directly on the underlying file object.
The detailed specification for the BAM format can be found at https://samtools.github.io/hts-specs/SAMv1.pdf
Example
This class requires an uncompressed bam file as input. For this example we will use gzip to decompress the test file included as a resource in the package
>>> import gzip >>> from pkg_resources import resource_stream >>> ubam = gzip.open('/tests/data/paired_end_testdata_human.bam'))
A bam filereader object can then be created and headers inspected
>>> mybam = bam.FileReader(ubam) >>> print(mybam.header)
The filereader object is an iterator and yields batches of alignments in raw BAM binary format
>>> align_batch = next(mybam) >>> for align in align_batch: >>> print(mybam.index_to_ref[get_ref_index(align)], >>> get_pos(align), >>> get_AS(align))
-
close()¶ Close input file
-
get_full_raw_header()¶ Get a complete BAM header with all elements suitable for writing to a bam.FileWriter
Returns: The complete raw BAM header including the BAM magic and reference block Return type: bytes
-
get_updated_header(id: str, program: str, version: str, command: str = None, description: str = None, raw_header=None)¶ Get a modified version of the header with additional program information in a @PG BAM header suitable for writing to a new output file.
Does not include the BAM magic or the BAM reference block.
Note that the header should be written after self.magic and before self.raw_refs
Parameters: - id (str) – a unique identifier of this action of the program on the BAM
- program (str) – the name of the program used to process the BAM
- version (str) – version number of the program used to process the BAM
- command (str) – the command and arguments to the program used to process the BAM
- description (str) – optional description
- raw_header (bytes) – a raw BAM formatted header without BAM magic or REF block used for recursive modification to add multiple
Notes
Parameters will be utf-8 encoded If there are existing @PG records the last record will be used as PP
See also
-
reset_alignments()¶ Reset the file pointer to the beginning of the alignment block
-
update_header(*args, **kwargs)¶ Use get_updated_header to update self.raw_header inplace
See get_updated_header for Parameters and documentation
-
update_header_length(raw_header=None)¶ Update the length of the SAM format text component of the header
Parameters: raw_header (bytes) – optional raw SAM text header as bytes to process Default self.raw_header Returns: returns the length corrected raw header if a raw header is provided Return type: bytes
-
-
class
xenomapper2.xenomapper2.DummyFile(*args, **kwargs)¶ Bases:
objectA dummy io class - looks like pylazybam.bam.FileWriter but does nothing
-
close()¶
-
get_updated_header(*args, **kwargs)¶
-
seekable()¶
-
update_header(*args, **kwargs)¶
-
update_header_length(*args, **kwargs)¶
-
writable()¶
-
write(data)¶
-
write_header(*args, **kwargs)¶
-
-
class
xenomapper2.xenomapper2.XenomapperOutputWriter(primary_raw_header: bytes, primary_raw_refs: bytes, secondary_raw_header: bytes, secondary_raw_refs: bytes, primary_specific: Union[str, pathlib.Path, BinaryIO, None] = None, primary_multi: Union[str, pathlib.Path, BinaryIO, None] = None, secondary_specific: Union[str, pathlib.Path, BinaryIO, None] = None, secondary_multi: Union[str, pathlib.Path, BinaryIO, None] = None, unresolved: Union[str, pathlib.Path, BinaryIO, None] = None, unassigned: Union[str, pathlib.Path, BinaryIO, None] = None, basename: str = None, cmdline: str = '', compresslevel: int = 6)¶ Bases:
objectContainer class grouping xenomapper output files & manipulating headers
Updated headers are written to these files on initialization. The primary BAM header is used for all files except secondary_specific and secondary_multi. A @PO line is added to the file indicating the id and program as xenomapper with the version, a description field indicating the type of xenomapper output in the file (eg primary_specific) and the command used to run it.
Parameters: - primary_raw_header (bytes) – the raw header from the primary species BAM file
- primary_raw_refs (bytes) – the raw reference sequence info from the primary species BAM file
- secondary_raw_header (bytes) – the raw header from the secondary species BAM file
- secondary_raw_refs (bytes) – the raw reference sequence info from the secondary species BAM file
- primary_specific (str or Path or BinaryIO, optional) – output file for uniquely mapping primary specific alignments [ Default : None ]
- primary_multi (str or Path or BinaryIO, optional) – output file for multimapping primary specific alignments [ Default : None ]
- secondary_specific (str or Path or BinaryIO, optional) – output file for uniquely mapping secondary specific alignments [ Default : None ]
- secondary_multi (str or Path or BinaryIO, optional) – output file for multimapping secondary specific alignments [ Default : None ]
- unresolved (str or Path or BinaryIO, optional) – output file for alignments where primary or secondary is ambiguous [ Default : None ]
- unassigned (str or Path or BinaryIO, optional) – output file for reads that do not map sufficiently well in either BAM [ Default : None ]
- basename (str, optional) – filename stem to derive all output filenames <basename>_<outfilename> (eg basename=’Foo’ -> Foo_primary_specific, … Foo_unassigned) [ Default : None ]
- cmdline (str, optional) – The commandline to include in the output BAM header
- compresslevel (int, optional) – gzip compression level for output file [ Default : 6 ]
Returns: Return type: A container class with __get_item__ , keys, close
Examples
>>> xow = XenomapperOutputWriter(phead,pref,shead,sref,basename='/foo') >>> for key in xow.keys(): >>> xow[key].write(data) >>> xow.close()
-
close()¶
-
keys()¶ return the keys for the file output objects :returns: :rtype: Iterable[str]
-
xenomapper2.xenomapper2.always_very_negative(*args, **kwargs) → int¶ A function that returns -2**31 for any combination of parameters :returns: always -2**31 == MIN32INT == -2147483648 :rtype: int
See also
-
xenomapper2.xenomapper2.always_zero(*args, **kwargs) → int¶ A function that returns zero for any combination of parameters :returns: always 0 :rtype: int
See also
-
xenomapper2.xenomapper2.calc_cigar_based_score(cigar_string: bytes, NM: int = None, mismatch: int = -6, gap_open: int = -5, gap_extend: int = -3, softclip: int = -2) → int¶ Calculate values for a score equivalent to a rescaled AS tag using the SAM formatted cigar line and NM tags from an alignment. Score is scaled such that a perfect match for the full length of the read achieves a score of 0. Standard penalties are: mismatch -6, gap open -5, gap extend -3, softclipped -2 (equiv to +2 match)
Parameters: - align (bytes) – the raw alignment entry from a BAM file
- mismatch (int) – score penalty for a mismatch [ Default : -6 ]
- gap_open (int) – score penalty for opening/initiating a gap [ Default : -5 ]
- gap_extend (int) – score penalty for extending an existing gap [ Default : -3 ]
- softclip (int) – score penalty for removing bases at the start or end of a read [ Default : -2 ]
Returns: The AS like score calculated from the cigar string 0 maximum and mismatch*seq_length theoretical minimum
Return type: See also
-
xenomapper2.xenomapper2.conservative_state_map(forward_state, reverse_state)¶ Conservative tate map for inferring fragment state from forward & reverse Reads where there is ambiguous evidence will be assigned the least specific state.
Parameters: - forward_state (str) – a xenomapper state for the forward read (primary_specific,primary_multi,secondary_specific,secondary_multi, unassigned, unresolved)
- reverse_state (str or None) – a xenomapper state for the reverse read or None (primary_specific,primary_multi,secondary_specific,secondary_multi, unassigned, unresolved) or None
Returns: a xenomapper state for the fragment (primary_specific,primary_multi,secondary_specific,secondary_multi, unassigned, unresolved)
Return type: Notes
In future these will be of type PEP586 Literal
-
xenomapper2.xenomapper2.get_bamprimary_AS_XS(alignments: List[bytes], AS_function=<function get_AS>, XS_function=<function get_XS>) → Tuple[int, int]¶ Get the AS and XS of the primary alignment
Parameters: - alignments (List[bytes]) – a list of binary format BAM alignments from the same read. Should contain only forward or reverse reads and only one primary flag
- AS_function – a function that accepts a BAM alignment bytestring and returns the AS score (alignment score) as an integer eg pylazybam.bam.get_AS or
- XS_function – a function that accepts a BAM alignment bytestring and returns the XS score as an integer
Returns: a tuple of the alignment score AS and suboptimal alignment score XS
Return type: Raises: ValueError– raises value error if there is no primary or more than one primary flag
-
xenomapper2.xenomapper2.get_cigar_based_score(align: bytes, no_tag: int = -2147483648) → int¶ Returns a score equivalent to a rescaled AS tag from a raw BAM alignment Extracts and decodes the cigar string and NM tags and calls calc_cigar_based_score to calculate
Parameters: Returns: the AS like score with 0 maximum and -6*seq_length theoretical minimum
Return type: See also
-
xenomapper2.xenomapper2.get_mapping_state(AS1: int, XS1: int, AS2: int, XS2: int, min_score: int = -2147483648)¶ Determine the mapping state based on scores in each species. Scores can be negative but better matches must have higher scores
Parameters: - AS1 (int) – score of best match in primary species
- XS1 (int) – score of suboptimal match in primary species
- AS2 (int) – score of best match in secondary species
- XS2 (int) – score of suboptimal match in secondary species
- min_score (int [ Default : -2**31 ]) – the score that matches must exceed in order to be considered valid matches. Note scores equalling this value will also be considered not to match.
Returns: ‘primary_multi’, ‘secondary_multi’, ‘unresolved’, or ‘unassigned’ indicating match state.
Return type: state - a string of ‘primary_specific’, ‘secondary_specific’,
-
xenomapper2.xenomapper2.get_max_AS_XS(alignments: List[bytes], AS_function=<function get_AS>, XS_function=<function get_XS>, sort_key=None, default=-2147483648) → Tuple[int, int]¶ Get the maximum AS and XS from a group of alignments
Parameters: - alignments (List[bytes]) – a list of binary format BAM alignments.
- AS_function (Callable[[bytes], int]) – a function that accepts a BAM alignment bytestring and returns the AS score (alignment score) as an integer eg pylazybam.bam.get_AS or
- XS_function (Callable[[bytes], int]) – a function that accepts a BAM alignment bytestring and returns the XS score as an integer
- sort_key (Callable[[Tuple[int,int]], Any]) – a key function that converts tuples to items to be used by sorted()
Returns: a tuple of the alignment score AS and suboptimal alignment score XS
Return type:
-
xenomapper2.xenomapper2.output_summary(category_counts: collections.Counter, title='Read Count Category Summary\n', outfile=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>)¶ Print a summary table based on a Counter
Parameters: - category_counts (Counter) –
- title (str) –
- outfile (TextIO) –
-
xenomapper2.xenomapper2.split_forward_reverse(alignments: List[bytes]) → Tuple[List[bytes], List[bytes]]¶ Split a list of BAM alignments into forward and reverse
Parameters: alignments (List[bytes]) – a list of BAM alignments in raw binary format containing forward and optionally reverse alignments Returns: a tuple containing a list of forward and reverse BAM reads Return type: (List[bytes], List[bytes]) Raises: ValueError– if there are any reads that do not have the forward or reverse flag
-
xenomapper2.xenomapper2.state_map(forward_state: str, reverse_state: str) → str¶ State map for inferring the fragment state from forward & reverse Reads where there is stronger but not contradictory evidence will be assigned the more specific state.
Parameters: - forward_state (str) – a xenomapper state for the forward read (primary_specific,primary_multi,secondary_specific,secondary_multi, unassigned, unresolved)
- reverse_state (str or None) – a xenomapper state for the reverse read or None (primary_specific,primary_multi,secondary_specific,secondary_multi, unassigned, unresolved) or None
Returns: a xenomapper state for the fragment (primary_specific,primary_multi,secondary_specific,secondary_multi, unassigned, unresolved)
Return type: Notes
In future these will be of type PEP586 Literal
-
xenomapper2.xenomapper2.xenomap(primary_bam: xenomapper2.xenomapper2.AlignbatchFileReader, secondary_bam: xenomapper2.xenomapper2.AlignbatchFileReader, output_writer: xenomapper2.xenomapper2.XenomapperOutputWriter, score_function: Callable = <function get_bamprimary_AS_XS>, AS_function: Callable = <function get_AS>, XS_function: Callable = <function get_XS>, min_score: int = -2147483648, conservative: bool = False)¶ core method to coordinate the xenomapping of BAMS
Parameters: - primary_bam (AlignbatchFileReader) – An interable that yields iterables of primary BAM alignments
- secondary_bam (AlignbatchFileReader) – An interable that yields iterables of secondary BAM alignments
- output_writer (XenomapperOutputWriter) – output writer that holds output files of type bam.FileWriter
- score_function (Callable) – a xenomapper alignment batch calling function get_bamprimary_AS_XS or get_max_AS_XS
- AS_function (Callable[[bytes], int]) – a function that accepts a BAM alignment bytestring and returns the AS score (alignment score) as an integer get_AS or get_cigar_based_score
- XS_function (Callable[[bytes], int]) – a function that accepts a BAM alignment bytestring and returns the XS score as an integer get_XS, get_ZS, always_zero, or always_very_negative
- min_score (int) – the score that matches must exceed in order to be considered valid matches. Note scores equalling this value will also be considered not valid matches. [ Default : -2**31 ]
- conservative –
Returns: Return type: Tuple[Counter, Counter, XenomapperOutputWriter]
-
xenomapper2.xenomapper2.xenomap_states(primary_aligns: Iterable[bytes], secondary_aligns: Iterable[bytes], score_function: Callable = <function get_bamprimary_AS_XS>, AS_function: Callable = <function get_AS>, XS_function: Callable = <function get_XS>, min_score: int = -2147483648) → Tuple[int, int]¶ Get the xenomapping state for the forward and reverse reads
- primary_aligns : List[bytes]
- a list of binary format BAM alignments from the primary BAM.
- secondary_aligns : List[bytes]
- a list of binary format BAM alignments from the primary BAM.
- score_function : Callable
- a xenomapper alignment batch calling function get_bamprimary_AS_XS or get_max_AS_XS
- AS_function : Callable[[bytes], int]
- a function that accepts a BAM alignment bytestring and returns the AS score (alignment score) as an integer get_AS or get_cigar_based_score
- XS_function : Callable[[bytes], int]
- a function that accepts a BAM alignment bytestring and returns the XS score as an integer get_XS, get_ZS, always_zero, or always_very_negative
- min_score : int [ Default : -2**31 ]
- the score that matches must exceed in order to be considered valid matches. Note scores equalling this value will also be considered not valid matches.
Returns: the xenomapper state of the forward and reverse read Return type: Tuple[str,str]
xenomapper2.cli module¶
Format decoders for BAM alignment data types
-
xenomapper2.cli.main(arguments: str = None, output=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>) → Tuple[collections.Counter, collections.Counter]¶ main function for orchestrating a xenomapper2 run
Parameters: - arguments (str) – a string of unix command line arguments - used for testing
- output (TextIO) – destination to print progress and results to - used for testing
Returns: - Counter – a counter of pair states
- Counter – a counter of read states
pylazybam.tags module¶
Functions for extracting and decoding SAM tag data from BAM alignments
Extract the high scoring alignment score from an AS tag in a raw BAM alignment bytestring
Parameters: - tag_bytes (bytes) – a bytestring containing bam formatted tag elements
- no_tag (Any) – return value for when tag not found (default: MIN32INT)
Returns: AS Tag Value – the integer value of the AS tag returns the value of no_tag if tag absent (default:MIN32INT)
Return type: Raises: ValueError– raises a ValueError if more than one tag matchNotes
Recommended try accept for use on raw alignment with fall back to calling on only the tag byte string.
Please test carefully on your BAM output as in complicated output the regular expression based extraction of the tag can be error prone
Extract the MD tag from a raw BAM alignment bytestring
Parameters: - tag_bytes (bytes) – a bytestring containing bam formatted tag elements
- no_tag (Any) – return value for when tag not found (default: None)
Returns: MD Tag Value – an ASCII string representing the SAM format value of the MD tag returns the value of no_tag if tag absent (default: None)
Return type: Raises: ValueError– raises a ValueError if more than one tag matchNotes
The MD field aims to achieve SNP/indel calling independent of the reference. Numbers represent matches, letters bases that differ from the reference and bases preceeded by ^ are deletions. A ^T0A indicates a base change to an A immediately following a deleted T.
Recommended try accept for use on raw alignment with fall back to calling on only the tag byte string.
Please test carefully on your BAM output as in complicated output the regular expression based extraction of the tag can be error prone
Extract the suboptimal alignment score from an XS tag in a raw BAM alignment bytestring
Parameters: - tag_bytes (bytes) – a bytestring containing bam formatted tag elements
- no_tag (Any) – return value for when tag not found (default: MIN32INT)
Returns: XS Tag Value – the integer value of the XS tag returns the value of no_tag if tag absent (default:MIN32INT)
Return type: Raises: ValueError– raises a ValueError if more than one tag matchNotes
This function is for the genome aligner definition of XS where XS:i:<int> is the alignment score of the suboptimal alignment. This is not the same as the spliced aligner XS tag XS:C:<str> that represents the strand on which the intron occurs (equiv to TS:C:<str>)
Recommended try accept for use on raw alignment with fall back to calling on only the tag byte string.
Please test carefully on your BAM output as in complicated output the regular expression based extraction of the tag can be error prone
Extract the suboptimal alignment score from the ZS tag in a raw BAM alignment bytestring
Parameters: - tag_bytes (bytes) – a bytestring containing bam formatted tag elements
- no_tag (Any) – return value for when tag not found (default: MIN32INT)
Returns: ZS Tag Value – the integer value of the ZS tag returns the value of no_tag if tag absent (default:MIN32INT)
Return type: Raises: ValueError– raises a ValueError if more than one tag matchNotes
ZS is the equivalent to XS:i:<int> tag in some spliced aligners including HISAT2.
Recommended try accept for use on raw alignment with fall back to calling on only the tag byte string.
Please test carefully on your BAM output as in complicated output the regular expression based extraction of the tag can be error prone
Extract an integer format tag from a raw BAM alignment bytestring
Parameters: Returns: the integer value of the tag returns the value of no_tag if tag absent (default:MIN32INT)
Return type: Raises: ValueError– raises a ValueError if more than one tag matchPotential values for the tag parameter include:
- AM:i:score The smallest template-independent mapping quality of any
- segment in the same template as this read. (See also SM.)
AS:i:score Alignment score generated by aligner.
CP:i:pos Leftmost coordinate of the next hit.
FI:i:int The index of segment in the template.
H0:i:count Number of perfect hits.
H1:i:count Number of 1-difference hits (see also NM).
H2:i:count Number of 2-difference hits.
- HI:i:i Query hit index, indicating the alignment record is the
- i-th one stored in SAM.
- IH:i:count Number of alignments stored in the file that contain the
- query in the current record.
MQ:i:score Mapping quality of the mate/next segment.
- NH:i:count Number of reported alignments that contain the query in
- the current record.
- NM:i:count Number of differences (mismatches plus inserted and deleted
- bases) between the sequence and reference, counting only (case-insensitive) A, C, G and T bases in sequence and reference as potential matches, with everything else being a mismatch.
- PQ:i:score Phred likelihood of the template, conditional on the mapping
- locations of both/all segments being correct.
- SM:i:score Template-independent mapping quality, i.e., the mapping
- quality if the read were mapped as a single read rather than as part of a read pair or template.
TC:i: The number of segments in the template.
- UQ:i: Phred likelihood of the segment,
- conditional on the mapping being correct.
Extract an integer format tag from a raw BAM alignment bytestring
Parameters: Returns: MD Tag Value – an ASCII string representing the SAM format value of the MD tag returns the value of no_tag if tag absent (default: None)
Return type: Raises: ValueError– raises a ValueError if more than one tag matchNotes
Potential values for the tag parameter include:
- BQ:Z:qualities Offset to base alignment quality (BAQ), of the same
- length as the read sequence. At the i-th read base, BAQi = Qi − (BQi − 64) where Qi is the i-th base quality
CC:Z:rname Reference name of the next hit; ‘=’ for same chromosome.
- E2:Z:bases The 2nd most likely base calls.
- Same encoding and same length as SEQ.
FS:Z:str Segment suffix.
MC:Z:cigar CIGAR string for mate/next segment.
MD:Z: String for mismatching positions.
- Q2:Z:qualities Phred quality of the mate/next segment sequence in the
- R2 tag. Same encoding as QUAL.
R2:Z:bases Sequence of the mate/next segment in the template.
- SA:Z: (rname ,pos ,strand ,CIGAR ,mapQ ,NM ;)+
- Other canonical alignments in a chimeric alignment, formatted as a semicolon-delimited list.
- U2:Z: Phred probability of the 2nd call being wrong
- conditional on the best being wrong.
RG:Z:readgroup The read group to which the read belongs.
LB:Z:library The library from which the read has been sequenced.
PG:Z:program id Program. Value matches the header PG-ID tag
PU:Z:platformunit The platform unit in which the read was sequenced.
CO:Z:text Free-text comments.
- BC:Z:sequence Barcode sequence (Identifying the sample/library),
- with any quality scores (optionally) stored in QT tag. The BC tag should match the QT tag in length.
- QT:Z:qualities Phred quality of the sample barcode sequence in BC tag.
- Same encoding as QUAL, i.e., Phred score + 33.
- CB:Z:str Cell identifier, consisting of the optionally-corrected
- cellular barcode sequence and an optional suffix. The sequence part is similar to the CR tag
- CR:Z:sequence+ Cellular barcode. The uncorrected sequence bases of the
- cellular barcode as reported by the sequencing machine, with the corresponding base quality scores (optionally) stored in CY.
- CY:Z:qualities+ Phred quality of the cellular barcode sequence in CR tag
- Same encoding as QUAL, i.e., Phred score + 33.
- MI:Z:str Molecular Identifier. A unique ID within the SAM file
- for the source molecule from which this read is derived.
- OX:Z:sequence+ Raw (uncorrected) unique molecular identifier bases,
- with quality scores (optionally) stored in the BZ tag.
- BZ:Z:qualities+ Phred quality of the (uncorrected) unique molecular
- identifier sequence in the OX tag. Same encoding as QUAL, i.e., Phred score + 33.
- RX:Z:sequence+ Sequence bases from the unique molecular identifier.
- These could be either corrected or uncorrected. Unlike MI, the value may be non-unique in the file.
- QX:Z:qualities+ Phred quality of the unique molecular identifier
- sequence in the RX tag. Same encoding as QUAL, i.e., Phred score + 33
- OA:Z:(RNAME,POS,strand,CIGAR,MAPQ,NM;)+ The original alignment
- information of the record prior to realignment or unalignment by a subsequent tool.
OC:Z:cigar Original CIGAR, usually before realignment.
- CT:Z:strand ;type (;key (=value )?)* Complete read annotation tag
- used for consensus annotation dummy features.
- PT:Z:annotag(|annotag)* where each annotag matches
- start;end;strand;type(;key(=value)?)* Read annotations for parts of the padded read sequence.
see https://samtools.github.io/hts-specs/SAMtags.pdf
Recommended try accept for use on raw alignment with fall back to calling on only the tag byte string.
Please test carefully on your BAM output as in complicated output the regular expression based extraction of the tag can be error prone