SeqAn3 3.3.0-rc.1
The Modern C++ library for sequence analysis.
 
Loading...
Searching...
No Matches
Reading SAM files

Construction and specialisation

The seqan3::sam_file_input class comes with four constructors: One for construction from a file name, one for construction from an existing stream and a known format and both of the former with or without additional reference information. Constructing from a file name automatically picks the format based on the extension of the file name. Constructing from a stream can be used if you have a non-file stream, like std::cin or std::istringstream. It also comes in handy, if you cannot use file-extension based detection, but know that your input file has a certain format.

Passing reference information, e.g.

  • ref_ids: The name of the references, e.g. "chr1", "chr2", ...
  • ref_sequences: The reference sequence information in the same order as the ref_ids.

comes in handy once you want to convert the CIGAR string, read from your file, into an actual alignment. This will be covered in the section Transforming the CIGAR information into an actual alignment.

In most cases the template parameters are deduced automatically:

#include <filesystem>
#include <sstream>
auto sam_file_raw = R"(@HD VN:1.6 SO:coordinate GO:none
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r003 0 ref 29 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 237 30 9M = 7 -39 CAGCGGCAT * NM:i:1
)";
int main()
{
// Create the temporary file.
auto tmp_file = std::filesystem::temp_directory_path() / "my.sam";
std::ofstream tmp_stream{tmp_file};
tmp_stream << sam_file_raw;
tmp_stream.close();
seqan3::sam_file_input fin{tmp_file}; // SAM format assumed, regular std::ifstream taken as stream
}
A class for reading SAM files, both SAM and its binary representation BAM are supported.
Definition: sam_file/input.hpp:242
T close(T... args)
T remove(T... args)
Provides seqan3::sam_file_input and corresponding traits classes.
T temp_directory_path(T... args)

Reading from a std::istringstream:

#include <sstream>
auto input = R"(@HD VN:1.6 SO:coordinate
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *)";
int main()
{
// ^ no need to specify the template arguments
}
The SAM format (tag).
Definition: format_sam.hpp:108

Note that this is not the same as writing sam_file_input<> (with angle brackets). In the latter case they are explicitly set to their default values, in the former case automatic deduction happens which chooses different parameters depending on the constructor arguments. For opening from file, sam_file_input<> would have also worked, but for opening from stream it would not have.

You can define your own traits type to further customise the types used by and returned by this class, see seqan3::sam_file_input_default_traits for more details. As mentioned above, specifying at least one template parameter yourself means that you loose automatic deduction. The following is equivalent to the automatic type deduction example with a stream from above:

#include <sstream>
auto input = R"(@HD VN:1.6 SO:coordinate
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *)";
int main()
{
// The default types; you can adjust this list if you don't want to read all this data.
using default_fields = seqan3::fields<seqan3::field::seq,
// The expected format:
default_fields,
// Which formats are allowed:
sam_file_input_t fin{std::istringstream{input}, seqan3::format_sam{}};
}
@ flag
The alignment flag (bit information), uint16_t value.
@ ref_offset
Sequence (seqan3::field::ref_seq) relative start position (0-based), unsigned value.
@ cigar
The cigar vector (std::vector<seqan3::cigar>) representing the alignment in SAM/BAM format.
@ mapq
The mapping quality of the seqan3::field::seq alignment, usually a Phred-scaled score.
@ mate
The mate pair information given as a std::tuple of reference name, offset and template length.
@ header_ptr
A pointer to the seqan3::sam_file_header object storing header information.
@ ref_id
The identifier of the (reference) sequence that seqan3::field::seq was aligned to.
@ id
The identifier, usually a string.
@ tags
The optional tags in the SAM format, stored in a dictionary.
@ seq
The "sequence", usually a range of nucleotides or amino acids.
@ qual
The qualities, usually in Phred score notation.
A class template that holds a choice of seqan3::field.
Definition: record.hpp:128
Type that contains multiple types.
Definition: type_list.hpp:29
Provides seqan3::type_list.

Reading record-wise

You can iterate over this file record-wise:

#include <sstream>
auto sam_file_raw = R"(@HD VN:1.6 SO:coordinate GO:none
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r003 0 ref 29 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 237 30 9M = 7 -39 CAGCGGCAT * NM:i:1
)";
int main()
{
for (auto & rec : fin)
{
seqan3::debug_stream << "id: " << rec.id() << '\n';
seqan3::debug_stream << "read sequence: " << rec.sequence() << '\n';
seqan3::debug_stream << "mapping position: " << rec.reference_position() << '\n';
seqan3::debug_stream << "mapping quality: " << rec.mapping_quality() << '\n';
// there are more fields read on default
}
}
Provides seqan3::debug_stream and related types.
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition: debug_stream.hpp:37

In the above example, rec has the type seqan3::sam_file_input::record_type which is a specialisation of seqan3::record and behaves like a std::tuple (that's why we can access it via get). Instead of using the seqan3::field based interface on the record, you could also use std::get<0> or even std::get<dna4_vector> to retrieve the sequence, but it is not recommended, because it is more error-prone.

Note
It is important to write auto & and not just auto, otherwise you will copy the record on every iteration. Since the buffer gets "refilled" on every iteration, you can also move the data out of the record if you want to store it somewhere without copying:
#include <sstream>
#include <utility>
#include <vector>
auto sam_file_raw = R"(@HD VN:1.6 SO:coordinate GO:none
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r003 0 ref 29 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 237 30 9M = 7 -39 CAGCGGCAT * NM:i:1
)";
int main()
{
using record_type = typename decltype(fin)::record_type;
std::vector<record_type> records{}; // store all my records in a vector
for (auto & rec : fin)
records.push_back(std::move(rec));
}
T push_back(T... args)

Reading record-wise (custom fields)

If you want to skip specific fields from the record you can pass a non-empty fields trait object to the seqan3::sam_file_input constructor to select the fields that should be read from the input. For example, you may only be interested in the mapping flag and mapping quality of your SAM data to get some statistics. The following snippets demonstrate the usage of such a fields trait object.

#include <sstream>
auto sam_file_raw = R"(@HD VN:1.6 SO:coordinate GO:none
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r003 0 ref 29 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 237 30 9M = 7 -39 CAGCGGCAT * NM:i:1
)";
int main()
{
for (auto & rec : fin)
{
seqan3::debug_stream << "flag: " << rec.flag() << '\n';
seqan3::debug_stream << "mapping quality: " << rec.mapping_quality() << '\n';
}
}

When reading a file, all fields not present in the file (but requested implicitly or via the selected_field_ids parameter) are ignored and the respective value in the record stays empty.

Reading record-wise (decomposed records)

Instead of using get on the record, you can also use structured bindings to decompose the record into its elements. Considering the example of reading only the flag and mapping quality like before you can also write:

#include <sstream>
auto sam_file_raw = R"(@HD VN:1.6 SO:coordinate GO:none
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r003 0 ref 29 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 237 30 9M = 7 -39 CAGCGGCAT * NM:i:1
)";
int main()
{
for (auto & [flag, mapq] : fin) // the order is the same as specified in fields!
{
seqan3::debug_stream << "flag: " << flag << '\n';
seqan3::debug_stream << "mapping quality: " << mapq << '\n';
}
}

In this case you immediately get the two elements of the tuple: flag of seqan3::sam_file_input::flag_type and mapq of seqan3::sam_file_input::mapq_type.

Note
But beware: with structured bindings you do need to get the order of elements correctly!

Transforming the CIGAR information into an actual alignment

In SeqAn, we represent an alignment as a tuple of two seqan3::aligned_sequences.

The conversion from a CIGAR string to an alignment can be done with the function seqan3::alignment_from_cigar. You need to pass the reference sequence with the position the read was aligned to and the read sequence. All of it is already in the record when reading a SAM file:

using namespace seqan3::literals;
auto sam_file_raw = R"(@HD VN:1.6
@SQ SN:ref LN:34
read1 41 ref 1 61 1S1M1D1M1I ref 10 300 ACGT !##$ AS:i:2 NM:i:7
read2 42 ref 2 62 1H7M1D1M1S2H ref 10 300 AGGCTGNAG !##$&'()* xy:B:S,3,4,5
read3 43 ref 3 63 1S1M1P1M1I1M1I1D1M1S ref 10 300 GGAGTATA !!*+,-./
)";
int main()
{
// The reference sequence might be read from a different file.
seqan3::dna5_vector reference = "ACTGATCGAGAGGATCTAGAGGAGATCGTAGGAC"_dna5;
// You will probably read it from a file, e.g., like this:
// seqan3::sam_file_input fin{"test.sam"};
for (auto && rec : fin)
{
auto alignment =
alignment_from_cigar(rec.cigar_sequence(), reference, rec.reference_position().value(), rec.sequence());
}
// prints:
// (ACT-,C-GT)
// (CTGATCGAG,AGGCTGN-A)
// (T-G-A-TC,G-AGTA-T)
}
Provides the function seqan3::alignment_from_cigar.
auto alignment_from_cigar(std::vector< cigar > const &cigar_vector, reference_type const &reference, uint32_t const zero_based_reference_start_position, sequence_type const &query)
Construct an alignment from a CIGAR string and the corresponding sequences.
Definition: alignment_from_cigar.hpp:84
@ alignment
The (pairwise) alignment stored in an object that models seqan3::detail::pairwise_alignment.
Meta-header for the IO / SAM File submodule .
The SeqAn namespace for literals.

The code will print the following:

(ACT-,C-GT)
(CTGATCGAG,AGGCTGN-A)
(T-G-A-TC,G-AGTA-T)

Views on files

Since SeqAn files are ranges, you can also create views over files. A useful example is to filter the records based on certain criteria, e.g. minimum length of the sequence field:

#include <ranges>
#include <sstream>
auto sam_file_raw = R"(@HD VN:1.6 SO:coordinate GO:none
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r003 0 ref 29 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 237 30 9M = 7 -39 CAGCGGCAT * NM:i:1
)";
int main()
{
auto minimum_length10_filter = std::views::filter(
[](auto const & rec)
{
return std::ranges::size(rec.sequence()) >= 10;
});
for (auto & rec : fin | minimum_length10_filter) // only records with sequence length >= 10 will "appear"
seqan3::debug_stream << rec.id() << '\n';
}

End of file

You can check whether a file is at its end by comparing begin() and end() (if they are the same, the file is at its end).

Formats

We currently support reading the following formats: