"Some" NGS analysis

Khristian Kotov

Introduction

While looking at job announcements on LinkedIn, I found one particularly attractive opening for Algorithm Developer position at one company specializing in Next Generation Sequencing (NGS) analysis. I emailed the contact person and received a task to with a following description:

"Please do some analysis on the file testReads.fastq.gz and write a small report on your findings.

Notes:

This file contains NGS reads for some human DNA regions.
The fastq file format is described at http://en.wikipedia.org/wiki/FASTQ_format, the Phred quality score is in sanger format not illumina.
The report should be in PDF format and figures rather than descriptive words is preferred. Please also make clear description on your figures.
The report could be anything that you find interesting on the file. For example, it could be patterns you found on the data, comments on the quality of the data or the variants detected on this data, etc."

Input

The input file contains 84205 four-line records; one example record is shown below:

@HKVCQHG01APQJG rank=0000016 x=176.0 y=10.0 length=108
CTCGCGTGTCAAGACTCGGCAGCATCTCCACCTTAAGATGAGCTCTAATTTGTTGTTATGTGGCTCCTGTAAGTATAGTGGAGAACAGTGACGATCGCGACACGCGAG
+
IIIIIIIIIHEEEHIG;<8;DIIIIG>43;//1111?9IIIIII?;66111?13?88??I???DDFFIH?<<>DIIIFDBBFB==@EIIIIIIIIIIIIIIIIIIIIF

The second line contains a sequence made of letters $A$, $C$, $G$, $T$, and sometimes $N$

The forth line encodes Phred+33 quality of a measurement at each position as follows:

convert a symbol to an ASCII code (e.g. 'I' $\rightarrow$ 73)
subtract 33 as ASCII alphabet starts at code 33 (e.g. Q['I'] = 73 - 33 = 40)
now probability of an error in a specific position is then given by $~p = 10^{-Q/10}$

Letter $N$ appears only in 432 records at positions with 0 quality

Problems

Neither I performed an NGS analysis before nor I had an idea what I am expected to do

So I came up with a few simple checks without understanding how practical they are:

count number of unique patters of a certain length in the data
scan all sequences and identify patterns that repeat more often than others
group similar reads together and find how many clusters they form

For these checks I need:

efficient engine to search patterns in the data
- avoid string comparisons in favor of hashing
fast machinery that compares the sequences and measures similarity
- comparison of all-to-all reads takes $\mathcal{O}(N^2) \sim \mathcal{O}(10^{10})$ iterations!
algorithm to merge similar sequences into clusters

Toolbox

I coded up all of the necessary algorithms in several C++ files:

Hashing machinery is implemented in toolbox.h file
- sequence2number function converts a short (<32 symbols) sequence into a number
- NumericSequence class represents sequence as an array of numbers
- LookUpTable hash table is initialized with patterns sought; provides "find" functionality
Needleman-Wunsch sequence alignment algorithm implemented in alignment.cc file
- alignmentScoreMatrix computes the score for the two aligning sequences
- reconstruction inserts gaps based on the score and achieves the best match
Hierarchical clustering algorithm implemented in cluster.cc file for a stand-alone executable
- reads text file formatted as "read #, read #, distance\n"
- merges reads into clusters until distance between reads is greater than some cutoff

Workflow

Pattern frequency test is implemented in analysis.cc file:

slides over the sequences letter by letter and counts encountered patterns

Search for similar reads is implemented in overlap.cc file:

findMatches preprocessing (indexing) step runs for all $\mathcal{O}(N^2)$ pair combinations:
- chops a reference read into search patterns ($k$-mers)
- checks all other reads and remember those firing any of the search patterns
lookCloser step calculates distance defined as a penalty score of an alignment:
- runs full-blown sequence alignment for pairs from the preprocessing step
- ignores gaps inserted into the two aligned sequences at the beginning and end
- in the common "overlap" region increment the score by 1 for a mismatch or gap

Clustering the reads is completed executing ./cluster output.csv CUTOFF

Analysis: data variability

To warm up, let's count number of unique patterns of a certain length $L$ found in data:

Number of unique combinations of the 4 letters grows as $4^L$ until $L$~10 where it slows

Analysis: pattern frequency test

Examining output files from the previous step with something like this:

sed -e 's|,| |g' *.csv | awk 'BEGIN{m=0;p=""} $1!~/ID/{if(m<$3){m=$3;p=$2}} END{print p}'

allows us to identify the most frequent relatively short patterns as follows:

length	sequence	frequency
20	`---------AAGACTCGGCAGCATCTCCA---`	47K
21	`--------GAAGACTCGGCAGCATCTCCA---`	24K
26	`TCAGACACGAAGACTCGGCAGCATCT------`	12K
30	`TCAGACACGAAGACTCGGCAGCATCTCCAT--`	3.8K
32	`CTTCCATTGACCACATCTCCTCTGACTTCAAA`	2.2K

I.e. ~2.6% of the reads have the CTTCCATTGACCACATCTCCTCTGACTTCAAA pattern

Analysis: grouping similar records

Grouping reads with up to 9 mismatches/gaps apart (edit distance) gives 1662 clusters

Cluster density diminishes slowly from center to periphery:

Excess in bin #10 suggests that clustering with such cutoff merges some fine-scale structure

Using 8 for the cutoff removes the excess and results in 1992 clusters

What was I really expected to demonstrate?

I failed the test because I did not identify there were several samples mixed in a single file

Wet labs sequence multiple samples in one run tagging each sample with a "barcode"

De-multiplexing such data can be achieved by clustering the reads with similar beginnings

Grouping sequences with 10 consecutive matches within first 15 symbols (barcodes2.cc) yields:

cluster	barcode	frequency
1	`.{0,2}TCGCGTGTC`	19.3K
2	`.{0,2}TATCGCGAG`	20.7K
3	`.{0,2}GTGTCTCTA`	21.4K
4	`.{0,2}TCAGACACG`	21.9K

(.{0,2} pattern allows any 0, 1, or 2 symbols; usually it is 'C' for #1,3 and 'A' for #2,4)

Only about 0.5K reads are left out of these 4 major clusters

What these data tell us?

Aligning the reads to human genome reveals a targeted sequencing of these genes:

chromosome	gene	id	description
chr17	BRCA1	NM_027676	Breast cancer 1, early onset
chr13	BRCA2	NM_000059	Breast cancer 2, early onset

A simple workflow finds 25 (48) high quality variants wrt. hg19 (hg38) reference genome

Some of these can be found in dbSNP database: rs3752451, rs1801406, rs206075, rs206076, rs206080, rs169547, rs9534262, rs3092994, rs1799966, rs1060915, rs1799949

The four barcodes identify four different individuals (check homo/heterozygosity of the variants)

Summary

I am happy I had a chance to play with some decent size datasets and prove that I know how important efficiency is. I also learned how to analyze these data.

Nonetheless, some points in the evaluation are still a mystery for me: "The data is generated with certain enrichment technology and certain sequencers." If only I knew what are the implications of using different sequencers and why would that be important.

I did my best to jump on NGS data analysis, but I learned that little things that I am not aware of can make a big difference. Although, I see learning such things is only a question of time for me.