genomeliberty-blog - Tumblr blog

genomeliberty-blog · 13 years ago

Text

Political Genomics: All the President’s DNA

As the 2012 political season heats up, the Democratic and Republican parties (and their Super PACs) are trying to do whatever they can to get an edge for their candidates – from the presidential race to thousands of other state and local elections. Usually, the emphasis is on finding political and character flaws in one’s opponent rather than highlighting one’s own qualities.

Offensive or Defensive Genetics

So how might personal genetic information enter the political sphere? A candidate could have his/her genome profiled and publicize those genetic traits inspiring strength and confidence. Such attributes as long life, low propensity for cardiac disease or cancer and the absence of a mutation predisposing for Alzheimer’s, could emphasize the viability of a candidate. This would extend the current practice where (some) candidates release limited health information and tout the longevity of their parents and relatives. Moreover, the candidate could point to features in their ethnic background, perhaps to emphasize ancestral diversity (or purity), depending on the specifics of the political contest.

Personal genome screens can also shed light on (if not necessarily predict) complex traits such as intelligence and obesity. A flamboyant governor could argue that his excess weight was due to an inherited predisposition, not by lack of exercise or self-will.

A more insidious possibility would be for a candidate to surreptitiously obtain a DNA sample from his opponent and have it profiled. (Reporters for New Scientist magazine showed this was feasible a few years ago.) As any CSI aficionado can attest, obtaining a suitable sample is fairly trivial: a discarded diner coffee cup or hair left in a comb. If the genomic profile pointed to a predisposition to serious diseases, such as Parkinson’s, this would be prized ammunition for the campaign.

An alternative approach would be for an independent organization to obtain DNA from the candidates and then to provide reports to the public. This would allow the electorate to evaluate if the candidates’ genetic profiles were in any way relevant. Extreme perhaps, but this would acknowledge the inevitability of political genetics and would prevent campaigns from only releasing a partial or biased genetic profile.

Privacy or Responsibility

Should genetic profiling of political candidates be allowed or even required? Some argue that a person’s genetic information is private and not for public dissemination. Everyone has quirks in their genome they would not necessarily want shared with others. Additionally, a person’s genetics are (for now) completely beyond their control. This falls in line with the 2008 Genetic Information Nondiscrimination Act(GINA), which prohibits employers and health insurance companies from discriminating against people based upon their genetics.

But those in favor of releasing genetic information for candidates view it as an issue of disclosure just as candidates disclose (some) information about their health and finances. Does the public have a right to know if the future President is predisposed to a debilitating genetic disease? Had voters known that President Reagan was likely to get Alzheimer’s while in office, this would surely have affected their voting. This viewpoint was summarized by the Harvard geneticist George Church in the Wall Street Journal: "I would be shocked if Americans and people in other countries don't want this type of data [about political candidates]. It is not like we are collecting horoscope data or tea-leaf data. These are real facts, just as real as bank accounts and the influence of political action committees or family members."

The legality of testing someone’s DNA without consent has not been clearly determined. There are privacy laws that might be construed to cover genetic material, but these laws were written long before genome sequencing became a reality. Daniel Vorhaus, an expert on genetics law, discussed the issues on the Genomics Law Report blog. After analyzing the federal and state laws that might govern the issue, he concludes: “There exists a wide range of scenarios where surreptitious genetic testing, should it occur, would fall squarely within a legal gray area.”

Of course, any potential intrusion into the genetic privacy of a political candidate would likely spur extreme anger from that candidate’s campaign and provoke a full slate of legal and other recriminations. Unapproved publication of the genetic profile of a sitting president would surely unleash a multitude of national security laws.

While the technology exists today for political genomics, there have still not been any reported cases of offensive or defensive genetic testing in the United States. Perhaps candidates are scared of opening a Pandora’s box. They have as much if not more hidden in their genome as in the genome of their opponent. Even so, we are probably moving towards an era where politicians will come under increasing pressure to disclose their genome. American society is moving towards knowing as much as possible about our presidential candidates and their genome is an important piece of knowledge.

0 notes

genomeliberty-blog · 13 years ago

Text

Next Generation Variant Callers

We will focus on the calling of short sequence variants such as SNPs and indels and will not discuss structural variants, which present their own large set of challenges.

Traditional Variant Calling

The general technique for sequencing follows this paradigm.

1. Sequence short reads from an Illumina, SOLiD or Ion Torrent machine.

2. Align the reads to the standard reference sequence.

3. Iterate through the genome to identify locations where there are a sufficient number of reads to indicate either the presence of a non-reference base either in a heterozygous or a homozygous context. These locations are called as SNPs.

4. Use a gapped mapping technique to identify locations where there is either a compression or an expansion of the genomic sequence indicating an indel.

5. Integrate the results of #3 and #4 to produce a set of variant calls for downstream analysis.

This is obviously a rough summary of variant calling and it skips steps such as basequalityscorere-calibration and the BaseAlignmentQuality (BAQ) technique that are often used to improve the robustness of variant calls. Importantly, alignment and base calling are performed sequentially and SNPs and indels are called separately. These two features produce some shortcomings in the variant calls that can be produced.

The requirement for an initial alignment of short (100bp) sequencing reads limits the amount of variation from the reference genome that can be detected. A case of 15 mis-matched sequential bases in a genome will not be detected since the aligner will be unable to properly map the reads supporting that variant. This shortcoming will remain regardless of the number of reads at that locus since the number of mismatches allowed by an aligner is limited in order to allow it to run in a reasonable amount of computational time. The procedure of calling SNPs and indels separately and then integrating the calls can lead to problems when there are complex loci containing a SNP opposite an indel. Let’s assume that the SNP caller made this call of a single SNP:

ATGTATGTA

ATGTGTGTA

and the indel caller produced this call of a 3 base deletion:

ATGTATGTA

ATGT---TA

Should it be assumed that there is a heterozygous SNP opposite a heterozygous indel or a more complex variant that is also supported by the reads? When a program is only tasked with finding a particular type of variant, such as an indel or a SNP, it will never detect anything else.

Fortunately, there are some new variant callers that are available which are more sophisticated and call SNPs, indels, and complex variants simultaneously. There are many different variant callers available. A quick scan of any issue of Bioinformatics will include at least one, and probably more, variant callers. I will highlight 3 different variant callers that I think are leading the way and take different approaches to the task of calling variants: FreeBayes which performs local physical phasing, the Complete Genomics caller which uses local de novo assembly, and Cortex which relies completely on de novo assembly.

FreeBayes - haplotype-based calling

The FreeBayes algorithm was developed by Erik Garrison in Gabor Marth’s group at Boston College and is one of the main callers utilized by the 1000 Genomes Project. It relies on an alignment of sequencing reads to the genome, but rather than just looking at individual bases, it calls variants based upon haplotypes up to the length of sequencing reads. Through a complex computational process (outlined in this preprint), the program takes reads from a single or multiple samples and determines the number of different haplotypes that appear in a local region. It then gives a probability for the presence of each of the haplotypes in each of the samples. Based upon these probabilities, FreeBayes gives the best approximation for the diploid genotypes in each individual. This technique has a few important benefits over a traditional variant caller:

A large number of different alleles can be called at each location across samples. For many variant callers, each location has a reference and an alternate allele and each sample is reported as either homozygous reference, heterozygous, or homozygous variant. With FreeBayes, there could potentially be 2,000 different alleles represented among the diploid genomes of the 1000 Genomes samples. In practice, this number of haplotypes is not found, but this is due to a limitation of human genomic variation and not the algorithm.

Nearby variants are identified on the same haplotype so they are in phase. When there are two SNPs separated by a reference base, it is not generally noted whether they are on the same chromosome or opposing chromosomes. These two possibilities are critical to distinguish when performing functional analysis to determine if variants are synonymous or non-synonymous. For example, given the reference sequence ACA and SNPs in both the 5’ and 3’ A, it is important to know whether there is a homozygous sequence TCG, or a heterozygous ACG and TCA. With calls from FreeBayes, this phasing can extend for several dozen bases.

All types of sequence-level variants are called simultaneously. Rather than employing a separate caller for SNPs, indels, and multi-nucleotide polymorphisms, FreeBayes calls all haplotypes at a locus regardless of what type of difference they contain relative to the reference genome. In a long haplotype, there could be a mix of multiple SNPs and indels. Additionally, the two alleles at a locus could be identified as extremely different from each other if that categorization is supported by the underlying reads.

The one major shortcoming of FreeBayes is that it is reliant on a prior alignment of reads to the genome. This requirement limits the scale of variants that can be called since highly divergent sequences would not be aligned properly by any algorithm. But with the lengths of reads increasing, this should become less problematic. With a 250bp read, many local mismatches or gaps could occur without preventing a valid alignment.

The Complete Genomics Caller

The variant caller from Complete Genomics (CG) is innovative and is completely based upon a local de novo technique. The CG variant caller takes the view that the best way to call variants is to use an assembly. Since a complete de novo assembly of a genome is extremely computationally taxing, it does the next best thing and uses a local assembly. The technique is encapsulated in this figure from their paper:

Reads are all aligned to the genome (blue) and a reference score is calculated based upon the number of reads at each location and the quality of the alignment. This score gives the likelihood of a base being homozygous reference. Whenever the reference score drops (purple box), then an assembly is performed to determine the variation that has occurred. Based upon the alignment of paired ends of reads (blue), all of the reads that are aligned with the variable base and a surrounding region are recruited (yellow). These yellow reads are then de novo assembled in order to determine their sequence.

Here are the benefits of this method of variant calling:

Every variant including a SNP is called based upon a de novo assembler rather than calling SNPs first and then looking for more complex variants. In this way, variants close to a SNP, such as other SNPs or indels that could alter the variation detection using a traditional method, are included in the process.

Because there are times when the evidence is insufficient to make a variant call properly, the program includes the possibility of calling a base as a ‘no-call’. These no-calls are regions where there is not enough read data to either support a call of homozygous reference or any type of variant. In the normal procedure, whenever no variant is called, the sequence is assumed to be reference whether or not this is correct. For downstream analysis, these no-call regions can be filtered to prevent unjustified conclusions.

The variants are called over a small region of the genome rather than at each specific base individually. This allows for the calling of longer haplotypes, just as with FreeBayes. Even better, since assembly is used, the length of a variant that could be called is not limited to an individual read.

A drawback to this technique is that the reads are relatively short (35bp) and therefore there will be regions of the genome where neither end of reads can be uniquely aligned. In this case, where there is a repetitive region of the genome, there will not be any reads whose mates map uniquely to allow for the local assembly. Additionally, this variant caller is only available as part of the Complete Genomics package, and one cannot simply run it separately.

Cortex

The Cortex algorithm was developed by Zamin Iqbal and Mario Caccamo in England. It is based upon the de Bruijn graph model of genome assemblers, but takes a novel approach by using what their paper calls colored de Bruijn graphs. The ideal method to call variants in a genome would be to do a complete de novo assembly and then to do an alignment to the reference genome. This is not feasible with current sequencing technologies due to their short read length and the complexity of the human genome. What Cortex does, is to assemble the genome as much as possible and then to use this assembly to call variants. If there are two genomes being assembled and compared, then they are each assigned a color for their graph. After the assembly graph is completed, a variation between the genomes will show up as a change between the two graphs. In the case of comparing a single genome to the reference, the genome will be broken up into a de Bruijn graph and given a color for the comparison. After a variant has been detected between the reference graph and a genome of interest, the assembled contig is mapped back to the reference genome so that the correct genome coordinates of the variant can be determined.

If there were two genomes colored as blue and red, then variants would be called as shown in this figure:

Since genomes are diploid, there are two red and two blue lines. The heterozygous variant has one blue allele matching the two red alleles with the variant blue allele separated. The homozygous variant has both blue and red lines separate from each other, and a repeat is represented by a bubble in the graph where there are multiple copies of the same sequence. While this graphical color view simplifies the procedure for explanation, in reality the work is all done within complex computational data structures. Because Cortex does an almost full de novo assembly, it has larger time and memory requirements than other variant calling algorithms. Even so, it has been used on large datasets including whole populations from the 1000 Genomes Project.

Here are a few of the benefits of Cortex:

The variant calling does not require a reference and can be performed directly between two or more sample genomes. This is extremely beneficial for researchers who work on non-model organisms that lack a reference genome. Using Cortex, they could directly examine the diversity of genomes of a species.

When using Cortex, there is no alignment at all to a reference, just the combined assembly of the sample genomes and then the calling of variants between them. This is extremely helpful if the sample genomes have many loci where they all have the same variant relative to the reference. Only the variants between the sample genomes would be detected by Cortex.

Cortex can call any type of variants between the genomes. As with FreeBayes and the Complete Genomics caller, this means that there is no difficulty in calling complex variants or multi-nucleotide polymorphisms. Since a de novo assembly is performed, the length of variants is theoretically unlimited, but in practice, it is limited by the coverage of the genome and the particulars of the de Bruijn graph construction.

Conclusion

As we have seen, there are many different innovations being made in the field of variant calling. Since these tools are either open source (FreeBayes and Cortex) or have had their details published (CG), their novel features can be incorporated into other tools. In this manner, there is a new version of GATK that is being released which will incorporate haplotype-based calling and some level of local de novo assembly. Due to their greater complexity, these tools require greater amounts of computational power than traditional variant callers, but their increased power is worth the trade off. Even if a researcher is only concerned with SNPs, the use of a sophisticated variant detector will ensure that their SNP calls are more robust with fewer false positives and incorrect calls.

0 notes

genomeliberty-blog · 13 years ago

Text

THE ELUSIVE APPEAL OF EXOME SEQUENCING

In the past few years, the prices of sequencing have plummeted and now for a few thousand dollars, the complete sequence of an individual can be obtained. Even so, many scientists have opted to sequence just the exome (coding regions) of an individual and to ignore the rest of the genome. This focus on the exome has some justification, but I think that it is shortsighted and despite the higher cost, the sequencing of a complete genome is more valuable even if that means sequencing fewer samples.

A. The sequencing of an exome is much cheaper than the sequencing of a genome. It must be substantially cheaper to sequence 1% of the genome than the whole genome. B. We don’t understand how to interpret non-coding variants and therefore we should limit our sequencing to genes that are well annotated. C. Variants that are associated with a genetic disease are more likely to be found in a coding region since they directly alter the structure of a protein.

I am not going to deny that there is some validity to these points, but I don’t think that they outweigh the shortcomings of exome sequencing and the benefits of whole genome sequencing that I will outline below. I understand that this is a contentious issue, and I welcome your comments whether you agree or disagree with my position.

1. Cost

The first reason the people generally look to exome sequencing is that of cost. Intuitively, the sequencing of 1% of the genome (the exome) should be cheaper than sequencing the entire genome. While this is true, the price differential is nowhere near 1:100 and is closer to 2:1 or 3:1 depending upon how the costs of the sequencing is calculated. Currently, a whole genome costs ~$4,000 and an exome costs ~$1,500. Why are these prices so close to each other? The answer is that the actual reagent cost of running the sequencer is not the only factor in the cost of a genome or an exome. For either type of experiment, library prep is required along with the costs associated with setting up a sequencing run of any size. For an exome, there is the additional cost of purchasing the selection kit which allow one to extract the coding sequences from raw DNA either using a microarray or in solution. This kit can cost several hundred dollars, and is therefore a substantial portion of the cost of exome sequencing.

Because of the lack of strong cost differential, the economic argument of favoring exome sequencing is not very strong. For the same amount of funding, a researcher would need to choose between say, 30 exomes and 10 genomes. While 30 samples are obviously better than 10, this is not a great differential. It is much less than the 1:100 differential that one would naively think of concerning the price of genome and exome sequences. An additional factor affecting the cost of exome sequencing is the time required to perform the hybridization. For the Nimblegen protocol, 72 hours of time are required for hybridization and 24 hours are required for the Agilent approach. These times add a delay into the time taken from sample to sequence which may be problematic for clinical applications. As an example, the Ion Torrent machine is being pitched as a tool for rapid sequencing that will produce results in a single day. When an exome is targeted using Agilent or NImblegen, this will grow to at least 2 or 4 days of time.

2. Exome coverage

The definition of an exome is somewhat elusive. It can refer to: a) All of coding exons of the genome b) A + microRNA genes c) A + 5’ UTR and 3’ UTR regions d) Unannotated transcripts that have been discovered in RNA-seq experiments or from the ENCODE project e) All "functional" portions of the genome These five definitions will include very different portions of the genome and some of them such as E are difficult to define in and of themselves. It has been shown in multiple studies that there pervasive transcription along substantial portions of the genome. Should all of these regions be considered part of the exome? In general these are not included in the exome kits since their inclusion would push the size of the exome much closer to that of a genome and any potential savings from the lesser amount of sequencing will decrease. Instead, the exome is generally limited to coding genes with some level of annotation along with microRNAs and to some extent UTRs.

Each of the different vendors that produce exome kits have taken different approaches to defining the exome. A recent paper http://www.nature.com/nbt/journal/v29/n10/full/nbt.1975.html compared the exome selection offering from the three main players in the field Agilent, Nimblegen and Illumina.

This figure gives a great comparison of the different technologies. Firstly, the approaches to selecting the exome sequence differ. Nimblegen uses overlapping DNA baits, Agilent uses RNA baits which are distinct but contiguous and Illumina uses distinct DNA baits that are not contiguous and contain breaks of un-targeted sequence. Because of this, Nimblegen contains many times the number of probes as the other two technologies. The rest of the figure shows Venn diagrams illustrating the overlap between the targeted regions. For two different defintions of human genes, RefSeq and Ensembl, there is substantial agreement between the technologies as indicated by the 28.5 and 28.4Mb of sequence that they all cover. The biggest discrepancy is with regard to UTR regions where Illumina has 28 Mb that are missing from the other two platforms.

A different technique to assess coverage is to look at the amount of the exome target from a particular kit that is covered at a sufficient threshold to make a confident call of a variant. For many scientists, a threshold of 20x coverage is required to trust a variant derived from an exome sequence. Any loci with lesser amounts of coverage are ignored. Since the general sequencing coverage for an exome is 80x, in theory, it should be no problem to achieve 20x coverage of the entire targeted region. In practice, this is not the case for three reasons. Firstly, exome sequencing, as with all sequencing, produces reads in a statistical distribution and not evenly along the genome. Randomly, some regions are going to have their DNA sequenced more often and thus have a higher number of reads. This idea forms the basis of the famous Lander-Waterman statistics that are used for designing sequencing projects. The second reason for variation in coverage is that some of the baits used for selecting the exomic DNA will have a higher affinity than other baits, mainly due to GC content.. Those probes with higher affinity for their targets will produce greater amounts of sequenced DNA. The final concern is due to the repetitive nature of the genome. The selection probes need to target a unique location in the genome to ensure that they are truly obtaining the DNA that they intend to select. If the targeted region is repeated in the genome, then sequence from all of matching regions will be equally selected. Many human genes share domains with other proteins, and any shared sequences cannot be targetted. This is an equivalent problem to the unique mapping of sequencing reads which is a big concern in the use of short sequence reads. Any reads that map to more than one location of the genome cannot be uniquely placed and are generally discarded. These concerns are illustrated in this figure from Agilent regarding their SureSelect sequencing:

This is an old figure, but I think that while the numbers might have changed a bit, the overall message remains. The read depth is extremely variable and you do not achieve anything close to 100% coverage of the exome. While accurate data is available for 80% of the exome (depth > 20x) this also means that 20% of the exome is missed. In odds terms, this means that for a disease study where an exomic variant correlates with the disease, there is a 1:5 chance of not having the variant included in the data. A researcher could conclude that there is no coding variant associated with their disorder when in actuality, it was just that it fell into the 20% that was missed. An error level of 20% is not trivial and cannot be lightly dismissed.

3. Whole Genomes

When a whole genome is sequenced, many of the issues regarding exome sequencing are not relevant. There is no need to buy a hybridization kit or to wait for the kit to hybridize. While there are sequencing biases (as there are in any sequencing experiment), there are not the additional biases introduced from the exome selection. Overall, there is probably the standard 5% error in sequencing giving a confidence level of 95%. But, the biggest gain from a whole-genome sequencing is that the entire genome (excluding some unclonable regions) is obtained. If one wants to focus on the exome because it is easier to understand and interpret, they can easily filter out the non-coding portions of the genome to obtain an in silico exome. This is an easy action to perform and if a positive result is not found in the exome, then you already have the rest of the genome sequenced to begin looking for an intronic variant related to splicing, or a non-coding promoter or enhancer variant. In a traditional exome experiment, this is not possible. If no variant is found in the exome, then there is no result and one needs to go back and sequence the whole genome again from scratch. To give a picture of the fraction of disease associated variants that are coding or non-coding, I looked at the UCSC collection of GWAS studies. The current list contains 5454 unique SNPs loci that were identified as part of a GWAS study. Of these SNPs, 3047 (56%) of them are not within coding genes. Thus, more than half of the identified important genomic variants are not in coding regions and would not be covered by exomes. (Some of these SNPs may be in UTRs or non-coding RNAs which are targeted by some of the platforms) I see this as a betting situation. Would you rather spend $1,500 and have a 44% chance of getting the answer of spend $4,000 and have a 95% chance of getting the answer? I think that the $4,000 genome is much more reasonable. Just because we don’t understand non-coding sequence does not mean that we can or should ignore it. As scientists, we have an obligation to try our best to investigate human disease and not to only focus on things that are easy to understand. As a final point, there has been some recent talk concerning variants that are only found from exome sequencing and not genome sequencing. These results are not a fair comparison of apples to apples. The exomes are generally sequenced at 80x coverage, and the genomes are sequenced at 30x coverage. For the specific variants under discussion, 80x sequencing coverage is required to identify them from any technique. This 80x coverage could have been of just the exome, or the entire genome. If the whole genome were sequenced to 80x for a true comparison, then I am confident that there would not have been an advantage for the exome over the genome.

0 notes