Authors:
Jong Bhak1, Taehyung Kim2, Jongsun Park, Sunghoon Lee2, Sungwoong Jho1, Suan Jho1, Minseob Lee, Rosalynn Gill, George Church3, and Byungchul Kim1
Affiliations:
1. Genome Research Foundation, Suwon, 443-270, South Korea.
2. TheragenEtex, South Korea
3. Harvard Medical School, Boston, MA, USA
Contacts:
Unique Paper Identifier: 2011.11.05.01.01.JongBhak
Abstract
A Caucasian female human genome analyses are reported. It was sequenced using Illumina GA2 next generation sequencers and 90GB of short DNA fragments of ~90 base pairs were obtained. Overall coverage on to the reference human geome (HG19) was 99.2% with sequencing depth of 33 folds. 3.5 million SNPs, 200,000 short indels, 10,000 long structural variations, and 125,000 copy number variations were found. We found 10,000 non-synonymous mutations resulting in 100 genes affected by amino acid changes. Out of 100, we found 2 presumably critical mutations that resulted in some structural changes. The whole analysis was carried out by a bioinformatics pipeline called 'Genome Engine' which finished the whole analysis process in about 2 weeks, semi-automatically. The donor of the genome is Rosalynn Gill who donated all the rights and information associated with the data to the public as a part of PGP (Personal Genome Project). Gill's genome was the first publically available female human genome and the data were available from http://bioftp.org since April 2010 with analyses results. Here we present the second stage comparative analyses of her genome.
Introduction
NGS (next generation sequencers) produced many personal genomes in various research projects such as PGP (Personal Genome Project), 1000 genome project, YH genome project of China, Korean Genome Project (KGP) by Genome Research Foundation in Korea, Japanese genome project, and Indian genome project. By 2011, the number of genomes has increased from 10s in 2010 to hundreds. However, relatively few female human genomes have been competely publicized. In 2008, Leiden university announced that they had sequenced the first female human genome. However, the claimed to be the first female human genome data were not publicized nor its research report was announced by Nov. 2011. In the mean time, PGP has publicized the first female genome, Rosalynn Gill's, on the web in April 2010 with extensive analyses that were similar to the previous genomes such as James Watson, YH, and SJK. Gill was PGP number 9 and many of phenotypic and clinical data points were also public at the time of data publication.
The data have been used to be compared with various existing genomes and 10s of Korean genomes that have been available through KPGP (Korean Personal Genome Project). Here, we report the second stage analyses of PGP9 using automated genome pipeline called "Genome Engine". Genome Engine was developed to analyse the first Korean human genome in 2008. It is a personal genome analyzer with common functions such as detecting SNPs, Indels, SVs, CNVs, chart drawing functions, and genome browsing. The main idea of the second stage female genome analysis was to apply an already common genome analysis package to annotate it as much as possible with minimum effort.
Results
|
Read length (bp)
|
Number of reads
|
Total bases (Gb)
|
Rate of mapped reads by BWA
|
Rate of unmapped reads
|
|
|
|
|
|
|
Illumina sequence data result by using short read mapping program.
|
Read length (bp)
|
Numbers of read (M)
|
Total bases(Gb)
|
Numbers of aligned reads(M)
|
Aligned bases (Gb)
|
Percentage of genome covered
|
Coverage depth
|
|
|
|
|
|
|
|
|
|
Assembler
|
Contigs
|
n50(bp)
|
Minimum Length (bp)
|
Total Length (bp)
|
Scafolds
|
Reads Used
|
|
|
|
|
|
|
|
|
|
|
Genotyped
n
|
Samtools
|
GATK
|
|||
|
n
|
%
|
n
|
%
|
|||
|
covered by sequence
|
Homozygote
|
|
|
|
|
|
|
Heterozygote
|
|
|
|
|
|
|
|
All
|
|
|
|
|
|
|
|
Concordant calls
|
Homozygote
|
|
|
|
|
|
|
Heterozygote
|
|
|
|
|
|
|
|
All
|
|
|
|
|
|
|
|
All disagreements
|
GT>Seq
|
|
|
|
|
|
|
Seq>GT
|
|
|
|
|
|
|
|
Other discordances
|
|
|
|
|
|
|
|
|
|
Diploid
|
||
|
|
|
Total
|
Novel
|
|
|
non-exonic
|
Homozygous
|
|
|
|
|
Heterozyous
|
|
|
||
|
Subtotal
|
|
|
||
|
UTR
|
Homozygous
|
|
|
|
|
Heterozyous
|
|
|
||
|
Subtotal
|
|
|
||
|
CDS
|
Homozygous
|
|
|
|
|
Heterozyous
|
|
|
||
|
Subtotal
|
|
|
||
|
Sum
|
Total
|
|
|
|
|
|
Total
|
Homozygous
|
Heterozygous
|
|
Total SNPs
|
|
|
|
|
Autosomal SNPs
|
|
|
|
|
CDS nsSNPs
|
|
|
|
|
Novel
|
|
|
|
|
non-exonic
|
|
|
|
|
UTR
|
|
|
|
|
CDS
|
|
|
|
List of nsSNPs detected in the PGP9 genome.
Total number of calls and fraction that match previous entries in dbSNP.
|
|
Number of short indels
|
%dbSNP
|
|
Heterozygotes
|
|
|
|
Homozygoues
|
|
|
|
All indels
|
|
|
Homozagous deletion figure -
Heterozagous deletion figure -
Annotation
Out of 90GB short DNA reads, 6% were not matched to the human reference. These were subject to de novo assembly program XXXX. We found 2 genes in the unassigned salvaged fragments. We assigned 25,xxx ORFs to PGP9. There were YYY genes mutated non-synonymously and they are listed in Table 1.
What are the mutated genes?
Any gene that is causing any clinical symptom to Gill?
PGP9 genome statistics with numbers of variants found (Polyphen 2 prediction, PharmGKB or HuGENet, GeneTests, nonsense and frameshift mutations, priority
|
Individual
|
Number of nonsyn. with
probably
damaging
Polyphen 2
prediction
|
Number of vars in
PharmGKB or
HuGENet
|
# of nonsyn
vars in genes
with clinical
testing
(GeneTests)
|
# of nonsense
and frameshift
mutations
|
# with
priority
score of
4 or
more
|
|
PGP9
|
|
|
|
|
|
SNPs Potestially associated with clinical phenotypes derived from the database of human gene mutation data (HGMD), OMIM, SNPedia and other hypotheses. DM.
|
Gene Symbol
|
HGMD
|
Phenotype
|
Disease Realationship
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Result of homozygous and heterozygous OMIM HGMD.
|
|
Gene Symbol
|
OMIM ID
|
Phenotype
|
Disease Realationship
|
|
homozygous
|
ABCF1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
heterozygous
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The ancestry of PGP9 was European maternally (XXX type) and paternally (XXX), Fig. 1. a.
We compared PGP9 with 10 other genomes.
How many unique SNPs in her genome? (compared to dbSNP and other genomes)
Compare Gill with Maryana genome.
|
Category
|
Velvet
|
|
|
|
Number of Contigs
|
Length (bp)
|
|
Total
|
|
|
|
Hs build 36.3
|
0
|
0
|
|
Hs GRCh 37
|
|
|
|
Hs Alt
|
|
|
|
Hs Other
|
|
|
|
Chimpanzee
|
|
|
|
Orangutan
|
|
|
|
Rhesus Macaque
|
|
|
|
Marmoset
|
|
|
|
Herpesvirus 4
|
|
|
|
Bos taurus
|
|
|
|
Remaning
|
|
|
Number of SNPs in the genome and the sequenced exome containing regions
|
Individual
|
Genomic SNPs
|
Novel SNPs
|
Coding exon SNPs
|
|
PGP9
|
|
|
|
|
NB1
|
|
|
|
|
MD8
|
|
|
|
|
TK1
|
|
|
|
|
ABT SOLID
|
|
|
|
|
ABT exome
|
|
|
|
|
NA 18507
|
|
|
|
|
NA 19240
|
|
|
|
|
J. Watson
|
|
|
|
|
J.C. Venter
|
|
|
|
|
NA12891
|
|
|
|
|
NA 12892
|
|
|
|
|
Chinese
|
|
|
|
|
Korean (SJK)
|
|
|
|
Conclusion
Methods. DNA extraction, library preparation, and massively parallel sequencing (MPS)
Genomic DNA (gDNA) was extracted from whole blood with a QIAamp DNA blood kit according to the manufacturer's instructions (Qiagen). Libraries were prepared according to the manufacturer's instructions (Illumina). Briefly, 5 ㎍ of gDNA in 200 ㎕ nuclease free water was fragmented by bioruptor (Diagenode) at high power for 30 min. (30 sec ON and 30 sec OFF). Overhangs of fragmented gDNA were converted to blunt ends using T4 DNA ligase and Klenow enzyme. Subsequently, an ‘A’ base was added to the ends of double-stranded DNA using Klenow exo (3' to 5' exo minus).
The paired end adaptor (Illumina) with a single ‘T’ base overhang at the 3' end was ligated to the above products. The PE adaptor ligated products were separated on a 2% agarose gel and excised from the gel at positions approximately for span size ranges (100bp, 200bp, 300bp, and 400bp). Size-selected DNA fragments were enriched by PCR with PE primers 1.1 and 2.1 (Illumina). The concentration of libraries was measured by both nanodrop (Thermo) and Qubit IT (Invitrogen). Finally, the libraries were validated by Experion (BIO-RAD). The gDNA library was sequenced using the Illumina 1G genome analyzer according to the manufacturer’s instructions.
Short read alignment
We used a fast short-read alignment program, BWA (ver. XXXX) with default parameters. BWA utilizes the read-pair information of paired-end reads. Using the read-pair information, BWA corrects wrong alignments, adds confidence to correct alignments, and
Calling SNPs Using MAQ, we aligned short reads to the NCBI reference genome and produced a consensus genotype sequence from the alignment. The options used for SNP calling were: minimum four read depth (-d 4), maximum depth (-D 100) to filter out randomly placed repetitive hits, consensus quality (Q20), adjacent sequence quality (Q20), and no SNP call if any indel occurred in the 3bp flanking region. De novo assembly of unmapped reads and mapping assembled contigs Velvet version 0.7.27 (Zerbino and Birney 2008) was used to assemble unmapped reads into contigs with hash length 235, coverage cutoff 23, and minimum contigs length 100. Among 41.2 million unmapped 75bp reads, we selected 24.6 million high quality reads. We filtered out low quality reads with total scores below 1500 or reads with ‘N’. Assembled contigs were aligned against the unanchored scaffolds of NCBI build 36 to find homologous sequence regions using the Blat alignment algorithm (Kent 2002). Before the alignment, the repeat sequences of contigs were masked by RepeatMasker (http://www.repeatmasker.org/). To select significantly homologous alignments, thresholds >95% identity and >80% coverage of contigs were used as thresholds. We performed a Blastx alignment against the NCBI refseq protein database of human
Detection of short indels We called short indels (from 29 bp deletion to 14 bp insertion) by using the MAQ pairedend indel detection method. Short indels were confirmed when they were identified in both strands with a minimum of three reads. When more than one indel candidates co-occured in a window size of 20bp, they were filtered out because alignment errors were found to represent most cases. Detection of structural variants (SV)
Genetic ancestry The chromosome Y haplotype data from Y Chromosomal Consortium (YCC) were used for the paternal lineage analysis. The mitochondrial DNA sequence was aligned with rCRS (Revised Cambridge Reference Sequence of the human mitochondrial DNA) (Andrews, Kubacka et al. 1999), and identified variations were mapped to a MitoVariome set (http://variome.kobic.re.kr/MitoVariome), which is an extended set of Mitomap (Brandon, Lott et al. 2005). The haplogroup-diagnostic nucleotide sequence variants were assembled from reference information (Kong, Bandelt et al. 2006). To make an autosomal phylogenic tree, Allele Sharing Distance (ASD) and Neighbor Joining (NJ) methods were used (Saitou and Nei 1987; Bowcock, Ruiz-Linares et al. 1994; Mountain and Cavalli-Sforza 1997). Four personal genomes, Watson, HuRef (Venter), YH (Han Chinese), and SJK, were used with HapMap Phase III samples of four populations, YRI (Yoruba), CHB (Chinese), CEU (Caucasian), and JPT (Japanese). HapMap samples had no relationship within them. Eight randomly chosen Koreans and the donor’s mother were genotyped with a Illumina 1M bead array. To calculate ASD, the proportion (Pi) of shared alleles between individuals was estimated by --- 1) Where the number of shared alleles S is added over all loci u, distance between individuals (Di ) is estimated by Di = 1 - Pi --- 2)
Acknowledgement
JB thanks TheragenEtex for generously supporting genomics research.
References
댓글 0