Analysis of human female personal genome

Automatic analysis of the first publically available female human genome by next generation sequencers as a part of PGP (Personal Genome Project).

Authors:

Jong Bhak¹, Taehyung Kim², Jongsun Park, Sunghoon Lee², Sungwoong Jho¹, Suan Jho¹, Minseob Lee, Rosalynn Gill, George Church³, and Byungchul Kim¹

Affiliations:

1. Genome Research Foundation, Suwon, 443-270, South Korea.

2. TheragenEtex, South Korea

3. Harvard Medical School, Boston, MA, USA

Contacts:

jongbhak@gmail.com

Unique Paper Identifier: 2011.11.05.01.01.JongBhak

Abstract

A Caucasian female human genome analyses are reported. It was sequenced using Illumina GA2 next generation sequencers and 90GB of short DNA fragments of ~90 base pairs were obtained. Overall coverage on to the reference human geome (HG19) was 99.2% with sequencing depth of 33 folds. 3.5 million SNPs, 200,000 short indels, 10,000 long structural variations, and 125,000 copy number variations were found. We found 10,000 non-synonymous mutations resulting in 100 genes affected by amino acid changes. Out of 100, we found 2 presumably critical mutations that resulted in some structural changes. The whole analysis was carried out by a bioinformatics pipeline called 'Genome Engine' which finished the whole analysis process in about 2 weeks, semi-automatically. The donor of the genome is Rosalynn Gill who donated all the rights and information associated with the data to the public as a part of PGP (Personal Genome Project). Gill's genome was the first publically available female human genome and the data were available from http://bioftp.org since April 2010 with analyses results. Here we present the second stage comparative analyses of her genome.

Introduction

NGS (next generation sequencers) produced many personal genomes in various research projects such as PGP (Personal Genome Project), 1000 genome project, YH genome project of China, Korean Genome Project (KGP) by Genome Research Foundation in Korea, Japanese genome project, and Indian genome project. By 2011, the number of genomes has increased from 10s in 2010 to hundreds. However, relatively few female human genomes have been competely publicized. In 2008, Leiden university announced that they had sequenced the first female human genome. However, the claimed to be the first female human genome data were not publicized nor its research report was announced by Nov. 2011. In the mean time, PGP has publicized the first female genome, Rosalynn Gill's, on the web in April 2010 with extensive analyses that were similar to the previous genomes such as James Watson, YH, and SJK. Gill was PGP number 9 and many of phenotypic and clinical data points were also public at the time of data publication.

The data have been used to be compared with various existing genomes and 10s of Korean genomes that have been available through KPGP (Korean Personal Genome Project). Here, we report the second stage analyses of PGP9 using automated genome pipeline called "Genome Engine". Genome Engine was developed to analyse the first Korean human genome in 2008. It is a personal genome analyzer with common functions such as detecting SNPs, Indels, SVs, CNVs, chart drawing functions, and genome browsing. The main idea of the second stage female genome analysis was to apply an already common genome analysis package to annotate it as much as possible with minimum effort.

Results

Whole-genome sequencing results for PGP9

Read length (bp)	Number of reads	Total bases (Gb)	Rate of mapped reads by BWA	Rate of unmapped reads

Illumina sequence data result by using short read mapping program.

Read length (bp)	Numbers of read (M)	Total bases(Gb)	Numbers of aligned reads(M)	Aligned bases (Gb)	Percentage of genome covered	Coverage depth

매핑되지 않은 read는 de novo로 보고 Velvet으로 assembly 분석. 이때 de novo가 quality score가 20 이상인 것만 사용한다.

리드 시퀀싱 에러 짐작하기

시퀀싱 에러 율을 짐작하기 위해 chromosomes X로 predominant한건 true로 나머진 error로 본다.

시퀀스 quality score를 계산한다. r=

염기 간 치환 에러 비율

C와 A간, G와 T 등을 Illumina Genome Analyzer로 관찰한다.

Result of de novo assembly.

Assembler	Contigs	n50(bp)	Minimum Length (bp)	Total Length (bp)	Scafolds	Reads Used

Genotyping result

Comparation of X chromosome SNP calls made from sequence vs. genotype data

		Genotyped n	Samtools		GATK
		Genotyped n	n	%	n	%
covered by sequence	Homozygote
	Heterozygote
	All
Concordant calls	Homozygote
	Heterozygote
	All
All disagreements	GT>Seq
All disagreements	Seq>GT
Other discordances

Genotyping of the PGP9 sample was carried out using the Illumina HumanHap BeadChip. GT>Seq denotes a heterozygous genotyping SNP call where there is a homozygous sequencing SNP call, Seq>GT denotes the converse, I.e. a heterozygous sequencing SNP call where there is a homozygous genotyping call. Other discordances are differences in the two SNP calls that cannot be accounted for by one allele being missing from one call.

Short indels identificaion in PGP9

		Diploid
		Total	Novel
non-exonic	Homozygous
	Heterozyous
	Subtotal
UTR	Homozygous
	Heterozyous
	Subtotal
CDS	Homozygous
	Heterozyous
	Subtotal
Sum	Total

Statistics of SNPs identified in PGP9 genome sequences through application x Compared with dbSNP

	Total	Homozygous	Heterozygous
Total SNPs
Autosomal SNPs
CDS nsSNPs
Novel
non-exonic
UTR
CDS

List of nsSNPs detected in the PGP9 genome.

Distribution of size of PGP 9 indels.

Figure. X: Length of indel (bp) Y: Number of cases

Analysis of short indel calls in human genome data for PGP9.

Total number of calls and fraction that match previous entries in dbSNP.

	Number of short indels	%dbSNP
Heterozygotes
Homozygoues
All indels

Homozagous deletion figure -

Heterozagous deletion figure -

Homozagous x kb inversion figure -

Annotation

Out of 90GB short DNA reads, 6% were not matched to the human reference. These were subject to de novo assembly program XXXX. We found 2 genes in the unassigned salvaged fragments. We assigned 25,xxx ORFs to PGP9. There were YYY genes mutated non-synonymously and they are listed in Table 1.

What are the mutated genes?

Any gene that is causing any clinical symptom to Gill?

By using Trait-o-matric we found and classified phenotypic correlations for variations in whole genomes.

We found x SNPs that were potentially associated with clinical phenotypes were identified. Of these, x were relatively common SNPs previously associated with risk of complex disorders or traits. For example, the genome of PGP9 contained x SNPs that have shown associations with susceptibility to various cancers, x SNPs with type 2 diabetes mellitus, The genome of PGP9 also contained x non-synonymous SNPs in genes associated with complex or Mendelian disorders or traits. Of these, x were stop codons and x were homozygous. Among Mendelian traits, PGP9 was homozygous for a variant conferring wet earwax that has a high allele frequency in caucasians. score of 4 or more)

PGP9 genome statistics with numbers of variants found (Polyphen 2 prediction, PharmGKB or HuGENet, GeneTests, nonsense and frameshift mutations, priority

Individual	Number of nonsyn. with probably damaging Polyphen 2 prediction	Number of vars in PharmGKB or HuGENet	# of nonsyn vars in genes with clinical testing (GeneTests)	# of nonsense and frameshift mutations	# with priority score of 4 or more
PGP9

SNPs Potestially associated with clinical phenotypes derived from the database of human gene mutation data (HGMD), OMIM, SNPedia and other hypotheses. DM.

Gene Symbol	HGMD	Phenotype	Disease Realationship

Result of homozygous and heterozygous OMIM HGMD.

	Gene Symbol	OMIM ID	Phenotype	Disease Realationship
homozygous	ABCF1


heterozygous

Ancestry

The ancestry of PGP9 was European maternally (XXX type) and paternally (XXX), Fig. 1. a.

Geographic map showing the region of ancestry. MT type, mitochomdrial haplogroup. Figure -

Autosomal phylogenic tree using allele sharing distanve (ASD) and Neighbor Joining (NJ) methods Figure-

Comparative genomics

We compared PGP9 with 10 other genomes.

How many unique SNPs in her genome? (compared to dbSNP and other genomes)

Compare Gill with Maryana genome.

Result of blast analysis of de novo contigs

Category	Velvet
	Number of Contigs	Length (bp)
Total
Hs build 36.3	0	0
Hs GRCh 37
Hs Alt
Hs Other
Chimpanzee
Orangutan
Rhesus Macaque
Marmoset
Herpesvirus 4
Bos taurus
Remaning

Number of SNPs in the genome and the sequenced exome containing regions

Individual	Genomic SNPs	Novel SNPs	Coding exon SNPs
PGP9
NB1
MD8
TK1
ABT SOLID
ABT exome
NA 18507
NA 19240
J. Watson
J.C. Venter
NA12891
NA 12892
Chinese
Korean (SJK)

Number of SNPs identified for pair of mitochodrial sequences.

"CRS"= Cambridge Reference Sequence.

Conclusion

Methods. DNA extraction, library preparation, and massively parallel sequencing (MPS)

Genomic DNA (gDNA) was extracted from whole blood with a QIAamp DNA blood kit according to the manufacturer's instructions (Qiagen). Libraries were prepared according to the manufacturer's instructions (Illumina). Briefly, 5 ㎍ of gDNA in 200 ㎕ nuclease free water was fragmented by bioruptor (Diagenode) at high power for 30 min. (30 sec ON and 30 sec OFF). Overhangs of fragmented gDNA were converted to blunt ends using T4 DNA ligase and Klenow enzyme. Subsequently, an ‘A’ base was added to the ends of double-stranded DNA using Klenow exo (3' to 5' exo minus).

The paired end adaptor (Illumina) with a single ‘T’ base overhang at the 3' end was ligated to the above products. The PE adaptor ligated products were separated on a 2% agarose gel and excised from the gel at positions approximately for span size ranges (100bp, 200bp, 300bp, and 400bp). Size-selected DNA fragments were enriched by PCR with PE primers 1.1 and 2.1 (Illumina). The concentration of libraries was measured by both nanodrop (Thermo) and Qubit IT (Invitrogen). Finally, the libraries were validated by Experion (BIO-RAD).

The gDNA library was sequenced using the Illumina 1G genome analyzer according to the manufacturer’s instructions.

Short read alignment

We used a fast short-read alignment program, BWA (ver. XXXX) with default parameters. BWA utilizes the read-pair information of paired-end reads. Using the read-pair

information, BWA corrects wrong alignments, adds confidence to correct alignments, and

accurately maps reads to repetitive sequences if their mate pairs are confidently aligned.

Calling SNPs

Using MAQ, we aligned short reads to the NCBI reference genome and produced a

consensus genotype sequence from the alignment. The options used for SNP calling were:

minimum four read depth (-d 4), maximum depth (-D 100) to filter out randomly placed

repetitive hits, consensus quality (Q20), adjacent sequence quality (Q20), and no SNP call

if any indel occurred in the 3bp flanking region.

De novo assembly of unmapped reads and mapping assembled contigs

Velvet version 0.7.27 (Zerbino and Birney 2008) was used to assemble unmapped reads into contigs with hash length 235, coverage cutoff 23, and minimum contigs length 100. Among 41.2 million unmapped 75bp reads, we selected 24.6 million high quality reads. We filtered out low quality reads with total scores below 1500 or reads with ‘N’.

Assembled contigs were aligned against the unanchored scaffolds of NCBI build 36 to find homologous sequence regions using the Blat alignment algorithm (Kent 2002). Before the alignment, the repeat sequences of contigs were masked by RepeatMasker (http://www.repeatmasker.org/). To select significantly homologous alignments, thresholds >95% identity and >80% coverage of contigs were used as thresholds. We performed a Blastx alignment against the NCBI refseq protein database of human

(NCBI build 36, 37,742 proteins), chimpanzee (NCBI build 2, 51,947 proteins), and mouse (NCBI build 37, 34,966 proteins). Hits with E-value <0.0001, percent identity >40%, and protein coverage >40% were used as significance thresholds.

Detection of short indels

We called short indels (from 29 bp deletion to 14 bp insertion) by using the MAQ pairedend indel detection method. Short indels were confirmed when they were identified in both strands with a minimum of three reads. When more than one indel candidates co-occured in a window size of 20bp, they were filtered out because alignment errors were found to represent most cases.

Detection of structural variants (SV)

We detected SVs based on span size and orientation information of each paired-end read. Paired-end reads with an anomalously long span size (more than double the average span size of each DNA library) were identified as SV candidates (deletion and inversion), especially when they had a minimum of three reads in the region, maximum 100 read depth and mapping quality (Q20). SV candidates either found in repeat regions of the genome or having more than 100kb of genomic deletions were filtered out. For insertion detection larger than the short indels (-29 - +14 bp), the longest 300bp span size of our paired-end libraries was used. Thus, we could fill 175~250bp insert gaps between short inserts and large inserts. The criteria used for detecting these insertions absent from the reference genome in the range of 175~250bp were: minimum four read depth, maximum 60 read depth to filter out randomly placed hits in repetitive structure region, and mapping quality (Q20).

Genetic ancestry

The chromosome Y haplotype data from Y Chromosomal Consortium (YCC) were used for the paternal lineage analysis. The mitochondrial DNA sequence was aligned with rCRS (Revised Cambridge Reference Sequence of the human mitochondrial DNA) (Andrews, Kubacka et al. 1999), and identified variations were mapped to a MitoVariome set (http://variome.kobic.re.kr/MitoVariome), which is an extended set of Mitomap (Brandon, Lott et al. 2005). The haplogroup-diagnostic nucleotide sequence variants were assembled from reference information (Kong, Bandelt et al. 2006). To make an autosomal phylogenic tree, Allele Sharing Distance (ASD) and Neighbor Joining (NJ) methods were used (Saitou and Nei 1987; Bowcock, Ruiz-Linares et al. 1994; Mountain and Cavalli-Sforza 1997). Four personal genomes, Watson, HuRef (Venter), YH (Han Chinese), and SJK, were used with HapMap Phase III samples of four populations, YRI (Yoruba), CHB (Chinese), CEU (Caucasian), and JPT (Japanese). HapMap samples had no relationship within them. Eight randomly chosen Koreans and the donor’s mother were genotyped with a Illumina 1M bead array. To calculate ASD, the proportion (Pi) of shared alleles between individuals was estimated by --- 1) Where the number of shared alleles S is added over all loci u, distance between individuals

(Di ) is estimated by

Di = 1 - Pi --- 2)

Phylogenic analyses were conducted in MEGA4 (Tamura, Dudley et al. 2007).

Acknowledgement

JB thanks TheragenEtex for generously supporting genomics research.

References

Analysis of human female personal genome

댓글 0