Geuvadis RNA sequencing project of 1000 Genomes samples


The Geuvadis project aims to bring together the knowledge and resources on medical genome sequencing at a European level and allow researchers to develop and test new hypotheses on the genetic basis of disease; to develop standards in sequencing data processing, storage, submission etc. The analysis of samples from the medical field, using RNA and DNA sequencing will allow the project to set up standards in operating procedures and biological/medical interpretation of sequence data in relation to clinical phenotypes.

In the RNA-sequencing work package of the Geuvadis project (Lappalainen et al. Nature 2013), have combined transcriptome and genome sequencing data by performing mRNA and small RNA sequencing on 465 lymphoblastoid cell line (LCL) samples from 5 populations of the 1000 Genomes Project: the CEPH (CEU), Finns (FIN), British (GBR), Toscani (TSI) and Yoruba (YRI). Of these samples, 423 were part of the 1000 Genomes Phase 1 dataset (Abecasis et al. Nature 2012) with low-coverage whole genome and high-coverage exome sequencing data, and the remaining 42 are part of the later phases of 1000 Genomes with Omni 2.5M SNP array data available at the time of this study; these genotypes were imputed from the array data using Phase 1 as the reference.

The main paper presenting the data set and summarizing the key findings, with a focus on transcriptome variantion and its genetic component has been published in Nature in September 2013 by Lappalainen et al.: Transcriptome and genome sequencing uncovers functional variation in humans. (in press) with a companion paper on reproducibility and technical variation in RNA-seq published at the same time in Nature Biotechnology by ‘t Hoen et al..: Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories, (in press). Additionally, there will be future companion papers on splicing variation (Ferreira et al. submitted) and loss-of-function variation (Rivas et al. in preparation), as well as on other aspects of the data.
The Geuvadis RNA-sequencing data are freely and openly available. The main portal for accessing the data is EBI ArrayExpress (accessions E-GEUV-1, E-GEUV-2, E-GEUV-3). For visualisation of the results we created the Geuvadis Data Browser ( where quantifications and QTLs can be viewed, searched, and downloaded.
Data access schema for Geuvadis RNAseq data.The main accession site to the data created and analyzed by the Geuvadis RNA-sequencing project is EBI ArrayExpress, where the data is stored under three accessions: E-GEUV-1 for mRNA post-QC samples used in analyses of this paper, E-GEUV-2 for small RNA post-QC samples, and E-GEUV-3 for all the sequenced data.
1) Raw reads in the form of fastq files are stored in ENA under the accession ERP001942 and ERP001941, but they are accessible also through ArrayExpress (the ENA and FASTQ columns)
2) mRNA mapped reads are stored and accessible from EBI ArrayExpress. Files of mapped small RNA reads are not provided due to the more complex nature of mapping to different references for different analytical purposes and the large number of multimapping reads making file sizes very large.
3) Genotype data that have been used in Geuvadis data analysis are available from EBI ArrayExpress site under accession E-GEUV-1, and the vcf files include also a functional reannotation of all the variants. The original data created by 1000 Genomes Project are available in the 1000 Genomes web site. 

4 and 5) Geuvadis analysis results for gene, transcript, exon, and repeat quantifications and QTLs will be available from EBI ArrayExpress site under accession E-GEUV-1, and miRNA quantifications and mirQTLs under accession E-GEUV-2.
6) mRNA mapping results per sample down to the level of individual reads can be visualized using Ensembl Genome Browser using the links from ArrayExpress (the Ensembl icon)
7) Geuvadis data browser was created specially for the Geuvadis RNA-seq project to visualize quantification and QTL results, and allows searching by variant ID, gene and region, and after publication also download of quantification and QTL data by region.
8) Original genotype data can be viewed and downloaded in the 1000 Genomes Browser
9) Protocol and sample metadata information is available in ArrayExpress (mRNA QC+ ; miRNA QC+ ; All QC+/-). The project wiki in is openly accessible and contains additional analysis results and method descriptions.
The tools and protocol for read mapping is described in
Do I need someone’s permission to download the data and analyze it? Can publish my findings?
You can download and analyze the data freely, as long as you include the proper citation in all presentations and publications based on these data. 
Are you part of the 1000 Genomes Project?
The RNA-sequencing data and analysis have not been done as a part of 1000 Genomes, so this should not be referred to as the 1000 Genomes RNA-seq project. However, we are happy and grateful users of 1000 Genomes genotype data and samples, and many of the co-authors are part of that consortium as well.
What protocol did you use for RNA extraction / library prep / mapping / etc?
See the protocol descriptions under the ArrayExpress accessions and the Supplementary Material of Lappalainen et al. Nature 2013. If that doesn’t answer your questions, send us an email .
I would need slightly different quanfication / association data than you provide. Is this available somewhere?
If this is something close to what we provide, we might have these data – you can send us an email to ask, but you might need to start from the raw data.
I don’t understand the file format / data that is provided
See the README file of the data directories, check Google for common formats like bam and vcf, and read the Supplementary Material of Lappalainen et al. Nature 2013. If that doesn’t answer your questions, send us an email.
The main paper of the project and the data that should be cited in all presentations and publications based on this data: Lappalainen et al. Nature 2013 : Transcriptome and genome sequencing uncovers functional variation in humans, (in press) 
Companion paper on technical variation and reproducibility of RNA-seq data:
‘t Hoen et al. Nature Biotechnology 2013: Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories, (in press).
Information on data
Tuuli Lappalainen (
Emmanouil Dermitzakis (
Technical assistance in data access
Natalja Kurbatova (
Information on the GEUVADIS project
Xavier Estivill (
Gabrielle Bertier (