header

This page provides a list of data sets generated from our research projects that can be viewed and downloaded. Some of the data has been published, while some not yet. If you are interested in using them for further global analysis please firstly contact us [Dr. Jianbing Yan]. We will be pleased to share the data for any specific gene or region analysis.

Genetic Resources

  • Maize is an ideal crop for association mapping due to its great genetic diversity and rapid linkage disequilibrium (LD) decay.Successful association mapping of a species requires firstly the creation of a desirable germplasm collection that reflects genetic diversity, extent of LD decay and genetic relatedness in a population, which determine the mapping resolution and power. Generally, germplasm collections need to encompass adequate genetic diversity to cover most variations for the traits of interest.

    • We have assembled a global germplasm collection with more than 1,000 maize elite inbred lines released from the major temperate and tropical/subtropical breeding programs of China, CIMMYT and Germplasm Enhancement of Maize (GEM) project in US. Totally 527 inbred lines assayed by genome-wide single nucleotide polymorphism (SNP) markers are listed below, of which the adaptation data obtained from field experiments is also available. RNA sequencing was performed on 368 lines[368 diversity inbred lines and pop-structure][0.1 MB] of these 527 lines using kernels harvested 15 days after pollination. We also provide 7 SSRs [SSR information and related PCR results] to detect the whole AMP lines.


Related papers:
Characterization of a global germplasm collection and its potential utilization for analysis of complex quantitative traits in maize.

Genetic analysis and characterization of a new maize association mapping panel for quantitative trait loci dissection.


MODEM: multi-omics data envelopment and mining in maize (2016).

Besides, we have developed 11 RIL and one BC2F6 populations, each containing 200 families. All the families were genotyped with 50K Maize SNP array [50K SNPs for 12 populations download][18 MB] and phenoytped in at least 6 locations. More RIL populations are under development and will be phenoytped in more locations .

Linkage maps [maps download] for above 12 populations were constructed based on SNPs from the 50K array [raw data] with the software CARTHAGENE, and the recombination dynamics along the maize genome was invesgated. [download for software, code and guide]


Related papers:
Genome-wide dissection of the maize ear genetic architecture using multiple populations (2016).

Genome-wide recombination dynamics are associated with phenotypic variation in maize (2016).

Genotypic Data:

  • The RNA-seq project (in collaboration with CAAS, BGI and CAU) has generated more than 3.6 million SNPs for the 368 diverse inbred lines. The genotypic data of each line is released below, containing high quality SNPs (missing rate less than 0.6) combined with SNPs from MaizeSNP50 BeadChip (totally more than 1.06 million)[Download].


Related papers:
Genome-wide association study dissects the genetic architecture of oil biosynthesis in maize kernels (2013).

RNA sequencing reveals the complex regulatory network in the maize kernel (2013).

To increase the power of association analysis, we imputed high density genotype [MAF<5% filtered, 66.3MB] to the whole 513 panel based on an integrated IBD and KNN model.


Related paper:
Genome Wide Association Studies Using a New Nonparametric Model Reveal the Genetic Architecture of 17 Agronomic Traits in an Enlarged Maize Association Panel (2014).

In recently, we created an new integrated variant map with much more higher density (1.25M SNPs with MAF≥0.05) and enlarged panel size (n=540), through combining genotypes from previous RNA-sequencing and 50K array with newly identified from high-density array (600K) and GBS technology. This dataset was applied in re-mapping the eQTL landscape for maize kernel, and would provide a great resource for future genetic studies. Due to the big size of related files, the finally merged genotyping set (with hapmap format) together with separately raw ones genotyped from different strategies could be available at Here [1.25M with 540 size].


Related paper:
Distant eQTLs and non-coding sequences play critical roles in regulating gene expression and quantitative trait variation in maize (2016).

We also genotyped some RILs (close to publish listed here only):


Zong3 X 87-1:

[261 SSR markers for the 294 RILs][156KB].

[Genotypes and map information of the 3,184 bins for the 256 RILs][1.64MB].

[261 SSR markers for the 441 crosses][232KB].


Related papers:
Genetic basis of grain yield heterosis in an “immortalized F2s” maize population (2014).

Performance prediction of F1 hybrids between recombinant inbred lines derived from two elite maize inbred lines (2012).

SV MAP

The present SV map derived from DNA deep re-sequencing data of the association panel based on the haplotype between SK, B73 and Mo17


[eqtl.psv.19707][1.88MB].

[GWAS.pSV.21081][2.96MB].

[pSV.80614][8.63MB].

[SV.382254][31.37MB].

[SV.386014][102.42MB].


Related papers:
A reference tropical maize genome provides insights into structural variation and facilitates cloning of a yield related gene (2019).

Moisture Content traits and AUDDC traits of association panel

The association panel were planted in five locations as follows: Jilin (14JL; Gongzhuling, E 124°69′, N 43°79′), Liaoning (14SY; Shenyang, E123°43′, N 41°81′), Henan (14HeN; Xinxiang, E 113°91′, N 35°31′), and Hubei (14WH; Wuhan E 114°32′, N 30°58′) in 2014 and Hainan (13HN; Sanya, E 109°51′, N 18°25′) in 2013. The phenotypic data (including moisture content traits and AUDDC traits for five locations and BLUP) could be found below:


[AUDDC.csv.19707][433KB].

TOP algorithm related data

Target-oriented prioritization (TOP) is a flexible machine learning algorithm that combines aspects of genomic prediction (GP) and maximum likelihood testing. This method learns the inherent correlations among traits in the training population, balancing the selection on multiple traits, and predicts the similarity degree between a tested material (inbred/hybrid) and a commercial variety. The maize NCII population included 5,820 F1 hybrids that were created from the cross of 194 maternal inbred lines, a subset of the maize Complete-diallel plus Unbalanced Breeding-derived Inter-Cross (CUBIC) population (Liu et al., 2020), and 30 diverse elite paternal lines. From the previous analyses, ~13.8M SNPs were available for the CUBIC lines (Liu et al., 2020). We filtered SNPs using PLINK (Chang et al., 2015), removing those with either minor allele frequency (MAF) <0.05, or expected missing rate >10% in the hybrid population, and those with an LD r2>0.3, finally obtaining 156,269 SNPs to represent the genotype of the 5,820 F1 hybrids. Genotype data consists of genotype indicators. The value 0 is for the homozygote of the major allele, the value 1 is for the heterozygote and the value 2 is for the homozygote of the minor allele. A total of 18 agronomic traits including flowering traits (days to tassel, days to anther and days to silk), plant architecture traits (plant height, ear height, ear leaf width, ear leaf length, tassel length, tassel branch number) and yield traits (cob weight, ear weight, ear diameter, ear length, ear row number, kernel number per ear, kernel number per row, kernel weight per ear, length of baren tip).


Data.rar][16.5MB].

Expression data of 368 lines:

  • The expression level in DAP15 kernel of the 368 association panel was quantified based on RNA-seq. Read counts for each gene were calculated and scaled according to RPKM. After RPKM normalization, all genes with a median expression level larger than zero for each sample were included, and the overall distribution among 368 lines of expression levels for each gene is normalized using a normal quantile transformation.

To download expression data of the 368 panel


Related papers:
RNA sequencing reveals the complex regulatory network in the maize kernel (2013).


Distant eQTLs and non-coding sequences play critical roles in regulating gene expression and quantitative trait variation in maize (2016).


MODEM: multi-omics data envelopment and mining in maize (2016).

PAN-transcriptome related:

  • The above described the deep RNA-seq of the 368 inbred lines (DAP15 kernel). Novel transcripts were de novo assemblied based on the preferred modified “assemble-then-align” strategy. After filtering and cluster steps, 2355 reference novel genes(transcripts) were obtained for the 368 panel, which were applied in further association mapping and estimation the maize pan-size. The novel sequences (fasta format) and their annotation could be available below:

[novel gene sequences (2355)]

[Annotation of Novel sequences]

[novel genes variation (2355)]

  • Additionally, we investigated the extreme variation at the transcript level by analyzing above RNA-seq data. We have identified almost one-third (13,443) nuclear genes under expression presence and absence variation (ePAV) in maize. The ePAV genes (dispensable transcriptome) are further shown to undergo different genetic mechanism and regulation roles compared with core expression genes, which tend to be much more regulated by distant eQTLs and likely to be functional as regulatory roles. We thus believe this new identified "markers" might be useful in your further specific studies, which could be availability below.

[ePAV variation (13443)]

  • Further, the kernel metabolome (including 616 metabolic traits) and 17 agronomic phenotypes were used to explore the genetic architecture of pan-transcriptome and thus to give us a much more general evaluation of phenotypic contribution from dispensable expressed genes and novel ones. Importantly, the ePAV states, not the genomic variation, have been demonstrated to be valuable effective markers and easy to interpret, to make respective advantages complementary to SNP-GWAS studies in understanding the genome regulatory complexity and for applications in quantitative trait loci (QTL) cloning. All the association mapping results are provided here:

[PAN GWAS details]


Related paper:
Maize pan-transcriptome provides novel insights into genome complexity and quantitative trait variation (2016).

Phenotypic Data:

  • The association panel and RILs were planted in multiple locations as follows: Honghe autonomous prefecture, Yunan province; Sanya, Hainan province; Wuhan, Hubei province; Ya’an, Sichuan province in Year 2010 and Chongqing; Hebi, Henan province; Nanning, Guangxi province; Kunming, Yunnan province in Year 2011; ranging from 18 to 35 degrees north latitude, from 102 to 114 degrees east longitude. The phenotypic data (including agronomic, metabolic and grain quality traits) is listed below.

[Agronomic traits(blup) of association panel][117 KB].

[Amino acids traits of association panel][102 KB].

[Metabolic traits (Experiment-1 of association panel)][380 KB].

[Metabolic traits (Experiment-2 of association panel)][526 KB].

[Metabolic traits (Experiment-3 of association panel)][508 KB].

[Metabolic traits (RIL-B73 X BY804)][249 KB].

[Metabolic traits (RIL-Zong3 X 87-1)][237 KB].


Related papers:
Genome Wide Association Studies Using a New Nonparametric Model Reveal the Genetic Architecture of 17 Agronomic Traits in an Enlarged Maize Association Panel (2014).


Metabolome-based genome-wide association study of maize kernel leads to novel biochemical insights (2014).


Genomic, Transcriptomic, and Phenomic Variation Reveals the Complex Adaptation of Modern Maize Breeding (2015).


Maize pan-transcriptome provides novel insights into genome complexity and quantitative trait variation (2016).


MODEM: multi-omics data envelopment and mining in maize (2016).

We also penotyped some RILs (close to publish listed here only):


By804 X B73:

A maize recombinant inbred line population (By804/B73) which was derived from a cross between normal line B73 and high-oil line By804 was planted at Huazhong Agricultural University field experiment station (Wuhan, E 109°51', N 18°25') in 2013. Phenotypes including seven agronomic traits (plant height (PH), ear height (EH), length of ear leaf (LL), width of ear leaf (LW), tassel length (TL), tassel branch number (TBN) and fresh shoot biomass) as well as 79 metabolic traits (profiled from three tissue types by using GC-MS) were measured and the dataset is provided below.

[79 metabolic traits (from three tissues) & 7 agronomic traits][421KB].


Related paper:
Genetic Determinants of the Network of Primary Metabolism and Their Relationships to Plant Performance in a Maize Recombinant Inbred Line Population (2015).


Zong3 X 87-1:

[Phenotypes of the 294 RILs and 441 crosses.csv][178KB].

Ear traits on ROAM population (10 RILS):


A maize ROAM (Random-open-parent Association Mapping) population was set of 10 independent recombinant inbred line (RIL) populations, i.e., B73×BY804, KUI3×B77, K22×CI7, DAN340×K22, ZHENG58×SK, YU87-1×BK, ZONG3×YU87-1, DE3×BY815, K22×BY815 and BY815×KUI3. These ten RIL populations were planted in eight trials during the summer and winter of 2011 and 2012 in five locations in China with one random-block replication per location. Four ear traits, Ear length (EL), Ear row number (ERN), Ear weight (EW) and Cob weight (CW), were measured and the dataset of the best linear unbiased prediction values is provided below. Relevant analyses and results based on this data could be found below.

[4 ear traits from ROAM population][124 KB].


Related paper:
Genome-wide dissection of the maize ear genetic architecture using multiple populations (2016).

Tea metabolic data


Raw metabolic data of catechins and gallic acid in three leaf samples (young leaf, the third leaf and mature leaf) of 176 tea accessions.

[Tea metabolic data][31.29 MB].

Software and packages:

  • Anderson-Darling (A-D) test


A new method called A-D test was developed to Genome-wide association studies (GWAS). One open R package [including manual and sample data] could be used easily to make the A-D test for GWAS, which is available here:

[R package for ADGWAS]


Related paper:
Genome Wide Association Studies Using a New Nonparametric Model Reveal the Genetic Architecture of 17 Agronomic Traits in an Enlarged Maize Association Panel (2014).

  • Random-Open-parents Association Mapping (ROAM)


These R scripts are designed for ROAM, a new proposed multi-parental population for large scale genetic analysis, including genotype imputation, projection, bin extraction, kinship calculation, joint linkage mapping (JLM) and genome-wide association study (GWAS) analyses.

[ROAM source codes]


Related paper:
Genome-wide dissection of the maize ear genetic architecture using multiple populations (2016).

  • Genetic Map Construction Software (HDGenMap)


    HDGenMap is designed for genetic map construction based-on carthagene.

    [HDGenMap]


    Related paper:
    Genome-wide recombination dynamics are associated with phenotypic variation in maize (2016).

  • Protocols from Maizego lab:

    • Single Tetrad-stage Microspore Sequencing in Maize


    The protocol from isolating single tetrad-stage microspores to single cell sequencing is available here:

    [maize single tetrad-stage microspore sequencing protocol]


    Related paper:
    Dissecting meiotic recombination based on tetrad analysis by single-microspore sequencing in maize (2015).

    Collective resources for CUBIC population:

    • The CUBIC (Complete-diallel plus Unbalanced Breeding-like Inter-Cross) population consists of 1404 progenies descended from 24 Chinese elite inbred lines. The 24 founders were selected from four unrelated groups and have been widely used in Chinese breeding over the past century. Generally, CUBIC descends from the traditional MAGIC design with the integration of the diallel cross to incorporate information from phenotypic selection. Since the conventional MAGIC design requires 2^N founder lines with at least N generations for initial inter-cross of all founders, the integration of the diallel cross allows escape from arbitrary founder number and saves time during subsequent population development.

      [see schematic diagram for population design]

      All progenies were re-sequenced with ~1x coverage, and the 24 founders with 11x coverage; 194 lines were further genotyped using a maize200K array to cross-validate variant discovery. Briefly, approximately 5 Tb of raw sequencing data were assembled, and popular pipelines were employed to characterize over 14 million high-quality SNPs.

      [Downloading raw sequencing reads] [Downloading SNPs in PLINK formats]

      The 1,404 lines and 30 checks, together with all founders, were evaluated in five locations for twenty-three agronomic traits. The best linear unbiased predictor (BLUP) values for each line were used to reduce environmental noise in the phenotypic data.

      [Downloading BLUP penotypes for 23 agronomic traits]

      A IBD map was constructed of contributions from the twenty-four founders onto the 1,404 progeny lines using a hidden Markov model (HMM) with several modifications on the study of Mott et al., (2000). In the modified HMM model, the hidden states are the progenitor IBD states, and the observed states are the SNP genotypic calls, which can be used undoubtedly to other multi-parental populations.

      [HMM script and demo for IBD map construction]

      To aid functional gene identification, a subset of 391 progenies were randomly selected from the CUBIC population for RNA-sequencing. These lines and the founder parents were grown at the Hainan field station in the winter of 2016. At the V9 stage (the stage with the fastest leaf tissue growth), total RNA was extracted from the tissue of the 11th leaf. A 150-bp paired end Illumina sequencing was performed using the HiSeq X-Ten protocols. Each sample had on average ~20 million raw reads. The reads with high sequencing quality were remained and mapped onto the B73 AGPv3.25 reference using STAR (Dobin et al., 2013); only those uniquely mapped reads were used to quantify gene expression levels with HTSeq (Anders et al., 2015). The expression for each gene was normalized using the software Deseq2 (Love et al., 2014) and only genes expressed in more than 60% of the lines were retained in eQTL mapping. Top 10 PEER (Stegle et al., 2012) factors, together with the top ten genotypic PCs, were utilized to account for covariates to perform eQTL mapping under EMMAX (Kang et al., 2010).

      [Downloading raw RNA-seq reads] [Downloading expression quantifications] [Downloading eQTL results]

    Reference(s): CUBIC: an atlas of genetic architecture promises directed maize improvement (2020).



    High-throughput CRISPR/Cas9 gene editing (initial stage):

    • The integration of high-throughput gene editing and genetic mapping has great potential in rapid functional gene cloning and validation. Here, we report the development of a CRISPR/Cas9-based editing platform adapted to high-throughput gene knockouts in maize, and its application in functional gene identification by integrating over one thousand candidate genes corresponding to agronomic and nutritional traits, which are derived from genetic mapping and comparative genomic analysis.

      [The pipeline for high-throughput genome-editing]

      With low-cost capture-based deep sequencing [downloading capture primer pairs], 412 precisely-edited sequences covering 118 genes were identified. This mutant profile [downloading] was parallel with that in human cell lines and highly predicable.

      Raw WGS and RNA-seq reads of the transformation receptor (KN5585), and raw reads of capture-based sequencing for 60 batches have been deposited in the Genome Sequence Archive of BIG Data Center with accession code "CRA001955". The assembled genomic and transcriptomic contigs can be directly obtained from the netdisc [MaizeGo Resources -> CRISPR-Contigs]


    Reference(s):
    High-Throughput CRISPR/Cas9 Mutagenesis Streamlines Trait Gene Identification in Maize (2020)