Phenotype simulation

The below demonstrates the simulation of

  • Binary trait
  • Quantitative trait
  • Ordinal categorical trait
  • Time-to-event trait (to be updated)

The phenotype data system.file("extdata", "simuPHENO.txt", package = "GRAB") is simulated as below line by line.

First, we simulate covariates data frame in R

library(GRAB)
set.seed(678910)
FamFile = system.file("extdata", "simuPLINK.fam", package = "GRAB")
FamData = data.table::fread(FamFile)
IID = FamData$V2  # Individual ID
n = length(IID)   # sample size
Covar = data.table::data.table(IID = IID,
                               AGE = rnorm(n, 60), 
                               GENDER = rbinom(n, 1, 0.5))

Then, we simulate linear predictors eta

beta.AGE = 0.5
beta.GENDER = 0.5
Covar = Covar %>% mutate(eta = beta.AGE * AGE + beta.GENDER * GENDER)

To simulate phenotypes for subjects with a given family structure, please add a random effect to the linear predictors eta as below.

bVec = GRAB.SimubVec(500, 50, "10-members", tau = 1)
Covar = merge(Covar, bVec)
PhenoData = Covar %>% mutate(eta = eta + bVec)

Next, we simulate phenotypes using eta

A. binary trait

PhenoData = PhenoData %>% mutate(BinaryPheno = GRAB.SimuPheno(eta, traitType = "binary", 
                                                              control = list(pCase=0.1)),
                                                              seed = 1)
PhenoData %>% select(BinaryPheno) %>% table()
# BinaryPheno
#   0   1
# 881 119

B. quantitative trait

PhenoData = PhenoData %>% mutate(QuantPheno = GRAB.SimuPheno(eta, traitType = "quantitative", 
                                                             control = list(sdError=1)))

C. ordinal categorical trait

PhenoData = PhenoData %>% mutate(OrdinalPheno = GRAB.SimuPheno(eta, traitType = "ordinal", 
                                                               control = list(pEachGroup = c(8,1,1))))
# OrdinalPheno
#   0   1   2
# 801 109  90

D. time-to-event trait

TimeToEventPheno = GRAB.SimuPheno(PhenoData$eta, traitType = "time-to-event", control = list(eventRate = 0.1))
PhenoData = cbind(PhenoData, TimeToEventPheno)

The simulated phenotype data is as below

head(PhenoData)
# Key: <IID>
#         IID      AGE GENDER      eta       bVec BinaryPheno  seed QuantPheno
#      <char>    <num>  <int>    <num>      <num>       <int> <num>      <num>
# 1:   Subj-1 59.61118      0 29.59961 -0.2059781           0     1 -0.4837045
# 2:  Subj-10 61.31636      0 29.32101 -1.3371678           0     1 -3.2131581
# 3: Subj-100 59.16625      0 29.90944  0.3263173           0     1  0.4453062
# 4: Subj-101 60.18490      1 28.99876 -1.5936861           0     1 -1.2842710
# 5: Subj-102 58.76700      1 27.90020 -1.9832990           0     1 -1.5138808
# 6: Subj-103 59.98608      1 29.11236 -1.3806732           0     1 -0.3199512
#    OrdinalPheno   SurvTime SurvEvent
#           <num>      <num>     <num>
# 1:            0 0.04815559         0
# 2:            0 0.06259781         0
# 3:            0 0.03009825         0
# 4:            0 0.01281380         0
# 5:            2 0.19564637         0
# 6:            0 0.08905056         0

Output the simulated phenotypes to a file

extDir = tempdir()
extFile = file.path(extDir, "simuPHENO.txt")
data.table::fwrite(PhenoData, extFile, 
                   row.names = F, quote = F, col.names = T, sep = "\t")                  

The distribution of simulated phenotypes

The distribution of simulated phenotypes compared to linear predicators eta is as below.

  • quantitative phenotype: positive correction between eta and phenotypes
  • binary phenotype: higher eta, higher possibility of being cases
  • ordinal categorical phenotype: higher eta, higher possibility of being groups with larger number

Distribution of simulated phenotypes