Simulation of life histories

Introduction

The life course or life history is conceptualized as a sequence of states and transitions between states. A state is a personal attribute, e.g. marital status or number of children ever born, or a combination of personal attributes. The set of possible states is the state space. It is the set of possible values of a personal attribute or combination of attributes. An individual’s life history depends on several factors, including individual characteristics, social context and personal experiences. In this paper, the influencing factors are limited to personal characteristics. The individual life history is modelled as a multistate stochastic process. The transitions the individual experiences and the states occupied after the transitions depend on personal characteristics and chance. The parameters of the multistate process are transition rates that vary with age (and covariates). They are estimated from empirical data. Parameter estimators are obtained by combining data from different individuals. As a result, the individual life history generated by the model is synthetic. Individuals are virtual individuals and the population is a virtual population.

The vignette consists of four sections. Section 2 is a brief review of the theory of multistate modelling. Section 3 covers the simulation of a single fertility history. Section 4 is an application. Fertility histories of women in the United States are simulated using data from the Human Fertility Database (HFD) (www.humanfertility.org) and the Human Mortality Database (HMD) (www.mortality.org). Individuals with identical characteristics have different life histories due to random factors ¹. VirtualPop has several vignettes. To list the vignettes, use $browseVignettes("VirtualPop")$ and to see the code, use edit(vignette(x)).

Theoretical background: simulating life histories

Let _kY(x) denote the state individual k occupies at age x. _kY(x) is a discrete random variable, a Bernoulli random variable if r=2 and a multinomial random variable if r>2. The probability that k is in state i at age x is the state probability _kp_i(x) = Pr[_kY(x) = i], with $\sum_{i=1}^{r} {_kp_i(x)=1}$. The probability that k is in state i at age x and in state j at age x + τ is the joint probability

The probability is the product of the probability that k is in i at x and the conditional probability that k is in j at x + τ, provided k is in i at x:

The conditional probability is the transition probability and is denoted by _kp_j|i(x, x + τ), which is more conveniently written as _kp_ij(x, x + τ) and, if τ = 1, it is also written as _kp_ij(x). The probability that k is in j at x + τ is

which is known as state equation. It expresses the state probability at x + τ in terms of the state probabilities at x and the transition probabilities.

State probabilities may be combined in a state vector _kS(x) and transition probabilities in the transition matrix _kP(x, x + τ):

where p_ii(τ) = 1 − ∑_j ≠ i_kp_ij(τ).

The element _kp_ij(x, x + τ) denotes the probability that k who resides in location i at age x resides in j at age x + τ. The diagonal element _kp_ii(x, x + τ) is the probability that k, who is in i at x, is also in i at x + τ. It does not imply that k stays in i from x to x + τ on a continuous basis. Multiple transitions are allowed during the interval (x, x + τ).

The state vector at x + τ is

where_kS(t) is a column vector and ’ denotes transpose. This matrix expression is a fundamental equation in multistate modelling. In this vignette, the state equation is specified at the individual level. The transition probabilities of an individual of age x depend on personal attributes at x, the history of attributes, and contextual factors. In this vignette, the dependencies are restricted.

If τ is small and tends to zero, is the instantaneous rate of transition (transition intensity) at age x. Multistate models are fully specified by their transition intensities. If the instantaneous transition rates are independent of the history of k, the stochastic process governed by the transition intensities is a continuous-time Markov process. A common assumption in demography is that _kμ_ij(x) is piecewise constant (e.g. age-specific rates).

The instantaneous transition rates are combined in a transition rate matrix: where μ_ii(τ) = ∑_j = i_kμ_ij(τ).

The cumulative transition rate during the age interval from 0 to x is the cumulative hazard:

The formulation assumes that, throughout the interval from 0 to x, individual k may enter any state j from any state i unless _kμ_ij(τ) is zero. A state that can be entered and left is a transient state, while a state that can be entered but not left is an absorbing state. To be able to move from i to j at time τ, i.e. during the small interval from τ to τ + dτ, individual k must be in i at age τ. The current state occupied determines possible destination states. The state k occupies at τ is determined endogenously, i.e. by the model. Individual k is in i at age τ and at risk of a transition to j if _kY(τ) = i. Let _kY_i(τ) be an indicator variable which is one if k is in i at τ and zero otherwise:_kY_i(τ) = I((_kY(τ) = i). The quantity _kY_i(τ)dτ is the time k spends in i during the interval (τ, τ + dτ). It is also the duration of exposure to the risk (duration at risk) of leaving i. The total duration of exposure before age x is ∫₀^x_kY_i(τ)dτ. It is the duration of stay on i between 0 and x. The stay does not need to be on a continuous basis. Individual k may leave i and return to i between ages 0 and x. The transition intensity at age x is the product of the instantaneous transition rate and the indicator variable:

The function _kμ_ij(τ) is the hazard function, _kY_i(τ) the exposure function and _kλ_ij(τ) the intensity function (Aalen, Borgan, and Gjessing 2008, 31; Cook and Lawless 2018, 29). The expression _kμ_ij(τ)dτ is the probability that a transition occurs during the interval (τ, τ + dτ), provided k is at risk of experiencing the transition; _kY_i(τ)dτ is the duration of exposure during the interval (τ, τ + dτ); and _kλ_ij(τ)dτ is the expected number of transitions from i to j by individual k during the interval (τ, τ + dτ). The integral ∫₀^x_kλ_ij(τ)dτ is the expected number of (i,j)-transitions between ages 0 and x. It is denoted by _kN_ij(x). The sequence of random variables {_kN_ij(τ); 0 ≤ τ ≤ x} is a counting process (Aalen, Borgan, and Gjessing 2008, 25). The presentation of life history processes as counting processes is a logical approach in demography because event occurrences are related to exposures. Willekens (2014) adopts the counting process perspective in multistate demographic modelling. Notice that the cumulative hazard _kΛ_ij(x) may also be interpreted as an expected number of transitions between ages 0 and x. It is the expected number, provided individual k remains at risk of an (i,j)-transition during the entire period from 0 to x. If during the interval from 0 to x, individual k is at risk during the subinterval from x₁ to x₂ with 0 ≤ x₁, x₂ ≤ x, then the cumulative hazard may be written as

During the interval, k may stop being at risk by moving to another destination or by censoring the observation. In the simulation, the cumulative hazard needs to be computed for a period during which individual k is exposed to the risk of a (i,j)-transition. Often, a period of exposure is endogenous and starts with the occurrence of another transition. For instance, the birth of a first child triggers a period at risk of a birth of a second child. The probability that k, who is in i at x₁, is in j at age x₂ is exp[−_kΛ_ij(x₁, x₂)]. The probability that k, who is in i at x₁, is still in i at x (x > x₁) is To determine the waiting time to leaving i for individual k, who is in i at x_1, a random number u is drawn from the standard uniform distribution and the root of the equation

is determined. It gives the duration between x₁ and the age at transition, i.e.x_u − x₁, with x_u the age for which the above equation holds. The destination j is determined by sampling the multinomial distribution with parameters {_kp_i1(x),_kp_i2(x), ….,_kp_ir(x)}

and ∑_j ≠ i_kp_ij^*(x) = 1. Sampling a multinomial distribution consists of two steps. First, a random number u is drawn from the standard uniform distribution. Second, the position of u in the parameter space is determined. If u < _kp_i1^*(x), the destination is state 1. If _kp_i1^*(x) ≤ u < _kp_i2^*(x), the destination is state 2, etc. The method is equivalent to creating a vector with endpoints 0 and 1 and breakpoints _kp_i1^*(x), _kp_i1^*(x) + _kp_i2^*(x), _kp_i1^*(x) + _kp_i2^*(x) + _kp_i3^*(x), etc. and to determine the segment that contains u.

Some transitions may not be possible, leading to an adjustment of the matrix of instantaneous transition rates _kμ(x). In the next section, the above theory is applied to simulate fertility careers. Each female member of the population becomes at risk of conception at menarche and remains at risk until menopause (conditional on survival and in the absence of sterilization). Fertility rates are zero initially, turn positive during adolescence and become zero again at menopause. The age at birth of the first child is determined by drawing a random waiting time from a piecewise exponential distribution with the age-specific first birth rates as parameters. If the waiting time exceeds the maximum reproductive age, the woman remains childless. Assume singleton births (no twins or triplets). Women with a first child are exposed to the risk of a second child. Exposure starts at the age at birth of the first child and ends at the age at birth of the second child or the maximum reproductive age. Women with two children are exposed to the risk of a third child, and so on. Formally, women with i children are at risk of an additional child. The risk period starts at the age of birth of the i-th child and ends at birth of the i+1-st child or the end of the reproductive period (age at menopause). If i=0, exposure starts at the lowest age of the reproductive period (age at menarche).

Assuming no multiple births, a feasible transition is from i births to i+1 births. Other transitions are not possible. Hence, the transition intensities _kμ_ij(t) are 0 except if j=i+1, with i the current number of biological children ever born to k. The matrix of party-specific instantaneous fertility rates is: with r the highest parity considered in the analysis and _kμ_rr(x) the instantenous rate that a woman with r children has another child. For instance, if r is five, then a woman with five children may have another child. Beyond parity five, the fertility rate is independent of the parity. The matrix of transition intensities defines a hierarchical model. A hierarchical model is characterized by having r living states connected by r-1 rates of transition (Schoen 2016, 2019).

In the presence of mortality, the reproductive life course may be interrupted by death. Birth of the next child and death are competing risks. In the model, it is assumed that the rate of death is independent of the number of children ever born. In that case, individual lifespans may be simulated first, followed by the simulation of fertility careers.

Illustration: simulation of a single life history

Consider a most simple case. An individual has one attribute with three possible values (A, B and C). The three possible values is the state space. The individual moves between the three states at constant rates. The transition rates are:

A <- matrix(c(   0.0,0.10,0.05,        
                 0.07,0.0,0.03,
                 0.02,0.05,0.0),nrow=3,byrow=TRUE)
namstates <- c("A","B","C")
dimnames(A) <- list(origin=namstates,destination=namstates)
diag(A) <- -rowSums(A)
B <- -A
B
#>       destination
#> origin     A     B     C
#>      A  0.15 -0.10 -0.05
#>      B -0.07  0.10 -0.03
#>      C -0.02 -0.05  0.07

The transition rate matrix B has the format of the matrix in equation 5.

The constant transition rates are used to generate life histories during segments of life. The function sim.msm() of the msm package simulates one realisation from a continuous-time Markov process up to a given time. Note that sim.msm() requires that the states are numeric values. It also requires that the transition matrix has origin in rows and destination in columns (Jackson 2011, 6). The following code chunk generates a history between ages 20 and 40 for an individual who starts in state A at age 20.

bio <- msm::sim.msm (qmatrix=-B,mintime=20,maxtime=40,start=1)
bio
#> $states
#> [1] 1 3 3
#> 
#> $times
#> [1] 20.00000 38.53716 40.00000
#> 
#> $qmatrix
#>       destination
#> origin     A     B     C
#>      A -0.15  0.10  0.05
#>      B  0.07 -0.10  0.03
#>      C  0.02  0.05 -0.07

The function produces a list of three objects:

Sequence of states occupied. The first state is the state at the start of the simulation. The last state is the state occupied at the end of the simulation, which is the same as the state occupied after the last transition.
Ages of transition. The list includes the starting age and ending age.
The transition matrix used to generate the life histories

At the start of the simulation, the individual is in state A (state 1) and exposed to the risk of moving to state B (state 2) or C (state 3). A random draw from the exponential waiting time distribution gives 18.54 years (since the start of the simulation and 38.54 years since the start of the time scale). Since 18.54 is the duration at which the value of the cumulative probability distribution is equal to u, we may determine the value of u produced by the random draw:

The probability that an individual leaves state A before the end of the observation at 40 is 1 − exp[−0.15 * 20] = 0.95. Hence, if a random number is drawn that exceeds 0.95, the waiting time to the transition exceeds the observation window. In that case, the transition does not occur due to censoring.

The destination state is determined by a Bernoulli trial with two possible outcomes: B or C and probability 0.10/(0.10 + 0.05) = 0.667 that B is the outcome.

The function sim.msm() is also used by Cook and Lawless (2018, 339).

After this illustration of the mechanism, we return to fertility. Suppose a female member of the virtual population experiences throughout her life the age- and parity-specific fertility rates of the United States in 2019. The fertility career is a realization (sample path) of a continuous-time Markov process. The function Sim_bio() of VirtualPop is used to generate the fertility career. Unlike in sim.msm(), the transition rates used by Sim_bio() are age-specific. The fertility rates of the United States in 2019 are distributed with the VirtualPop package. They are loaded with the data function of R base:

library (VirtualPop)
rates <- NULL
data(rates)

The object rates has three components: (a) the death rates by age and sex (ASDR), (b) the fertility rates by age and parity (ASFR), and (c) the fertility rates in a format required by multistate models (ratesM). The rates for ages 25 to 29 are:

rates$ratesM[26:29,,]
#> , , Origin = par0
#> 
#>     Destination
#> Age     par0     par1 par2 par3 par4 par5 par6
#>   25 0.05250 -0.05250    0    0    0    0    0
#>   26 0.05646 -0.05646    0    0    0    0    0
#>   27 0.06218 -0.06218    0    0    0    0    0
#>   28 0.06976 -0.06976    0    0    0    0    0
#> 
#> , , Origin = par1
#> 
#>     Destination
#> Age  par0    par1     par2 par3 par4 par5 par6
#>   25    0 0.16041 -0.16041    0    0    0    0
#>   26    0 0.15638 -0.15638    0    0    0    0
#>   27    0 0.15205 -0.15205    0    0    0    0
#>   28    0 0.15047 -0.15047    0    0    0    0
#> 
#> , , Origin = par2
#> 
#>     Destination
#> Age  par0 par1    par2     par3 par4 par5 par6
#>   25    0    0 0.14116 -0.14116    0    0    0
#>   26    0    0 0.13336 -0.13336    0    0    0
#>   27    0    0 0.12156 -0.12156    0    0    0
#>   28    0    0 0.11499 -0.11499    0    0    0
#> 
#> , , Origin = par3
#> 
#>     Destination
#> Age  par0 par1 par2    par3     par4 par5 par6
#>   25    0    0    0 0.13117 -0.13117    0    0
#>   26    0    0    0 0.12155 -0.12155    0    0
#>   27    0    0    0 0.11328 -0.11328    0    0
#>   28    0    0    0 0.10449 -0.10449    0    0
#> 
#> , , Origin = par4
#> 
#>     Destination
#> Age  par0 par1 par2 par3    par4     par5 par6
#>   25    0    0    0    0 0.14558 -0.14558    0
#>   26    0    0    0    0 0.13723 -0.13723    0
#>   27    0    0    0    0 0.12618 -0.12618    0
#>   28    0    0    0    0 0.11658 -0.11658    0
#> 
#> , , Origin = par5
#> 
#>     Destination
#> Age  par0 par1 par2 par3 par4    par5     par6
#>   25    0    0    0    0    0 0.14558 -0.14558
#>   26    0    0    0    0    0 0.13723 -0.13723
#>   27    0    0    0    0    0 0.12618 -0.12618
#>   28    0    0    0    0    0 0.11658 -0.11658
#> 
#> , , Origin = par6
#> 
#>     Destination
#> Age  par0 par1 par2 par3 par4 par5 par6
#>   25    0    0    0    0    0    0    0
#>   26    0    0    0    0    0    0    0
#>   27    0    0    0    0    0    0    0
#>   28    0    0    0    0    0    0    0

The following function call simulates the fertility career of a hypothetical person born on June 12th 1990.

popsim <- data.frame (ID=3,
           born=1990.445,
           start=0,
           end=55,
           st_start="par0")
ch <- suppressWarnings(Sim_bio (datsim=popsim,ratesM=rates$ratesM))
ch
#> $age_startSim
#> [1] 0
#> 
#> $age_endSim
#> [1] 56
#> 
#> $nstates
#> [1] 1
#> 
#> $states
#> [1] 1
#> 
#> $path
#> [1] "par0"
#> 
#> $ages_trans
#> [1] 0 0

The individual has a first child at age 28 and a second at age 34. Since the date of birth is known, the calendar dates at childbirth can be determined. The first child is born on July 3rd 2018 and the second on 27th March 2025.

format(lubridate::date_decimal(1990.445+28.0567), "%Y-%m-%d")
#> [1] "2018-07-03"
format(lubridate::date_decimal(1990.445+34.789), "%Y-%m-%d")
#> [1] "2025-03-27"

The simulation window covers the entire reproductive period. The simulation may be interrupted at any age during the reproductive period, analogous to the censoring of a longitudinal observation. To assess the validity of the simulation, the simulation window must coincide with the observation window (see vignette Validity of the simulation). The argument end of the Sim_bio() function is an individual-level variable indicating the end of the simulation for the individual.

Sim_bio() is used in the function Children() of VirtualPop. The function Children() produces two objects (data frames), one related to the mother (and father) and the other related to the new-born children. Using Children() to simulate the fertility career of the individual in the virtual population with ID equal to 1, is

Children (dat0=dataLH[c(1,3),1:11],rates)
#> $data
#>   ID gen    sex   bdated   ddated x_D IDpartner IDmother IDfather jch nch id.1
#> 1  1   1 Female 2019.053 2112.222  35      1177       NA       NA  NA   3    4
#> 3  3   1 Female 2019.158 2091.099  35      9416       NA       NA  NA   1    7
#>   id.2 id.3 id.4 id.5 id.6 id.7 id.8 id.9    age.1    age.2    age.3 age.4
#> 1    5    6   NA   NA   NA   NA   NA   NA 25.13885 28.95929 32.28581    NA
#> 3   NA   NA   NA   NA   NA   NA   NA   NA 30.53096       NA       NA    NA
#>   age.5 age.6 age.7 age.8 age.9
#> 1    NA    NA    NA    NA    NA
#> 3    NA    NA    NA    NA    NA
#> 
#> $dch
#>   ID gen    sex   bdated   ddated      x_D IDpartner IDmother IDfather jch
#> 1  4   2 Female 2044.192 2131.248 87.05666        NA        1       NA   1
#> 2  5   2   Male 2048.012 2131.956 83.94424        NA        1       NA   2
#> 3  6   2 Female 2051.339 2138.235 86.89618        NA        1       NA   3
#>    7   2   Male 2049.689 2145.910 96.22127        NA        3       NA   1

Now we turn to an application: the simulation of fertility careers of women in the United States.

Application: fertility histories of women in the United States

The model described in the previous section is used to simulate individual fertility histories in the presence of mortality.The data are fertility and mortality rates of the United States in 2019. The data are downloaded from the Human Mortality Database (HMD) and the Human Fertility Database (HFD). The retrieval of data from the HMD and HFD is described in another vignette (Tutorial). The mortality and fertility rates are assumed to remain constant during the simulation. Consequently, the simulated life histories convey information embedded in the mortality and fertility rates of 2019. They should not be interpreted as descriptions of historical trends, projections or forecasts. That requires time-varying rates.

The age-specific fertility and mortality rates of the population of the United States in 2019 are distributed with the VirtualPop package for illustrative purpose. The rates are contained in the data object rates. The rates can be used after being loaded with the data function of R base:

library (VirtualPop)
rates <- NULL
data(rates)

The creation of the virtual population from fertility and mortality rates involves several steps. The remainder of this section is a description of the steps to generate the virtual population. The steps are:

Determine the size of the virtual population
Create a data frame with selected data on each individual in the virtual population.
- Add ID of partner (function Partnership())
Generate individual fertility histories (for members of initial population - generation 1) (function Children())
Add, for each new-born, date of birth, sex, ID of mother and ID father
Add, for each new-born, ID of partner
Generate individual fertility histories for members of generation 2 (function Children())
Generate individual fertility histories for members of generation 3 (function Children())
Generate individual fertility histories for members of generation 4 (function Children())

The simulation starts by determining the size of the virtual population. Each individual in the virtual population is assigned a unique identification number (ID) and a date of birth. If all births occur in the same year or same period, then the population consists of a single birth cohort. A real population may be approximated by distributing births over many years. In this section, it is assumed that all births occur in a single year. Dates of birth are obtained by sampling from a probability distribution of births during the year. In this paper, a uniform distribution is assumed (no seasonal effects). The date of birth of an individual is the year of birth plus a fraction of a year, equal to a random number from the standard uniform distribution. The date of birth is expressed as a real number (decimal numeral system) and is referred to as decimal date of birth. The format is particularly useful to compute ages and durations. Decimal dates may be converted into calendar dates. Note that the computer does not store calendar dates, but stores dates as time (seconds or milliseconds) elapsed since a reference date and time, e.g. midnight, 1st January 1970. The conversion into calendar dates is not at all straightforward and software packages may give different results. The conversion should account for months of different length and leap years. The R library contains several date functions that correctly handle chronological objects. For instance, the date_decimal() function of the lubridate package, which is part of tidyverse, converts a decimal date into a calendar date. For a review and discussion of issues related to ages and durations in the simulation of life histories, see Willekens (2013).

At the start of the simulation, each individual is randomly assigned a sex using the empirical age-specific sex ratio at birth in the base year. Other personal attributes may be added, but they are omitted in this illustration.

The simulation includes a very simple partner formation process. Partners are selected at random from the opposite sex. Since the process is purely random, partnership formation may be simulated to occur at birth. When more boys are born than girls, some males remain without a partner.

For each individual in the virtual population, a data record is created. It contains the individual’s ID, date of birth and sex. It also contains a variable for the generation to which the individual belongs. Members of the initial population constitute generation one. The code is

refyear <- 2019
ages <- c(0:110)
ncohort <- 1000
ID <- 1:ncohort
sex <- rbinom(ncohort,1,prob=1/2.05)
sex <- factor (sex,levels=c(0,1),labels=c("Male","Female"),ordered=TRUE)
# Population size by sex
nmales <- length(sex[sex=="Male"])
nfemales <- length(sex[sex=="Female"])
gen <- rep(1,ncohort) # generation 1
# Decimal date of birth
bdated <- refyear+runif(ncohort)
# Create data frame
d <- data.frame (ID=ID,gen=gen,sex=sex,bdated=bdated,ddated=NA,x_D=NA,IDpartner=NA,IDmother=NA,IDfather=NA,jch=NA,nch=NA)
# Ages at death, obtained by sampling a peicewise-exponential distribution, using the rpexp function of the msm package
d$x_D[d$sex=="Male"] <- msm::rpexp(n=nmales,rate=rates$ASDR[,"Males"],t=ages)
d$x_D[d$sex=="Female"] <- msm::rpexp(n=nfemales,rate=rates$ASDR[,"Females"],t=ages)
# Decimal data of death
d$ddated <- d$bdated+d$x_D

Individual records include empty spaces for the IDs of the partner, the mother and father of the individual. Their IDs are determined later. The spaces are included to facilitate the merging of data on multiple generations into a single data frame.

The simple partnership model is implemented in function Partnership(d). It accepts the data frame d as input and returns the data frame augmented by the partner’s ID.

d <- VirtualPop::Partnership(dLH=d)

The first lines of the data frame of individual records are:

head(d)
#>   ID gen    sex   bdated   ddated      x_D IDpartner IDmother IDfather jch nch
#> 1  1   1   Male 2019.257 2109.674 90.41694       762       NA       NA  NA  NA
#> 2  2   1 Female 2019.221 2048.229 29.00860        61       NA       NA  NA  NA
#> 3  3   1 Female 2019.460 2111.601 92.14055       968       NA       NA  NA  NA
#> 4  4   1 Female 2019.581 2111.684 92.10307       398       NA       NA  NA  NA
#> 5  5   1   Male 2019.503 2106.110 86.60664       429       NA       NA  NA  NA
#> 6  6   1   Male 2019.794 2094.515 74.72116       663       NA       NA  NA  NA

For women, ages at childbirth are determined by sampling from waiting time distributions with parameters equal to the fertility rates by age and parity. When a child is born, it receives a unique ID and a new individual data record is created. The generation variable is set to 2. The date of birth is determined by the date of birth of the mother and her exact age at birth of the child. The ID of the mother is added to the child’s record and the child’s ID is added to the data records of the mother and her partner, assumed to be the father. The record linkages enable to track the descending line of offspring and ascending line of parentage. It significantly facilitates the study of genealogies and kinship networks. The new-born is also assigned a sex using the empirical sex ratio at birth and is randomly allocated a partner. The lifespan and the fertility career of each member of the second generation is determined by sampling the appropriate probability distributions. The function Children(dat0, rates) does the computations. The first argument is the individual data structure,d in this illustration, and the second is the object with the transition rates. For instance,

dch1 <- VirtualPop::Children(dat0=d,rates=rates)

The object dch1 has two components: data and dch. The first components is comparable to the women’s file in demographic surveys and the second to the children’s file (see e.g. Demographic and Health Survey (DHS) [https://dhsprogram.com]). The object data is the input object d with, for each mother and father, information on

number of children, and
the ID of each child and the mother’s age at birth.

The second object dch is a data frame with individual records for the new-borns (second generation). It has the ID of the child, the generation, sex, date of birth, the birth order (variable jch), the ID of the mother, the ID of the father, and the date of death of the child. The ID of the father is the ID of the partner of the mother:

dch1$dch$IDfather <- dch1$data$IDpartner[dch1$dch$IDmother]

The first records are shown below.

head(dch1$data)
#>   ID gen    sex   bdated   ddated      x_D IDpartner IDmother IDfather jch nch
#> 1  1   1   Male 2019.257 2109.674 90.41694       762       NA       NA  NA   0
#> 2  2   1 Female 2019.221 2048.229 29.00860        61       NA       NA  NA   2
#> 3  3   1 Female 2019.460 2111.601 92.14055       968       NA       NA  NA   2
#> 4  4   1 Female 2019.581 2111.684 92.10307       398       NA       NA  NA   3
#> 5  5   1   Male 2019.503 2106.110 86.60664       429       NA       NA  NA   0
#> 6  6   1   Male 2019.794 2094.515 74.72116       663       NA       NA  NA   0
#>   id.1 id.2 id.3 id.4 id.5 id.6 id.7 id.8 id.9    age.1    age.2   age.3 age.4
#> 1   NA   NA   NA   NA   NA   NA   NA   NA   NA       NA       NA      NA    NA
#> 2 1001 1002   NA   NA   NA   NA   NA   NA   NA 22.21320 24.75549      NA    NA
#> 3 1003 1004   NA   NA   NA   NA   NA   NA   NA 31.59449 33.09805      NA    NA
#> 4 1005 1006 1007   NA   NA   NA   NA   NA   NA 22.62514 29.67398 32.5137    NA
#> 5   NA   NA   NA   NA   NA   NA   NA   NA   NA       NA       NA      NA    NA
#> 6   NA   NA   NA   NA   NA   NA   NA   NA   NA       NA       NA      NA    NA
#>   age.5 age.6 age.7 age.8 age.9
#> 1    NA    NA    NA    NA    NA
#> 2    NA    NA    NA    NA    NA
#> 3    NA    NA    NA    NA    NA
#> 4    NA    NA    NA    NA    NA
#> 5    NA    NA    NA    NA    NA
#> 6    NA    NA    NA    NA    NA
head(dch1$dch)
#>     ID gen    sex   bdated   ddated      x_D IDpartner IDmother IDfather jch
#> 1 1001   2   Male 2041.434 2125.934 84.50009        NA        2       61   1
#> 2 1002   2 Female 2043.976 2101.600 57.62420        NA        2       61   2
#> 3 1003   2 Female 2051.055 2122.476 71.42117        NA        3      968   1
#> 4 1004   2   Male 2052.559 2125.463 72.90399        NA        3      968   2
#> 5 1005   2 Female 2042.206 2136.599 94.39282        NA        4      398   1
#> 6 1006   2   Male 2049.255 2128.263 79.00805        NA        4      398   2

The members of the second generation partner with individuals of the same generation. Partners are allocated randomly.

d2 <- VirtualPop::Partnership (dLH=dch1$dch)

The object d2 is dch1$dch augmented by the IDs of partners. The data frame contains the data on the second generation in a format that is consistent with the data frame of the first generation, which facilitates merging the data frames. The Children() function is used to obtain information on the third generation (children of women of the second generation):

dch2 <-  VirtualPop::Children(dat0=d2,rates=rates)

Partnership in the third generation is the outcome of a random search:

d3 <- VirtualPop::Partnership (dLH=dch2$dch)

A similar procedure is followed to generate the fourth generation (children of mothers of the third generation) and higher-order generations.

dch3 <-  VirtualPop::Children(dat0=d3,rates=rates)
d4 <- VirtualPop::Partnership (dLH=dch3$dch)
dch4 <-  VirtualPop::Children(dat0=d4,rates=rates)
d4 <- dch4$data[,1:which (colnames(dch4$data)=="nch")]

The object d4 has data on the fourth generation, including on the children of members of the fourth generation.

Data on the four generations is merged to a single data frame.

data4 <- rbind(dch1$data,dch2$data,dch3$data,dch4$data)

The data frame that results is similar to the dataLH data object included in the VirtualPop package.

References

Aalen, Odd, Ornulf Borgan, and Hakon Gjessing. 2008. Survival and Event History Analysis: A Process Point of View. Springer Science & Business Media.

Caswell, Hal. 2009. “Stage, Age and Individual Stochasticity in Demography.” Oikos 118 (12): 1763–82.

Cook, Richard J, and Jerald F Lawless. 2018. Multistate Models for the Analysis of Life History Data. Chapman; Hall/CRC.

Jackson, Christopher. 2011. “Multi-State Models for Panel Data: The Msm Package for R.” Journal of Statistical Software 38: 1–28. https://doi.org/10.18637/jss.v038.i08.

Schoen, Robert. 2016. “Hierarchical Multistate Models from Population Data: An Application to Parity Statuses.” PeerJ 4: e2535.

———. 2019. Analytical Family Demography. Vol. 47. Springer.

Willekens, Frans. 2013. “Chronological Objects in Demographic Research.” Demographic Research 28: 649–80.

———. 2014. Multistate Analysis of Life Histories with R. Springer.

For a discussion of the distinction between individual heterogeneity and individual stochasticity, see Caswell (2009).↩︎