Bellwether

pcvr v0.1.0

Josh Sumner, DDPSC Data Science Core Facility

2023-10-16

.static-code {
  background-color: white;
  border: 1px solid lightgray;
}
.simulated {
  background-color: #EEF8FB;
  border: 1px solid #287C94;
}
Code
library(pcvr)
library(data.table) # for fread
library(ggplot2)
library(patchwork) # for easy ggplot manipulation/combination

Example Bellwether (Lemnatech) Workflow

The Bellwether phenotyping facility at the Donald Danforth Plant Science Center allows for high throughput image based phenotyping of up to 1140 plants over the course of several weeks. This generates a massive amount of image data which is typically analysed using plantCV, a python based image analysis tool developed and maintained by the Data Science Core Facility at DDPSC. The plantCV output from a Bellwether experiment consists of numeric phenotypes commonly broken into two categories, single value traits and multi value traits. Single value traits are phenotypes where one image yields one value, things like plant height or plant area. Multi value traits require multiple numbers to describe a single image and currently are limited to color histograms in various color spaces. Here we will focus only on the hue channel of HSV color, but there are lots of options with your data. This package is being developed to help with common analysis tasks that arise in using plantCV output. If your goal or experiment seems to be unsupported then please consider raising an issue on github so we can know what directions to take development in for the future.

Installation of pcvr from github is possible with the remotes or devtools packages. You may need to restart R after installing or reinstalling packages.

Code
devtools::install_github("joshqsumner/pcvr", build_vignettes = TRUE)
library(pcvr)

If you clone that repository and are making edits then you can easily use your local version with: devtools::load_all("file/path/to/local/pcvr")

Functions in pcvr use colorblind friendly palettes from the viridis package if they specify color/fill scales. Some functions do not specify a color/fill scale and use ggplot2 defaults. If you use these functions or base work off of them please be mindful of your choices regarding color.

As a final note before starting into pcvr, this vignette is laid out to help guide analyses from Bellwether experiments these functions tend to be generalizable and can be used for other plantCV data as well. There are some code chunks in this vignette that you are only presented to demonstrate syntax and that you are not meant to run locally if following along. Those are identified by style:

Code
complicatedFunction("syntax") # do not run this style
Code
1 + 1 # run this style
## [1] 2
Code
support <- seq(0, 1, 0.0001) # this style is simulated data
plot(support, dbeta(support, 5, 5), type = "l", main = "simulated example")

Read In Data

read.pcv

Bellwether data can be very large so this vignette will use a small bellwether dataset that is already subset to remove most multi-value traits (most color histograms). Both the single value data and multi value data are available on github. For single value traits being read into wide format there is not a compelling reason to use read.pcv in place directly calling fread or read.csv. The benefits of read.pcv are only shown if you need to filter data outside of R or if you want to use your data in a different format than it is stored in (wide going to long and vice versa).

The single value traits can be read in with read.pcv, here using data.table::fread for speed.

Code
base_url <- "https://raw.githubusercontent.com/joshqsumner/pcvrTestData/main/"
base_url2 <- "https://media.githubusercontent.com/media/joshqsumner/pcvrTestData/main/"
sv <- read.pcv(
  filepath = paste0(base_url, "pcv4-single-value-traits.csv"),
  reader = "fread"
)

You will generally want to join some metadata to your phenotypes or parse metadata from your barcodes. Metadata key files will generally look and be used like this:

Code
key <- read.csv(paste0(base_url, "smallPhenotyperRun_key.csv"))
head(key)
##          barcode genotype fertilizer
## 1 Em004ZDB130826       MM         50
## 2 Em002ZDC130783      B73          0
## 3 Em002ZDC130771      B73          0
## 4 Em005ZDC130867     Mo17          0
## 5 Em005ZDA130872     Mo17        100
## 6 Em005ZDC130864     Mo17          0
Code
sv <- merge(sv, key, by = "barcode")
table(sv$genotype, sv$fertilizer)
##        
##           0  50 100
##   B73   193 133 241
##   MM    255 208 292
##   Mo17  364 221 304
##   W605S 182 168 263

If we did not have a key file then we would parse our barcodes doing something like this:

Code
genotype <- substr(sv$barcode, 3, 5)
genotype <- ifelse(genotype == "002", "B73",
  ifelse(genotype == "003", "W605S",
    ifelse(genotype == "004", "MM", "Mo17")
  )
)
fertilizer <- substr(sv$barcode, 8, 8)
fertilizer <- ifelse(fertilizer == "A", 100,
  ifelse(fertilizer == "B", 50, 0)
)
table(genotype, fertilizer)
##         fertilizer
## genotype   0  50 100
##    B73   193 133 241
##    MM    255 208 292
##    Mo17  364 221 304
##    W605S 182 168 263

Here we’ll also convert some of the phenotypes from pixel units to meaningful units. Here we know that the chips on our color card are 1.2cm on each side, so we can use the following conversion. You can convert these phenotypes to whatever units you like, but you need to know the size of the color card chips in some real unit and color correct in plantCV so that the color chip measurements are returned.

Code
chip_size_px <- mean(c(sv$median_color_chip_height_median, sv$median_color_chip_width_median)) # ~52

px_per_cm <- chip_size_px / 1.2 # ~ 43.5
pixels_per_cmsq <- px_per_cm^2 # ~ 1890

sv$area_cm2 <- sv$area_pixels / pixels_per_cmsq
sv$height_cm <- sv$height_pixels / px_per_cm

Very large datasets

Sometimes large plantCV output may be too large to be read into memory for R. In that case read.pcv has a filter argument which will filter rows using awk through linux/unix outside of R. That feature would work as shown below to read only single value traits into memory. This can take a few minutes but allows the entirely workflow to be documented in one R file.

Code
example <- read.pcv("prohibitivelyLargeFile.csv",
  filters = list(
    "trait in area_pixels, area_above_reference_pixels, area_below_reference_pixels",
    "sample in default"
  )
)

read.pcv.4.x

Before the release of PlantCV version 4 there were several widely used branches of PlantCV that output results as a single long format csv file. The original goal of read.pcv was to make it easier to work with those files. Now that the default output has changed the read.pcv function has been somewhat simplified but it can still be used for these “PlantCV 4.x” legacy style data.

Code
sv <- read.pcv(paste0(base_url2, "smallPhenotyperRun.csv"),
  mode = "wide",
  reader = "fread"
)

if (TRUE) { # we can parse barcodes for the metadata that we need
  sv$genotype <- substr(sv$barcode, 3, 5)
  sv$genotype <- ifelse(sv$genotype == "002", "B73",
    ifelse(sv$genotype == "003", "W605S",
      ifelse(sv$genotype == "004", "MM", "Mo17")
    )
  )
  sv$fertilizer <- substr(sv$barcode, 8, 8)
  sv$fertilizer <- ifelse(sv$fertilizer == "A", "100",
    ifelse(sv$fertilizer == "B", "50", "0")
  )
} else { # or we might use a key file and join it to our data
  key <- read.csv(paste0(base_url, "smallPhenotyperRun_key.csv"))
  sv <- merge(sv, key, by = "barcode")
}

read.pcv.3

Legacy Bellwether data (plantCV version 3 output) can also be read in with read.pcv.3 which is a wrapper around read.pcv which attempts to do several common tasks related bellwether data joining internally. This should be considered experimental as it is based on plantCV version 3 and earlier outputs/bellwether experiments can take many formats and this has only been considered with a few datasets. With the older plantCV output the data is already in a wider format so here the default mode is “long”. Note that while read.pcv works fine with this older data and still has some added benefits the reasons to use it in place of data.table::fread or base::read.csv are less compelling.

Here we have examples of reading in various amounts of plantCV 3 bellwether data. In the final example we also list a conversion to take area from pixels to \(\text{cm}^2\) for the 5MP camera that was used prior to 2023. Note that the conversion would change for a different camera such as the current 18MP camera. It is a good idea to check your color chip sizes if you are not sure about the appropriate conversion.

Code
onlyPhenos <- read.pcv.3(file = paste0(base_url, "pcv3Phenos.csv"), metaCol = NULL)
colnames(onlyPhenos)

phenosAndMeta <- read.pcv.3(
  file = paste0(base_url, "pcv3Phenos.csv"), metaCol = "meta",
  metaForm = "vis_view_angle_zoom_horizontal_gain_exposure_v_new_n_rep",
  joinSnapshot = "id"
)
colnames(phenosAndMeta)

all <- read.pcv.3(
  file = paste0(base_url, "pcv3Phenos.csv"),
  snapshotFile = paste0(base_url, "pcv3Snapshot.csv"),
  designFile = paste0(base_url, "pcv3Design.csv"),
  metaCol = "meta", metaForm = "vis_view_angle_zoom_horizontal_gain_exposure_v_new_n_rep",
  joinSnapshot = "id", conversions = list(area = 13.2 * 3.7 / 46856)
)
colnames(all)

Other metadata

Often we want to convert our timestamp data from the lemnatech into either days after start (DAS), days after planting (DAP), or days after emergence (DAE). By default the bw.time function will add columns called DAS, DAP, and DAE to your data. Days after emergence requires using a phenotype and a value to classify emergence. Here an area greater than 10 pixels is considered an emerged plant. In this example with a planting delay of 0 DAP and DAE will be the same, but both are still created for the purpose of the example.

Code
out <- bw.time(sv,
  plantingDelay = 0, phenotype = "area_pixels", cutoff = 10,
  timeCol = "timestamp", group = c("barcode", "rotation"), plot = TRUE
)
out$plot
## [[1]]

## 
## [[2]]

## 
## [[3]]

Code
sv <- out$data
dim(sv)
## [1] 2824   52

Note that in these plots, and particularly clearly in the DAE plot, there are lots of plants which show a vertical line on their last day, indicating that our grouping might be incorrect. For this data that is due to the imaging schedule changing and getting 2 separate images at each rotation on the last day.

Before moving on we’ll also check the grouping in our data. Here we see that we have lots of plants with more than one image per day.

Code
checkGroups(sv, c("DAS", "barcode", "rotation", "genotype", "fertilizer"))
## There are 126 observations that are not uniquely identified.
## The max number of duplicates is 2.
## Run `sv[duplicated(interaction(sv$DAS, sv$barcode, sv$rotation, sv$genotype, sv$fertilizer)),]` to see the duplicated rows,
##  or sv[interaction(sv$DAS, sv$barcode, sv$rotation, sv$genotype, sv$fertilizer)=='19.Em002ZDC130765.0.B73.0',] to see the first duplicated instance.

We can combine images taken at different angles of the same plant on a given day with aggregate, where we will also remove some columns we aren’t using. In this case we use the mean of observations, but some people prefer a sum. Either way is fine. Here we also remove the DAE and DAP columns since we will not be using them.

Code
phenotypes <- colnames(sv)[c(19:35, 43:45, 48:49)]
phenoForm <- paste0("cbind(", paste0(phenotypes, collapse = ", "), ")")
groupForm <- "DAS+timestamp+barcode+genotype+fertilizer"
form <- as.formula(paste0(phenoForm, "~", groupForm))
sv_ag_with_outliers <- aggregate(form, data = sv, mean, na.rm = TRUE)
dim(sv_ag_with_outliers)
## [1] 1450   27

Outlier Removal

The bw.outliers function can be used to remove outliers relative to a phenotype using cook’s distance. Here due to the experimental design having these plants germinate on the machine around 1 percent of data are removed as outliers. The plot shows removed data points in red, although here that is hard to see.

Code
out <- bw.outliers(
  df = sv_ag_with_outliers, phenotype = "area_pixels",
  group = c("DAS", "genotype", "fertilizer"), plotgroup = c("barcode")
)
## Warning in bw.outliers(df = sv_ag_with_outliers, phenotype = "area_pixels", :
## 18 groupings had all observations removed
Code
sv_ag <- out$data
out$plot

Code
dim(sv_ag)
## [1] 1439   27

It is also useful to check our grouping assumptions again, here we see that there are some plants with multiple images from a single day.

Code
checkGroups(sv_ag, c("DAS", "barcode", "genotype", "fertilizer"))
## There are 63 observations that are not uniquely identified.
## The max number of duplicates is 2.
## Run `sv_ag[duplicated(interaction(sv_ag$DAS, sv_ag$barcode, sv_ag$genotype, sv_ag$fertilizer)),]` to see the duplicated rows,
##  or sv_ag[interaction(sv_ag$DAS, sv_ag$barcode, sv_ag$genotype, sv_ag$fertilizer)=='19.Em002ZDC130765.B73.0',] to see the first duplicated instance.

Watering Data

We might also want to check the watering data, which can be read easily from json with bw.water.

Code
water <- bw.water(paste0(base_url, "metadata.json"))
## Using the first watering time, 2023-04-13 23:28:17.58, as beginning of experiment to assign DAS
Code
water$genotype <- substr(water$barcode, 3, 5)
water$genotype <- ifelse(water$genotype == "002", "B73",
  ifelse(water$genotype == "003", "W605S",
    ifelse(water$genotype == "004", "MM", "Mo17")
  )
)
water$fertilizer <- substr(water$barcode, 8, 8)
water$fertilizer <- ifelse(water$fertilizer == "A", "100",
  ifelse(water$fertilizer == "B", "50", "0")
)

ggplot(water[water$weight_after != -1, ], aes(
  x = DAS,
  y = water_amount, group = barcode, color = genotype
)) +
  facet_wrap(~ factor(fertilizer, levels = c("0", "50", "100"))) +
  geom_line() +
  pcv_theme() +
  guides(color = guide_legend(title = "Condition", override.aes = list(linewidth = 5))) +
  labs(y = "Watering Amount (g)") +
  theme(legend.position = "bottom")

A common use for watering data is to look at water use efficiency (WUE). Here we can calculate an approximation of WUE based on the change in some phenotype (area) over the change in weight between waterings. Note that the plants in this example are very young and as such the data in this example is dominated by noise.

Code
test <- pwue(df = sv_ag, w = water, pheno = "area_pixels", time = "timestamp", id = "barcode")

ggplot(test, aes(x = DAS, y = pWUE, color = genotype, group = barcode)) +
  geom_line() +
  guides(color = guide_legend(override.aes = list(linewidth = 5))) +
  labs(y = expression("Pseudo WUE (" ~ frac(
    Delta ~ textstyle("Area")[" pixels"],
    Delta ~ textstyle("Weight")[" g"]
  ) ~ ")")) +
  pcv_theme()
## Warning: Removed 88 rows containing missing values or values outside the scale range
## (`geom_line()`).

FREM

Now that our data is read in and has undergone some basic quality control we want to know which phenotypes are best explained by our design variables. The frem function partitions variance using a fully random effects model (frem). Here we provide our dataframe, the design variables, the phenotypes, and the column representing time. By default frem will use the last timepoint, but this is controlled with the time argument. For this first example we mark the singular model fits (markSingular=TRUE) to indicate places where lme4::lmer had some convergence issues. Generally these issues are minor and do not cause problems for interpreting these models, but you can pass additional arguments to lme4::lmer through additional arguments if desired.

Code
frem(sv_ag,
  des = c("genotype", "fertilizer"),
  phenotypes = c(
    "area_pixels", "area_above_reference_pixels", "area_below_reference_pixels",
    "convex_hull_area_pixels", "convex_hull_vertices_none", "ellipse_angle_degrees",
    "ellipse_eccentricity_none", "ellipse_major_axis_pixels", "ellipse_minor_axis_pixels",
    "height_pixels", "height_above_reference_pixels", "height_below_reference_pixels",
    "horizontal_reference_position_none", "hue_circular_mean_degrees", "hue_circular_std_degrees",
    "hue_median_degrees", "perimeter_pixels", "solidity_none", "width_pixels"
  ),
  timeCol = "DAS", cor = TRUE, returnData = FALSE, combine = FALSE, markSingular = TRUE, time = NULL
)
## [[1]]

## 
## [[2]]

Here we look at how much variance in each phenotype was explained over the course of the experiment. We could also specify a set of times (time = c(10:14) for example) if we are most interested in a particular timeframe.

Code
frem(sv_ag,
  des = c("genotype", "fertilizer"),
  phenotypes = c(
    "area_pixels", "area_above_reference_pixels", "area_below_reference_pixels",
    "convex_hull_area_pixels", "convex_hull_vertices_none", "ellipse_angle_degrees",
    "ellipse_eccentricity_none", "ellipse_major_axis_pixels",
    "ellipse_minor_axis_pixels",
    "height_pixels", "height_above_reference_pixels", "height_below_reference_pixels",
    "horizontal_reference_position_none", "hue_circular_mean_degrees", "hue_circular_std_degrees",
    "hue_median_degrees", "perimeter_pixels", "solidity_none", "width_pixels"
  ),
  timeCol = "DAS", cor = FALSE, returnData = FALSE, combine = FALSE, markSingular = FALSE, time = "all"
)

This informs our next steps. Hue and size based phenotypes are well explained by our design variables so we might decide to focus more complex analyses on those. In this experiment mini maize is one of the genotypes and the fertilizer treatment has an option with no nitrogen, so this intuitively makes sense and passes an eye check.

Single Value Traits

Most analysis focus on single value traits, that is phenotypes where one object in an image returns on numeric value such as area or height. These can be compared longitudinally or with respect to individual days. Note that we do not recommend using a PCA of the single value traits to look for differences in your treatment groups, these traits are interpretable on their own and are interdependent enough that a PCA is not appropriate.

Growth Trendlines

Trendlines help us decide what next steps make the most sense and give a general impression of which conditions yielded healthier plants.

Code
ggplot(sv_ag, aes(
  x = DAS, y = area_cm2, group = interaction(genotype, fertilizer, lex.order = TRUE),
  color = genotype
)) +
  facet_wrap(~ factor(fertilizer, levels = c("0", "50", "100"))) +
  geom_smooth(method = "loess", se = TRUE, fill = "gray90") +
  geom_line(aes(group = barcode), linewidth = 0.15) +
  labs(
    y = expression("Area" ~ "(cm"^2 ~ ")"),
    color = "Genotype"
  ) +
  guides(color = guide_legend(override.aes = list(linewidth = 5))) +
  pcv_theme() +
  theme(axis.text.x.bottom = element_text(angle = 0))
## `geom_smooth()` using formula = 'y ~ x'

Single day comparisons

Non-longitudinal data (or single day data from larger longitudinal data) can be compared using the conjugate function. conjugate uses conjugate priors to make simple Bayesian comparisons for several types of data. Optionally, region of practical equivalence (ROPE) testing is also supported. Most distirubtions supported by this function can be used with either wide or long single value data or wide multi-value data. See ?pcvr::conjugate or the conjugate tutorial for details on usage and available options.

Code
mo17_area <- sv_ag[sv_ag$genotype == "Mo17" & sv_ag$DAS > 18 & sv_ag$fertilizer == 100, "area_cm2"]
b73_area <- sv_ag[sv_ag$genotype == "B73" & sv_ag$DAS > 18 & sv_ag$fertilizer == 100, "area_cm2"]

area_res_t <- conjugate(s1 = mo17_area, s2 = b73_area, method = "t", plot = TRUE, rope_range = c(-5, 5))

Here we can see a dramatic difference in our posterior distributions between genotypes and the distribution of differences between these samples is entirely outside of our ROPE range.

Relative Tolerance

Often bellwether experiments involve comparing stress tolerance between groups. For example in this dataset we might want to know which genotype shows the most resilience to reduced fertilizer. To easily check this we can change out data with relativeTolerance. Details on this function can be read with ?pcvr::relativeTolerance. Note that if you are going to be making a longitudinal model anyway then it makes more sense to use the model to assess relative tolerance rather than transforming data into a different unit.

Code
rt <- relativeTolerance(sv_ag,
  phenotypes = c("area_cm2", "height_cm"),
  grouping = c("fertilizer", "genotype", "DAS"), control = "fertilizer", controlGroup = "100"
)


ggplot(
  rt[rt$phenotype == "area_cm2" & rt$DAS %in% c(10:12), ],
  aes(x = DAS, y = mu_rel, fill = interaction(fertilizer, genotype))
) +
  geom_col(position = "dodge") +
  geom_errorbar(aes(ymin = mu_rel - 1.96 * se_rel, ymax = mu_rel + 1.96 * se_rel),
    position = position_dodge(width = 0.9), width = 0.3
  ) +
  pcv_theme() +
  labs(y = "Relative Tolerance", fill = "Fertilizer\nand Genotype")

Looking at the entire data for the area phenotype is very busy so we might subset to look at something more specific, where we see some odd products of the plants germinating on the bellwether system.

Code
pd <- rt[rt$phenotype == "area_cm2" & rt$DAS %in% c(5:19) & rt$fertilizer == "0", ]
pd$upper_2se <- pd$mu_rel + 2 * pd$se_rel
pd$lower_2se <- pd$mu_rel - 2 * pd$se_rel

ggplot(pd, aes(x = DAS, y = mu_rel, fill = genotype)) +
  geom_col(position = "dodge") +
  geom_errorbar(aes(
    ymin = lower_2se, ymax = upper_2se,
    group = genotype
  ), position = "dodge") +
  pcv_theme() +
  labs(y = "Relative Tolerance")

Cumulative Phenotypes

Sometimes we might want to use the cumulative difference over time

Code
cp <- cumulativePheno(sv_ag, phenotypes = c("area_cm2", "height_cm"), group = "barcode", timeCol = "DAS")

We can check that this worked correctly with trendlines:

Code
ggplot(cp, aes(x = DAS, y = area_cm2_csum, color = genotype, group = barcode)) +
  facet_wrap(~ factor(fertilizer, levels = c("0", "50", "100"))) +
  geom_line() +
  pcv_theme() +
  labs(
    y = expression("Cumulative Sum of Area" ~ "(cm"^2 ~ ")"),
    color = "Genotype"
  )

Longitudinal Modeling

Longitudinal modeling is the most comprehensive way to use the single value traits from a bellwether experiment. Longitudinal modeling can also be complicated compared to single timepoint analyses. Statistical complications including changes in variance, non-linearity, and autocorrelation present potential problems in analyses. To address these we recommend using hierarchical models. pcvr attempts to lower the barrier to entry for these models with helper functions for use with brms, nlme, nlrq, mgcv, and nls. Here we focus only on brms models, but there are tutorials that cover more options.

Growth Model Forms

Based on literature and observed trends there are 13 growth models that pcvr supports. The six most common of those are shown here using the growthSim function but the remaining options could be visualized in the same way.

Code
simdf <- growthSim("logistic",
  n = 20, t = 25,
  params = list("A" = c(200, 160), "B" = c(13, 11), "C" = c(3, 3.5))
)
l <- ggplot(simdf, aes(time, y, group = interaction(group, id))) +
  geom_line(aes(color = group)) +
  labs(title = "Logistic") +
  theme_minimal() +
  theme(legend.position = "none")

simdf <- growthSim("gompertz",
  n = 20, t = 25,
  params = list("A" = c(200, 160), "B" = c(13, 11), "C" = c(0.2, 0.25))
)
g <- ggplot(simdf, aes(time, y, group = interaction(group, id))) +
  geom_line(aes(color = group)) +
  labs(title = "Gompertz") +
  theme_minimal() +
  theme(legend.position = "none")

simdf <- growthSim("monomolecular",
  n = 20, t = 25,
  params = list("A" = c(200, 160), "B" = c(0.08, 0.1))
)
m <- ggplot(simdf, aes(time, y, group = interaction(group, id))) +
  geom_line(aes(color = group)) +
  labs(title = "Monomolecular") +
  theme_minimal() +
  theme(legend.position = "none")

simdf <- growthSim("exponential",
  n = 20, t = 25,
  params = list("A" = c(15, 20), "B" = c(0.095, 0.095))
)
e <- ggplot(simdf, aes(time, y, group = interaction(group, id))) +
  geom_line(aes(color = group)) +
  labs(title = "Exponential") +
  theme_minimal() +
  theme(legend.position = "none")

simdf <- growthSim("linear", n = 20, t = 25, params = list("A" = c(1.1, 0.95)))
ln <- ggplot(simdf, aes(time, y, group = interaction(group, id))) +
  geom_line(aes(color = group)) +
  labs(title = "Linear") +
  theme_minimal() +
  theme(legend.position = "none")

simdf <- growthSim("power law", n = 20, t = 25, params = list("A" = c(16, 11), "B" = c(0.75, 0.7)))
pl <- ggplot(simdf, aes(time, y, group = interaction(group, id))) +
  geom_line(aes(color = group)) +
  labs(title = "Power Law") +
  theme_minimal() +
  theme(legend.position = "none")

(l + g + m) / (e + ln + pl)

Typically at least one of these models will be a good fit to your bellwether data, with gompertz models being the most broadly useful so far. In this experiment the plants were not germinated before being added to the machine, so we might not see asymptotic size. Still, conceptually we know that these plants will stop growing in the near future so we might use a gompertz model in place of the exponential model that looks most like our loess trendlines.

Model setup

Coding a multilevel Bayesian model can be a difficult and time consuming process. Even with the greatly simplified syntax used by brms this can present a barrier to entry for some people who could benefit from using very robust models. To help get around this potential issue for the specific case of measuring growth over time pcvr includes several functions to work with brms, the first of which is growthSS, a self-starter helper function for use with brms::brm.

submodel options

For the purposes of this vignette we will only consider the default Student T model family, for examples of other model families (useful for count or circular data especially) see the Advanced Growth Modeling tutorial on github or the longitudinal growth vignette.

There are several ways to consider variance over time. By default almost all modeling assumes homoscedasticity, that is constant variance across predictor variables (time here). That assumption is very unrealistic in biological settings since all seeds/seedlings will start from a very low area but will grow differently through the experiment. The growthSS function can use any of the main growth model options as a model for distributional parameters (here being sigma). Splines or asymptotic models will often yield the best fit to your data’s variance.

Prior Distributions

An important part of Bayesian statistics is setting an appropriate prior. These represent your knowledge about the field and are used along with your collected data to yield results. Priors should generally be weak relative to your data, meaning that if your prior belief is wrong then your experiment can move the posterior distribution away from the prior in a meaningful way.

In growthSS priors can be specified as a brmsprior object (in which case it is used as is), a named list (names representing parameters), or a numeric vector, where values will be used to generate lognormal priors with a long right tail. Lognormal priors with long right tails are used because the values for our growth curves are strictly positive and the lognormal distribution is easily interpreted. The tail is a product of the variance, which is assumed to be 0.25 for simplicity and to ensure priors are wide. This means that only a location parameter needs to be provided. If a list is used then each element of the list can be length 1 in which case each group will use the same prior or it can be a vector of the same length as unique(data$group) where group is your grouping variable from the form argument to growthSS. If a vector is used then a warning will be printed to check that the assumed order of groups is correct. The growthSim function can be useful in thinking about what a reasonable prior distribution might be, although priors should not be picked by trying to get a great fit by eye to your collected data.

We can check the priors made by growthSS with the plotPrior function, which can take a list of priors or the growthSS output.

Code
priors <- list("A" = 130, "B" = 10, "C" = 0.2)
priorPlots <- plotPrior(priors)
priorPlots[[1]] / priorPlots[[2]] / priorPlots[[3]]

Looking at the prior distributions this way is useful, but for those still familiarizing with a given growth model the parameter values may not be very intuitive. To help with picking reasonable priors while familiarizing with the meaning of the model parameters the plotPrior function can also simulate growth curves by making draws from the specified prior distributions. Here is an example of using plotPrior in this way to pick between possible sets of prior distributions for a gompertz model. For asymptotic distributions the prior on “A” is added to the y margin. For distributions with an inflection point the prior on “B” is shown in the x margin. Arbitrary numbers of priors can be compared in this manner, but more than two or three can be cluttered so an iterative process is recommended if you are learning about your growth model.

Code
twoPriors <- list("A" = c(100, 130), "B" = c(6, 12), "C" = c(0.5, 0.25))
plotPrior(twoPriors, "gompertz", n = 100)[[1]]

Using growthSS

Now we’re ready to define the necessary variables in our data and use the growthSS function.

Code
sv_ag$group <- interaction(sv_ag$fertilizer, sv_ag$genotype)

The brms package is not always imported by pcvr (since dependencies is NA by default when installing from github), so before fitting models you may need to install that package. For details on installing brms and either rstan or cmdstanr (with cmdstanr being recommended), see those packages linked documentation. Note that if you install pcvr from github with dependencies=T then cmdstanr and brms will be installed and ready to use.

Code
library(brms)
library(cmdstanr)
cmdstanr::install_cmdstan()

Here our priors are informed by a general understanding of what we expect to see for a plant on the bellwether system. In general the example priors

Code
ss <- growthSS(
  model = "gompertz", form = area_cm2 ~ DAS | barcode / group, sigma = "spline", df = sv_ag,
  start = list("A" = 130, "B" = 10, "C" = 0.5), type = "brms"
)

Now we have most of our model components in the ss object. Since we specified a gompertz model we have three parameters, the asymptote (A), the inflection point (B), and the growth rate (C). For other model options see ?pcvr::growthSim for details on the parameters.

Before trying to fit the model it is generally a good idea to check one last plot of the data and make sure you have everything defined correctly.

Code
ggplot(sv_ag, aes(x = DAS, y = area_cm2, group = barcode, color = group)) +
  geom_line() +
  theme_minimal() +
  labs(
    y = expression("Area" ~ "(cm"^2 ~ ")"),
    color = "Genotype\nand Soil"
  )

This looks okay, there are no strange jumps in the data or glaring problems.

Running Models

The fitGrowth function in this case will call brms::brm which automatically uses the output from growthSS. Any additional arguments to brms::brm can still be specified, a few examples of which are shown here.

Code
fit <- fitGrowth(ss,
  iter = 1000, cores = 2, chains = 2, backend = "cmdstanr",
  control = list(adapt_delta = 0.999, max_treedepth = 20)
) # options to increase performance

Check Model Fit

We can visualize credible intervals from a brms model and compare that to our growth trendlines to get an intuitive understanding of how well the model fit. Note that since this vignette does not load brms these are only a picture of the output from brmPlot and brmViolin. The code is present to run this locally if you have brms installed and choose to.

Code
brmPlot(fit, form = area_cm2 ~ DAS | barcode / group, df = ss$df) +
  labs(y = expression("Area" ~ "(cm"^2 ~ ")"))

Test Hypotheses

Now we probably have some ideas about what we want to test in our data. The brms::hypothesis function offers incredible flexibility to test all kinds of hypotheses. For consistency with other backends the testGrowth function calls brms::hypothesis to evaluate hypotheses. For some comparisons pcvr has a helper function called brmViolin to visualize posterior distributions and the posterior probability of some hypotheses associated with them.

Code
brmViolin(
  model = fit, params = NULL,
  hyp = "num/denom>1.05", compareX = c("0.B73", "50.B73", "100.B73"), againstY = "0.B73",
  group_sep = "[.]", groups_into = c("soil", "genotype"), x = "soil", facet = "genotype",
  returnData = FALSE
)

This shows that we have a posterior probability greater than 99 percent of an asymptotic size at least 5 percent higher with the 100 type soil when compared against the 0 type soil. Note that this data does not have any plants that reached asymptotic size, so the model uses the incomplete data to estimate where an asymptote would be. In a normal experiment the plants would be more mature but here the asymptote parameter is artificially inflated for the 100 soil treatment group due to their slower growth rate.

There are a lot of options for how to use this function and even more ways to use brms::hypothesis, so this example should not be seen as the only way to compare your models.

Multi Value Traits

Working with multi value traits leads to different statistical challenges than the single value traits. Generally reading the data in as wide format makes for a significantly smaller object in memory terms since there are not lots of duplicated metadata (identifiers for each row when every image has hundreds of rows potentially). Note that for many questions even about color it is not necessary to use the entire color histograms. Make sure that you have a good reason to use the complete color data before going down this particular path for too long. As an example, a very simple comparison of the circular mean of Hue here will show our treatment effect in this data.

Code
ggplot(sv_ag[sv_ag$DAS == 18, ], aes(
  x = fertilizer, y = hue_circular_mean_degrees,
  fill = as.character(fertilizer)
)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.05, size = 0.5) +
  scale_fill_manual(values = c(viridis::viridis(3, 1, 0.1)), breaks = c("0", "50", "100")) +
  pcv_theme() +
  theme(legend.position = "none") +
  facet_wrap(~genotype, scales = "free_x") +
  scale_x_discrete(limits = c("0", "50", "100")) +
  labs(y = "Hue Circular Mean (degrees)", x = "Soil and Genotype")

Code
hue_wide <- read.pcv(paste0(base_url2, "pcv4-multi-value-traits.csv"),
  mode = "wide", reader = "fread"
)
hue_wide$genotype <- substr(hue_wide$barcode, 3, 5)
hue_wide$genotype <- ifelse(hue_wide$genotype == "002", "B73",
  ifelse(hue_wide$genotype == "003", "W605S",
    ifelse(hue_wide$genotype == "004", "MM", "Mo17")
  )
)
hue_wide$fertilizer <- substr(hue_wide$barcode, 8, 8)
hue_wide$fertilizer <- ifelse(hue_wide$fertilizer == "A", "100",
  ifelse(hue_wide$fertilizer == "B", "50", "0")
)
hue_wide <- bw.time(hue_wide, timeCol = "timestamp", group = "barcode", plot = FALSE)

phenotypes <- colnames(hue_wide)[grepl("hue_frequencies", colnames(hue_wide))]
phenoForm <- paste0("cbind(", paste0(phenotypes, collapse = ", "), ")")
groupForm <- "DAS+barcode+genotype+fertilizer"
form <- as.formula(paste0(phenoForm, "~", groupForm))
hue_wide <- aggregate(form, data = hue_wide, mean, na.rm = TRUE)

Joyplots

Joyplots are a common way to look at lots of distributions. Here we check the hue histograms as joyplots using the 18th day and add a new fill for the hue colorspace.

Code
p <- pcv.joyplot(hue_wide[hue_wide$DAS == 18, ],
  index = "hue_frequencies",
  group = c("fertilizer", "genotype")
)
p + scale_fill_gradientn(colors = scales::hue_pal(l = 65)(360)) +
  scale_y_discrete(limits = c("0", "50", "100"))

We can also specify a y argument to show multiple days across several variables.

Code
p <- pcv.joyplot(hue_wide[hue_wide$DAS %in% c(5, 10, 15), ],
  index = "hue_frequencies", group = c("fertilizer", "genotype"),
  y = "DAS", id = NULL
)
p + scale_fill_gradientn(colors = scales::hue_pal(l = 65)(360)) +
  scale_y_discrete(limits = c("5", "10", "15"))

As mentioned previously the conjugate function can be used with wide multi-value data. To use multi value data the samples should be data frames or matrices representing color histograms. Here we compare our color histograms assuming a lognormal distribution and find that they are very similar in this parameterization.

Code
mo17_sample <- hue_wide[
  hue_wide$genotype == "Mo17" & hue_wide$DAS > 18 & hue_wide$fertilizer == 100,
  grepl("hue_freq", colnames(hue_wide))
]
b73_sample <- hue_wide[
  hue_wide$genotype == "B73" & hue_wide$DAS > 18 & hue_wide$fertilizer == 100,
  grepl("hue_freq", colnames(hue_wide))
]

hue_res_ln <- conjugate(
  s1 = mo17_sample, s2 = b73_sample, method = "lognormal",
  plot = TRUE, rope_range = c(-10, 10), hypothesis = "equal"
)

Ordination

Ordinations are another common way to look at multi value traits. The pcadf function runs ordinations and optionally returns the PCs with metadata.

Code
pcadf(hue_wide, cols = "hue_frequencies", color = "genotype", returnData = FALSE) +
  facet_wrap(~ factor(fertilizer, levels = c("0", "50", "100")))

Earth Mover’s Distance

Since color data is exported from plantCV as histogram data we can also use Earth Mover’s Distance (EMD) to compare images. Conceptually EMD is a distance that quantifies how much work it would take to turn one histogram into another. Here we do pairwise comparisons of all our rows and return a long dataframe of those distances. Note that even running several cores in parallel this can take a lot of time for larger datasets since the number of comparisons quickly can become unwieldy. The output also will require more work to keep analyzing, so make sure this is what you want to be doing before using EMD to compare color histograms. If you are only interested in a change of the mean then this is probably not the best way to use your data.

Here is a fast example of a place where EMD makes a lot of sense. In this simulated data we have five generating distributions. Normal, Log Normal, Bimodal, Trimodal, and Uniform. We could use some gaussian mixtures to characterize the multi-modal histograms but that will get clunky for comparing to the unimodal or uniform distributions. The conjugate function would not work here since these distributions do not share a common parameterization. Instead, we can use EMD.

Code
set.seed(123)

simFreqs <- function(vec, group) {
  s1 <- hist(vec, breaks = seq(1, 181, 1), plot = FALSE)$counts
  s1d <- as.data.frame(cbind(data.frame(group), matrix(s1, nrow = 1)))
  colnames(s1d) <- c("group", paste0("sim_", 1:180))
  s1d
}

sim_df <- rbind(
  do.call(rbind, lapply(1:10, function(i) {
    simFreqs(rnorm(200, 50, 10), group = "normal")
  })),
  do.call(rbind, lapply(1:10, function(i) {
    simFreqs(rlnorm(200, log(30), 0.25), group = "lognormal")
  })),
  do.call(rbind, lapply(1:10, function(i) {
    simFreqs(c(rlnorm(125, log(15), 0.25), rnorm(75, 75, 5)), group = "bimodal")
  })),
  do.call(rbind, lapply(1:10, function(i) {
    simFreqs(c(rlnorm(100, log(15), 0.25), rnorm(50, 50, 5), rnorm(50, 90, 5)), group = "trimodal")
  })),
  do.call(rbind, lapply(1:10, function(i) {
    simFreqs(runif(200, 1, 180), group = "uniform")
  }))
)

sim_df_long <- as.data.frame(data.table::melt(data.table::as.data.table(sim_df), id.vars = "group"))
sim_df_long$bin <- as.numeric(sub("sim_", "", sim_df_long$variable))

ggplot(sim_df_long, aes(x = bin, y = value, fill = group), alpha = 0.25) +
  geom_col(position = "identity", show.legend = FALSE) +
  pcv_theme() +
  facet_wrap(~group)

Our plots show very different distributions, so we get EMD between our images and see that we do have some trends shown in the resulting heatmap.

Code
sim_emd <- pcv.emd(
  df = sim_df, cols = "sim_", reorder = c("group"),
  mat = FALSE, plot = TRUE, parallel = 1, raiseError = TRUE
)
## Estimated time of calculation is roughly 3.1 seconds using 1 cores in parallel.
Code
sim_emd$plot

Now we can filter edge strength during our network building step for EMD > 0.5, and plot our network.

Code
n <- pcv.net(sim_emd$data, filter = 0.5)
net.plot(n, fill = "group")

The distributions separate very well from each other, but we don’t actually see our uniform distribution here. That is because the uniform distribution does not have self-similar replicates and is also not very similar to the other distributions. If we change our filtering then we can even see which generating distributions are most similar to each other. Here we pass 0.5 as a string, which tells pcv.net to use the top 50 percent of EMD values instead of EMD values > 0.5.

Code
n <- pcv.net(sim_emd$data, filter = "0.5")
net.plot(n, fill = "group")

Just as we’d expect, our uniform distribution shows up now and is the most different. Now changing the edgeFilter in the net.plot function would let us fine tune this plot more to show finer distinctions between our other four generating distributions.

Here is an example of how we might use hue data.

Code
EMD <- pcv.emd(
  df = hue_wide[hue_wide$DAS %in% c(5, 12, 19), ], cols = "hue_frequencies",
  reorder = c("fertilizer", "genotype", "DAS"),
  mat = FALSE, plot = TRUE, parallel = 12, raiseError = TRUE
)

EMD can get very heavy with large datasets. For a recent lemnatech dataset using only the images from every 5th day there were \(6332^2\) = 40,094,224 pairwise EMD values. In long format that’s a 40 million row dataframe, which is unwieldy. To get around this problem we might decide to use the hue circular mean as a single value trait or to aggregate some of our data with the mv_ag function.

Starting with our complete hue data we have 1417 image histograms. The mv_ag function will take a group argument and randomly pick members of that group to have their histograms combined. Note that histograms are scaled to sum to 1 before they are sampled. Here we return one example with 2 histograms kept per group and one example where groups are summarized into 1 histogram. If there are equal or fewer images as the n_per_group argument then no aggregation is done for that group but data are rescaled.

Code
hue_ag1 <- mv_ag(df = hue_wide, group = c("DAS", "genotype", "fertilizer"), n_per_group = 2)
dim(hue_ag1)
## [1] 455 183
Code
hue_ag2 <- mv_ag(hue_wide, group = c("DAS", "genotype", "fertilizer"), n_per_group = 1)
dim(hue_ag2)
## [1] 236 183

Network Analysis

As it stands our EMD data is potentially difficult to use for those unfamiliar with distance matrix based analysis. Here we represent our distances as a network to help use the results.

Code
set.seed(456)
net <- pcv.net(EMD$data, meta = c("fertilizer", "genotype", "DAS"), filter = 0.5)
Code
net.plot(net, fill = "DAS", shape = "fertilizer", size = 2)

This is a much more complicated network! Parsing biological meaning out of this would require more work than the first example with 5 very different distributions, but there is a pretty strong cluster in the upper right corner of plants with low fertilizer at the end of the experiment. In general if you pick to use EMD with or without networks start small and consider what each step should mean conceptually for your experiment.

Conclusion

This vignette will be periodically updated as pcvr changes. Generally the tutorials are updated more frequently and can be a good resource. Once again if your goal or experiment has some set of questions or needs that are not met so far please consider making an issue on github to help the Data Science Core continue to work on this project in ways that will help the plantCV community.