This post provides R code to estimate growth chart LMS parameters1 given a table of measurements and percentile values (or Z-scores). The initial rationale was to explore problems with LMS parameterization in a publication describing growth charts for Noonan syndrome2, with the findings uploaded to MedRxiv3.
Load packages
Show the code
library(tidyverse)
library(knitr)
library(minpack.lm) # nlsLM() for robust nonlinear least squares
library(ggh4x) # facet_grid2(), for multiple panels with free scales
Load data
The source data was extracted from the supplemental materials from2 using Tabula and collected to a single datagrame.
Show the code
##### Load original published data tables #####
# `cappa_full.csv` contains: chart, age / units, gender, measure / units, M, S, L, 3 / 10 / 25 / 50 / 75 / 90 / 97 percentiles
<- read.csv("cappa_full.csv") |> as_tibble()
df_orig
|> select(measure, gender, age, M, S, L, p_3:p_97) |> head() |> knitr::kable() df_orig
measure | gender | age | M | S | L | p_3 | p_10 | p_25 | p_50 | p_75 | p_90 | p_97 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
height_velocity | f | 0.0 | 19.301 | 0.262 | 0.780 | 10.353 | 13.069 | 15.955 | 19.301 | 22.780 | 26.015 | 29.298 |
height_velocity | f | 0.2 | 18.601 | 0.264 | 0.781 | 9.923 | 12.557 | 15.355 | 18.601 | 21.976 | 25.114 | 28.298 |
height_velocity | f | 0.4 | 17.903 | 0.266 | 0.782 | 9.497 | 12.048 | 14.759 | 17.903 | 21.172 | 24.212 | 27.297 |
height_velocity | f | 0.6 | 17.207 | 0.267 | 0.784 | 9.077 | 11.544 | 14.166 | 17.207 | 20.370 | 23.311 | 26.296 |
height_velocity | f | 0.8 | 16.515 | 0.269 | 0.785 | 8.663 | 11.045 | 13.577 | 16.515 | 19.570 | 22.412 | 25.295 |
height_velocity | f | 1.0 | 15.827 | 0.271 | 0.786 | 8.254 | 10.551 | 12.994 | 15.828 | 18.775 | 21.516 | 24.297 |
There were problems with the LMS parameters, so the goal was to re-estimate them from the percentile values (3 / 10 / 25 / 50 / 75 / 90 / 97). The LMS parameterization relates the measurement value X with the parameters L, M, S, and Z as follows:
or
Prepare training data
Show the code
<- df_orig |> # wide to long format with published X at percentiles
df pivot_longer(
:p_97,
p_3names_to = "percentile",
names_prefix = "p_",
values_to = "X",
names_transform = as.numeric
|>
) mutate(Z = qnorm(percentile / 100))
<- df |>
df_train select(measure, gender, age, X, Z)
|> head(n = 14) |> knitr::kable() df_train
measure | gender | age | X | Z |
---|---|---|---|---|
height_velocity | f | 0.0 | 10.353 | -1.8807936 |
height_velocity | f | 0.0 | 13.069 | -1.2815516 |
height_velocity | f | 0.0 | 15.955 | -0.6744898 |
height_velocity | f | 0.0 | 19.301 | 0.0000000 |
height_velocity | f | 0.0 | 22.780 | 0.6744898 |
height_velocity | f | 0.0 | 26.015 | 1.2815516 |
height_velocity | f | 0.0 | 29.298 | 1.8807936 |
height_velocity | f | 0.2 | 9.923 | -1.8807936 |
height_velocity | f | 0.2 | 12.557 | -1.2815516 |
height_velocity | f | 0.2 | 15.355 | -0.6744898 |
height_velocity | f | 0.2 | 18.601 | 0.0000000 |
height_velocity | f | 0.2 | 21.976 | 0.6744898 |
height_velocity | f | 0.2 | 25.114 | 1.2815516 |
height_velocity | f | 0.2 | 28.298 | 1.8807936 |
Functions
The meat of the code is the estimate_LMS_parameters()
function which takes a dataframe including columns of X
and Z
, optionally groups (for example, by anthropometric measurement, sex, and age), and performs non-linear least squares regression via minpack.lm::nlsLM()
to estimate L
, M
, and S
for each group.
The estimation function could be improved to more explicitly handle the case where L = 0
, as the objective function converges to a different form as L
approaches 0, but setting starting values for L
that include values just above and below 0 seems to work. Because we are trying to estimate 3 parameters, at least 3 points per group are needed. In this dataset, we have 7 X / Z points per group.
##### Estimate_LMS_parameters #####
<- function(df, ...) {
estimate_LMS_parameters # takes a dataframe of columns Z and X and optional (...) grouping variables
# returns each group's estimated LMS parameters, as well as RMSE, R-squared,
# convergence, n, and method comment
#
# Requires: `tidyverse` and `minpack.lm` libraries
library(tidyverse)
library(minpack.lm) # nlsLM() for robust nonlinear least squares
<- 1e-6 # used for L ≈ 0 and for when assessing `best_obj`
tolerance <- 0.01 # max RMSE for regression to be considered "converged"
rmse_tolerance
# Vectorized model function for optimization
<- function(params, Z) {
LMS_model_vec <- params[1]
L <- params[2]
M <- params[3]
S if (abs(L) < tolerance) { # When L ≈ 0, use exponential form
return(M * exp(S * Z))
else { # Normal case
} return(M * (1 + L*S*Z)^(1/L))
}
}
# Function to estimate parameters for a single group
<- function(group_data) {
estimate_single_group <- group_data$Z
Z <- group_data$X
X <- nrow(group_data)
n
if (n < 3) { # Need at least 3 points to estimate 3 parameters
return(data.frame(
L = NA, M = NA, S = NA, rmse = NA,
r_squared = NA, converged = FALSE,
n_obs = n, method = "insufficient_data"))
}
# Test discontinuity at L = 0 (normal distribution with no skew)
#
# tibble(
# measure = "test", age = 0, gender = "m",
# percentile = c(3, 10, 50, 90, 97)
# ) |>
# mutate(
# Z = qnorm(percentile / 100),
# X = 50 * exp(1 * Z) # formula for L = 0; M = 50, S = 1
# ) |>
# estimate_LMS_parameters(measure, age, gender)
#
# L_starts <- c(-5, 5) # fails to converge
# L_starts <- c(-5, -0.01, 0.01, 5) # converges to L = 1.165e-06
# Sets of starting values
<- c(mean(X), median(X)) # estimating the median
M_starts <- c(-5, -0.01, 0.01, 5) # handle potential L = 0
L_starts <- c(0.01, 1) # always positive
S_starts
<- NULL
best_params <- Inf
best_obj <- "failed"
best_method
# Try different starting combinations
for (M_start in M_starts) {
for (L_start in L_starts) {
for (S_start in S_starts) {
tryCatch({ # Using minpack.lm::nlsLM() robust nonlinear least squares
# objective function NOT normalized to # of points
# RMSE = sqrt(obj_va / n)
if (is.null(best_params) || best_obj > tolerance) {
<- nlsLM(
fit # X ~ ifelse( # this didn't work
# abs(L) < tolerance,
# M * exp(S * Z),
# M * (1 + L * S * Z)^(1/L)
# ),
~ M * (1 + L * S * Z)^(1/L), # will fail if L is exactly 0
X data = data.frame(X = X, Z = Z),
start = list(L = L_start, M = M_start, S = S_start),
control = nls.lm.control(maxiter = 1000))
<- coef(fit)
params <- sum(residuals(fit)^2)
obj_val if (obj_val < best_obj) {
<- obj_val
best_obj <- params
best_params <- "nlsLM"
best_method
}
}error = function(e) {
}, # Unsuccessful fitting --> continue
})
}
}
}
# Check if found a solution
if (is.null(best_params)) {
return(data.frame(
L = NA, M = NA, S = NA,
rmse = NA, r_squared = NA, converged = FALSE,
n_obs = n, method = "all_failed"))
}
# Calculate fit statistics
<- LMS_model_vec(best_params, Z)
predicted <- X - predicted
residuals <- sqrt(mean(residuals^2))
rmse
# R-squared
<- sum((X - mean(X))^2)
tss <- sum(residuals^2)
rss <- 1 - rss/tss
r_squared
# Check convergence quality
<- (rmse < rmse_tolerance) && # OK criteria
converged all(is.finite(best_params)) && # pretty unrestrictive criteria
> 0) # pretty unrestrictive criteria
(r_squared
return(data.frame(
L = best_params[1],
M = best_params[2],
S = best_params[3],
rmse = rmse,
r_squared = r_squared,
converged = converged,
n_obs = n,
method = best_method
))
}
# Apply to each group
<- rlang::ensyms(...)
grouping_syms <- df
results if (length(grouping_syms) > 0) {
<- results |> group_by(!!!grouping_syms)
results
}<- results |>
results group_modify(~ estimate_single_group(.x)) %>%
ungroup()
return(results)
}
(Also included is a vectorized helper function from the peditools
package to calculate X from L
, M
, S
, and Z
, which could also probably be improved by not using strict equality of L
with 0.)
Show the code
##### z_lms_to_x #####
# From `peditools`
<- function (Z, L, M, S) {
z_lms_to_x <- function(...) {
expand.arguments <- list(...)
dotList <- max(sapply(dotList, length))
max.length lapply(dotList, rep, length = max.length)
}<- expand.arguments(Z, L, M, S)
temp = temp[[1]]
Z = temp[[2]]
L = temp[[3]]
M = temp[[4]]
S ifelse(L != 0, M * (1 + L * S * Z)^(1/L), M * exp(S * Z))
}
Use the training data to estimate LMS parameters
The full dataframe df_train
includes 4 anthropometric measures, 2 genders, and 100 ages, so LMS parameters will be estimated from 800 separate groups of seven X / Z pairs (one for each of the 7 percentiles (3 / 10 / 25 / 50 / 75 / 90 / 97).
Show the code
<- estimate_LMS_parameters(df_train, measure, gender, age) # group by measure, gender, and age
df_est
|> head() |> knitr::kable() df_est
measure | gender | age | L | M | S | rmse | r_squared | converged | n_obs | method |
---|---|---|---|---|---|---|---|---|---|---|
bmi | f | 0.0 | 0.0204592 | 14.14973 | 0.0923936 | 0.0001710 | 1 | TRUE | 7 | nlsLM |
bmi | f | 0.2 | 0.0124989 | 14.33114 | 0.0928675 | 0.0001556 | 1 | TRUE | 7 | nlsLM |
bmi | f | 0.4 | 0.0053028 | 14.50763 | 0.0933272 | 0.0002511 | 1 | TRUE | 7 | nlsLM |
bmi | f | 0.6 | 0.0007589 | 14.67681 | 0.0937863 | 0.0002292 | 1 | TRUE | 7 | nlsLM |
bmi | f | 0.8 | -0.0076834 | 14.83683 | 0.0942656 | 0.0001313 | 1 | TRUE | 7 | nlsLM |
bmi | f | 1.0 | -0.0158573 | 14.98664 | 0.0947315 | 0.0002322 | 1 | TRUE | 7 | nlsLM |
The LMS estimation results include some per-group statistics (like RMSE, R-squared, and number of observations). The RMSE values are quite good but not perfect, as they should be if the percentile values were all calculated correctly from source LMS values.This imperfect convergence reflects problems with the source data.
Show the code
summary(df_est |> mutate(measure = as.factor(measure), gender = as.factor(gender), method = as.factor(method)))
measure gender age L
bmi :200 f:400 Min. : 0.00 Min. :-3.8136
height :200 m:400 1st Qu.: 4.95 1st Qu.:-1.1473
height_velocity:200 Median : 9.90 Median :-0.1807
weight :200 Mean : 9.90 Mean :-0.3799
3rd Qu.:14.85 3rd Qu.: 0.6690
Max. :19.80 Max. : 2.6379
M S rmse r_squared
Min. : 0.8091 Min. :0.04280 Min. :3.656e-05 Min. :0.9996
1st Qu.: 11.2486 1st Qu.:0.09131 1st Qu.:1.768e-04 1st Qu.:1.0000
Median : 17.4175 Median :0.14562 Median :2.264e-04 Median :1.0000
Mean : 42.7057 Mean :0.17739 Mean :4.009e-03 Mean :1.0000
3rd Qu.: 54.2090 3rd Qu.:0.22781 3rd Qu.:2.879e-04 3rd Qu.:1.0000
Max. :166.2587 Max. :0.48378 Max. :1.325e-01 Max. :1.0000
converged n_obs method
Mode :logical Min. :7 nlsLM:800
FALSE:53 1st Qu.:7
TRUE :747 Median :7
Mean :7
3rd Qu.:7
Max. :7
Compare reported X values with those calculated from re-estimated LMS values
The newly estimated LMS values were used to recalculate the X values. (In the technical brief, the problems exhibited by the published LMS values were also shown.)
Show the code
# Functionally assess estimatedLMS values
<- df_est |>
df_assess transmute(measure, gender, age, estL = L, estM = M, estS = S) |>
left_join(
|> select(measure, gender, age, L, M, S, everything()),
df_orig by = c("measure", "gender", "age")
|>
) mutate(
deltaL = estL - L, deltaM = estM - M, deltaS = estS - S
|>
) pivot_longer(p_3:p_97, names_to = "percentile", names_prefix = "p_", values_to = "X") |>
mutate(
# Calculate X from original and re-estimated LMS values
percentile = as.numeric(percentile),
calcX = z_lms_to_x(qnorm(percentile / 100.0), L, M, S), # using the originally published LMS values
estX = z_lms_to_x(qnorm(percentile / 100.0), estL, estM, estS)
|>
) select(
chart, age, age_units, gender, measure, measure_units, L, M, S, estL, estM, estS, deltaL, deltaM, deltaS, percentile, X, calcX, estX )
Overlaying the reported X values with those calculated from the re-estimated LMS values show near-perfect alignment. (Each panel is reproduced in larger size below, to better visualize the minor imperfections present from the flawed original percentile measurement data.)
Show the code
# Plot X, calcX, and estX
<- df_assess |>
g mutate(percentile = factor(percentile, ordered = TRUE)) |>
ggplot(aes(age, X, group = percentile)) +
geom_line(size = 0.2) +
# geom_point(aes(age, calcX), color = "red", size = 0.4, alpha = 0.5) +
geom_point(aes(age, estX), color = "blue", size = 0.3, alpha = 0.2) +
facet_grid2(gender ~ measure, scales = "free_y", independent = "y") +
theme_bw() +
labs(
x = "Age (years)",
y = "X measurement"
) g
So, that’s how to regenerate LMS parameters from tables of percentile values. Theoretically, if the table of percentiles were correctly generated directly from LMS values, 3 values per group would be sufficient to perfectly recreate the LMS parameters. In that case, it’d probably best to use 10 / 50 / 90 percentile values, to avoid the decreased precision at more extreme Z scores.
Individual plots of reported vs calculated X values from re-estimated LMS
Show the code
for (m in c("bmi", "height", "height_velocity", "weight")) {
{# for (g in c("f", "m")) {
<- df_assess |>
g # filter(measure == m, gender == g) |>
filter(measure == m) |>
mutate(percentile = factor(percentile, ordered = TRUE)) |>
ggplot(aes(age, X, group = percentile)) +
geom_line() +
# geom_point(aes(age, calcX), color = "red", size = 0.4, alpha = 0.5) +
geom_point(aes(age, estX), color = "blue", size = 0.4, alpha = 0.5) +
theme_bw() +
labs(
x = "Age (years)",
y = "X measurement"
+
) labs(
# title = paste0(m, " - ", g)
title = paste0(m)
+
) facet_wrap(~gender)
print(g)
} }