Test-retest reliability and short-term variability of quantitative light reflex pupillometry in a mixed memory clinic cohort

Background: Quantitative light reflex pupillometry (qLRP) may be a promising digital biomarker in neurodegenerative diseases such as Alzheimer's disease (AD), as neuropathological changes have been found in the midbrain structures governing the light reflex. Studies investigating test-retest reliability and short-term, intra-subject variability of qLRP in these patient groups are missing. Our objective was therefore to investigate the test-retest reliability and short-term, intra-subject variability of qLRP in a memory clinic setting, where patients with neurodegenerative disease are frequently evaluated. Methods: Test-retest reliability study. We recruited patients from a tertiary memory clinic and qLRP was carried out at a baseline visit and then repeated on day 3 – 14 and on day 21 – 35 using a hand-held pupillometer. We evaluated the test-retest reliability of qLRP by calculating intraclass correlation coefficients (ICCs) and intra-subject, short-term variability by fitting linear mixed models. We compared ICCs for subgroups based on age, sex, disease severity (MCI vs. mild dementia), AD diagnosis, and amount of neurodegeneration (cerebrospinal fluid-total tau levels). Results: In total, 40 patients (mean age 72 years, 15 female, 22 with mild dementia) were included in the study. We found good-excellent reliability (ICC range 0.86 – 0.93) for most qLRP parameters. qLRP parameters exhibited limited intra-subject variability and we found no large sources of variability when examining subgroups. Conclusion: qLRP was found to have acceptable test-retest reliability and the study results pave the way for research using longitudinal or cross-sectional measurements to assess the construct in identifying and prog-nosticating neurodegenerative diseases.


Introduction
Quantitative light reflex pupillometry (qLRP) has been proposed as a novel marker in neurodegenerative diseases such as Alzheimer's disease (AD) [1], but studies investigating the test-retest reliability of the method in these patient groups are missing [2,3].Also, in cross-sectional studies, the aspect of intra-subject, short-term variability may be relevant.Test-retest reliability is defined as the ability of a measurement method to quantify the same property under replicated conditions [4].The intra-subject variability is defined as the amount that a given measured parameter varies for each repeated measurement in a single individual [5].
In AD in particular, prognostic and disease-tracking markers as well as case-finding tools are urgently needed to accommodate the advent of disease-modifying anti-amyloid therapies, which have been recently approved for use [6].Due to its low-cost, non-invasive nature, qLRP could be a promising digital disease tracking and prognostic biomarker and aid in case-finding, but the test-retest reliability and the intrasubject, short-term variability of qLRP when used in neurodegenerative diseases are essentially not known.Light reflex pupillometry is the objective measurement of the pupillary response (contraction of the m.constrictor pupillae) to external light stimuli.It can be carried out bedside using a hand-held point-andshoot pupillometer [7].The reflex arc is, with modulation by cortical areas such as the frontal eye field, governed by midbrain structures.Most important for the reflex is the Edinger Westphal nucleus, which is shown to undergo neurodegeneration in patients with AD [8][9][10].qLRP has been investigated in recent studies [11,12] as a marker for AD, describing that several metrics of qLRP are affected in AD, possibly as a result of disrupted autonomic function [1].In healthy individuals, the test-retest of qLRP has shown excellent reliability [7], and also in critically ill patients [13], but no studies exist on the test-retest reliability and short-term, intra-subject variability in patients suspected of neurodegenerative disease.In addition to this, the underlying factors that might give rise to a posited increased variability are not known.
Patients with neurodegenerative disease exhibit progressive neuronal dysfunction [14] and thus may exhibit higher degrees of variability in qLRP parameters.The specific presence of neurodegeneration as reflected by neurodegenerative markers, such as tau protein measured in cerebrospinal fluid, could for example be related to the intra-subject variability, but this has not been investigated previously.Further, patients with dementia are more often treated with antipsychotic or antidepressant medication, which could influence the pupillary state through modulation of the autonomic nervous system and thereby qLRP parameters [15,16].Therefore, studies that quantify intra-subject variability in patients undergoing diagnostic evaluation for neurodegenerative disease are needed before the introduction of longitudinal measurements to clinical practice in individual patients.
In the present study, we examined patients in a memory clinic who were undergoing diagnostic evaluation for neurodegenerative disease, at serial visits over a short period to investigate the test-retest reliability and intra-subject variability of qLRP.We hypothesized that qLRP could be reliably measured and that measurements exhibited limited shortterm, intra-subject variability.

Study design
Test-retest reliability study.

Participants
Patients referred to a tertiary memory clinic for diagnostic evaluation for a possible neurodegenerative disorder were consecutively recruited with the following inclusion criteria: undergoing lumbar puncture due to suspicion of neurodegenerative disease, a Mini-Mental State Examination (MMSE) total score > 20, could cooperate to the qLRP procedure.We applied the following exclusion criteria: no severe bilateral ophthalmological disorders (severe untreated cataract, glaucoma, or similar), no history of stroke within the last 3 months, no previous or current major psychiatric conditions (schizophrenia, bipolar affective disorder, or psychosis), no alcohol or substance abuse within the last two years, and no participation in intervention studies.The inclusion periods were January 2022-May 2023.Patients included in the study had the following diagnoses which were diagnosed using the relevant diagnostic criteria: MCI or dementia due to AD [17,18], dementia with Lewy bodies [19], frontotemporal dementia [20], cerebral amyloid angiopathy [21], and vascular dementia [22].The criteria for syndromes included mild cognitive impairment (MCI), using the International Working Group criteria [23], or the National Institute of Aging-Alzheimer's Association (NIA-AA) criteria [17] and the NIA-AA dementia criteria [24].We did not estimate an a priori sample size due to the explorative nature of the study.The study was registered at clinicalt rials.gov(NCT05175664).Permission from the regional ethics committee was acquired before the commencement of the study (H-21044863).We followed the tenets of the 1975 Helsinki Declaration and written informed consent was obtained from all patients.

Quantitative light reflex pupillometry
Patients were recruited on the day of lumbar puncture scheduled as part of diagnostic work-up due to suspicion of neurodegenerative disease.After a successful lumbar puncture and subsequently 30 min of rest, the light reflex was measured at this visit (referred to as baseline) in an examination room without disturbances (at hours 09:00-15:30).This procedure was repeated on day 3-14 after lumbar puncture (visit 2) and again at day 21-35 (visit 3) under similar conditions (see study flow chart, Fig. 1), but not always by the same examiner.The examiner was not blinded to previous measurements.
Ophthalmoscopy was performed before the qLRP procedure to check for the absence of obscurity of the lens and vitreous body.Approximately 5 min of adaptation to ambient light was provided afterward.Before the procedure, ambient light intensity was measured using the Light Meter App for iPhone.Direct sunlight was avoided and room lighting was calibrated to normal ambient light conditions at 300-1500 lx [25] simulating real-world conditions.
We used the PLR-3000 (NeurOptics®) Pupillometer system to measure the light reflex.The PLR-3000 Pupillometer is a research-grade, hand-held unit that can wirelessly transfer measurements and raw data in Excel format.It uses a proprietary algorithm to calculate various pupillometric parameters (see Fig. 2 for an explanation).The settings for the pupillometer were: Positive Pulse Stimulus, 180 μW light intensity, 0.84-s stimulus duration, stimulus onset after 1 s, measurement duration of 10 s, and no background light.The pupillometer tracks the pupillary size using an infrared camera recording at a 30 Hz framerate and it is calibrated by the manufacturer.The device was used according to the manufacturer's instructions.In short, the patients were seated and instructed to stare at a fixed point on a wall behind the examiner.If patients had difficulty keeping the examined eye open, the examiner kept the eye open manually during the recording.The right eye was measured first and a total of three acceptable measurements (without excessive blinking or artifacts) were carried out before measuring the contralateral eye in the same manner.

Pre-processing of data
The summarized and raw pupillometric data were transferred in Excel format to a computer and imported into R Studio, where lowquality recordings (flagged by the pupillometer) were visually inspected using the clintools package function dilations, which was adapted for this purpose.If blinks were present in the pre-stimulus and contraction phases, the recording was discarded.The pupillometer does not calculate dilation velocity and T75 when blinking occurs during the recording, and as such some patients in which blinking could not be avoided had measurements missing for these variables (N = 2 observations).The T75 parameter is dependent upon the pupil returning to 75% of the baseline value, and in cases where this did not occur the parameter cannot be calculated (N = 3 observations).We did not interpolate pupillary tracings for blinks or artifacts and did not calculate derivatives of qLRP parameters, and thus only pupillometric parameters provided by the PLR-3000 Pupillometers proprietary algorithm were used for analysis: baseline pre-stimulus pupil diameter (mm), peak constriction pupil diameter (mm), delta change between baseline and peak pupil diameter (%), latency (msec), average constriction velocity (mm/s), maximum constriction velocity (mm/s), average dilation velocity (mm/s), and time to reach 75% of the baseline value after peak constriction (T75) (sec).A mean of up to three acceptable measurements were used for analysis.The right eye measurements were used for analysis as a significant effect of the number of trials was present and some qLRP parameters were shown to vary between eyes (see Supplementary material).In cases where significant monocular disease was present (one patient had congenital unilateral blindness and one patient had unilateral central vein thrombosis), the left eye measurements were used.

Cognitive testing and demographic data
The Mini-Mental State Examination [26] assesses global cognitive function and was administered by a trained nurse at an initial evaluation visit to the memory clinic.A history of ophthalmological conditions, subjective visual acuity, medications, and eye surgery was recorded.Visual acuity was not measured objectively.Demographic data, including age, sex, educational level, information on medication use, and eye surgery was extracted from medical files and entered into an electronic database (REDCap, Vancouver).

Cerebrospinal fluid
We measured total-tau in the cerebrospinal fluid as a marker of neurodegeneration to investigate whether elevated levels were associated with poorer test-retest reliability of qLRP.CSF was obtained by lumbar puncture, which was performed due to suspicion of a neurodegenerative disorder and CSF was handled according to standard operating procedures.Total tau was measured in cerebrospinal fluid (CSF) using a commercially available enzyme-linked immunosorbent assay (ELISA) (Innotest, Fujirebio, Ghent) as part of routine diagnostic workup at the hospital lab.The assay used by our hospital laboratory changed during the study period (after 1/7-2022) to a sandwich electrochemiluminescence-immunoassay (Roche, Cobas 8000).We used the reference cutoffs of our hospital lab before and after the  measurement method changed (before 1/7-2022: 400 ng/L; after 1/ 7-2022: 250 ng/L) to dichotomize groups with normal and high total tau.

Statistical analysis
We present summary statistics of variables according to their distribution (normal or non-normal) and present them as means with corresponding standard deviations and medians and ranges where appropriate.We calculated Spearman's rho for correlations between all pupillometric and demographic variables and constructed a correlogram using the function corrplot in the ggcorrplot package in R. We fitted linear mixed models (LMMs) using the lme4 package and the function lme, assuming an unstructured covariance pattern (only one missing measurement).To test for the effect of the number of trials, patients were included as a random effect and trial as a fixed effect and separate univariate models including each pupillometric variable measured at the baseline visit for up to three acceptable trials as the outcome.We extracted fixed effect estimates from the models and calculated 95% confidence intervals (95% CIs) using the confint function.To investigate the inter-visit, intra-subject variability, we fitted LMMs with patients and visits as random effects and visits as a fixed effect.The residuals for the random effects from separate LMMs including each pupillometric parameter were used to derive a proposed clinically significant change representing the standard deviation of the residuals, reflecting the intra-subject, short-term variability.Further, these same LMMs were fitted with adjustment for baseline pupil diameter included as a log-transformed fixed effect, as this measure is tightly correlated with the other variables, and had been shown to exhibit a logarithmic relationship with constriction velocity [7,27,28].Model assumptions were not checked for individual LMMs, as it was assumed to be robust to the non-normality of variables due to few missing data [29].To investigate the test-retest reliability we calculated estimates of intraclass correlation coefficients (ICCs) from a two-way agreement model (McGraw and Wong convention ICC (A,3)) [30] considering complete cases only (acceptable measurements for all three visits) for all pupillometric parameters with corresponding 95% CIs, using the irr package and the function icc for average measurements, as we used the mean of three measurements at each visit.For ICCs calculated based on single measurements the ICC (A,1) model was used.To investigate for possible sources of variability, ICCs were also calculated for subgroups dichotomized according to the median of age (> and < 72 years), sex (male vs. female), disease severity (MCI vs. mild dementia), total tau concentration in CSF (above 400 ng/L before 1/7-2022 and > 250 ng/L after this date), diagnosis of AD (vs.non-AD), and antidepressant medication use (yes vs. no) and the 95% CIs for the ICC estimate were compared, with an overlap indicating no difference between groups.The ICC estimates were evaluated according to the intervals provided by Koo & Li [4].We fitted univariate linear regressions for the relationship between ambient light intensity measured in lux and all pupillometric parameters measured at baseline separately.Linear regression model assumptions were checked by inspection of Q-Q and scale-location plots.If assumptions were violated, transformation of the data was tried and subsequently a Spearman's rank correlation coefficient was calculated instead if no non-linear transformation fitted the data better.R (ver.4.2.2) was used for all analyses [31].A p-value<0.05 was considered significant and only two-tailed tests were used.In the supplementary material, a full read-out of the results from R with the source code used to produce the output is provided.

Results
A total of 47 patients were recruited of which 40 completed all visits, comprising the analyzed cohort (see Fig. 1 for a study flow chart).
The mean age of patients was 71.6 years (range 60.8-81.7)and there was a slight overrepresentation of males (62.5%) (see Table 1 for baseline characteristics).
The median MMSE score was 27.The patients had a wide range of diagnoses, and >50 % of the subjects were diagnosed with AD.Most baseline pupillometric variables, except latency, were positively correlated with moderate-strong correlations (see Fig. 2 and Fig. 3), while educational level was weakly positively correlated with average constriction velocity, although this correlation was not found when we examined the full dataset (i.e., with all available measurements, see Supplementary Figs. 1 and 2).No further correlations were found between pupillometric parameters and demographic variables.

Test-retest reliability and short-term variability of qLRP
We examined the test-retest reliability of measurements across visits by calculating ICCs based on the means of measurements from each visit.These results are shown in Table 2.
We generally found good-excellent reliability of measurements (ICC range 0.86-93) for most pupillometric variables, although T75 showed moderate reliability (ICC 0.56).ICCs based on the first measurements from each visit showed moderate-good reliability (see Supplementary Table 2).We further explored the inter-visit variability by fitting LMMs for all variables with visits and patients as random effects.The residuals of these models indicate the amount of variation that could occur randomly, and we have defined a clinically significant change for each variable based on the standard deviation of these residuals reflecting the intra-subject, short-term variability (Table 2).This value can be used to indicate whether an individual change (either due to pharmacological influence or neurodegenerative disease processes) could be ascribed to the actual intervention or disease process and not the intra-subject, short-term variability.In Table 3 we also present mean values for all pupillometric measurements at each visit as well as spaghetti plots of all measurements for all patients in a subset of pupillometric variables in Fig. 4 (see Supplementary Fig. 3 for the spaghetti plots of the remaining variables).As can be seen from the spaghetti plots in Fig. 4, some individual observations for T75 were far from the prior or subsequent observations.

Possible sources of variability
To further explore whether the reliability of qLRP measurements could be influenced by test-day conditions such as ambient lighting and the number of trials, we investigated the association between these variables and qLRP parameters.We found a significant association between the number of trials and baseline pupil diameter, peak constriction diameter, latency, and average constriction velocity, although the effects were small (see Supplementary Table 3 and Supplementary Figs. 4 and 5).We found an association between ambient light and T75 (P = 0.01), and a trending association between ambient light intensity and relative change in pupillary size (P = 0.06) and dilation velocity (P = 0.07), although assumptions for the linear regression model for T75 were violated due to non-linearity and heteroscedasticity (see Supplementary Table 4).Log transformation of ambient light and T75 did not improve the model fit (data not shown).A statistically significant negative correlation (Spearman's rho = − 0.36, p = 0.03) between ambient light and T75 was found.
The results of models with adjustment for baseline pupil diameter are shown in Supplementary Table 5.The adjustment for baseline pupil diameter reduced the residuals of some of the models, indicating that the parameter could explain part of the inter-visit variability, mainly for the qLRP parameters peak constriction velocity, delta, and constriction velocity.Note: The proposed clinically significant change is derived from residuals of linear mixed-effects models.Patients and visits are included as random effects.
The ICCs are calculated from the means of the first three acceptable qLRP measurements on complete cases (acceptable qLRP measurements at all three visits).a N = 39 patients.b N = 38 patients.c N = 37 patients.

Sub-group analyses
We investigated whether certain groups were more prone to exhibit poorer test-retest reliability for qLRP measurements, possibly reflecting higher intra-subject, short-term variability.We investigated sex (male vs. female), disease severity (MCI vs. mild dementia), age groups (< and > 72 years), normal vs. high CSF-total tau, AD vs. non-AD, as well as antidepressant medicine use (yes vs. no).Neither of these subgroups had significantly different ICC values for any qLRP parameter (see Supplementary Material).

Feasibility and data quality
The procedure was generally well tolerated with no adverse events.One patient could not cooperate due to discomfort associated with qLRP.For one patient, no acceptable measurements could be obtained at the last visit due to excessive blinking.

Discussion
In this test-retest study, we investigated the reliability and intrasubject, short-term variability of quantitative light reflex pupillometry in a memory clinic setting.We generally found good-excellent reliability and limited intra-subject, short-term variability for most qLRP parameters.We found that ambient light possibly influenced T75, but this could not fully explain the poor test-retest reliability of this particular qLRP parameter.Also, a small effect of the number of trials was found for some pupillometric variables, which could account for some intrasubject variability.Otherwise, we found no inherent sources of intrasubject, short-term variability pertaining to certain sub-groups of patients with regard to sex, age, disease severity, or antidepressant medication use.
qLRP has previously shown a high degree of reliability in both healthy individuals [7] and critically ill patients [13].However, this study is, to our knowledge, the first to investigate the test-retest reliability and intra-subject, short-term variability of qLRP in a memory clinic setting.Our results corroborate these findings and extend the generalizability to include cohorts of patients with cognitive dysfunction and varying degrees of neurodegeneration.
We found that patients generally tolerated the procedure well with no adverse events registered and only one patient could not cooperate due to discomfort.The data loss was limited to a single visit for one patient for a few qLRP parameters (dilation velocity and T75).This indicates that the feasibility of applying this procedure in a memory clinic is high, although the study was not designed to examine this aspect.
We did not identify serious sources of patient or condition-related variability, such as sex, disease severity, and ambient light, although our sample size may not have been large enough to identify small sources of variability.We found that ambient light intensity did influence T75, a variable that is dependent upon the speed of the pupil to return to its baseline diameter.Also, the number of trials had a small effect on some pupillary parameters.This could be mitigated by controlling light conditions to keep a normal ambient light interval (300-1500 lx) [25], limiting the number of trials, and using proper measurement technique (keeping the eyelids open manually to avoid excessive blinking).Another solution to controlling light conditions is to perform measurements in complete darkness, although this may be uncomfortable and requires time for adaptation that may not be suited for everyday clinical use.We find that three trials seem to produce reliable serial measurements without affecting the qLRP parameters to a large degree.Although we did not identify significant differences between eyes, we propose that qLRP measurements should be performed on the same eye for serial evaluation.
The disease severity of patients (MCI vs. mild dementia) did not seem to influence the test-retest reliability of qLRP.This means that longitudinal qLRP could perform equally well in both early and later-stage disease, although we did not include patients with an MMSE total score lower than 20.Thus, we cannot generalize findings to patients with moderate-severe dementia for which the neurodegenerative processes might be hypothesized to influence qLRP parameters more severely.Surprisingly, we did not find an increased variability when dichotomizing groups according to CSF-total tau levels, which indicates that the presence of active neurodegenerative processes does not influence the variability of qLRP.
We found a high degree of correlation between qLRP variables.The baseline pupil diameter was correlated with all but one qLRP parameter (latency).This has previously been found [7,15] and it has therefore been proposed to only use relative measures such as relative change in pupillary size.Further, the observed correlation also led us to investigate whether baseline pupil diameter could explain some of the inter-visit variability.Indeed, adjusting for baseline pupil diameter reduced residuals of the models for peak pupil diameter, delta, and constriction velocity.This could indicate that baseline pupil diameter, as a marker of arousal and/or affective state, has a general effect on these light reflex properties [32][33][34].Another explanation could be that all these parameters are correlated to a higher-order affective state, which is an aspect we could not investigate since we did not have data relating to the affective state of the patients.
We only investigated metrics that were provided by the manufacturer to increase the reproducibility of our study and to resemble a clinical setting.For example, we did not report on qLRP parameters such as the maximum pupillary constriction acceleration, which has previously been found to be altered in patients with AD [1], and we cannot conclude on the test-retest reliability of this metric.
An interesting possible application of qLRP is disease tracking in Alzheimer's disease, as the light reflex has been found to be altered in patients with AD also at a pre-symptomatic stage [11,35].We found that patients that qLRP could be reliably measured equally well in patients with an AD and non-AD diagnosis, and our study, therefore, supports this proposed application.Previous studies have focused on cross-sectional measurements in specific disease groups, such as patients with AD and patients with Parkinson's disease with and without cognitive impairment [1], which are disorders frequently encountered in memory clinics.While some groups have shown excellent separation of disease vs. healthy groups [11], others have failed to replicate these results [36,37].It has been proposed recently that longitudinal measurements may better capture neurodegenerative processes in these diseases [2,3], the further exploration of which the present study supports.
We did not find a statistically significant correlation between MMSE score and qLRP variables, which has previously been shown in patients with AD [38].Also, age would be expected to correlate with qLRP parameters, which we could not show.There are several explanations for this.First, the study was not powered to this aspect, as we were mainly interested in the test-retest reliability and not cross-sectional associations.Second, our cohort consisted of a mixture of dementia etiologies.This again limited power in detecting significant correlations with MMSE specifically.Third, the MMSE total scores were quite restricted and MMSE is generally not considered a sensitive test of cognitive decline.Finally, qLRP may not be marker of disease severity but perhaps a trait marker of disease, which our study was not designed to investigate.As such, qLRP still presents a promising approach, and a study in a large cross-sectional cohort conducted by our group has shown correlations between MMSE, age, and qLRP parameters (Gramkow et al., unpublished).Efforts should be made in future studies to evaluate whether qLRP could be used according to a cut point, which effectively would aid the clinician in determining the presence of a disease such as Alzheimer's disease, in which qLRP has shown the most promising results [1,11,38].The present study had a purposefully limited follow-up period, and future studies should evaluate a longitudinal design allowing for longer individual trajectories with repeat measurements of qLRP.
The strengths of this study are the prospective design, the relatively large sample size, and a cohort that represents the heterogeneous profile of memory clinic patients.We also performed measurements in out-oflab clinical settings reflecting real-world conditions, where qLRP is expected to be applied.As such, external factors that could influence measurements (e.g., scheduled lumbar puncture) were allowed in the research design.This likely increases the external validity of our results.
We are also aware of certain limitations pertaining to our study.We had a small, but noteworthy drop-out rate and the application of complete case analysis may limit the generalizability of our results on the test-retest reliability.Also, the cohort had relatively high MMSE total scores, and we did not include patients with moderate-severe dementia.It is possible that these more advanced disease states could lead to higher intra-subject variability which we could not investigate in this study.To add to this, we only included patients who had lumbar puncture performed, which might have induced selection bias.
In conclusion, most pupillometric parameters, except T75, showed good-excellent test-retest reliability and limited intra-subject, shortterm variability.We propose that a mean of three measurements is used for analysis as this method gives the smallest amount of variability.Pupillometry could be reliably used regardless of the age and sex of subjects, the amount of neurodegeneration present, a diagnosis of AD, and the use of antidepressant medication.Considering our results, we find that qLRP is a reliable tool that can be used in patients undergoing evaluation in a memory clinic.This work paves the way for future studies that could evaluate changes in qLRP related to neurodegenerative processes or pharmacological interventions.

Fig. 2 .
Fig. 2. A plot of the pupillary constriction as a function of time.The various metrics that are extracted by the PLR-3000 pupillometer are shown.The star marks the beginning of the light stimulus.

Fig. 3 .
Fig. 3.A correlogram showing correlations between baseline pupillary measurements (mean of the first three acceptable measurements at the baseline visit) and demographic variables.Spearman's rho is given, and only significant correlations are shown (NS = not significant).For the pupillometric measurements, only complete cases are considered (conditioned on having at least one acceptable T75 measurement) and one patient was excluded due to not having a baseline T75 value (N = 39 patients).

Table 1
Study cohort characteristics.

Table 2
Intraclass correlation coefficients and proposed clinically significant change.

Table 3
Means of the first three acceptable qLRP measurements at the three study visits.