Open Access
Issue
Wuhan Univ. J. Nat. Sci.
Volume 30, Number 2, April 2025
Page(s) 169 - 183
DOI https://doi.org/10.1051/wujns/2025302169
Published online 16 May 2025

© Wuhan University 2025

Licence Creative CommonsThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

0 Introduction

Subgroups are usually defined based on biomarkers such as blood sugar, blood pressure, gene or protein expression levels, etc[1]. When evaluating the efficacy of new therapy on patients in clinical studies, these characteristic differences may lead to different responses of the same therapy to different patients[2]. Therefore, when physicians assess therapeutic effect, they should not only consider the average effect across the overall population, but also the heterogeneous effects within subgroups. Many methods for subgroup identification have proposed in research literature, such as method based on the tree structure[3], global model[4], clustering analysis[5], etc. In clinical practice, a straightforward and easily interpretable approach is often preferred, which involves dividing subgroups according to whether a single continuous biomarker exceeds a certain threshold. This is the threshold model we focus on.

However, in the threshold model described above, a single variable is used to classify subgroups, which can result in a limited representation of subgroup characteristics and may be compromised by the inappropriate choice of the threshold variable. Numerous studies suggest that subgroups may arise from the combined influence of multiple variables, for instance, Jiang et al[6] demonstrated that multiple single nucleotide polymorphisms (SNPs) can frequently alter the risk of developing specific diseases, whereas any single SNP alone lacks this capability; Fan et al[7] showed that a risk score defined as a function of multiple predictors is useful for identifying subgroups of AIDS patients who benefit more from treatment. Similarly, Vander Weele et al[8] discussed the importance of using multiple variables to identify optimal treatment subgroups. He et al[9] and Wei et al[10] extended this model to survival and longitudinal data, respectively. While they utilized penalization methods to identify significant biomarkers, they did not perform variable selection for the included covariates.

Moreover, the majority of current studies on threshold model primarily focus on traditional mean regression. However, sometimes we place greater emphasis on sample data at other quantiles, for example, we prefer to conduct heterogeneity analysis on patients with worse condition and provide targeted medical solutions. Consequently, we consider introducing quantile regression to analyze the threshold model. The threshold quantile regression was initially introduced by Cai et al[11] and applied to quantile self-exciting threshold autoregressive time series models. Galvao et al[12] studied the threshold quantile autoregressive processes. Furthermore, Lee et al[13] developed the general tests for threshold effects in regression models, Su and Xu[14] presented a systematic estimation and inference procedure for threshold quantile regression. On this basis, Zhang et al[15] proposed a more general single-index threshold quantile regression model. However, there is no existing literature that addresses the variable selection for covariates and biomarkers in single-index threshold quantile regression.

In practical applications, incorporating excessive number of biomarkers for subgroup division may diminish the interpretability of the subgroups, and including excessive number of covariates may also introduce multicollinearity. Consequently, we prefer to select the variables most pertinent to the response from the model's explanatory variables. Common variable selection methods include those based on penalties. This method, initially proposed by Tibshirani in 1996[16], is known as Lasso estimation, which uses an L1 norm penalty. Subsequent researches include: Fused Lasso penalty[17], Adaptive Lasso[18], Graphical Lasso[19] and so on. Furthermore, as Lasso compresses all coefficients equally, the resulting estimator is often biased. To this end, Fan et al[20] introduced the smoothly clipped absolute deviation (SCAD) penalty and proved that the estimation result satisfied the oracle property.

Our paper extends the single-index threshold quantile regression[15] to longitudinal data and introduces the SCAD penalty for variable selection of covariates and biomarkers. To this end, we propose an efficiency algorithm similar to that proposed by Zhang et al[21], which transforms the problem of estimating the covariate regression coefficients and threshold parameters into a penalized linear quantile regression. Furthermore, we obtain a degraded linear quantile regression through pseudo observations. In addition, the penalty for covariates and biomarkers should be performed separately. We combine the two-step grid search algorithm[22] to select the compression parameters that minimize the Bayesian information criterion (BIC), and then substitute the pre-selected compression parameters into the iterative algorithm for parameter estimation. Finally, random simulations and example analysis illustrate the practicality of our method.

1 Models and Estimation

We assume that yij is the response variable of the i-th individual at the j-th observation, i=1,,n, j=1,,m, total observations N=nm, zij=(z1,ij,,zp,ij)T is the covariates with dimension of p, where z1,ij=1, z˜ij is a subset of zij, and xi=(xi0,,xid)T is the baseline biomarker including the intercept term. We consider the following single-index threshold quantile regression with τ(0,1):  yij=zijTβ(τ)+z˜ijTδ(τ)I(xiTγ(τ)>0)+εij,

  i = 1 , , n ,     j = 1 , , m .

The regression coefficients β(τ) characterize the baseline effect of zij at the τ-quantile of yij, and the regression coefficients δ(τ) describe the heterogeneity of z˜ij within the subgroup, the threshold coefficients γ(τ) determine the subgroup based on whether the linear combination xiTγ(τ) exceeds 0. We consider time-invariant biomarkers to prevent time-varying biomarkers from dividing repeated observations of the same sample into different subgroups due to the influence of treatment. In addition, the error terms (εi1,,εim) and (εk1,,εkm) are typically assumed to be independent for any ik stemming from the index set {1,,n}, but correlated within (εi1,,εim) for any fixed i{1,,n}. Assume P(εij<0|zij,xi)=τ.

For the recognizability of γ(τ), reference Zhang et al[15] and Horowitz[23], assuming that at least one threshold variable has a nonzero coefficient, and the corresponding probability distribution is absolutely continuous with respect to the Lebesgue measure conditional on the other threshold variables. The variables in xi can be rearranged so that x1i meets this condition, then the model can be rewritten as:

Q y i j ( τ | z i j , x i ) = z i j T β ( τ ) + z ˜ i j T δ ( τ ) I ( x 1 i + x 2 i T ψ ( τ ) > 0 )

where xi=(x1i,x2iT)T,ψ(τ)=γ-1(τ)γ1(τ) with γ1(τ) and γ-1(τ) corresponding to the coefficients of x1i and x2i, respectively. If γ1(τ) is positive, the normalization without altering the coefficients of β(τ) and δ(τ). If γ1(τ) is negative, we should redefined β(τ) and δ(τ) to ensure the consistency of the model[15]. Moreover, for the convenience of expression, we remove (τ) in the parameters below.

In order to estimate all parameters θ=(βT,δT,ψT)T, a computationally efficient method is to ignore the possible correlation between repeated observations within the longitudinal data, that is, to minimize the following objective function under the independence framework:

l n ( θ ) = 1 n m i = 1 n j = 1 m ρ τ { y i j - z i j T β - z ˜ i j T δ I ( x 1 i + x 2 i T ψ > 0 ) }

where ρτ(u)={τ-I(u<0)}u is the quantile loss function. In order to address the challenge of discontinuity in the objective function, similar to Seo and Linton[24], we employ an integrated kernel function to smooth this characteristic function, for example, the cumulative distribution function of the standard normal distribution Φ(), then we can derive the smoothing objective function as following:

S n ( θ ; h n ) = 1 n m i = 1 n j = 1 m ρ τ { y i j - z i j T β - z ˜ i j T δ Φ ( x 1 i + x 2 i T ψ h n ) }

In refer to Zhang et al[15], this article selects the bandwidth hn=0.5log(N)/N.

Furthermore, it is hoped to select important covariates zij and threshold variables x2i. In practice, a few exposure variables of interest are included in z˜ij, such as treatment group indices or drug doses, thus variable selection for interaction terms is not considered. Consequently, the objective function, which includes corresponding penalty terms, is formulated as:

Q n ( θ ; h n ) = S n ( θ ; h n ) + l = 1 p p λ 1 ( | β l | ) + l = 1 d p λ 2 ( | ψ l | )

where pλ() is the penalty function, and λ1 and λ2 are the shrinkage adjustment parameters that control the partial linear regression parameters and threshold parameters, respectively. We consider the following non-convex SCAD penalty function:

p λ ( | α | ) = λ | α | I ( 0 < | α | < λ )                   + ( a 2 - 1 ) λ 2 - ( | α | - a λ ) 2 2 ( a - 1 ) I ( λ < | α | < a λ )                   + ( a 2 + 1 ) λ 2 2 I ( | α | a λ ) ,

when λ>0, and a=3.7. SCAD penalty approaches the optimal Bayesian risk[20], therefore, the nuisance parameters involved in the model only are λ1>0 and λ2>0.

Since the objective function Qn(θ;hn) is non-convex and not differentiable everywhere, we use the local linear approximation algorithm of Zou and Li[25] to transform the non-convex SCAD penalty into a convex approximation form. Specifically,

p λ 1 ( | β l | ) p λ 1 ( | β ^ l ( 0 ) | ) + p λ 1 ' ( | β ^ l ( 0 ) | ) ( | β l | - | β ^ l ( 0 ) | ) , f o r   β l β ^ l ( 0 )

p λ 2 ( | ψ l | ) p λ 2 ( | ψ ^ l ( 0 ) | ) + p λ 2 ' ( | ψ ^ l ( 0 ) | ) ( | ψ l | - | ψ ^ l ( 0 ) | ) , f o r   ψ l ψ ^ l ( 0 )

Moreover, even if we employ a profile estimation similar to Zhang et al[15] to estimate the slope (βT,δT)T and threshold parameters ψ separately, the objective function is still non-convex with respect to ψ with fixing (βT,δT)T. Owing to the slow computation speed of genetic algorithms and the sensitivity of the Nelder-Mead method to initial values, therefore, we employ the first-order Taylor linear approximation to Φ,

Φ ( x 1 i + x 2 i T ψ h n ) Φ ( x 1 i + x 2 i T ψ ^ ( 0 ) h n ) + Φ ' ( x 1 i + x 2 i T ψ ^ ( 0 ) h n ) x 2 i T h n ( ψ - ψ ^ ( 0 ) ) ,

where Φ'() indicates the probability density function of the standard normal distribution. Accordingly, with initial estimation for δ^ and (β^,ψ^), we can derive a new objective function Q˜(β,ψ) with respect to β and ψ:

Q ˜ ( β , ψ ) = i = 1 n j = 1 m ρ τ ( y ˜ i j - z i j T β - x ˜ i T ψ )

+ N l = 1 p p λ 1 ' ( | β ^ l | ) | β l | + N l = 1 d p λ 2 ' ( | ψ ^ l | ) | ψ l |

and

y ˜ i j = y i j - z ˜ i j T δ ^ Φ ( x 1 i + x 2 i T ψ ^ h n ) + z ˜ i j T δ ^ Φ ' ( x 1 i + x 2 i T ψ ^ h n ) x 2 i T h n ψ ^ ,

x ˜ i = z ˜ i j T δ ^ Φ ' ( x 1 i + x 2 i T ψ ^ h n ) x 2 i h n .

Now, the estimation of β and ψ is essentially a penalized linear quantile regression problem, specific iterative algorithm is as follows:

Step 0   Initialize (β^(0),ψ^(0)).

Step 1   Given (β^(k-1),ψ^(k-1)), we have:

δ ^ ( k ) = a r g m i n i = 1 n j = 1 m ρ τ { y i j - z i j T β ^ ( k - 1 ) - z ˜ i j T δ Φ ( x 1 i + x 2 i T ψ ^ ( k - 1 ) h n ) }

Step 2   Given δ^(k), the estimation (β^(k),ψ^(k)) is derived by minimizing the following objective function:

      Q ˜ ( β , ψ ) = i = 1 n j = 1 m ρ τ ( y ˜ i j - z i j T β - x ˜ i T ψ ) + N l = 1 p p λ 1 ' ( | β ^ l ( k - 1 ) | ) | β l | + N l = 1 d p λ 2 ' ( | ψ ^ l ( k - 1 ) | ) | ψ l | ,

and

y ˜ i j   =   y i j - z ˜ i j T δ ^ ( k ) Φ ( x 1 i + x 2 i T ψ ^ ( k - 1 ) h n ) + z ˜ i j T δ ^ ( k ) Φ ' ( x 1 i + x 2 i T ψ ^ ( k - 1 ) h n ) x 2 i T h n ψ ^ ( k - 1 ) ,

set x˜i=z˜ijTδ^(k)Φ'(x1i+x2iTψ^(k-1)hn)x2ihn, repeat step 1 and step 2 until the L2 norm of adjacent estimation for all parameters are less than the tolerance error.

Finally, according to the facts ρτ(cv)=cρτ(v) (c>0) and |βl|=ρτ(βl)+ρτ(-βl), set Npλ1'(|β^l(k-1)|)=cl, and Npλ2'(|ψl(k-1)|)=dl, then, we rewrite the first and second penalty terms as follows:

l = 1 p ( ρ τ ( c l β l ) + ρ τ ( - c l β l ) ) ,   l = 1 d ( ρ τ ( d l ψ l ) + ρ τ ( - d l ψ l ) ) .

Furthermore, we construct the "unpenalized" linear quantile regression through incorporating additional pseudo observations. The augmented form of the defining variables is as follows:

y ˜ l + = {      y ˜ l ,   l = 1 , , N      0 ,    l = N + 1 , , N + 2 p + 2 d

z ˜ l + = { z l ,                      l = 1 , , N , ( 0 , , 0 , c l , 0 , , 0 ) ,   l = N + 1 , , N + p , ( 0 , , 0 , - c l , 0 , , 0 ) , l = N + p + 1 , , N + 2 p , ( 0 , , 0 ) ,             l = N + 2 p + 1 , , N + 2 p + 2 d ,

x ˜ l + = { x l ,                     l = 1 , , N , ( 0 , , 0 ) ,              l = N + 1 , , N + 2 p , ( 0 , , 0 , d l , 0 , , 0 ) ,   l = N + 2 p + 1 , , N + 2 p + d , ( 0 , , 0 , - d l , 0 , , 0 ) , l = N + 2 p + d + 1 , , N + 2 p + 2 d ,

then the objective function in Step 2 degenerates into linear quantile regression:l=1N+2p+2dρτ(y˜l+-z˜l+Tβ-x˜l+Tψ),and proceed with solving it.

In practice, BIC is frequently employed in variable selection problems for tuning parameter selection,

B I C ( λ 1 , λ 2 ) = l o g ( i = 1 n j = 1 m ρ τ { y i j - z i j T β ^ λ 1 - z ˜ i j T δ Φ ( x 1 i + x 2 i T ψ ^ λ 2 h n ) } )                             + l o g ( N ) 2 N d λ , d λ represents the total count of non-zero coefficients in both the slope and threshold components, while N denotes the total number of observations. Following Ruppert and Carroll[22], we employ a two-step grid search algorithm to derive the optimal tuning parameters (λ1,λ2).

2 Simulation

In order to assess the performance of our proposed estimation in finite samples, we consider the following data generation model:

y i j = z i j T β 0 + z ˜ i j T δ 0 I { x 1 i + x 2 i T ψ 0 + 0.1 Φ - 1 ( U i ) > 0 } + ( 1 + 0.1 z 2 , i j ) ε i j ,

covariates zij=(1,z1,ij,,zp-1,ij)T, and z1,ijexp(1), z2,ijN(-1,1), the remaining covariates z3,ij,,zp-1,ij are independent and conform to uniform distribution U(0,1); interaction variable z˜ij=(1,z1,ij)T, threshold variables x1iN(0,1), component x2i*N(1,1) within x2i=(1,x2i*,,xdi*)T, the remaining covariates x3i*,,xdi* are independent and conform to uniform distribution U(0,1) too; setting β0=(1,1,1,0,,0p-3)T, δ0=(1,1)T, and ψ0=(-1,1,0,,0d-2)T. However, in our data generation model, the true value of (β01,β03,ψ01) varies with the quantile. The random error term εij=ξ+eij, where ξ ensures P(εij<0|zij,xi)=τ, we take into account two error distributions:

Case 1 The random error vector conforms to multivariate normal distribution, that is, εi=(εi1,,εim)TMVN(0,Σ), and the covariance matrix Σ follows AR-1 correlation structure with correlation coefficient ρ=0.3, where the partial coefficients of the model are set to

( β 01 , β 03 , ψ 01 ) = ( 1 + Φ - 1 ( τ ) , 1 + 0.1 Φ - 1 ( τ ) , 0.1 Φ - 1 ( τ ) - 1 ) .

Case 2 The random error vector conforms to multivariate t distribution, that is, εi=(εi1,,εim)T

M V T 3 ( 0 , Σ ) , and the covariance matrix shares the same structure, where (β01,β03,ψ01)=(1+qt(τ,df=3),1+0.1×qt(τ,df=3), 0.1Φ-1(τ)-1), qt(τ,df=3) is the τ quantile of the t distribution with degrees of freedom (df) 3.

We will assess the performance of our estimation across varying numbers of parameters (p=d=10 or p=d=20) or different quantiles (τ=0.25,0.5,0.75), each random experiment yields (n=400 or 800) samples, with each sample being observed (m=4) times, and the simulation is repeated 500 times. We assess the model's performance using four criterias: 1) True Negative (TN): This represents the average number of zero parameters that are correctly classified as zero by the model. A higher TN indicates better model performance; 2) False Negative (FN): The average number of non-zero parameters erroneously classify as zero. A lower FN indicates better model performance; 3) Correct (%): This represents the percentage of accurate true models with non-zero parameters that are correctly identified. The closer this percentage is to 100, the more accurate the model estimation effect is; 4) MSE: It represents the mean square error of the parameter estimation, smaller MSE indicates that the estimated value θ^ is closer to the true value θ0, and the better estimation effect of the model. By comparing the mean square error O.MSE of the oracle estimator with the mean square error P.MSE of our method, the P.MSE is closer to O.MSE means the better estimation effect of our approach.

Tables 1 and 2 present the variable selection outcomes of the single-index threshold quantile regression when the error terms follow a multivariate normal distribution or a multivariate T(3) distribution, respectively. In an ideal scenario, the component of TN for the linear segment is p-3, and for the threshold segment it is d-2. The oracle TN in scenario p=d=10 and scenario p=d=20 is 15 or 35, respectively. In all settings, our method can compress the true zero coefficients to zero and correctly identify all relevant covariates (FN=0), with P.MSE and the O.MSE is very close, which indicate that the model estimation is very effective.

Tables 3 and 4 show the estimators of the identified nonzero parameters by our method when the error terms follow a multivariate normal distribution or a multivariate T(3) distribution, respectively. The Biases for all nonzero parameters remain minimal across various parameter counts (p=d=10 or p=d=20) or quantiles (τ=0.25,0.5,0.75), and all metrics: Biases, SDs, and RMSEs decrease as the sample size increases.

We further conduct sensitivity comparisons for two bandwidths r=log(N)/N and radj=1/N5 under p=d=10, τ=0.5, while keeping the remaining settings as the same as before. The RMSEs of the proposed estimators in Fig. 1 show that these two bandwidths do not exert significant effects on the performance of the proposed method

thumbnail Fig. 1 The sensitivity diagram of parameter estimation under the bandwidth rates r and radj with p=d=10 and τ=0.5

Table 1

Variable selection in the case where εiMVN(0,Σ) and Σ follows an AR-1 structure

Table 2

Variable selection in the case where εiMVT3(0,Σ) and Σ follows an AR-1 structure

Table 3

Estimation results of parameters in the case where εiMVN(0,Σ) and Σ follows an AR-1 structure

Table 4

Estimation results of parameters in the case where εiMVT3(0,Σ) and Σ follows an AR-1 structure

3 Real Data Analysis

We apply the proposed method to analyze the PA.3 trial dataset conducted by the Canadian Cancer Group. In this trial, patients with locally advanced or metastatic pancreatic cancer were randomly assigned to the erlotinib plus gemcitabine group or the gemcitabine monotherapy group. In the primary analysis, the survival rate of the erlotinib plus gemcitabine group was significantly improved compared with the gemcitabine monotherapy group[26]. Shultz et al[27] found that when each biomarker was evaluated individually, none of them was significantly associated with erlotinib. However, He et al[9] selected CA19-9 and AXL from 47 biomarkers to define the treatment-sensitive subgroup, and found that the erlotinib plus gemcitabine could prolong patient's survival time compared with gemcitabine monotherapy. Similarly, Wei et al[10] selected CA19-9 and AXL from 6 biomarkers to define a treatment-sensitive subgroup and found that erlotinib plus gemcitabine improved patient's quality of life (QoL) scores compared with gemcitabine monotherapy. However, these studies only focused on selecting biomarkers for defining subgroups, not covariates. Moreover, none of these studies investigated whether distinct subgroup effects exist in patients across various QoL levels. Therefore, we use our method to select relevant covariates and biomarkers, and assess the differential therapeutic effect of erlotinib in subgroups defined by linear combinations of these biomarkers, as well as the differences in effects at different quantile levels of the outcomes.

A total of 377 patients were included in the analysis, of whom 196 received erlotinib treatment, and their quality of life was assessed using the global QoL score. For each patient in the trial, QoL scores were recorded at the baseline, as well as 4, 8, 12, and 16 weeks following the treatment. We focus on the changes in patient's QoL scores relative to the baseline (cQoL), and missing values have imputed using multiple imputation. The Q-Q plot of the imputed variable y=cQoL and the Shapiro-Wilk normality test indicate that the data do not conform to a normal distribution. The calculated skewness is 0.139 3, indicating a right skewed distribution, so that quantile regression is considered to be more suitable than the mean regression. The covariates included in the analysis are: baseline QoL score (bQoL), time point (Time), baseline age (Age), treatment index B (if the patient received erlotinib plus gemcitabine, B = 1; if the patient received gemcitabine alone, B=0), and pain level (Painmeas). The twelve biomarkers included in the analysis include proteins with lower overall survival and baseline expression levels associated with metastatic diseases, specifically: AXL: AXL receptor tyrosine kinase; CA 19-9: Carbohydrate antigen 19-9; IL8: Interleukin-8; CEA: Carcinoembryonic antigen; MUC-1: Mucin 1, cell surface associate; PDGFRA: Platelet-derived growth factor receptor; EGFR: Epidermal growth factor receptor; PDK1: Pyruvate dehydrogenase kinase 1; BMP-2: Bone morphogenetic protein 2; HER2: Erb-b2 receptor tyrosine kinase 2; PF4: Platelet factor 4; GAS6: Growth arrest-specific protein 6. We standardize the 12 biomarkers and impute missing values using the median.

It is worth noting that our model requires the development of significant biomarkerx1and assumes its coefficient to be nonzero. With prior knowledge, we should primarily consider results from expert opinions or existing literature. In terms of analyzing the PA.3 dataset, existing literature has selected CA19-9 and AXL as significant biomarkers in mean regression[10] and Cox regression[9] analysis based on threshold models for y. Hence, we may consider utilizing CA19-9 or AXL as significant threshold variable for defining subgroups, with one designate as x1 and the rest store in x2, we can also compare the differences in subgroup analysis when different biomarker is designated as x1. We construct the following single-index threshold quantile regression to fit the data:

c Q o L i j = β 1 + β 2 b Q o L i + β 3 T i m e i j + β 4 A g e i + β 5 B i                 + β 6 P a i n m e a s i + δ 1 I ( x 1 i + x 2 i T ψ > 0 )                 + δ 2 B i I ( x 1 i + x 2 i T ψ > 0 ) + ε i j , i = 1 , , 377 , j = 1 , , 4 .

Set y=cQoL,z=(1,bQoL,Time, Age,B, Painmeans), z˜=(1,B), x1=CA19-9 or AXL, x=(1,CA19-9,AXL,IL8,CEA,

M U C -1,PDGFRA,EGFR,PDK1,IP-10,HER2,PF4,GAS6),

x 2 represents the part of x minus x1, assuming that the error term satisfies P(εij<0|zij,xi)=τ.

In order to conduct subgroup analysis of outcome variables at different levels as comprehensively as possible, we consider 7 quantile levels τ=0.2,0.3,,0.8, and the same two-step grid search method as in the simulation study is used to select the tuning parameters λ1 and λ2. Table 5 presents the point estimator, bootstrap standard error (BS-SE), 95% confidence interval (95%-CI), and p-value of each parameter obtained at the above quantile levels with fixing x1=AXL.

From Table 5, it can be seen that in terms of variable selection, our method sets the coefficients for observation time (Time) and treatment index variable (B) to 0 across all quantiles ranging from 0.2 to 0.8, additionally, it sets the coefficients for age (Age) to 0 at quantiles 0.4 to 0.8, while the effect at quantiles 0.2 to 0.3 have not statistically significant. Thus, we conclude that the effects of observation time (Time) and age (Age) on patient's QoL scores are not significant. However, the selection of treatment effect (B) as an unrelated variable may be due to the masking of the interaction effect between treatment and subgroups, resulting in an insignificant effect, which requires further confirmation in stratified analysis. In addition, CA19-9 is selected as an important biomarker at all quantiles, and the coefficients of the remaining 10 biomarkers are compressed to 0. The biomarkers we selected are consistent with the results of He et al[9] and Wei et al[10].

In terms of subgroup analysis, within the mid or low percentiles (0.2-0.6), the p-value of the interaction coefficient δ2 between treatment and subgroup is less than 0.05, this demonstrates a significant difference in therapeutic effect between the two subgroups at the 0.05 significance level. However, within the high percentile range (0.7-0.8), the corresponding p-value is met with p>0.05, this demonstrates that the difference in therapeutic effect between the two subgroups is not significant at the 0.05 significance level, the possible reason is that there is no treatment sensitive subgroup, and further exploration is required through stratified analysis.

Next, we define the biomarker positive subgroup and the biomarker negative subgroup according to linear combination of fixing biomarker AXL, intercept and selected biomarker CA19-9, corresponding coefficients (ψ0,ψ1) as shown in Table 5. Linear quantile regression has conducted on all covariates separately for the samples in two subgroups, the estimated therapeutic effect and their 95% confidence intervals at different quantiles present in Fig. 2, and the confidence intervals are derived from rank tests. Figure 2 reveals that there is significant heterogeneity in therapeutic effect between the two subgroups. Specifically, at all quantiles, within the biomarker positive subgroup, the therapeutic effect are all significant positive values, and the 95% confidence intervals mostly fall above the zero line, which indicate that erlotinib has a positive impact on improving QoL scores. Therefore, we believe that at fixed x1=AXL, the identified biomarker positive subgroup is the treatment sensitive subgroup, in which patients receiving treatment of erlotinib plus gemcitabine have significantly improved quality of life scores compared to treatment with gemcitabine alone. However, within the biomarker negative subgroup, the 95% confidence interval for therapeutic effect remains zero, which indicates that the results are not significant. Therefore, we believe that in this subgroup, patients do not benefit more from erlotinib plus gemcitabine treatment compared to treatment with gemcitabine alone.

thumbnail Fig. 2 The 95% confidence intervals of therapeutic effect when fixing x1=AXL

Moreover, Table 6 presents the same four estimators at the corresponding quantile level with fixing x1=CA19-9. As can be seen from Table 6, the corresponding main results of variable selection and subgroup analysis are similar to those in Table 5 with fixing x1=AXL.

Similarly, we also define the biomarker positive subgroup and the biomarker negative subgroup according to linear combination of fixing biomarker CA19-9. The corresponding estimated therapeutic effect and their 95% confidence intervals at different quantiles present in Fig. 3. Figure 3 shows that there is still significant heterogeneity in the therapeutic effect between the two subgroups. But unlike in the case of fixing x1=AXL, when x1=CA19-9 is fixed, the treatment sensitive subgroup with significant therapeutic effect is the identified biomarker negative subgroup, patients in this subgroup who receive treatment with erlotinib plus gemcitabine will significantly improve their overall quality of life score, however, within the biomarker positive subgroup, patients did not benefit more from the combination therapy of erlotinib and gemcitabine. This indicates that the 0-1 values of the subgroup index variables have reversed in both scenarios, that is, the interaction coefficient δ2 between treatment and subgroups is negative when x1=CA19-9, and positive when x1=AXL. It also explains why the coefficient β5 of the treatment term is selected as a significant correlation variable when x1=CA19-9, while it is compressed to 0 when x1=AXL.

thumbnail Fig. 3 The 95% confidence intervals of therapeutic effect when fixing x 1 = C A 19 -9

In summary, combining the results of subgroup analysis for setting x1=AXL and x1=CA19-9, we can see that our method can identify treatment sensitive subgroups. In this subgroup, patients receiving treatment of erlotinib plus gemcitabine are more beneficial in improving their QoL score compared with using gemcitabine alone, and erlotinib is effective for patients with different QoL levels.

Finally, when fixing x1=AXL or x1=CA19-9 separately, we also want to compare whether the defined subgroups are consistent. We draw Fig. 4 to visualize the overlap of the subgroups divided in both cases, when fixing x1=AXL, the treatment sensitive subgroup corresponding to the biomarker positive subgroup is defined as group 1; the treatment insensitive subgroup corresponding to the biomarker negative subgroup is defined as group 2. Conversely, when fixing x1=CA19-9, the treatment sensitive subgroup corresponding to the biomarker negative subgroup is defined as group 1; the treatment insensitive subgroup corresponding to the biomarker positive subgroup is defined as group 2. Figure 4 shows the intersection and union of group 1 and group 2 in two different scenarios, the orange bars corresponding to 11 indicate that they are all assigned to group 1 in both scenarios, the blue bars corresponding to 22 signify that they are all assigned to group 2, 12 and 21 represent the number of individuals assigned to distinct subgroups in each scenario, with no intersection, and are uniformly displayed with gray bars.

thumbnail Fig. 4 Schematic diagram of subgroup overlap for two different settings

In order to clearly display the proportion of patients divided into different subgroups in both scenarios, we set the highest count point of each subgraph to 100 and compare the height of the orange and gray bars. It can be found that the overlap of the treatment sensitive subgroups identified in both scenarios is very high, which indicates that the application of our method for subgroup identification has good robustness in the selection of x1.

Table 5

Estimation results of parameters when fixing x1=AXL

Table 6

Estimation results of parameters when fixing x1=CA19-9

4 Conclusion

This article considers a single-index threshold quantile regression for subgroup analysis, and selects covariates and biomarkers for defining subgroups, an efficient algorithm is proposed to smooth and locally linearly approximate the subgroup index function and non-convex penalty, respectively. Based on pseudo observations, the corresponding estimation problem degenerates into linear quantile regression, and all unknown parameters are iteratively solved through a two-step estimation process. Numerical simulations demonstrate that the proposed algorithm performs well with a moderate number of variables. Analysis of real data from the PA.3 trial shows that it can distinguish patient between treatment sensitive and treatment insensitive subgroups.

However, our model only considers the case of one threshold, so patient can only be divided into two subgroups. Furthermore, multiple thresholds can be introduced to divide subjects into multiple subgroups with different covariate effects, for related studies, please refer to Li et al[28-29]. In addition, the recognizability constraint in our model requires specifying a threshold variable x1 with non-zero coefficient. In practical applications, this variable can be determined through professional knowledge. However, when prior knowledge is not available, it may be achieved by extending the score type specification test method designed by Zhang et al[30] for selecting the non-zero threshold variables. In addition, considering the internal correlation of longitudinal data and conducting variable selection and parameter estimation under certain condition p>n is our future research direction.

References

  1. Alosh M, Huque M F, Bretz F, et al. Tutorial on statistical considerations on subgroup analysis in confirmatory clinical trials[J]. Statistics in Medicine, 2017, 36(8): 1334-1360. [Google Scholar]
  2. Sachdev J C, Sandoval A C, Jahanzeb M. Update on precision medicine in breast cancer[J]. Cancer Treatment and Research, 2019, 178: 45-80. [Google Scholar]
  3. Su X, Tsai C L, Wang H, et al. Subgroup analysis via recursive partitioning[J]. Journal of Machine Learning Research, 2009, 10(2): 141-158. [Google Scholar]
  4. Lipkovich I, Dmitrienko A, Denne J, et al. Subgroup identification based on differential effect search: A recursive partitioning method for establishing response to treatment in patient subpopulations[J]. Statistics in Medicine, 2011, 30(21): 2601-2621. [Google Scholar]
  5. Zhang Z H, Seibold H, Vettore M V, et al. Subgroup identification in clinical trials: An overview of available methods and their implementations with R[J]. Annals of Translational Medicine, 2018, 6(7): 122. [Google Scholar]
  6. Jiang Z Y, Du C G, Jablensky A, et al. Analysis of schizophrenia data using a nonlinear threshold index logistic model[J]. PLoS One, 2014, 9(10): e109454. [Google Scholar]
  7. Fan A L, Song R, Lu W B. Change-plane analysis for subgroup detection and sample size calculation[J]. Journal of the American Statistical Association, 2017, 112(518): 769-778. [Google Scholar]
  8. Vander Weele T J, Luedtke A R, Vander Laan M J, et al. Selecting optimal subgroups for treatment using many covariates[J]. Epidemiology, 2019, 30(3): 334-341. [Google Scholar]
  9. He Y, Lin H Z, Tu D S. A single-index threshold Cox proportional hazard model for identifying a treatment-sensitive subset based on multiple biomarkers[J]. Statistics in Medicine, 2018, 37(23): 3267-3279. [Google Scholar]
  10. Wei K C, Zhu H C, Qin G Y, et al. Multiply robust subgroup analysis based on a single-index threshold linear marginal model for longitudinal data with dropouts[J]. Statistics in Medicine, 2022, 41(15): 2822-2839. [Google Scholar]
  11. Cai Y Z, Stander J. Quantile self-exciting threshold autoregressive time series models[J]. Journal of Time Series Analysis, 2008, 29(1): 186-202. [Google Scholar]
  12. Galvao Jr A F, Montes‐Rojas G, Olmo J. Threshold quantile autoregressive models[J]. Journal of Time Series Analysis, 2011, 32(3): 253-267. [Google Scholar]
  13. Lee S, Seo M H, Shin Y. Testing for threshold effects in regression models[J]. Journal of the American Statistical Association, 2011, 106(493): 220-231. [Google Scholar]
  14. Su L J, Xu P. Common threshold in quantile regressions with an application to pricing for reputation[J]. Econometric Reviews, 2019, 38(4): 417-450. [Google Scholar]
  15. Zhang Y Y, Wang H J, Zhu Z Y. Single-index thresholding in quantile regression[J]. Journal of the American Statistical Association, 2022, 117(540): 2222-2237. [Google Scholar]
  16. Tibshirani R. Regression shrinkage and selection via the lasso[J]. Journal of the Royal Statistical Society Series B: Statistical Methodology, 1996, 58(1): 267-288. [CrossRef] [Google Scholar]
  17. Tibshirani R, Saunders M, Rosset S, et al. Sparsity and smoothness via the fused lasso[J]. Journal of the Royal Statistical Society Series B: Statistical Methodology, 2005, 67(1): 91-108. [Google Scholar]
  18. Zou H. The adaptive lasso and its oracle properties[J]. Journal of the American Statistical Association, 2006, 101(476): 1418-1429. [CrossRef] [Google Scholar]
  19. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso[J]. Biostatistics, 2008, 9(3): 432-441. [CrossRef] [PubMed] [Google Scholar]
  20. Fan J Q, Li R Z. Variable selection via nonconcave penalized likelihood and its oracle properties[J]. Journal of the American Statistical Association, 2001, 96(456): 1348-1360. [CrossRef] [Google Scholar]
  21. Zhang Y, Lian H, Yu Y. Ultra-high dimensional single-index quantile regression[J]. Journal of Machine Learning Research, 2020, 21(224): 1-25. [Google Scholar]
  22. Ruppert D, Carroll R J. Theory & methods: Spatially-adaptive penalties for spline fitting[J]. Australian & New Zealand Journal of Statistics, 2000, 42(2): 205-223. [Google Scholar]
  23. Horowitz J L. A smoothed maximum score estimator for the binary response model[J]. Econometrica, 1992, 60(3): 505. [Google Scholar]
  24. Seo M H, Linton O. A smoothed least squares estimator for threshold regression models[J]. Journal of Econometrics, 2007, 141(2): 704-735. [Google Scholar]
  25. Zou H, Li R Z. One-step sparse estimates in nonconcave penalized likelihood models[J]. Annals of Statistics, 2008, 36(4): 1509-1533. [Google Scholar]
  26. Moore M J, Goldstein D, Hamm J, et al. Erlotinib plus gemcitabine compared with gemcitabine alone in patients with advanced pancreatic cancer: A phase III trial of the National Cancer Institute of Canada Clinical Trials Group[J]. Journal of Clinical Oncology, 2007, 25(15): 1960-1966. [Google Scholar]
  27. Shultz D B, Pai J, Chiu W, et al. A novel biomarker panel examining response to gemcitabine with or without erlotinib for pancreatic cancer therapy in NCIC clinical trials group PA.3[J]. PLoS One, 2016, 11(1): e0147995. [Google Scholar]
  28. Li J L, Jin B S. Multi-threshold accelerated failure time model[J]. The Annals of Statistics, 2018, 46(6A): 2657-2682. [Google Scholar]
  29. Li J L, Li Y G, Jin B S, et al. Multithreshold change plane model: Estimation theory and applications in subgroup identification[J]. Statistics in Medicine, 2021, 40(15): 3440-3459. [Google Scholar]
  30. Zhang L W, Wang H J, Zhu Z Y. Testing for change points due to a covariate threshold in quantile regression[J]. Statistica Sinica, 2014, 24(4): 1859-1877. [Google Scholar]

All Tables

Table 1

Variable selection in the case where εiMVN(0,Σ) and Σ follows an AR-1 structure

Table 2

Variable selection in the case where εiMVT3(0,Σ) and Σ follows an AR-1 structure

Table 3

Estimation results of parameters in the case where εiMVN(0,Σ) and Σ follows an AR-1 structure

Table 4

Estimation results of parameters in the case where εiMVT3(0,Σ) and Σ follows an AR-1 structure

Table 5

Estimation results of parameters when fixing x1=AXL

Table 6

Estimation results of parameters when fixing x1=CA19-9

All Figures

thumbnail Fig. 1 The sensitivity diagram of parameter estimation under the bandwidth rates r and radj with p=d=10 and τ=0.5
In the text
thumbnail Fig. 2 The 95% confidence intervals of therapeutic effect when fixing x1=AXL
In the text
thumbnail Fig. 3 The 95% confidence intervals of therapeutic effect when fixing x 1 = C A 19 -9
In the text
thumbnail Fig. 4 Schematic diagram of subgroup overlap for two different settings
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.