Once again we have let things slip and once again we crave the indulgence of our loyal readers. It is well over 3 weeks since we last posted. Today we’ll take a look at a review on druglikeness which appeared in a high impact journal. Strictly it is the parent journal that has the high impact factor but we’re not going to get bogged down by that minor detail. This article has already been cited here in connection with categorical sins. Much (mainly white) noise about it has been made where we work.
Actually we’re not going to review the entire article. Those of you who have nothing better to do than read this column will know that we are typically underwhelmed by publications on druglikeness and believe that as a concept it is rather over-rated. Instead we will focus on one particular piece of data analysis that is described in the target publication. Are we being lazy? Read on and you can make your own minds up. You are adults after all.
We’d now like you to take a look at Figure 3a in the featured article. The horizontal axis is ClogP and the vertical axis is promiscuity. Promiscuity? Are the drugs getting up to something naughty about which we shouldn’t be writing in a family friendly blog? Promiscuity in this plot is defined by the number of assays for which at least 30% inhibition is observed at a concentration of 10 micromolar. The plot suggests a strong relationship between promiscuity and lipophilicity, doesn’t it? Well that’s what the authors of the article want you to think but, as loyal and cultured readers of the Crapshoot, you really should know better by now.
Now let’s take a closer look at Figure 3a. First the horizontal axis is not ClogP but Median ClogP. Where did the median come from? A reasonable question and, if you’ll just let us continue, everything will become abundantly clear. Well sort of abundantly clear. The authors appear to have computed the median ClogP for each value of promiscuity. Why have they done this? The quick answer this very reasonable question is go take a look at Box S3 in the supplementary information.
The manner in which Figure 3a has been constructed gives it some rather unusual characteristics. Most importantly each value of promiscuity is represented by a single point regardless of the number of drugs with that value of promiscuity. This distorts the original data by emphasizing the tails of the distribution and we think it’s a rather naughty thing to do. Plotting the data as the authors have done displays the underlying trend in the data while hiding the variation in ClogP for individual values of the promiscuity. This makes the trend easier to see but prevents us from knowing how strong it really is.
One common approach to quantifying the strength of the relationships between two properties is to fit one to the other using regression. Typically one starts by assuming a linear relationship but other functional forms (e.g. polynomial) are used if the plot suggests non-linearity. One measure of the quality of fit is the r-squared which is the proportion of the variance in the dependent variable that is explained by the regression model. The r-square ranges in value from 0 (no fit) to 1 (perfect fit).
Now let’s go back to Figure 3. It appears the authors have done the linear regression on the summary data shown in Figure 3a rather than the full set of original data. They quote an r value of 0.83 which corresponds to an r-squared of 0.69. It’s a good time to take another look at Box S3 in the supplementary material. The data from which the summary shown in Figure 3a was generated is distributed between two plots, one for acids and bases and the other for neutrals, quaternary bases and zwitterions. We were a little curious about how the ClogP values were derived for the quaternary bases and why the authors decided to group the charge types as they did. However that is not a path that we wish to go down right now and we’ll not make further mention of these concerns. The plots show that promiscuity will be low when ClogP is very low. However maintaining potency when ClogP is that low is simply not going to be an option for many targets and you’re going to run into permeability problems if you drop ClogP too far. The question we’d like to pose to you, our loyal readers, is whether you’d expect for an r-squared value of 0.69 for either of the two plots in Box S3.
Let’s pause for a moment to review what we’ve learned. Firstly, quote r rather than r-squared because the latter can never exceed the former and your less alert readers may not even notice. Secondly, and more importantly, averaging (in this case taking the median) of one variable over the each of the categories of the other is likely to give you an optimistic view of the strength of the underlying relationship. This is the basis of Categorical Sin and, to help convince you of the fundamental sinfulness of the analyzing data in this manner, consider the situation in which there are only two categories of promiscuity (yes or no). Now suppose the median ClogP values are different for the two categories. What do you expect r-squared to be? Everybody get 1? It really is an honor to write for such clever, cultured readers.
Sadly this is sadly not the only example of Categorical Sin that we have encountered in the peer-reviewed literature (see 1 , 2). Why do the reviewers not pick these things up? It is for journal editors to fret over and it would be grossly unfair to speculate about possible family connections with the unfortunate rogue trader who famously lost his Barings in the city state of Singapore.
Friday, May 16, 2008
Breaking stone in Changi
Labels:
az,
categorical sin,
data analysis,
literature reviews,
nrdd,
oral drugs,
stamp collecting
Subscribe to:
Post Comments (Atom)

0 comments:
Post a Comment