Start with what I said in my review of Robert Plomin’s Blueprint.
Plomin is excited by polygenic scores, a recent development in genetic studies. Researchers use large databases of DNA-sequence individuals to identify combinations of hundreds of genes that correlate with traits.
The most predictive polygenic score so far is height, which explains 17 percent of the variance in adult height… height at birth scarcely predicts adult height. The predictive power of polygenic scores is greater than any other predictors, even the height of the individuals’ parents.
One can view this 17 percent figure either as encouraging or not. It represents progress over attempts to find one or two genes that predict height, an effort that is futile. But compared to the 80 percent heritability of height it seems weak.
Plomin is optimistic that with larger sample sizes better polygenic scores will be found, but I am skeptical.
My question, to which I do not have the answer, is this: if height is 80 percent heritable, why is the statistical correlation found between genes and height only 17 percent?
I do not know any biology. But as a statistician, here is how I would go about developing a polygenic score.
1. I would work with one gender at a time. Assume we have a sample of 100,000 adults of one gender, with measurements of height and DNA sequences. I would throw out the middle 80,000 and just work with the top and bottom deciles.
2. For every gene, sum up the total number in the top decile with that gene and the total number in the bottom decile with that gene, and see where the differences are the greatest. If 8500 in the top decile have a particular gene and 1200 in the bottom decile have the gene, that is a huge difference. 7500 and 7200 would be a small difference. Take the 100 largest differences and build a score that is a weighted average of the presence of those genes.
3. To try to improve the score, see whether adding the gene with the 101st largest difference improves predictive power. My guess is that it won’t.
4. Also to try to improve the score, see whether adding two-gene interactions helps the score. That is, does having gene 1 and gene 2 make a difference other than what you would expect from having each of those genes separately? My guess is that some of these two-gene interactions will prove significant, but not many.
It seems to me that one should be able to extract most of the heritability from the data by doing this. But perhaps this approach is not truly applicable.
Another possibility is that heritability comes from factors other than DNA. Perhaps the reliance on twin studies to try to separate environmental factors from genetic factors is flawed, and the heritability of height comes in large part from environmental factors. Or perhaps DNA is not the only biological force affecting heritability, and we need to start looking for that other force.
Another possibility is that scientists are working with much smaller sample sizes. If you have a sample of one thousand, then the top decile just has one hundred cases in it, and that is not enough to pick out the important DNA differences.
As a related possibility, the effective sample sizes might be small, because of a lot of duplication. Suppose that the top decile in your sample had mostly Scandinavians, and the bottom decile had mostly Mexicans. Your score will be good at separating Scandinavians from Mexicans, but it will be of little use in predicting heights within a group of Russians or Greeks or Kenyans or Scots.
I am just throwing out wild guesses about why polygenic scores do not work very well. I probably misunderstand the problem. I wish that someone could explain it to me.
It’s not just genes. Even when it is genes, it’s not the presence or absence of a gene but rather the presence of a specific variant of a gene, version 1 vs version 1.0001. That said, we are just scratching the surface of the operation of gene regulatory networks. Non-coding RNAs seem to play a large role in regulation, and they are not understood at all, despite being pervasive. To get a flavor of this, have a look at the introductory material in Isabelle Peter and Eric Davidson’s Genomic Control Process: Development and Evolution (2015). Most of it is way technical, but you will get a good sense of what’s involved. And they don’t even get into non-coding RNAs that much.
Another possibility is that Kling, Plomin, Charles Murray, Handle, and others are beholden to a False Meme about how genes work.
Genes are transcribed to mRNA, mRNA is translated to protein. mRNA maybe translated multiple ways. Genes are made up of about 10,00o to 15,000 DNA bases. A single person’s individual gene can have many differences with another person’s individual gene (your DNA fingerprint that makes you unique). Some of these differences do not matter (multiple 3 base DNA sequences can encode the same amino acid). Some will alter the amino acid encoded (specifically the 3D structure – altering how it binds to stuff) – this can change the how the protein functions. Some changes will make the protein unusable – generally bad. Changes in protein function mean the protein maybe more likely to interact with proteins, or less likely – this affects the cellular pathways (up regulated or down regulated). There is now significant evidence, that what was thought to be “junk” DNA, is translated and has a function. There is also a great deal of redundancy, if changing a single protein significantly modified a critical pathway – you would be dead. Maybe you have one gene that results in the down regulation of critical pathway for height and two more that up regulate it. That is just a quick overview, but there is a lot of noise and a weak signal with many causes (some not even genetic like nutrition, there is “heretibilty” of nutrition – do you eat similarly to your parents, and your children to you – you can say the same of exercise and many others). Ultimately you are looking at a result and finding a reason for that result – just association. Finally, the “reason” (genetics) is not even close to being completely understood.
That doesn’t explain the 80% heritability very well.
My claim is simple: that prediction based on the models derived from polygenic analysis will converge with heritability estimates.
Phenotypic traits are highly heritable when 1. They are mostly due to genes / nature, and 2. Because the genes are inherited.
The explanation for why the factors are difficult to pin down early in the effort is clear – there are lots of important genes and interactions and lots of variants – some very rare so hard to compare – and it takes a huge number of high quality genomes and a ton of processing power to make good models. But what is the explanation for why heritability and predictability should remain diverged over the long term?
Look at Dr. Stephen Hsu’s “Accurate Genomic Prediction of Human Height”, using the 500k sample set of the UKbiobank cohort, from over two years ago. “Replication tests show that these predictors capture, respectively, ~40, 20, and 9 percent of total variance for the three traits” (the first trait is height) “For example, predicted heights correlate ~0.65 with actual height.”
Ok, that does not line up at all with that 17% figure which claims to be the “most predictive”. How does this adjust your priors and skepticism now? And that was over two years ago, and things have gotten better.
See Wainschtein’s “Recovery of trait heritability from whole genome sequence data” from March 2019.
“we assigned 47.1 million genetic variants to groups based upon their minor allele frequencies (MAF) and linkage disequilibrium (LD) with variants nearby, and estimated and partitioned variation accordingly. The estimated heritability was 0.79 (SE 0.09) for height and 0.40 (SE 0.09) for BMI, consistent with pedigree estimates.”
So, in a short amount of time, we’ve gone from “17%” as “most predictive”, to another study saying 40%, to a new one getting close to the heritability range.
I think results like these should shift your judgment into a more genomics-optimistic camp.
A big issue with 1 and 2 is that variants (which version of a gene you have; everyone has every gene, but they vary at certain positions from person to person, so we call each version a ‘variant’) are often highly correlated with one and other, and this is especially likely to be true if genes are close to each other (I won’t get into the biology of why, but it’s called ‘linkage’ if you wish to read wikipedia about it). If a ‘tall variant’ and a ‘short variant’ are correlated with each other, then you’ll underestimate the magnitude of both. This is also a reason not to ‘throw out the middle.’ It would be a mistake to assume average height people have ‘middling’ variants; many, rather, may have many tall variants and many short variants that cancel each other out.
You’re right about 4 at first, but probably wrong that there won’t be many interaction effects. There are many, and what’s more, probably even a decent number of 3 or 4 gene interaction effects. Genes often (actually, usually) function in cascades or networks. Gene A activates gene B activates gene C activates gene D, for example. If the phenotype is determined by the final product, gene D, then knocking out any 1 of them will have the same effect as knocking out any combination of 2, 3 ,or all 4. On the other hand, a mutation that increases activity by 100% in gene A or gene B individually will increase gene D activity by 100%; but if you have both mutations, gene D (and your phenotype) increases not by 200% but by 300% because the effects are multiplicative rather than additive.
There are undoubtedly many high dimensional patterns in the relationship between phenotype and phenotype. This is problematic even when there surely are high dimensional patterns, when you look for such patterns, you’re very likely to find ones that don’t exist as well (see all the machine learning findings of important patterns that nonetheless fail to replicate). And of course, once we start to take into account 2nd or even 3rd order interaction effects, the necessary sample size balloons to prohibitively large numbers.
A good visual illustration of how the complexity of the system makes difficult to determine higher order patterns inevitable, this is what a gene signalling pathway looks like (this one is important in cancer): https://www.cusabio.com/statics/images/pathway/MAPK-signaling-pathway-picture.png
The best polygenic scores for height explain around 40% of the phenotypic variance.
By analyzing men and women separately you would severely impede your ability of finding causal variants. The very same genetic variants explain height variation in men and women. In a regression of height on height genes and sex, the intercepts differ (women are shorter on average) but the slopes are very similar. So, to maximize statistical power, you should partial out the effect of sex (by z-scoring within sexes, for example) and analyze everyone together.
Similarly, discarding the middle portion of the height distribution would lead to much lower statistical power. The same genetic variants explain height differences in those of average height as in those who are very short or very tall, so it makes no sense to throw data away in the way you propose (there are some very rare variants causing drawfism and gigantism, but those contribute almost nothing to overall population variation).
There are something like 50 million places in the human genome where people differ from each other. Perhaps a few thousand, or less than 0.01%, of those genetic differences affect height and each of them alone has only a tiny effect. The problem for geneticists is to pinpoint true positive needles (causal variants) in the haystack of false positives. Extremely large samples are needed for sufficient statistical power and even then the estimates can be biased by population stratification (environmental confounding).
Furthermore, the DNA chips used in today’s genome-wide association studies contain a few million variants at most, so these studies cannot even in principle recover the full heritability which is strongly influenced by very rare variants (you need whole-genome sequencing to observe all variants). The Lello et al. study linked to above found pretty much all of the genetic variance for height that the DNA chips available to them can capture; the rest is explained by rare variants they did not observe. The sample size they needed was about 500,000 people, which is much less than is needed for most other complex phenotypes, given that most of them are more polygenic than height, with smaller effect sizes per variant.
But as a statistician, here is how I would go about developing a polygenic score.
I’d also look at genes specific to pygmies, since they’re probably knocked-out versions of some common genes for height.
I think it may be that they are going to have to get much more sophisticated in the analysis in terms of interaction effects. Height involves a broad collection of things from hormones to bone cell replication factors, that are considered independently, but all interact with each other, it is possible that some subset of genetic material has a positive effect on height in some people and a negative effect in others. It may be one of the most important factors, but largely cancels out in the broad statistical analysis, because it can have a positive or negative effect, depending on the particular permutation of genes in an individual.
If I look at the math behind the basic score- it’s a summation of weights times values, very little interaction is generally considered. It does remind one a bit of a the first layer of neural network model. If that second layer or third layer could be added to the calculation, it may reveal which combinations of values lead to higher scores, and you could learn the weightings via backpropagation. By only considering the impact of each SNP individually, much of the story is being missed.
Read Stephen Hsu’s blog.
https://infoproc.blogspot.com/
He’s done some original work in this field on height from genetics and also explains the state of the field very well. Large datasets needed. Better predictors are non-linear with the size of dataset. There’s a predictor of the size of dataset needed to get better results.
Accurate Genomic Prediction Of Human Height
Louis Lello Et al (including Stephen Hsu) 2017
Determination of Nonlinear Genetic Architecture using Compressed Sensing
Chiu Man Ho & Stephen D. H. Hsu 2015