At the AEI, Dalton Conley commented on Charles Murray’s new book. At minute 30, Conley starts to discuss polygenic scores. At around minute 35, he points out that the polygenic score for height, which seems to do much better than polygenic scores for other traits, still does a terrible job. The score, which has been based primarily on data from Europeans, under-predicts heights of Africans by 6 inches.
As you know, I am a skeptic on polygenic scores. The exercise reminds me too much of macroeconomic modeling. Economic history did not design the types of experiments that we need in order to gauge the effect of fiscal and monetary policy. What we want are lots of time periods in which every little changed other than fiscal and monetary policy. But we don’t have that. And as you increase the sample size by, say, going back in time and adding older decades to your data set, you add all sorts of new potential causal variables. Go back 70 years and fluctuations are centered in steel and automobiles. Go back 150 years and they are centered in the farm sector.
Similarly, evolution did not design the types of experiment that we need in order to gauge the effect of genes on traits. That is, it didn’t take random samples of people from different geographic locations and different cultures and assign them the same genetic variation,, so that a statistician could neatly separate the effect of genes from that of location or culture.
If I understand Conley correctly, he suggests looking at genetic variation within families. I am not sure what advantage that has that is not outweighed by the disadvantage that you reduce the likely range of genetic combinations that you can observe.
As I read through Jamie A. Davies’ book “Life Unfolding” I am Even more convinced that we are making a mistake equating nature with the deterministic parts of genetics and ignoring the much larger “emergent” component captured in molecular/cellular biology.
I think there is an important case for comparing an individual’s genome against at least their genetic parent’s genome and sometimes even grandparent’s: to detect mutations that differ from the common alleles in the population.
Our genome is an incredibly useful tool but it is a building block for life, not the causative source of truth. Our genome represents twenty-thousand enzymes, nothing more. The predictive power of enzymes is very limited given our current ignorance about the mechanics of emergent order and self-assembly. The genome represents a treasure trove of statistical correlation since the history of evolution is captured in its sequence, but it’s sequence holds little causative information. Mitochondrial DNA has been instrumental in statistical analysis but I’m confident that it’s causative contribution is nil. Lynn Margulis is another non-consensus scientist we should celebrate.
There is more information in the conversation RogerSweeny and I had the last time Kling’s posted about photogenic scores.
Well done. I salute your effort in indexing previous conversations.
Once in a while, the comments here are positively splendid. But it’s hard to find them later. More grease to your elbow
Technically, the light grey timestamp attached to every comment is a link the appends a unique #fragment-id to the post link.
I don’t know if Google search is more useful now that this site supports secure TLS certificates,. If my elbows truly had grease that is something I would have checked already.
Now that I’ve watched the AEI video I can safely say that Dalton Conley is not promoting the type of intra-family genetic analysis that I’m referring to in the quote. He means that new sibling/twin/adoption studies should be done to tease out how different alleles contribute to the factors psychologists measure.
This is part of a larger question as lots of predictive models emerge from the oceans of data we are collecting. We need a very big conversation to take place on the ethics of the application of prediction(as opposed to actuarial measures), and how dangerous it is. The skepticism expressed here is primarily over the quality of this class of predictive models. It should also be about whether they are a good idea at all, regardless of their quality.
Why would accurate predictive models be a bad idea?
In my mind, “quality” and “predictive” are two words describing the exact same characteristic of a model. Given a specific dataset, you can create any number of models that fits the historical data. The predictive power depends on whether this model works equally well with new datasets and/or scenarios.
Absent knowledge of the biomechanical mechanism associated with a gene variant, I am very skeptical of claims about this gene variant or that being associated with a particular trait or variation in a trait. I am not even sure that one gene variant works the same in one human sub population to the next, as I have read reports of the so-called “warrior” gene variant being associated with higher violence in Northern Europeans and having no effect in other sub populations. My confidence in the correctness of any of these correlations findings is very low.
There are meaningful genetic differences between populations? I thought Richard Lewontin said that was racist nonsense.
Based on my crude knowledge, Richard Lewontin was brilliant, annoying, very sure of himself, but not to be relied upon on this issue.
Greg Cochrane’s take here:
https://westhunt.wordpress.com/2015/01/20/lewontin-wins-the-craaford-prize/
P.S.: Most of what I know about Lewontin is the particular assertion provided by Roger Sweeny, which is repeated to this day.
That and the fact that Robert L. Trivers seems to have hated him–Trivers and Edward O Wilson were on one side of the Sociobiology discussion, for lack of a better term. Stephen J. Gould and Richard Lewontin were on the other side. Wikipedia has details.
See Wilson’s memoirs, or Trivers’ memoirs. See also Trivers’ reprinted papers with introductions and post hoc comments by Trivers, here.
Trivers, R. L. (2002) Natural Selection and Social Theory: Selected Papers of Robert L. Trivers. (Evolution and Cognition Series) Oxford University Press, Oxford. ISBN 0-19-513062-6
polygenic are like many things in life, interesting, probably means nothing, but may mean something. It is basically a genetic association study (coffee makes you live longer… no wait with this group its use was associated with a shorter life… no wait this group had a longer life span… no …)
I have never worked with polygenic scores, but have been involved in GWAS studies of several complex diseases. As an example, asthma is definitely hereditary, but there is no causal gene. So you sequence a cohort (and there are large cohorts that include many extended families data) and look for differences between the cases and controls. For example look for a case being homozygous for a SNP and the controls being heterozygous for the same SNP (Single Nucleotide Polymorphism). There is definitely something to be learned from this exercise and the answer is in there somewhere. The problem the DNA sequence is just one part, not all your DNA is transcribe, the RNA may be modified after transcription, the RNA is translated, the protein then has to act, but may be blocked, and on and on.
There is definitely something there, and GWAS (or a polygenic score) may point in an interesting direction but it probably means nothing – until you are lucky.
Because they would inevitably deter and discourage unpredictable outcomes. We would eventually take the shape of our predictions.
That last comment was in response to RogerSweeny – “Why would accurate predictive models be a bad idea?”
I don’t understand what you mean. Polygenic scores are derived from people’s genomes. You genome is the outcome of your parents’ sperm/egg merger event. Polygenic scores happen after that happens. Maybe I just don’t know what you mean by “outcome”.
Really?
Polygenics are used to predict traits, which is what I mean by outcomes.
Widespread use of such calculations will cause it to inevitably become better to be predictably intelligent than it will to be unpredictably intelligent. It would create a counter evolutionary pressure on society. That’s bad.
I asked why accurate predictive models would be a bad idea and you say it will “become better to be predictably intelligent than it will to be unpredictably intelligent.”
1) If the models are accurate predictors, there won’t be many “unpredictably intelligent”.
2) Why would people care about someone’s polygenic score if they can just give that person an IQ test? Why would the “unpredictably intelligent” have a bad time?
I gather that you’re thinking there will be a substantial number of people who are smart but have low polygenic scores. They will then be kept down and will get fewer of their genes in the next generation which will make the next generation stupider (that is how I interpret “would create a counter evolutionary pressure on society.”).
That seems unlikely to me.
I suppose we could ponder the possibility that polygenic scores would just become so accurate that there would be nothing worry about, or that they just wouldn’t be used instead of actuarial data. But neither of those two possibilities is in the spirit of this discussion.
Societies change, and the value of traits change to meet the moment change too. We need unpredictable adaptations. They are important. We don’t want algorithms defining what works. We want success to emerge from trial and error.
Success can only emerge through trial and era over long time periods with high selection pressures.
Neither of these is relevant in the modern era. There are no strong selection pressures (welfare state, generally high level of abundance). Environment can change far faster that evolution can adapt to the new environment.
asdf-
If you are viewing success as the widespread evolutionary genetic adaptation of the species, then yes, you are correct. But people are matched to tasks every day. The benefits of allowing for unexpected emergent traits is immediate.
Or perhaps you’re afraid that polygenic scores will be used for selective abortion, and that their availability will make selective abortion more acceptable? Well before thirteen weeks, the polygenic scores come back, “IQ: 100, height 5’6″, shy. Not what we were looking for. Make an appointment at Planned Parenthood.”
We can only hope. There are only four possible outcomes.
1) The superior outbreed the inferior.
2) The inferior are selected against in utero.
3) The inferior selected against post utero.
4) The inferior outbreed the superior, civilization collapses.
#1 is a favorite of mine, though it would require high levels of fertility amongst the superior. They appear unwilling to commit to that at this time, and there are some that worry the natural resources of the planet can’t support such indefinite growth.
#2 seems far more humane than #3 or #4. It is clearly the path of least resistance if you want to avoid #3 or #4. So much so that I think anyone worried about #3 or #4 should be devoting all possible effort into making #2 a reality.
We can only hope? Given historical evidence, eugenic fantasies are at minimum a sign of an inferior mind if not a clinical disorder. Good luck with that.
Blah blah blah.
If my hope is a fantasy…god help us all.
As a side note, Dalton Conley mentions the Burakumin caste in Japan together with other self-identifying ethno-linguistic groups in the former Yugoslavia and the Hutus and Tutsis in Rwanda. I’ve never heard of this group/class before.
Barakumin got a couple of mentions in James Clavell’s SHOGUN. The Japanese equivalent of “untouchables.” I’d guess it’s not that unusual to find bottom castes in human societies — think of Hispanic farm workers in American rural areas.
Not sure that the facts support you on rural Hispanics. Pee reports that rural Hispanics are more likely than other Hispanics to intermarry with whites, that white-Hispanic marriages are the most common intermarriage pairings, and that increasing numbers of Hispanics are finding opportunity in the rural Midwest. Moreover rural Hispanics enjoy longer life expectancy than rural whites. This is not what I think of when I think of an untouchable caste.
Polygenic scores still sound like a three body problem: multiple genes interacting with each other, physical factors such as diet and environment, and social factors.
The problem is that genes interact with each other in ways we don’t even understand very well, so a polygenic score is going to miss all of that information. You being 5 foot 2 could be the interaction of dozens or more genes interacting with each other, including genes that don’t seem to have anything at all to do bone length and volume or bones at all, and interacted with each other during the first parts of cell differentiation.
We will eventually figure more of this out, and polygenic scores will improve and improve. I remember when the human genome was first fully decoded the shock at how few genes there actually were, the similarity of that genome across vastly different species. The answer, of course, is in the vastly different matrices of interactions that arise from just those small differences.
I’m experiencing an awful sense of ironic deja vu. I’ve spent the last 20 years politely explaining The Bell Curve (IQ matters and is heritable), The Nurture Assumption (Nature:Nurture is 50:50 but Nurture==0), and The Blank Slate (progressives are on a Nurture-Only Crusade) to anyone who emphasizes the opposite. I was met with either blank stares or eye rolling and my friends would politely explain that it was my “weird politics”.
Now Charles Murray has published Human Diversity and the emphasis has switched (this is the irony) to what I see as a mistaken but consensus narrative about genes that is being promoted by no other than Charles Murray himself.
This is a plea to Yancey or any of the smart and eloquent commenters on this blog to try to simplify and clarify the Genome-is-20K-Enzymes-Nothing-More(for-now) and Biological-Emergence-is-THE-FUTURE messages from my first comment above (or explain why I should shut up). I’m terrible at communicating. Don’t let me suffer another two decades. Please help me Obi Wan(s).
Let’s say some trait is 100% determined by a simple, non-interactive, linear combination of effects from 30 genes with three common variants each. If you aren’t simplifying or taking shorts cuts, such that your model is the true model with no error, by merely observing every possibility and compiling a giant table mapping genes to trait values, how many genomes do you need to observe (with sufficient resolution to correctly identity which variant one has for each gene)?
3^30 = 206 Trillion, more than 25,000 times more humans than actually exist. And life is much more complicated than that! However, the scenario of determinism isn’t so far fetched. Notice that absent things like malnutrition, tumors, injuries, or major illnesses, identical twins nearly always have nearly identical heights, so even the high Falconer’s Formula estimates are likely underestimates, and there is a very low-variance mapping from all those genes to height.
Now, the question is what is the shape of the curve of the value of expected errors for the best inferable models from all numbers of observations from 0 (maximum error, the average human variance in height) to 3^30 (zero error).
I think it’s reasonable to expect two kinds of shapes.
1. Exponential Decay, but, if this were the case, one wouldn’t have diminishing returns until one already had a low-error model.
2. Some kind of inverted S-curve. You take a few observations, and your model improves slowly. You take a lot – perhaps a significant fraction of all possibilities – and your model improves quickly over that range. Eventually you have a model with very low error, and then you are in the realm of diminishing returns.
Now, there are a few big GWAS studies with hundreds of thousands of people. But in the scenario above, 206K is a billion times fewer than the 206 Trillion above, and so, you would only be 0.0001 of a percent of a percent of the way there – barely scratching the surface.
With either error curve shape above, the “barely scratching the surface” amount of observations wouldn’t be close to the diminishing returns range.
So, I think there is reason to be quite optimistic about the degree to which future, massive GWAS studies can improve our ability to predict polygenic trait values.
I think your optimism is based on the determinism built into your stated assumption. There have been a handful of genes, like the infamous huntingtin gene, whose effects are causative and linear but these types of genes proved to be the exception. The next hope was that a combination of genes would exhibit causative behavior but what we have found amounts to weak correlation at best.
The best analogy I can think of is using historical market prices to predict a wide range of socio-economic measures. Certain models may be able to make highly accurate predictions and these predictions could be refined with a polygenic-like inclusion of many independent prices but these correlation based models are only right until they are wrong.
Each gene perfectly predicts the chemical signature of a specific molecule. Those molecules are critical in the behavior of downstream biochemical pathways but these behaviors and the conditions under which they occur is emergent. The deterministic behavior of genes just waiting to be unlocked is a false meme.
“The deterministic behavior of genes just waiting to be unlocked is a false meme.”
I disagree. If this were true, identical twins wouldn’t be so identical. Yeah, “it’s complicated”, but if code reliably maps to traits, then it’s not in principle impossible to statistically infer a predictive model.
There are two questions here:
1. Assuming similar life histories, to what degree are code and macro-traits correlated? Relatedly, does that correlation decrease with the number of genes that influence a trait?
My position is that the correlation is usually very strong even for “hard” traits influenced in small amounts by large numbers of genes. Identical twins are identical in many of those traits, like height.
2. Under scenarios of different statistical relationships of the n’s, what is the relationship between n and the best inferable model’s errors?
For the second answer, Kling thinks the n’s are very redundant, and that the declining error curve quickly becomes extremely flat after a few hundred K. In other words, the model to predict height from code using this technique is already nearly as good as it’s going to get, at least without additional techniques or breakthroughs.
My guess is that diminishing returns won’t hit until model accuracy improves.
False Meme: The Genome is Uncracked Computer Code
We will recognize that this meme was false in three decades just like we recognize today that The Nurture Assumption was a false meme.
The nature (vs. environmental) component of life is part deterministic (20K genes map to 20K enzymes) and part freakishly cool emergent processes. You don’t just don’t get DNA from your parents, you also get one of your mother’s fertlized egg cells which is an identical chemical copy (other than the DNA) to your grandmother’s eggs, etc. I’m going to link to my conversation with RogerSweeny again for more details and a rough model.
20,000 genes map to 20,00 proteins (mostly enzymes). Humans have about 200 specialized cell types. The DSM-5 contains about 300 mental-health disorders. The ICD-10 contains an additional 70,000 health conditions. Those 20K enzymes almost never map directly to traits or conditions and it is not a special combination of genes that map to traits either.
Life unfolds in a mostly emergent way and this process is magnificent. From your mother’s fertilized egg, no genes are even expressed until you get to 32 cells; all the machinery required to start life and create the first set of specialized cells is emergent based on nothing but the shape, size, and chemical composition of the fertilized egg which itself is shaped by evolution.
It is not random or unsophisticated. It is sophisticated beyond our comprehension currently. It is this emergence that is uncracked and exciting. The enzymes transcribed from genes get the ball rolling but they are not the stars of the show, just bit players in a magnificent dance with an invisible choreographer (evolution).
FGML: Functional Genome Markup Language
The genome is like a file format that represents a string of amino acids. These variable length strings are made of 20 amino acids that we will number from 1 to 20 and separate with commas. We will have one line per gene in our file:
9, 17, 2, 1, 20…
19, 10, 9, 18…
…
The .fgml file for a human will have 50K lines/genes. Each gene is transcribed into either a protein molecule or a RNA molecule. There are 20K protein molecules that act as enzymes breaking down a specific molecule into simpler ones or combining simpler molecules into a more complex one. Some of the RNA is the machinery that performs protein transcription, some of it acts as a mediator that modifies the transcription rates, and most of it has no known function (yet).
Emergence: Activator Site and Start-of-Gene Marker
We will add a special number 0 (zero) that acts as the Start-of-Gene-Marker and separates the gene proper from a preamble of amino-acids that represent an Activator Site:
8, 10, 2, 0, 9, 17, 2, 1, 20…
19, 18, 0, 19, 10, 9, 18…
0, 6, 5, 18…
…
The Activator Site acts as chemical socket. A molecule in the cytoplasm of the cell with a matching chemical plug attaches to the Activator Site. When a molecule is attached then transcription of the gene is repressed. Depending on the shape of the molecule attached to the Activator Site, a different molecule in the cell’s cytoplasm can pull it away temporarily from the Activator Site and allow the gene to be transcribed. When a line/gene starts with 0 it is automatically transcribed with the preceding gene and bundles of these Operons are common.
Enzymes transcribed from genes can modify the state of molecules in the cytoplasm. Molecules in the cytoplasm can modify the transcription of genes. This feedback loop forms the basis of the emergence that builds the structures/traits we recognize in living things.
Our current statistics used for genetic analysis only considers the gene without the Activator Site preamble, and only the deterministic 64 Codon => (20 Amino Acids + Star-of-Gene + End-of-Gene) mapping without any consideration to the emergent behavior which is not deterministic yet creates identical twins every time given the same DNA.
And history will be the judge.