Arnold Kling     Essays | Short Book Reviews | Favorite Links | Internet Bubble Monitor | Home

Data Mining is a Fad

"Arguing in My Spare Time," No. 2.24

Arnold Kling

Dec. 15, 1999

May not be redistributed commercially without the author's permission.

In the spring of 2000, I am scheduled to teach a night-school course on starting an Internet business. The students may be disappointed, because my background is limited to one entrepreneurial effort. Moreover, I have no experience at all with one of the most challenging aspects of starting a business, which is obtaining funding. In 1994, the only thing crazier than starting a web business would have been expecting someone else to finance it.

If I have any advice to offer, it is to try to start a business that stands to benefit from a "killer trend." It seems to me that my own experience supports the hypothesis that one can make plenty of mistakes if the trends are moving your way.

An analogy would be going up to bat in a baseball game where there is a thirty-mile-per-hour wind blowing from left field to right field. If you try to aim toward left, you will have to connect perfectly to have a chance at a hit. Aim toward center, and you still have to make pretty good contact. But if you go with the wind toward right field, you leave plenty of margin for error. A pop fly could turn into a home run.

This raises the question of how one spots a trend. In particular, how do you distinguish a genuine trend from its impostor, a fad?

One of the best examples of something that was a fad rather than a trend was "push technology." This meant services, with names such as Pointcast or Backweb, which sent information to your PC rather than waiting for you to "pull" the information. In the summer of 1996, both Business Week and Wired had cover stories on "push" technology, proclaiming it an important trend. Instead, "push" technology sputtered and failed spectacularly relative to its hype.

"Push" technology had several salient characteristics that, in retrospect, could have enabled one to identify it as a fad.

  1. It was a product designed to serve the wishes of corporations, not the needs of consumers. Corporations are frustrated by the fact that on the Internet the consumer holds the initiative in the communication process. "Push" technology attempted to reverse this, and instead to put corporations in control.
  2. Having not clamored for "push" technology in the first place, consumers became downright annoyed with it as they began to evaluate its impact. Most of us who tried Pointcast ended up uninstalling it, because it interrupted us with ads and slowed down our computers and Internet connections.
  3. The technology sounded more sophisticated than it really was.

If "push technology" was the hot buzzword in 1996, then "data mining" is the equivalent today.

So what exactly is data mining? Data mining sounds like an esoteric concept, but in fact it can be understood easily by comparing it to something with which we all are familiar, multinomial logistic regression.

Or not.

I will attempt to explain data mining in layman’s terms.

When economists handle data, we start with some expectations about relationships in the data. For example, we might expect that if we were to look at a cross-section of consumers, we would find that high-income consumers tend to buy more luxury automobiles than low-income consumers. The expectation that the purchase of luxury automobiles depends on income is called a specification of our model.

In the pristine theory of econometrics, we bring a specification into contact with the data, and based on how well the data correspond to the specification, we draw conclusions about hypotheses. In practice, however, once we meet the data, we become fascinated with it, and we come up with new hypotheses that lead to new specifications and new confrontations with the data.

About twenty years ago, Edward Leamer wrote a book called "Specification Searches," which pointed out that the practice of re-using data in a trial-and-error process of fine-tuning specifications means that the reliability of econometric results is vastly overstated. Of course, forbidding people from doing specification searches is nothing we can or should do. However, specification searching is a practice whose blessings certainly are mixed.

Data mining is an automated process for doing specification searches. In one data mining technique, called "neural networks," the computer searches through all possible specifications to a user-specified level of non-linearity. For example, suppose that the variable that we are trying to predict is Y, and we have two independent variables, X and Z.

Y could represent the probability that you will like a book of feminist poetry, X could represent the number of poetry books you have purchased and Z could represent the number of books on women’s issues you have purchased. Amazon recently recommended a book of feminist poetry to me, so I am thinking that they used an algorithm like this. But I am waiting for them to recommend a book of conservative, African-American feminist poetry about the Internet media culture, in order to reflect a more precise customization based on my past purchases.

Anyway, we can think of the neural network as starting with a linear relationship. That is, it looks at how Y is predicted by the equation:

Y = aX + bZ + c,

where a, b, and c are the parameters that are found to provide the best prediction model in the data.

Next, the neural network looks at "second-order" terms. The equation would be

Y = pX + qZ + rX*X + sZ*Z + tX*Z + u,

where the lower-case letters are coefficients found by confronting this specification with the data.

Next, the neural network might look at "third-order" terms. However, in practice, there often are so many independent variables (X’s and Z’s) that it is not possible to investigate to that level.

In fact, neural network algorithms are used in cases where there are very large numbers of independent variables. In order to conduct an exhaustive specification search with so many variables, the neural network operates by trial and error. It as if some coefficients are held constant, while others are allowed to change. If increasing a coefficient improves the fit, the computer will continue to increase that coefficient until the fit no longer improves. The algorithm tunes the coefficients to get the best equations for prediction.

Data mining software programs are licensed for hundreds of thousands of dollars. For that same money, you could hire several statisticians to undertake specification searches for a year. However, there is only so much information in any data set, and therefore there are only so many specifications worth trying. In the end, my guess is that the humans will do a better job than the neural networks of filtering out the noise and finding the signal that is in the data.

But the hypothesis that humans can do specification searches more effectively and more cheaply is not what troubles me about data mining software. What bothers me about it is that it is so-o-o 1991.

In 1991, there was a lot of mileage to be gotten out of analyzing data from consumers in order to place them in finer classifications. For example, one could see that credit scoring systems might be used to help sort mortgage borrowers more effectively into different risk buckets. I cannot resist adding that the leading credit scoring companies used systems that were developed using specification searches done by humans, rather than by neural networks.

Today, I believe that the biggest opportunities are not in extracting data from consumers but in providing data to them. For example, in the past, market research firms like Claritas or CACI Marketing have used statistical cluster analysis (another human technique) of data from the Census Bureau and other sources in order to help direct marketers identify the best target markets for solicitations. With the advent of the Web, it is the consumers that are gaining access to this data. Microsoft’s homeadvisor.com lets consumers view the Claritas clusters by zip code. Our homefair.com site lets consumers view the CACI clusters.

The other night, I heard a talk by the head of the National Medical Library. He pointed out another category of data to which consumers are starting to gain access: our own medical records. He described a demonstration project focused on low-income families in the western U.S., in which the families were provided with smart cards that include immunization records. He pointed out the irony that these poor families now have handy access to information that many of us must panic to obtain whenever we get ready to go overseas or send our children to camp.

If you want to get in front of a trend that is robust, develop services that enable consumers to see more data and to share data more safely. Online banking will take off when it enables consumers to understand better where their money is going. As my essay on "People Protocols" argued, hardware or software that enables consumers to establish and protect their identity online will fill a major need.

I’ve been wrong before. MicroStrategy may be valued at $10 billion by the time you read this. Michael Saylor and other scions of corporate data mining, who promise corporations big rewards for extracting data from consumers, could end up hitting a home run. But I see the thirty-mile-per-hour wind blowing in the other direction.