The Null Device


This juxtaposition between content and automatically served ads has recently been brought to my attention:

It would appear that the culprit is Google's ad-serving algorithm, which seems to be based on keywords or word frequencies in the content; in this case, I'm guessing that it noticed that the page was about strings, and it had an ad targetted at a number of keywords, including "string". And, hence, we have an illustration of why naïve keyword-based content matching can fail.

I imagine that Google could do better than this. They have (a) a copy of the entire Web, stored and indexed as they see fit, and (b) huge quantities of parallel-processing power to crunch through this and data derived from this. They have already used this to great effect in building statistical models of language, which they use in things like their language tools and the context-sensitive spelling correction in Wave. I imagine, though, that it could be used to implement a machine-learning system, taking content classification beyond word frequencies.

Imagine, for example, if there were a classification engine, trained on millions of web pages (and auxilliary data about them) that, when fed a web page or document, could assign it a score along several axes with some degree of accuracy. Some axes could be the obvious things: a "sex" axis, for example (with thongs falling on one side and C++ classes well on the other) could be used for things like SafeSearch. An "emotional response" axis could be used to classify how likely content is to arouse strong emotions; on one end would be accounts of lurid violence and depravity, and on the other end things like source code and stationery catalogues, with art, celebrity gossip and LiveJournal angst falling in the spaces between. As soon as a page crossed a certain point on the axis, the ad-serving algorithm could stop matching ads by keywords (you don't want ads for airfares next to a piece about an air crash, for example), or even reverse them (so that topical ads aren't shown).

In fact, one need not restrict oneself to pre-imagined axes; it's conceivable that an ad serving company with Google's resources could set up a learning engine, program it to categorise pages according to a dozen arbitrary axes, and see what comes about and what it's useful for, in turn coming up with a model for clustering web content into crisply defined categories that no human would think of. Of course, for all I know, someone at Google (or Microsoft or Facebook) could be doing this right now.

(via David Gerard) advertising c++ fail google juxtaposition sex 3