The Null Device

Strings and strings

This juxtaposition between content and automatically served ads has recently been brought to my attention:
It would appear that the culprit is Google's ad-serving algorithm, which seems to be based on keywords or word frequencies in the content; in this case, I'm guessing that it noticed that the page was about strings, and it had an ad targetted at a number of keywords, including "string". And, hence, we have an illustration of why naïve keyword-based content matching can fail.

I imagine that Google could do better than this. They have (a) a copy of the entire Web, stored and indexed as they see fit, and (b) huge quantities of parallel-processing power to crunch through this and data derived from this. They have already used this to great effect in building statistical models of language, which they use in things like their language tools and the context-sensitive spelling correction in Wave. I imagine, though, that it could be used to implement a machine-learning system, taking content classification beyond word frequencies.

Imagine, for example, if there were a classification engine, trained on millions of web pages (and auxilliary data about them) that, when fed a web page or document, could assign it a score along several axes with some degree of accuracy. Some axes could be the obvious things: a "sex" axis, for example (with thongs falling on one side and C++ classes well on the other) could be used for things like SafeSearch. An "emotional response" axis could be used to classify how likely content is to arouse strong emotions; on one end would be accounts of lurid violence and depravity, and on the other end things like source code and stationery catalogues, with art, celebrity gossip and LiveJournal angst falling in the spaces between. As soon as a page crossed a certain point on the axis, the ad-serving algorithm could stop matching ads by keywords (you don't want ads for airfares next to a piece about an air crash, for example), or even reverse them (so that topical ads aren't shown).

In fact, one need not restrict oneself to pre-imagined axes; it's conceivable that an ad serving company with Google's resources could set up a learning engine, program it to categorise pages according to a dozen arbitrary axes, and see what comes about and what it's useful for, in turn coming up with a model for clustering web content into crisply defined categories that no human would think of. Of course, for all I know, someone at Google (or Microsoft or Facebook) could be doing this right now.

There are 3 comments on "Strings and strings":

Posted by: gusset http://blog.gusset.co.uk Thu Nov 5 17:14:16 2009

Remember that gmail already has context based ad filtering http://homepage.mac.com/joester5/art/gmail.html

Posted by: acb http://dev.null.org/acb/ Thu Nov 5 17:37:36 2009

That seems to be a simple keyword-based hack. It doesn't categorise the content in a N-dimensional space, but merely kills ads if some hot-button words are found.

Posted by: gusset http://blog.gusset.co.uk Fri Nov 6 09:01:45 2009

True. Although it shows at least some thought has been given to it. Just not enough yet.

Want to say something? Do so here.

Post pseudonymously

Display name:
URL:(optional)
To prove that you are not a bot, please enter the text in the image into the field below it.

Your Comment:

Please keep comments on topic and to the point. Inappropriate comments may be deleted.

Note that markup is stripped from comments; URLs will be automatically converted into links.