The Null Device

2010/1/22

A Google engineer writes about how Google's search engine attempts to understand synonyms:

We use many techniques to extract synonyms, that we've blogged about before. Our systems analyze petabytes of web documents and historical search data to build an intricate understanding of what words can mean in different contexts. In the above example "photos" was an obvious synonym for "pictures," but it's not always a good synonym. For example, it's important for us to recognize that in a search like [history of motion pictures], "motion pictures" means something special (movies), and "motion photos" doesn't make any sense. Another example is the term "GM." Most people know the most prominent meaning: "General Motors." For the search [gm cars], you can see that Google bolds the phrase "General Motors" in the search results. This is an indication that for that search we thought "General Motors" meant the same thing as "GM." Are there any other meanings? Many people can think of the second meaning, "genetically modified," which is bolded when GM is used in queries about crops and food, like in the search results for [gm wheat]. It turns out that there are more than 20 other possible meanings of the term "GM" that our synonyms system knows something about. GM can mean George Mason in [gm university], gamemaster in [gm screen star wars], Gangadhar Meher in [gm college], general manager in [nba gm] and even gunners mate in [navy gm].

ai google language 1

The Economist looks at some of the more complicated human languages, ones which make legendarily tricky languages like Greek and Latin look simple:

For sound complexity, one language stands out. !Xóõ, spoken by just a few thousand, mostly in Botswana, has a blistering array of unusual sounds. Its vowels include plain, pharyngealised, strident and breathy, and they carry four tones. It has five basic clicks and 17 accompanying ones. The leading expert on the !Xóõ, Tony Traill, developed a lump on his larynx from learning to make their sounds. Further research showed that adult !Xóõ-speakers had the same lump (children had not developed it yet).
Beyond Europe things grow more complicated. Take gender. Twain’s joke about German gender shows that in most languages it often has little to do with physical sex. “Gender” is related to “genre”, and means merely a group of nouns lumped together for grammatical purposes. Linguists talk instead of “noun classes”, which may have to do with shape or size, or whether the noun is animate, but often rules are hard to see. George Lakoff, a linguist, memorably described a noun class of Dyirbal (spoken in north-eastern Australia) as including “women, fire and dangerous things”. To the extent that genders are idiosyncratic, they are hard to learn. Bora, spoken in Peru, has more than 350 of them.
Berik, a language of New Guinea, also requires words to encode information that no English speaker considers. Verbs have endings, often obligatory, that tell what time of day something happened; telbener means “[he] drinks in the evening”. Where verbs take objects, an ending will tell their size: kitobana means “gives three large objects to a man in the sunlight.” Some verb-endings even say where the action of the verb takes place relative to the speaker: gwerantena means “to place a large object in a low place nearby”. Chindali, a Bantu language, has a similar feature. One cannot say simply that something happened; the verb ending shows whether it happened just now, earlier today, yesterday or before yesterday. The future tense works in the same way.
When faced with the question of what the hardest language for an Anglophone to learn might be, the Economist posits the Amazonian language Tucuya, a language with between 50 and 140 genders and a tendency to create compound words from morphemes, among other features:
Most fascinating is a feature that would make any journalist tremble. Tuyuca requires verb-endings on statements to show how the speaker knows something. Diga ape-wi means that “the boy played soccer (I know because I saw him)”, while diga ape-hiyi means “the boy played soccer (I assume)”. English can provide such information, but for Tuyuca that is an obligatory ending on the verb. Evidential languages force speakers to think hard about how they learned what they say they know.
The most complex languages tend to be the ones from isolated areas (like the Amazon and the highlands of New Guinea). This makes some sense; after all, trade and cultural exchange would serve to smooth languages, polishing off the rough edges, generalising special cases and cutting enough corners to allow foreigners and travellers to understand them. If a language is used by people in a wide variety of environments, its structures are going to become more generic and adaptable. Conversely, if you and your ancestors have spent all your lives knowing the objects and routines of one kind of environment, then the circumstances of your lives will seem timeless and absolute, and specialised word genders and grammatical cases for them will seem like common sense.

(via BBC) culture language linguistics 1