Semantic Web: What sources can Google Square trust?

I was playing around with Google Squared prompted by the news that Google has improved its data  via monitoring user edits and increasing the scope of data you can input. 

Google Square is a hint of how the semantic web may be structured - it provides data in a standardised way such that web applications can manipulate it easily to provide better - Tim Berners Lee TED talk about it is here:

Now a big issue with making the semantic web work is what sources do you use for information - at the moment it looks like Wikipedia is a favoured source for Google Squared (In keeping with Google search favouring Wikipedia), with most information in this "Mountains in Peru" example coming from this one source. Its debatable that this is a good enough, since standard practice in University's is that Wikipedia is regarded as a Popular rather than Scholarly source; as the Yale guidelines puts it: "to rely on Wikipedia—even when the material is accurate—is to position your work as inexpert and immature."

Another possible source for data is by parsing HTML tables for where data intersects.  A typical HTML table contains organised information which imparts a higher meaning on the data on the page. 

For example, a table of mountain heights will likely have "Name of Mountain" and "Height above sea level in feet" - a standard Google query for [highest mountain in the world] will probably not find our example page as those keywords are not on the page.  

A semantic search engine however could recognise that "Mt. Everest" and "29,029" will answer the query and be able to give just one search result - the "right" one  It could also similarly compute and find answers to queries such as: [median height of mountains in nepal]; [smallest three mountains in new zealand]; [what unit is used to measure mountain height] - all queries which would be a lot harder to find via traditional search.  This is something that Google CEO Eric Schmidt has often expressed desire for:

So I don’t know how to characterize the next 10 years except to say that we’ll get to the point – the long-term goal is to be able to give you one answer, which is exactly the right answer over time. Okay, you know, the question I’ll ask today, how many Americans have – what percentage of Americans have passports?…The Google’s answer was a site, which was somebody who had attempted to answer that question and had multiple answers. It’s quite interesting actually to read…So you go to a very good definitive site. And what I’d like to do is to get to the point where we could read his site and then summarize what it says, and answer the question…Along with the citation and so forth and so on.

  
This is another reason, should you need one, to not use tables when designing your website - you may be uncrawlable by tomorrows semantic search bots.  I've seen evidence that Bing is already toying with this way of treating tabled data.  Wolfram|Alpha looks to do this but by controlling all its own data, so doesn't crawl the web at large, rather trusting its own team inputting formatted data, but as Wolfram|Alpha and Bing are already getting cosy with each other it could be a short while before Wolfram is using live web data to generate its results, increasing its scope and usefulness 100 fold.

Going back to the Google Squared example about Mountains in Peru, there still seems to be some work to do about sources: the one non-wiki example was website MundoAndion as can be seen in this screenshot:

Google-squared

However, clicking through to the result, its credited source page is at the time of writing an error page, with no data at all:
...not even the Google cache, which hints that perhaps Google Square doesn't use Google search crawler to gather data.

It all comes down to sources - how do you decide what pages to source the information?  Even if MundoAndion data was up, who vouches for the accuracy of its data?  Knowing Google, this will in the future need to be done in a scalable automated manner, so manual review is out - what about TrustRank? Number of citation links? Traffic? Result Bounce Rate?  And how do you cite the true source of the information, rather than a rehash done on another blog or website?  

All questions I'm sure Google or someone will solve eventually, although I think it is worth keeping track of how these sources are determined - the danger is the more trust put in semantic search the less variety of results, which could mean one opinion or politic dominating another.  In our highest mountain example above, how would it deal with Mt. Everest's country of residence - Nepal or China?   

Nothing is ever 100% absolutely right, and as people have differing views and opinions, so should those views be represented in semantic results.

Posted

0 comments

Leave a comment...