Every 3-4 years, there’s a big shift or addition to the key metrics Google (and, to a lesser extent MSN/Bing and Yahoo!) uses to order competitive search results.
1996-1999: On-page keyword usage + meta data
1999 – 2002: PageRank + On-page
2002 – 2005: Anchor text + Domain name + PageRank + On-Page
2005 – 2009: Domain authority + Diversity of linking domains + Topic modeling + Anchor text + Domain name + PageRank + On-Page
In 2010 and 2011, we’ve already seen the entry of social signals from Facebook and Twitter. The recent clickstream stories revealed that both Google and Bing employ clickstream data (Bing has done so publicly for the last 3 years, Google more quietly and probably longer), though this likely is a relatively small data point for both.
It’s my belief that the next generation of ranking signals will rely on three (relatively) new groups of metrics.
#1: Brand Signals
One of the reasons Google took so long to penalize JCPenney (it was first spam reported to me in late 2009) is that their human raters and user data likely suggested it was actually quite a good result for searches like “dresses” and “bedding.” The brand name meant that people felt good about the listing and Google, up until the bad press, felt no need to take punitive action, if the methodology was manipulative (I’m pretty sure they knew about the manipulation for a long time, but wanted to solve it algorithmically).
For millions of retail, transactional-focused searches, Google’s results are, to be honest, easily and often-gamed. We could find hundreds of examples in just a few hours, but the one below serves the purpose pretty well.
I just bought some new yellow pumas (these ones), but the best possible page Google could return (probably this one) is nowhere to be found, and most of the first two pages of results aren’t specific enough – a good number don’t even offer any yellow Pumas that I could find!
Google wants to solve this, and one very good way is to separate the “brands” that produce happy searchers and customers from the “generics” – sites they’ve often classified as “thin affiliates” or “poor user experiences.” As webmasters and supporters of small-business on the web, we might complain, but as searchers, even we can agree that Puma, Amazon and Zappos would be pretty good results for a query like the above.
So what types of signals might Google employ to determine if a site is a “brand” or not?
These are just a few examples of data types and sources – Google/Bing can look at dozens, possibly hundreds of inputs (including applying machine learning to selected subsets of brand vs. non-brand sites to identify pattern matches that might not be instantly apparent to human algorithm creators).
As you might imagine, many manipulative sites could copy a number of these signals, but the engines can likely have a significant quality impact. The Vince update from 2009 is often pointed-to as a first effort along these lines from Google.
#2: Entity Associations
Search engines have, classically, relied on a relatively universal algorithm – one that rates pages based on the metrics available, without massive swings between verticals. In the past few years, however, savvy searchers and many SEOs have noted a distinct shift to a model where certain types of sites have a greater opportunity to perform for certain queries. The odds aren’t necessarily stacked against outsiders, but the engines appear to bias to the types of content providers that are likely to fulfill the users’ intent.
For example, when a user performs a search for “lamb shanks,” it could make a lot of sense to give an extra boost to sites whose content is focused on recipes and food.
This same logic could apply to “The King’s Speech” where the engine might bias to film-focuses sites like RottenTomatoes, IMDB, Flixster or Metacritic.
Bill Slawski has written brilliantly about entities in the past:
Rather than just looking for brands, it’s more likely that Google is trying to understand when a query includes an entity – a specific person, place, or thing, and if it can identify an entity, that identification can influence the search results that you see…
…I’ve written about the topic before, when Google was granted a patent named Query rewriting with entity detection back in May of 2009, which I covered in Boosting Brands, Businesses, and Other Entities: How a Search Engine Might Assume a Query Implies a Site Search.
Google’s recent acquistion of Metaweb is noteworthy for a number of reasons. One of them is that Metaweb has developed an approach to cataloging different names for the same entity, so that for example, when Google sees names on the Web such as Terminator or Governator or Conan the Barbarian or Kindergarten Cop, it can easily associate those mentions with Arnold Schwarzenegger.
Entity associations can be used to help bolster brand signals, classify query types (and types of results), and probably help with triggering vertical/universal results like Places/Maps, Images, Videos, etc.
#3: Human Quality Raters (Trusted) User Behavior
Last November, I wrote a post on my personal blog called “The Algorithm + the Crowd are Not Enough”
In the last decade, the online world has been ruled by two, twin forces: The Crowd and The Algorithm. The collective “users” of the Internet (The Crowd) create, click, and rate, while mathematical equations add scalability and findability to these overwhelming quantities of data (The Algorithm). Like the moon over the ocean, the pull of these two forces help create the tides of popularity (and obscurity) on the Internet. Information is more accessible, useful, and egalitarian than ever before.
But lately, at least to me, the weaknesses of this crowdsourced + algorithmic system are showing, and the next revolution feels inevitable.
Given that Google’s just launched a Chrome web extension to allow users to block sites of their choosing in the SERPs and the many attempts to leverage user data in the search results (remember SideWiki, SearchWiki, Starred Results), it’s a good bet that the pure-algorithm bias is slowly seeping away. Bing uses a panel of search quality reviewers, as does Google (though the latter continues to be very secretive about it).
Both are looking at clickstream data (a form of user-based information). Here’s a former Google search qualty engineer noting that Google’s used the same form of clickstream analysis via their toolbar that they railed against Bing for applying.
All of this strongly suggests that more user and usage information will be gathered and used to help rank results. It’s far tougher to access than link data and, particularly hard to game without appearing “unnatural” compared to the normal web traffic patterns. I’ve talked before about how I don’t like the direct signals of clicks on search results, but many ancillary data points could be collected and used, including information about where users have “good” user experiences on the web.
I’m looking forward to your thoughts on the next generation of ranking signals and what Google/Bing might do next to overcome problems like JCPenneyGate, spam perception among technophiles and content farms. It seems hard to imagine that either will simply rest on a system they know can be gamed.
p.s. I’d also add that vertical/universal results and more “instant answers” will continue to rise in importance/visibility in the SERPs for both engines (though these aren’t really classic “ranking signals”)
p.p.s. If you’re PRO and interested in the brand signals in particular (and some suggested brand-building tactics), feel free to join our webinar this Friday.