by Brent K
Google is a now irrevocably integral part of many of our lives. The company has spent many years by now building its many, varied, and mostly free products all centered around a basic principle of bringing the right content to the user faster, including their primary source of revenue, which is carefully, almost artfully coordinated relevant advertisements. Despite their many products, their name is still primarily known for one reason: internet searches. Virtually nobody these days says “let’s go research that on the internet,” but instead “Let’s Google it.” Even Ask.com, Yahoo!, and Microsoft’s somewhat recently renovated Bing haven’t made such a level of infamy that they have become a part of our everyday vocabulary. So certainly then, Google must have some very careful methods to their search engine technology. I’ll elaborate on two of these: Google Panda, and Google Penguin.
Google Panda and Google Penguin are both optimization algorithms in Google’s search results, but they perform two very different functions. According to Google’s official blog, in a post on Panda’s launch in February of 2011, “This update is designed to reduce rankings for low-quality sites—sites which are low-value add for users, copy content from other websites or sites that are just not very useful. At the same time, it will provide better rankings for high-quality sites—sites with original content and information such as research, in-depth reports, thoughtful analysis and so on.” Similarly, yet conversely, Google’s Inside Search official blog posted this about Penguin on its launch day, April 24th 2012, Penguin is an “important algorithm change targeted at webspam. The change will decrease rankings for sites that we believe are violating Google’s existing quality guidelines.”
Google Panda, then, is essentially a rating system based on your internet search query and complicated statistics, aimed at bringing you the most relevant content based on a combination of simple keyword matching, robustness and relevance of content, and a generalized statistical model derived from data provided by thousands of ratings by human quality testers of websites based on measures of quality (design, trustworthiness, and speed) and whether or not the testers would return to the site. Panda’s algorithm is tailored to find sites that have positive answers to these questions, found also on Google’s official blog:
• Would you trust the information presented in this article?
• Is this article written by an expert or enthusiast who knows the topic well, or is it more shallow in nature?
• Does the site have duplicate, overlapping, or redundant articles on the same or similar topics with slightly different keyword variations?
• Would you be comfortable giving your credit card information to this site?
• Does this article have spelling, stylistic, or factual errors?
• Are the topics driven by genuine interests of readers of the site, or does the site generate content by attempting to guess what might rank well in search engines?
• Does the article provide original content or information, original reporting, original research, or original analysis?
• Does the page provide substantial value when compared to other pages in search results?
• How much quality control is done on content?
• Does the article describe both sides of a story?
• Is the site a recognized authority on its topic?
• Is the content mass-produced by or outsourced to a large number of creators, or spread across a large network of sites, so that individual pages or sites don’t get as much attention or care?
• Was the article edited well, or does it appear sloppy or hastily produced?
• For a health related query, would you trust information from this site?
• Would you recognize this site as an authoritative source when mentioned by name?
• Does this article provide a complete or comprehensive description of the topic?
• Does this article contain insightful analysis or interesting information that is beyond obvious?
• Is this the sort of page you’d want to bookmark, share with a friend, or recommend?
• Does this article have an excessive amount of ads that distract from or interfere with the main content?
• Would you expect to see this article in a printed magazine, encyclopedia or book?
• Are the articles short, unsubstantial, or otherwise lacking in helpful specifics?
• Are the pages produced with great care and attention to detail vs. less attention to detail?
• Would users complain when they see pages from this site?
Matt Cutts, head of Google’s Webspam team, said of the sort of websites that Panda filters out while in an interview by Wired, “It was like, “What’s the bare minimum that I can do that’s not spam?” It sort of fell between our respective groups. And then we decided, okay, we’ve got to come together and figure out how to address this.”
Google Penguin is a much simpler algorithm built to tackle the problems inherent in many search engines, with or without optimization technologies, namely: webspam, or spamdexing. Spamdexing is a deliberate manipulation of search engine indexes to illegitimately improve a website’s page ranking in a given search result. Most of the time, this technique boils down to injecting words which are not seen by the user but can be seen in the code by Google’s and other search engines’ indexing spiders. These words, and occasionally links as well, are virtually completely non-sequittur to the content of the page. Sometimes as well, the content of the page may be nothing but nonsense, but contains many key words often highlighted by search engine spiders. One other common technique that is under the umbrella of spamdexing is the generation of duplicate content pages, wherein a website will almost if not completely clone the content of another website in the hopes of cloning that website’s search rank.
Now that we know what Panda and Penguin are, we must see how useful they are. This is, however, much less scientifically explained. Simply put, most people are either quietly satisfied with the efforts of Google, or they write long and statistically supported blog posts about why Google’s new algorithms are ruining lives. Despite my sarcastic spin, the ruining-life effect does unfortunately hold true in some senses; the Penguin algorithm is finding an awful lot of false positives.
Sarah Needleman and Emily Maltby of the Wall Street Journal wrote an article like this latter example, saying in it that “… some small businesses say they are scrambling to avoid being relegated to the Internet’s junk bin.” They cite examples such as Andrew Strauss, who owns a website which sells things such as dog clothing and bedding, etc. According to the same post in the Wall Street Journal, “Traffic through Google has plunged by 96%, he says. Mr. Strauss expects his six-year-old business to generate sales of $25,000 this month, down from $68,000 in March, the month before the changes. “We’re completely crippled now,” he says.” Even Ralph Slate, owner of HockeyDB.com, and one who has not put a cent of investment into pushing links for his site anywhere, is affected: “”I have never paid for a link, and I don’t do link-sharing sites,” says Mr. Slate. “I don’t do keyword stuffing. That’s why this is so frustrating. “”
However, most larger, more established sites typically benefited, according to a careful statistical study conducted by Declan McCullagh of cnet. “CNET’s analysis found no significant change among the very top sites, which remained the same. Wikipedia, YouTube, Amazon.com, and IMDB stayed in the same enviable tier one positions, respectively. Hulu.com surged to position No. 22 from No. 51. Twitter, Facebook, and Huffington Post each moved up a single notch, with Yelp, Flickr, Apple.com, and WebMD slipping a bit. Government Web sites got a boost, with WhiteHouse.gov climbing from No. 125 to No. 79, and NASA, the Centers for Disease Control, and the National Institutes of Health increasing as well.”
Similarly, as expected, other pages that were more superfluous went down in rankings: “Among the Web sites that slid in visibility: WikiHow and eHow, which is consistent with other reports that Panda lowered the ranking of so-called content farms.”
A visualization of this statistical analysis follows, with red being the big sites like Amazon, Wikipedia, etc. and as the colors fade to blue, smaller sites.
So, effectively, big business is helped, and little business is hindered by Google’s algorithm updates. But we still see the advantage in these updates, in a practical way as they also correctly reduce or remove search results which are annoying, full of spam, and unnecessary, and other search results which are very poorly formed albeit honest websites that do not adequately answer our queries. They have a philosophical benefit as well, if only in that they are attempts to make search results more pertinent and so open up the field of possibilities to various other methods of SEO.
Singhal, Amit, and Cutts, Matt (Feb 24th, 2011). The Official Google Blog. “Finding more high-quality sites in search.” Retrieved from http://googleblog.blogspot.com/2011/02/finding-more-high-quality-sites-in.html
Cutts, Matt (Feb 24th, 2012). The Official Google Search Blog. “Another step to reward high-quality sites.” Retrieved from http://insidesearch.blogspot.com/2012/04/another-step-to-reward-high-quality.html
Sullivan, Danny (Oct 21st, 2008). SearchEngineLand.com. “What Is Search Engine Spam? The Video Edition.” Retrieved from http://searchengineland.com/what-is-search-engine-spam-the-video-edition-15202
Needleman, Sarah E. and Maltby, Emily (May 16th, 2012). The Wall Street Journal. “As Google Tweaks Searches, Some Get Lost in the Web.” Retrieved from http://online.wsj.com/article/SB10001424052702303505504577406751747002494.html
McCullagh, Declan (Apr 18th, 2011). CNET.com. “Testing Google’s Panda algorithm: CNET analysis.” Retrieved from http://news.cnet.com/8301-31921_3-20054797-281.html
Levy, Steven (Mar 3rd, 2011). Wired.com. “TED 2011: The ‘Panda’ That Hates Farms: A Q&A With Google’s Top Search Engineers.” Retrieved from http://www.wired.com/business/2011/03/the-panda-that-hates-farms/2/