Join Login

1 Million Websites Ignoring SEO

Research Methodology

Abstract: Here’s how we went about gathering the data we needed to expose the high percentage of websites neglecting fundamental SEO elements. The results of our research are highlighted on the infographic published on http://FreeSEOScorecard.com

We were curious to know how many websites had implemented fundamental SEO elements correctly on their home pages—arguably—a website’s most important page?

The fundamental elements we were interested in evaluating were:

  1. Title tag
  2. Meta-Description tag (albeit, not necessarily an SEO element)
  3. H1 (headline) tag
  4. Sitemap.xml
  5. Robots.txt

From an SEO standpoint, it’s pretty basic stuff. We didn’t get into all the deeper, cool stuff like SEOmoz does (backlinks, social media, keywords, anchor text, competition, etc.) given our focus being “SEO for mere mortals.”

If there are 141 million active websites, our focus is on the 99% of sites (the bottom 139.6 million) that don’t have dedicated SEO experts, but who nonetheless, have websites and do worry about their inbound traffic, or lack thereof.

Quants

To help us acquire the big-data set and help us make sense of the results we teamed up with online-behavior engineering quants at OBTO Tech.

The following is the procedure they laid out for us:

Test environment

Pinging 1 million websites takes some tricky coding and smart planning. Obviously, you want to get it right the first time so you don’t have to start all over again after you’ve already analyzed 750k sites.

Data sources

We were able to get the list of the top 1 million sites from Alexa.com. The top 1 million sites breaks down like this:

TLD % of 1 million domains
.com 52.7
.net 5.8
.org 4.1
.ru 3.8
.de 3.8
co.uk 1.8
.info 1.5
com.br 1.3
all others 25.2

 

In case you want to do your own research, here’s the Alexa 1 million site file we used (.csv format, as compressed .rar file, 9M download).

Criteria

For the SEO elements we were interested in, we used the following criteria:

  • Title: exists, only 1 on the web page, 65 characters or less
  • H1: exists, only 1 on the web page
  • Meta-description: exists, only 1 on page, 160 characters or less
  • Sitemap.xml: exists on the website
  • Robots.txt: exists on the website
  • Urllist.txt: exists on the website

 

Procedure

To keep costs and complexity to a minimum, OBTO Tech used a single server to ping the 1 million sites. However, given the latency of website (HTTP-request) response times, their calculations showed that it was going to take a very long time to complete our research.

To speed things up, Gearman was used to implement a basic map/reduce system: Map: Several threads were created to reduce the bottleneck of waiting for each HTTP response. Each thread would send one request to the given url, analyze it and store the analysis into the filesystem.
Reduce: Each thread would grab a range of the stored analyses (site 1-100) and send back the compiled report to the main Gearman worker.

See hand drawn sketch of the map/reduce system, below.

The entire project took about three days to complete.

Results

Web page element Critera % sites
Title tag Single title tag, not longer than 65 characters 64.9 %
H1 tag Single H1 tag 32.2 %
Meta-Description tag Single Meta-Description tag, not longer than 160 characters 38.5 %
  Pages with all 3 of the above metrics correct 9.6 %
Title tag No title tag on the page 6 %
Title tag More than 1 title tag on the page 2 %
H1 tag No H1 tag on the page 55 %
H1 tag More than 1 H1 tag on the page 14 %
Meta-Description tag No meta-description tag on the page 36 %
Meta-Description tag More than 1 meta-description tag on the page 2 %
Sitemap.xml * File present in root folder of website 30%
Robots.txt File present in root folder of website 61%
urllist.txt File present in root folder of website 2%

*Sitemap.xml: We looked for the default file: sitemap.xml. We estimate that less than 2% of sites “hide” their sitemap file by renaming it something other than the default