1 Million Websites Ignoring SEO

Research Methodology

Abstract: Here’s how we went about gathering the data we needed to expose the high percentage of websites neglecting fundamental SEO elements. The results of our research are highlighted on the infographic published on http://FreeSEOScorecard.com

We were curious to know how many websites had implemented fundamental SEO elements correctly on their home pages—arguably—a website’s most important page?

The fundamental elements we were interested in evaluating were:

Title tag
Meta-Description tag (albeit, not necessarily an SEO element)
H1 (headline) tag
Sitemap.xml
Robots.txt

From an SEO standpoint, it’s pretty basic stuff. We didn’t get into all the deeper, cool stuff like SEOmoz does (backlinks, social media, keywords, anchor text, competition, etc.) given our focus being “SEO for mere mortals.”

If there are 141 million active websites, our focus is on the 99% of sites (the bottom 139.6 million) that don’t have dedicated SEO experts, but who nonetheless, have websites and do worry about their inbound traffic, or lack thereof.

Quants

To help us acquire the big-data set and help us make sense of the results we teamed up with online-behavior engineering quants at OBTO Tech.

The following is the procedure they laid out for us:

Test environment

Pinging 1 million websites takes some tricky coding and smart planning. Obviously, you want to get it right the first time so you don’t have to start all over again after you’ve already analyzed 750k sites.

Data sources

We were able to get the list of the top 1 million sites from Alexa.com. The top 1 million sites breaks down like this:

TLD	% of 1 million domains
.com	52.7
.net	5.8
.org	4.1
.ru	3.8
.de	3.8
co.uk	1.8
.info	1.5
com.br	1.3
all others	25.2

In case you want to do your own research, here’s the Alexa 1 million site file we used (.csv format, as compressed .rar file, 9M download).

Criteria

For the SEO elements we were interested in, we used the following criteria:

Title: exists, only 1 on the web page, 65 characters or less
H1: exists, only 1 on the web page
Meta-description: exists, only 1 on page, 160 characters or less
Sitemap.xml: exists on the website
Robots.txt: exists on the website
Urllist.txt: exists on the website

Procedure

To keep costs and complexity to a minimum, OBTO Tech used a single server to ping the 1 million sites. However, given the latency of website (HTTP-request) response times, their calculations showed that it was going to take a very long time to complete our research.

To speed things up, Gearman was used to implement a basic map/reduce system: Map: Several threads were created to reduce the bottleneck of waiting for each HTTP response. Each thread would send one request to the given url, analyze it and store the analysis into the filesystem.
Reduce: Each thread would grab a range of the stored analyses (site 1-100) and send back the compiled report to the main Gearman worker.

See hand drawn sketch of the map/reduce system, below.

The entire project took about three days to complete.

Results

Web page element	Critera	% sites
Title tag	Single title tag, not longer than 65 characters	64.9 %
H1 tag	Single H1 tag	32.2 %
Meta-Description tag	Single Meta-Description tag, not longer than 160 characters	38.5 %
	Pages with all 3 of the above metrics correct	9.6 %
Title tag	No title tag on the page	6 %
Title tag	More than 1 title tag on the page	2 %
H1 tag	No H1 tag on the page	55 %
H1 tag	More than 1 H1 tag on the page	14 %
Meta-Description tag	No meta-description tag on the page	36 %
Meta-Description tag	More than 1 meta-description tag on the page	2 %
Sitemap.xml *	File present in root folder of website	30%
Robots.txt	File present in root folder of website	61%
urllist.txt	File present in root folder of website	2%

*Sitemap.xml: We looked for the default file: sitemap.xml. We estimate that less than 2% of sites “hide” their sitemap file by renaming it something other than the default

1 Million Websites Ignoring SEO

Quants

Test environment

Data sources

Criteria

Procedure

Results

Aw, Snap! You’re maxed out for today.

Well, I’m Feeling Stupid

Can’t find the URL you entered

Modern Browsers Only!

Sorry, no proxies allowed!