1 Million Websites Ignoring SEOResearch Methodology
We were curious to know how many websites had implemented fundamental SEO elements correctly on their home pages—arguably—a website’s most important page?
The fundamental elements we were interested in evaluating were:
- Title tag
- Meta-Description tag (albeit, not necessarily an SEO element)
- H1 (headline) tag
From an SEO standpoint, it’s pretty basic stuff. We didn’t get into all the deeper, cool stuff like SEOmoz does (backlinks, social media, keywords, anchor text, competition, etc.) given our focus being “SEO for mere mortals.”
If there are 141 million active websites, our focus is on the 99% of sites (the bottom 139.6 million) that don’t have dedicated SEO experts, but who nonetheless, have websites and do worry about their inbound traffic, or lack thereof.
To help us acquire the big-data set and help us make sense of the results we teamed up with online-behavior engineering quants at OBTO Tech.
The following is the procedure they laid out for us:
Pinging 1 million websites takes some tricky coding and smart planning. Obviously, you want to get it right the first time so you don’t have to start all over again after you’ve already analyzed 750k sites.
We were able to get the list of the top 1 million sites from Alexa.com. The top 1 million sites breaks down like this:
|TLD||% of 1 million domains|
In case you want to do your own research, here’s the Alexa 1 million site file we used (.csv format, as compressed .rar file, 9M download).
For the SEO elements we were interested in, we used the following criteria:
- Title: exists, only 1 on the web page, 65 characters or less
- H1: exists, only 1 on the web page
- Meta-description: exists, only 1 on page, 160 characters or less
- Sitemap.xml: exists on the website
- Robots.txt: exists on the website
- Urllist.txt: exists on the website
To keep costs and complexity to a minimum, OBTO Tech used a single server to ping the 1 million sites. However, given the latency of website (HTTP-request) response times, their calculations showed that it was going to take a very long time to complete our research.
To speed things up, Gearman was
used to implement a basic map/reduce system: Map: Several threads were
created to reduce the bottleneck of waiting for each HTTP response. Each
thread would send one request to the given url, analyze it and store the
analysis into the filesystem.
Reduce: Each thread would grab a range of the stored analyses (site 1-100) and send back the compiled report to the main Gearman worker.
See hand drawn sketch of the map/reduce system, below.
The entire project took about three days to complete.
|Web page element||Critera||% sites|
|Title tag||Single title tag, not longer than 65 characters||64.9 %|
|H1 tag||Single H1 tag||32.2 %|
|Meta-Description tag||Single Meta-Description tag, not longer than 160 characters||38.5 %|
|Pages with all 3 of the above metrics correct||9.6 %|
|Title tag||No title tag on the page||6 %|
|Title tag||More than 1 title tag on the page||2 %|
|H1 tag||No H1 tag on the page||55 %|
|H1 tag||More than 1 H1 tag on the page||14 %|
|Meta-Description tag||No meta-description tag on the page||36 %|
|Meta-Description tag||More than 1 meta-description tag on the page||2 %|
|Sitemap.xml *||File present in root folder of website||30%|
|Robots.txt||File present in root folder of website||61%|
|urllist.txt||File present in root folder of website||2%|
*Sitemap.xml: We looked for the default file: sitemap.xml. We estimate that less than 2% of sites “hide” their sitemap file by renaming it something other than the default