4 years ago

Why server logs matter for SEO

Server log analysis can provide unparalleled insights into crawl prioritization, enabling SEO teams to finetune their crawl budget management for better rankings.

The majority of website operators are unaware of the importance of web server logs. They do not record, much less analyze their website’s server logs. Large brands, in particular, fail to capitalize on server log analysis and irretrievably lose unrecorded server log data.

Organizations that choose to embrace server log analysis as part of their ongoing SEO efforts often excel in Google Search. If your website consists of 100,000 pages or more and you wish to find out how and why server logs pose a tremendous growth opportunity, keep reading.

Why server logs matter

Each time a bot requests a URL hosted on a web server a log record entry is automatically created reflecting information exchanged in the process. When covering an extended period of time, server logs become representative of the history of requests received and of the responses returned.

The information retained in server log files typically include client IP address, request date and time, the page URL requested, the HTTP response code, the volume of bytes served as well as the user agent and the referrer.

While server logs are created at every instance a web page is requested, including user browser requests, search engine optimization focuses exclusively on the use of bot server log data. This is relevant with regard to legal considerations touching on data protection frameworks such as GDPR/CCPA/DSGVO. Because no user data is ever included for SEO purposes, raw, anonymized web server log analysis remains unencumbered by otherwise potentially applicable legal regulations.

It’s worth mentioning that, to some extent, similar insights are possible based on Google Search Console Crawl stats. However, these samples are limited in volume and time span covered. Unlike Google Search Console with its data reflecting only the last few months, it is exclusively server log files that provide a clear, big picture outlining long-term SEO trends.

The valuable data within server logs

Each time a bot requests a page hosted on the server, a log instance is created recording a number of data points, including:

The IP address of the requesting client.
The exact time of the request, often based on the server’s internal clock.
The URL that was requested.
The HTTP was used for the request.
The response status code returned (e.g., 200, 301, 404, 500 or other).
The user agent string from the requesting entity (e.g., a search engine bot name like Googlebot/2.1).

A typical server log record sample may look like this:

150.174.193.196 - - [15/Dec/2021:11:25:14 +0100] "GET /index.html HTTP/1.0" 200 1050 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)" "www.example.ai"

In this example:

150.174.193.196 is the IP of the requesting entity.

[15/Dec/2021:11:25:14 +0100] is the time zone as well as the time of the request.

"GET /index.html HTTP/1.0" is the HTTP method used (GET), the file requested (index.html) and the HTTP protocol version used.

200 is the server HTTP status code response returned.

1050 is the byte size of the server response.

"Googlebot/2.1 (+http://www.google.com/bot.html)" is the user agent of the requesting entity.

"www.example.ai" is the referring URL.

How to use server logs

From an SEO perspective, there are three primary reasons why web server logs provide unparalleled insights:

Assisting to filter out undesirable bot traffic with no SEO significance from desirable search engine bot traffic originating from legitimate bots such as Googlebot, Bingbot or YandexBot.
Providing SEO insights into crawl prioritization and thereby enabling the SEO team with an opportunity to proactively tweak and finetune their crawl budget management.
Allowing for monitoring and providing a track record of the server responses sent to search engines.

Fake search engine bots can be a nuisance, but they only rarely affect websites. There are a number of specialized service providers like Cloudflare and AWS Shield that can help in managing undesirable bot traffic.In the process of analyzing web server logs, fake search engine bots tend to play a subordinate role.

In order to accurately gauge which parts of a website are being prioritized other than major search engines, bot traffic has to be filtered when performing a log analysis. Depending on the markets targeted, the focus can be on search engine bots like Google, Apple, Bing, Yandex or others.

Especially for websites where content freshness is key, how frequently those sites are being re-crawled can critically impact their usefulness for users. In other words, if content changes are not picked up swiftly enough, user experience signals and organic search rankings are unlikely to reach their full potential.

Only through server log filtering is it possible to accurately gauge relevant search engine bot traffic.

While Google is inclined to crawl all information available and re-crawl already known URL patterns regularly, its crawl resources are not limitless. That’s why, for large websites consisting of hundreds of thousands of landing pages, re-crawl cycles depend on Google‘s crawl prioritization allocation algorithms.

That allocation can be positively stimulated with reliable up-time, highly responsive web services, optimized specifically for a fast experience. These steps alone are conducive to SEO. However, only by analyzing complete server logs that cover an extended period of time is it possible to identify the degree of overlap between the total volume of all crawlable landing pages, the typically smaller number of relevant, optimized and indexable SEO landing pages represented in the sitemap and what Google regularly prioritizes for crawling, indexing and ranking.

Such a log analysis as an integral part of a technical SEO audit and the only method to uncover the degree of crawl budget waste. And whether crawlable filtering, placeholder or lean content pages, an open staging server or other obsolete parts of the website continue to impair crawling and ultimately rankings. Under certain circumstances, such as a planned migration, it is specifically the insights gained through an SEO audit, including server log analysis, that often make the difference between success and failure for the migration.

Additionally, the log analysis offers for large websites critical SEO insights. It can provide an answer to how long Google needs to recrawl the entire website. If that answer happens to be decisively long — months or longer — action may be warranted to make sure the indexable SEO landing pages are crawled. Otherwise, there’s a great risk that any SEO improvements to the website go unnoticed by search engines for potentially months after release, which in turn is a recipe for poor rankings.

A high degree of overlap between indexable SEO landing pages and what Google crawls regularly is a positive SEO KPI.

Server responses are critical for great Google Search visibility. While Google Search Console does offer an important glimpse into recent server responses, any data Google Search Console offers to website operators must be considered a representative, yet limited sample. Although this can be useful to identify egregious issues, with a server log analysis it is possible to analyze and identify all HTTP responses, including any quantitatively relevant non-200 OK responses that can jeopardize rankings. Possible alternative responses can be indicative of performance issues (e.g., 503 Service Unavailable scheduled downtime) if they are excessive.

Excessive non-200 OK server responses have a negative impact on organic search visibility.

Where to get started

Despite the potential that server log analysis has to offer, most website operators do not take advantage of the opportunities presented. Server logs are either not recorded at all or regularly overwritten or incomplete. The overwhelming majority of websites do not retain server log data for any meaningful period of time. This is good news for any operators willing to, unlike their competitors, collect and utilize server log files for search engine optimization.

When planning server log data collection, it is worth noting which data fields at a minimum must be retained in the server log files in order for the data to be usable. The following list can be considered a guideline:

remote IP address of the requesting entity.
user agent string of the requesting entity.
request scheme (e.g., was the HTTP request for http or https or wss or something else).
request hostname (e.g., which subdomain or domain was the HTTP request for).
request path, often this is the file path on the server as a relative URL.
request parameters, which can be a part of the request path.
request time, including date, time and timezone.
request method.
response http status code.
response timings.

If the request path is a relative URL, the fields which are often neglected in server log files are the recording of the hostname and scheme of the request. This is why it is important to check with your IT department if the request path is a relative URL so that the hostname and scheme are also recorded in the server log files. An easy workaround is to record the entire request URL as one field, which includes the scheme, hostname, path and parameters in one string.

When collecting server log files, it is also important to include logs originating from CDNs and other third-party services the website may be using. Check with these third-party services about how to extract and save the log files on a regular basis.

Overcoming obstacles to server log analysis

Often, two main obstacles are put forward to counter the urgent need to retain server log data: cost and legal concerns. While both factors are ultimately determined by individual circumstances, such as budgeting and legal jurisdiction, neither have to pose a serious roadblock.

Cloud storage can be a long-term option and physical hardware storage is also likely to cap the cost. With retail pricing for approximately 20 TB hard drives below $600 USD, the hardware cost is negligible. Given that the price of storage hardware has been in decline for years, ultimately the cost of storage is unlikely to pose a serious challenge to server log recording.

Additionally, there will be a cost associated with the log analysis software or with the SEO audit provider rendering the service. While these costs must be factored into the budget, once more it is easy to justify in the light of the advantages server log analysis offers.

While this article is intended to outline the inherent benefits of server log analysis for SEO, it should not be considered as a legal recommendation. Such legal advice can only be given by a qualified attorney in the context of the legal framework and relevant jurisdiction. A number of laws and regulations such as GDPR/CCPA/DSGVO can apply in this context. Especially when operating from the EU, privacy is a major concern. However, for the purpose of a server log analysis for SEO, any user-related data is of no relevance. Any records that can not be conclusively verified based on IP address are to be ignored.

With regard to privacy concerns, any log data which does not validate and is not a confirmed search engine bot must not be used and instead can be deleted or anonymized after a defined period of time-based on relevant legal recommendations. This tried and tested approach is being applied by some of the largest website operators on a regular basis.

When to get started

The major question remaining is when to start collecting server log data. The answer is now!

Server log data can only be applied in a meaningful way and lead to actionable advice if it is available in sufficient volume. The critical mass of server logs’ usefulness for SEO audits typically ranges between six and thirty-six months, depending on how large a website is and its crawl prioritization signals.

It is important to note that unrecorded server logs can not be acquired at a later stage. Chances are that any efforts to retain and preserve server logs initiated today will bear fruits as early as the following year. Hence, collecting server log data must commence at the earliest possible time and continue uninterrupted for as long as the website is in operation and aims to perform well in organic search.

Source: Kaspar Szymanski, Jan 11, 2022