Julien says: “My insight is about using log files and analysing how search engine robots crawl your website. I know there are people who are very keen on doing this and very advanced at it, but even for smaller websites, it's sometimes very, very important.
As some of your audience will already have guessed, I come from France. I still live in France. In fact, we've been doing log file analysis in France for probably 15-20 years already. People have quite a lot of experience. Even in France, you've got people with a lot of experience – especially on the enterprise SEO side of the business – but, on the rest of the SEO spectrum, it's a bit overlooked.
What we tend to say is that it's useful for larger websites when you've got tens of thousands or millions of pages, and that's true. It's very important on such large websites to be monitoring your crawl and making sure that not too many strange things are happening. But even on smaller websites, it's probably a good idea to try and grab your log files every now and then and see what's happening because you might find quite a lot of surprises.”
Let's start at the very beginning and just ask you, what are log files and what is log file analysis?
“Basically, the software hosting your website is called a web server and it's registering every request that's being made to it. Whenever a user or a robot comes to download a web page to display your browser, for example, it will have to request quite a lot of files: the actual HTML file, but also images, JavaScript, etc. All of this is registered into what we call log files.
Those are plain text files with several fields that register every part of the traffic that's coming to the server. To be honest, nowadays it's a bit more complex with CDNs, etc. It might be a bit trickier than this, but it's always the same logic. It's registering traffic that's coming to a website.”
Assuming that you're not using a CDN, how do you get access to your log files?
“I would suggest the easiest way is to either ask IT, your web developer, host service, etc. to give them to you. Otherwise – it's not easy to describe this – but they are always stored in the same kinds of places in your web server.
If you are using smaller hosting services, they often already provide those files in a specific place in your management account. The issue you might have is that, with some services – I'm thinking of a few, so I won't be pointing too many fingers.”
You can mention any software you want. It doesn't matter.
“Yeah, some services like HubSpot or Shopify, for example, do not provide log files, sadly (at least to my knowledge).
You might be blocked from accessing those files sometimes, but it's mostly edge cases and 90% of the time you can get those. You will also find that, in some countries, it's a legal requirement to store log files for at least a year. In some countries, if your host is telling you that they don't have your log files, they are breaking the law.”
You mentioned earlier on that it can be a bit more challenging if you use a CDN to access all your log files. What happens if you use something like Cloudflare? What would be the process to make sure that you access all the log files you need?
“If you are using a CDN, it will depend a lot on how it's configured. Basically, you can either have access to log files directly on your own web server, but it might not be the entirety of your traffic.
You also can get some information directly on your Cloudflare interface, for example. It’s completely feasible to set up a daily export of your logs that are generated by the CDN. It won't be exactly the same format as what you are maybe used to on an Apache or NGINX server.”
I'm sorry for jumping in. If you have certain elements of your page that are delivered by a CDN, and other elements delivered by your own web server, do you attempt to tie the log files together?
“It's probably a bit of an edge case, but I think you probably would end up using the CDN files because most of the time it's what's registering the entirety of the traffic. In fact, what we use those for is to cache traffic and make sure that our web servers are not getting too many requests.
Again, it's very specific to how things are set up and, nowadays, everybody has a different way of doing stuff like this. I can't give many generic answers on this. The best idea is to get help from your IT department.”
What software do you use to analyse your log files?
“Nowadays we've got plenty of tools available on the market. I'm eager to use two French SEO tools, which are Oncrawl and Botify. Especially Oncrawl, because I've been using mainly this tool for many, many years and I know it quite well now.
There are tonnes of other tools on the market nowadays. I'm always coming back to those two because they are probably the most experienced, and also the ones with which I've got the most experience myself. It's always easier to find your way around.
But, since those log files are simple text files, you can also analyse them with any other solution. For example, back in the day, we did not have either Botify or Oncrawl and we used our terminal or the console on our computers to analyse this.
Nowadays, if I was to do it by myself, I would probably use a Python script, like a bundles data frame. There are tonnes of solutions.”
I guess it depends on the size of the site as well, because if you're using Oncrawl, then you're probably using a cloud service so you can probably crawl larger websites.
“That’s an important point. As your website size and traffic grows, it's more and more important to get proper software to analyse your data just because, by yourself, you might have trouble handling that much data.
In that particular context, I think cloud services are the best option. As you mentioned, those tools also have a crawler. It's very interesting because you can compare what you see when you crawl your own website with what search engines would see, for example.
In fact, that's what the core of log analysis is – to understand whether the vision you've got of your own website and where you want users and bots to go is what's actually happening on your website.”
What discrepancies do you tend to see when you're comparing your own crawls with crawls that you see from search engines, and what do you do about those discrepancies?
“First of all, I would take a step back. The first issue you might face is the actual amount of data you get. Because, as we said, we've got CDNs, different servers, etc. It's sometimes tricky to get the entire flux of data.
So, the first step I would take is to make sure that the amount of Googlebot hits that I get on a file is relevant in regards to what I can see in Google Search Console, for example. It won't be the exact same numbers, of course, but if it's within 2-5% it's okay.
If that’s not the case, you've got a problem somewhere, so you have to dive a bit deeper and understand what's happening. That would be my first big piece of advice: make sure that the files you are analysing are realistic and are making sense.”
What would likely be the error here? Would there be more crawl traffic that you're discovering through your log file analysis compared with traffic through Search Console or the other way around? What would typically be causing that issue?
“Hopefully you've got the right amount of data the first time, and that's the case in most cases. What often happens is that you don't get enough data because you were probably not looking at the right server.
We talked earlier about a CDN caching traffic and not showing everything to the server. If you are grabbing the log files from the server, you might not get the entire logs.
What happens sometimes, of course, is the other way around: you got more data in your log files than what you are supposed to find. Most of the time when this happens, it's because your web server is not configured the right way and you are registering a single request twice, for example.”
What would be one or two examples of how log file analysis can help to direct SEO strategy?
“I think one of the first key elements would be to see whether (especially Googlebot, because that's the one we are mostly focusing on) Googlebot is seeing errors on your website. For errors, I'm talking about anything that's not a 200 OK HTTP response code – either redirects, client errors, or server errors.
With redirects, it's normal to get 301s, etc., but you have to make sure that your internal links on your website are pointing directly to the URL (if you change URLs, for example), and not to a redirected one. That's one of the classic mistakes that we can make.
Then, regarding other kinds of errors, it's always important to try and fix those, and also to make sure that they don't end up being the vast majority of the crawl – especially server errors (every error that starts with a 5, like 503, etc.). When the rate of those errors is increasing, you might end up getting de-indexed by Google and not showing up in the results. This can be very bad for your traffic, of course.
It's easy to recover if you are fast enough to fix the problem, but it's one of those topics where, when you're working on larger websites, you have to be very consistent with looking at what's going on and making sure that there are no issues.”
How often should you be crawling your site and conducting log file analysis?
“That's a good question. I think it depends on the context. We might have strategies for crawling that might be surprising sometimes. For the vast majority of websites, I'd recommend having at least a monthly crawl.
With or without log file analysis, you will still find many things, if there are issues. If there are no issues, it's still a good thing to just make sure that things are evolving the way you want them to evolve.
Regarding log files, if you are working on a website that's larger than, let's say, a hundred thousand pages, I would recommend getting log file monitoring. Basically, grabbing daily files and putting those on a cloud-based software. If it's a smaller website, you can probably get away with just auditing your files every now and then – maybe once a year is probably enough.”
You've shared what SEO should be doing in 2024. Now let's talk about what SEO shouldn't be doing. What's something that's seductive in terms of time, but ultimately counterproductive? What's something that SEO shouldn't be doing in 2024?
“I'm quite sure it's not just about SEOs, but I think one of our major issues right now is being a bit too confident about what AI agents can provide. I mean, I'm totally pro using ChatGPT and other models like this to help us daily. In fact, I've been using some of those models for years now, especially in the coding part of my job.
It's very useful and it can help you get things done a lot faster, but you still have to make sure you are not too confident about what's going on with those. People might have heard about models hallucinating, for example. That’s where they are basically inventing stuff, or just putting words and numbers together because the sentence makes sense when you read it – but it can often end up not being actual reliable data, for example.
There are workarounds being developed that can help you get more reliable data or results on a larger scale. But, if you do not take those additional steps, and you want to use AI on a larger scale, you will probably quite quickly end up with text that makes no real sense to people who know the actual topic, for example.
That would be my main concern about SEO, and our broader society really, for the coming time.”
Julien Deneuville is Senior SEO Manager at Turo, and you can find him over at Turo.com.