Minimalistic User Statistics with Matomo's Server Log Analytics
Privacy and data protection are becoming more and more important issues for web developers. A lot of people were reminded of this last month, when the Austrian Data Protection Authority ruled that Austrian websites using Google Analytics are in violation of the GDPR.
Now it's easy to to blame the EU and their GDPR for all the trouble. But the ruling raises a valid concern: by including a simple script, we allow Google to track our users' every move - not just on ours but across many sites - and collect unfathomable amounts of data about them.
Most of this data is never used by us to improve the user experience or gain any insight into their behavior, but Google collects it nonetheless. It's the price we pay for a free service - but in the end it's the price our users are paying. Luckily, awareness on this issue has been increasing for a while.
Are there Alternatives to Google Analytics?
There are good reasons why so many people use Google Analytics: it's free, it's powerful, and adding it to a website only takes a few seconds - and last but not least it's seemingly used on every other website as well.
Use Privacy-Friendly Analytics Provider
But even though it may at times seem like it, Google is not the only player in this space. There are other very capable options that can be used just as easily without violating GDPR and your users' right to privacy. Just pop in their script instead if Google's and you're done.
Here are three that I have used and would recommend:
Plausible (paid) is made and hosted in the EU, doesn't use cookies and doesn't collect personally identifiable information about your users
Fathom (paid) is a Canadian company with a separate EU-based infrastructure for GDPR compliance, which doesn't rely on cookies and doesn't collect data on individual visitors
Matomo (paid/free) offers both cloud hosted (with servers in Germany/EU) as well as self-hosted analytics, letting you decide what data you want to collect and whether you want to use cookies and/or anonymize any user data
Yes, some of those tools cost money. But that's actually a good thing because it means the companies behind them try to deliver a great product that works for everyone, instead of secretly reselling the data they collected on your site.
Only gather the Data you actually need
Now, I should point out that while these are very capable tools, none of them give you the wealth of data that Google Analytics provides. Turning people's personal information into advertising dollars is Google's business model, after all.
But it might be worth the time to take a step back and look at what data you actually need for your business goals.
Is it important to you to know which blog posts one individual user read? Does it help your business to know that a user is 29-45 years old and also interested in basketball? When is the last time you even looked at 98% of the data that Google is collecting for you (and themselves)?
In the end there are probably only a few key metrics you actually need and analyze. Which page is visited how often, where are visitors coming from and are they converting? Have a look at your processes and see what other information you need - and see if any of the above can provide that without sending all your users' data to Google.
Self-hosted Matomo: a great Option for Personal Sites
Now what information do I care about on this personal site you are currently reading? Certainly not your age, location, or favorite musical, that's for sure. All I would really like to know boils down to:
is anybody even reading this?
if so, which posts are the most popular?
how are people finding those posts?
That's it. Any other reports I might look at every now and then when I'm trying to avoid actual work, but otherwise those would just occupy server space. I simply have no use for that information and it wouldn't change anything about how I run this site.
If you're managing a sales or marketing oriented site, your needs will obviously be different, so you will have to figure out the best fit yourself. For my needs, any of the tools above are more than enough. Which is why I went with the most basic solution: Matomo
Advantage 1: Self-hosted Analytics
As mentioned above, Matomo can be hosted in the cloud or on your own server. Using your own infrastructure not only lets you control where the data is stored (making sure nobody else can access it), but also means that you can use Matomo for free, which is always great for small sites that don't generate a lot of income.
Advantage 2: Server Log Analytics
Like all the other tools, Matomo comes with a tracking script that you can include on your site. And since you can run it under a subdomain of your choosing, it's a lot harder to block by ad blockers and other browser extensions than a script hosted on Google servers.
But what's even better is that you can ditch the tracking script all together and instead run server side analytics. So instead of JavaScript sending data to Matomo, it's your server logs that are being parsed and turned into analytics data.
This has a couple of benefits:
it doesn't collect any more data than what your server already does anyways
there is no additional load-time for a tracking script, making your site faster
since it doesn't use JavaScript, no ad blocker or other tool can disable it, meaning you really see every click to your website
Those are some interesting advantages, which is why I decided to give set it up for this site and see if it's something I can work with.
Setting up Matomo Server Log Analytics
Downloading and installing Matomo on your server takes only a few minutes. Afterwards you can create a user, log in to the backend and set up the site you want to track. By default it gives you instructions to set up the JavaScript tracking script, but that can be ignored if you want to use the server log method.
Instead you will have to use the python script that Matomo provides, which can be found at web/misc/log-analytics/import_logs.py
inside your installation, to read your log file. The command is fairly simple - using python to execute the script, naming the site id, the number of processes you want to use, some flags to enable tracking for errors and redirects, as well as the path of the access log.
You can simply SSH into your server manually and execute this command yourself. But for useful analytics data you will want to run it on a regular basis instead, so you don't miss out on any data.
The easiest way to do this is to set up a cron task running at an interval. How this is done depends on your hoster and operating system, so I'll leave it to you to google that one. Along with the command to be executed you need to also set the execution time - I used 0 */6 * * *
to execute the script every six hours on the hour.
Getting Data from the Server Access Log
To see what data we will have available after scanning the server logs lets take a look at a few sample entries I pulled from my server's access log:
As you can tell the data is very limited. Here's what we get in the above order:
the user's IP address
timestamp of the request
the source that was requested (e.g. the front page
/
)HTTP response code
number of bytes sent to fulfill the request
where the user came from (referrer)
user string, very useful to filter out bot requests
It's especially noteworthy that the referrer doesn't tell you much about how they got there - you can see a user is coming from Google or Twitter, but not what they searched for or which Tweet they clicked on.
Some of this can be compensated using additional tools (e.g. Google Search Console to get the search strings), but it's definitely something that will make your life a bit harder if you rely on SEO and social media for your site.
Is Server Log Analytics enough?
This is something I might have been willing to deal with. However, I did find one more limitation that finally made it a bit too cumbersome to use as my only analytics tool: the inability to dynamically filter out my own requests.
With a JS based tracking system I can usually set a cookie or install a browser extension so my own clicks on my site aren't counted. At the very least I can configure my ad blocker to not load the tracking script.
But server logs don't offer that option - and Matomo only let's you exclude users by IP address, which changes frequently for most people. This is my personal site, so a good portion of visits are by me - analytics that include my own clicks aren't of much value to me.
So as much as I love the idea of limiting my analytics to the server logs, I ended up going back to using the tracking script provided by Matomo. It's still hosted on my own server, and I decided to not use cookies as well as anonymize IPs, so at least my users' data us a lot safer than with Google.