Access Log in Apache – Learn how to parse it

If you have an Apache on your server and a big access log, great, otherwise get some logs on Google. It’s amazing how many are publically available.

inurl:access.log filetype:log

Goaccess Installation

echo "deb $(lsb_release -cs) main" | sudo tee -a /etc/apt/sources.list.d/goaccess.list
wget -O - | sudo apt-key add -
sudo apt-get update
sudo apt-get install goaccess

Run goaccess on the Access Log

goaccess -f access.log

and select the format with a space key

goaccess format

Now you have all data available to review:

goaccess screenshot

Use awk to Parse Access Log

Awk is more useful for scripting and here are a couple of examples of how to extract particular data from the file:

awk '{print $1}' access.log         # ip address (%h)
awk '{print $2}' access.log         # RFC 1413 identity (%l)
awk '{print $3}' access.log         # userid (%u)
awk '{print $4 $5}' access.log       # date/time (%t)
awk '{print $9}' access.log         # status code (%>s)
awk '{print $10}' access.log        # size (%b)

Python Script for Parsing Access Log:

You can also use python to parse through the data in the log.

import csv
import re

log_file_name = "access.log"
csv_file_name = "parsed.csv"

parts = [
    r'(?P<host>\S+)',                   # host %h
    r'\S+',                             # indent %l (unused)
    r'(?P<user>\S+)',                   # user %u
    r'\[(?P<time>.+)\]',                # time %t
    r'"(?P<request>.+)"',               # request "%r"
    r'(?P<status>[0-9]+)',              # status %>s
    r'(?P<size>\S+)',                   # size %b (careful, can be '-')
    r'"(?P<referer>.*)"',               # referer "%{Referer}i"
    r'"(?P<agent>.*)"',                 # user agent "%{User-agent}i"

pattern = re.compile(r'\s+'.join(parts)+r'\s*\Z')

file = open(log_file_name)

with open(csv_file_name, 'w') as out:
    csv_out.writerow(['host', 'user', 'time', 'request', 'status', 'size', 'referer', 'user agent'])

    for line in file:
        m = pattern.match(line)
        result = m.groups()