If you have an Apache on your server and a big access log, great, otherwise get some logs on Google. It’s amazing how many are publically available.
inurl:access.log filetype:log
Goaccess Installation
echo "deb http://deb.goaccess.io/ $(lsb_release -cs) main" | sudo tee -a /etc/apt/sources.list.d/goaccess.list
wget -O - https://deb.goaccess.io/gnugpg.key | sudo apt-key add -
sudo apt-get update
sudo apt-get install goaccess
Run goaccess on the Access Log
goaccess -f access.log
and select the format with a space key
Now you have all data available to review:
Use awk to Parse Access Log
Awk is more useful for scripting and here are a couple of examples of how to extract particular data from the file:
awk '{print $1}' access.log # ip address (%h)
awk '{print $2}' access.log # RFC 1413 identity (%l)
awk '{print $3}' access.log # userid (%u)
awk '{print $4 $5}' access.log # date/time (%t)
awk '{print $9}' access.log # status code (%>s)
awk '{print $10}' access.log # size (%b)
Python Script for Parsing Access Log:
You can also use python to parse through the data in the log.
import csv
import re
log_file_name = "access.log"
csv_file_name = "parsed.csv"
parts = [
r'(?P<host>\S+)', # host %h
r'\S+', # indent %l (unused)
r'(?P<user>\S+)', # user %u
r'\[(?P<time>.+)\]', # time %t
r'"(?P<request>.+)"', # request "%r"
r'(?P<status>[0-9]+)', # status %>s
r'(?P<size>\S+)', # size %b (careful, can be '-')
r'"(?P<referer>.*)"', # referer "%{Referer}i"
r'"(?P<agent>.*)"', # user agent "%{User-agent}i"
]
pattern = re.compile(r'\s+'.join(parts)+r'\s*\Z')
file = open(log_file_name)
with open(csv_file_name, 'w') as out:
csv_out=csv.writer(out)
csv_out.writerow(['host', 'user', 'time', 'request', 'status', 'size', 'referer', 'user agent'])
for line in file:
m = pattern.match(line)
result = m.groups()
csv_out.writerow(result)