Tuesday 5 February 2013

awk for apache log


One thing I've never seen anyone else do, for reasons that I can't imagine, is to change the Apache log file format to a more easily parseable version with the information that actually matters to you.

For example, we never use HTTP basic auth, so we don't need to log those fields. I am interested in how long each request takes to serve, so we'll add that in. For one project, we also want to know (on our load balancer) if any servers are serving requests slower than others, so we log the name of the server we're proxying back to.

Here's an excerpt from one server's apache config:
# We don't want to log bots, they're our friends
BrowserMatch Pingdom.com robot

# Custom log format, for testing
#
#         date          proto   ipaddr  status  time    req     referer         user-agent
LogFormat "%{%F %T}t    %p      %a      %>s     %D      %r      %{Referer}i     %{User-agent}i" standard
CustomLog /var/log/apache2/access.log standard env=!robot

What you can't really tell from this is that between each field is a literal tab character (\t). This means that if I want to do some analysis in Python, maybe show non-200 statuses for example, I can do this:

for line in file("access.log"):
  line = line.split("\t")
  if line[3] != "200":
    print line

Or if I wanted to do 'who is hotlinking images?' it would be

if line[6] in ("","-") and "/images" in line[5]:

For IP counts in an access log, the previous example:

cat log | grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" | sort -n | uniq -c | sort -n

becomes something like this:

cat log | cut -d '\t' -f 3 | uniq -c | sort -n

Easier to read and understand, and far less computationally expensive (no regex) which, on 9 GB logs, makes a huge difference in how long it takes. When this gets REALLY neat is if you want to do the same thing for User-agents. If your logs are space-delimited, you have to do some regular expression matching or string searching by hand. With this format, it's simple:

cat log | cut -d '\t' -f 8 | uniq -c | sort -n

Exactly the same as the above. In fact, any summary you want to do is essentially exactly the same.

Why on earth would I spend my system's CPU on awk and grep when cut will do exactly what I want orders of magnitude faster?


How do I use awk pattern scanning and processing language under bash scripts? Can you provide a few examples?

One thing I've never seen anyone else do, for reasons that I can't imagine, is to change the Apache log file format to a more easily parseable version with the information that actually matters to you.

For example, we never use HTTP basic auth, so we don't need to log those fields. I am interested in how long each request takes to serve, so we'll add that in. For one project, we also want to know (on our load balancer) if any servers are serving requests slower than others, so we log the name of the server we're proxying back to.

Here's an excerpt from one server's apache config:
# We don't want to log bots, they're our friends
BrowserMatch Pingdom.com robot

# Custom log format, for testing
#
#         date          proto   ipaddr  status  time    req     referer         user-agent
LogFormat "%{%F %T}t    %p      %a      %>s     %D      %r      %{Referer}i     %{User-agent}i" standard
CustomLog /var/log/apache2/access.log standard env=!robot

What you can't really tell from this is that between each field is a literal tab character (\t). This means that if I want to do some analysis in Python, maybe show non-200 statuses for example, I can do this:

for line in file("access.log"):
  line = line.split("\t")
  if line[3] != "200":
    print line

Or if I wanted to do 'who is hotlinking images?' it would be

if line[6] in ("","-") and "/images" in line[5]:

For IP counts in an access log, the previous example:

cat log | grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" | sort -n | uniq -c | sort -n

becomes something like this:

cat log | cut -d '\t' -f 3 | uniq -c | sort -n

Easier to read and understand, and far less computationally expensive (no regex) which, on 9 GB logs, makes a huge difference in how long it takes. When this gets REALLY neat is if you want to do the same thin
Awk is an excellent tool for building UNIX/Linux shell scripts. AWK is a programming language that is designed for processing text-based data, either in files or data streams, or using shell pipes. In other words you can combine awk with shell scripts or directly use at a shell prompt.
Print a Text File

awk '{ print }' /etc/passwd
OR
awk '{ print $0 }' /etc/passwd
Print Specific Field

Use : as the input field separator and print first field only i.e. usernames (will print the the first field. all other fields are ignored):
awk -F':' '{ print $1 }' /etc/passwd
Send output to sort command using a shell pipe:
awk -F':' '{ print $1 }' /etc/passwd | sort
Pattern Matching

You can only print line of the file if pattern matched. For e.g. display all lines from Apache log file if HTTP error code is 500 (9th field logs status error code for each http request):
awk '$9 == 500 { print $0}' /var/log/httpd/access.log
The part outside the curly braces is called the "pattern", and the part inside is the "action". The comparison operators include the ones from C:

== != < > <= >= ?:

If no pattern is given, then the action applies to all lines. If no action is given, then the entire line is printed. If "print" is used all by itself, the entire line is printed. Thus, the following are equivalent:
awk '$9 == 500 ' /var/log/httpd/access.log
awk '$9 == 500 {print} ' /var/log/httpd/access.log
awk '$9 == 500 {print $0} ' /var/log/httpd/access.log
Print Lines Containing tom, jerry AND vivek

Print pattern possibly on separate lines:
awk '/tom|jerry|vivek/' /etc/passwd
Print 1st Line From File

awk "NR==1{print;exit}" /etc/resolv.conf
awk "NR==$line{print;exit}" /etc/resolv.conf
Simply Arithmetic

You get the sum of all the numbers in a column:
awk '{total += $1} END {print total}' earnings.txt
Shell cannot calculate with floating point numbers, but awk can:
awk 'BEGIN {printf "%.3f\n", 2005.50 / 3}'
Call AWK From Shell Script

A shell script to list all IP addresses that accessing your website. This script use awk for processing log file and verification is done using shell script commands.

#!/bin/bash
d=$1
OUT=/tmp/spam.ip.$$
HTTPDLOG="/www/$d/var/log/httpd/access.log"
[ $# -eq 0 ] && { echo "Usage: $0 domain-name"; exit 999; }
if [ -f $HTTPDLOG ];
then
    awk '{print}' $HTTPDLOG >$OUT
    awk '{ print $1}' $OUT  |  sort -n | uniq -c | sort -n
else
    echo "$HTTPDLOG not found. Make sure domain exists and setup correctly."
fi
/bin/rm -f $OUT

AWK and Shell Functions

Here is another example. chrootCpSupportFiles() find out the shared libraries required by each program (such as perl / php-cgi) or shared library specified on the command line and copy them to destination. This code calls awk to print selected fields from the ldd output:


chrootCpSupportFiles() {
# Set CHROOT directory name
local BASE="$1"         # JAIL ROOT
local pFILE="$2"        # copy bin file libs

[ ! -d $BASE ] && mkdir -p $BASE || :

FILES="$(ldd $pFILE | awk '{ print $3 }' |egrep -v ^'\(')"
for i in $FILES
do
  dcc="$(dirname $i)"
  [ ! -d $BASE$dcc ] && mkdir -p $BASE$dcc || :
  /bin/cp $i $BASE$dcc
done

sldl="$(ldd $pFILE | grep 'ld-linux' | awk '{ print $1}')"
sldlsubdir="$(dirname $sldl)"
if [ ! -f $BASE$sldl ];
then
        /bin/cp $sldl $BASE$sldlsubdir
else
        :
fi
}

This function can be called as follows:
chrootCpSupportFiles /lighttpd-jail /usr/local/bin/php-cgi
AWK and Shell Pipes

List your top 10 favorite commands:
history | awk '{print $2}' | sort | uniq -c | sort -rn | head
Sample Output:

   172 ls
    144 cd
     69 vi
     62 grep
     41 dsu
     36 yum
     29 tail
     28 netstat
     21 mysql
     20 cat

whois cyberciti.com | awk '/Domain Expiration Date:/ { print $6"-"$5"-"$9 }'
Awk Program File

You can put all awk commands in a file and call the same from a shell script using the following syntax:
awk -f mypgoram.awk input.txt
Awk in Shell Scripts - Passing Shell Variables TO Awk

You can pass shell variables to awk using the -v option:


n1=5
n2=10
echo | awk -v x=$n1 -v y=$n2 -f program.awk

Assign the value n1 to the variable x, before execution of the program begins. Such variable values are available to the BEGIN block of an AWK program:

BEGIN{ans=x+y}
{print ans}
END{}

No comments:

Post a Comment