One thing I've never seen anyone else do, for reasons that I can't imagine, is to change the Apache log file format to a more easily parseable version with the information that actually matters to you.
For example, we never use HTTP basic auth, so we don't need to log those fields. I am interested in how long each request takes to serve, so we'll add that in. For one project, we also want to know (on our load balancer) if any servers are serving requests slower than others, so we log the name of the server we're proxying back to.
Here's an excerpt from one server's apache config:
# We don't want to log bots, they're our friends
BrowserMatch Pingdom.com robot
# Custom log format, for testing
#
# date proto ipaddr status time req referer user-agent
LogFormat "%{%F %T}t %p %a %>s %D %r %{Referer}i %{User-agent}i" standard
CustomLog /var/log/apache2/access.log standard env=!robot
What you can't really tell from this is that between each field is a literal tab character (\t). This means that if I want to do some analysis in Python, maybe show non-200 statuses for example, I can do this:
for line in file("access.log"):
line = line.split("\t")
if line[3] != "200":
print line
Or if I wanted to do 'who is hotlinking images?' it would be
if line[6] in ("","-") and "/images" in line[5]:
For IP counts in an access log, the previous example:
cat log | grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" | sort -n | uniq -c | sort -n
becomes something like this:
cat log | cut -d '\t' -f 3 | uniq -c | sort -n
Easier to read and understand, and far less computationally expensive (no regex) which, on 9 GB logs, makes a huge difference in how long it takes. When this gets REALLY neat is if you want to do the same thing for User-agents. If your logs are space-delimited, you have to do some regular expression matching or string searching by hand. With this format, it's simple:
cat log | cut -d '\t' -f 8 | uniq -c | sort -n
Exactly the same as the above. In fact, any summary you want to do is essentially exactly the same.
Why on earth would I spend my system's CPU on awk and grep when cut will do exactly what I want orders of magnitude faster?
How do I use awk pattern scanning and processing language under bash scripts? Can you provide a few examples?
One thing I've never seen anyone else do, for reasons that I can't imagine, is to change the Apache log file format to a more easily parseable version with the information that actually matters to you.
For example, we never use HTTP basic auth, so we don't need to log those fields. I am interested in how long each request takes to serve, so we'll add that in. For one project, we also want to know (on our load balancer) if any servers are serving requests slower than others, so we log the name of the server we're proxying back to.
Here's an excerpt from one server's apache config:
# We don't want to log bots, they're our friends
BrowserMatch Pingdom.com robot
# Custom log format, for testing
#
# date proto ipaddr status time req referer user-agent
LogFormat "%{%F %T}t %p %a %>s %D %r %{Referer}i %{User-agent}i" standard
CustomLog /var/log/apache2/access.log standard env=!robot
What you can't really tell from this is that between each field is a literal tab character (\t). This means that if I want to do some analysis in Python, maybe show non-200 statuses for example, I can do this:
for line in file("access.log"):
line = line.split("\t")
if line[3] != "200":
print line
Or if I wanted to do 'who is hotlinking images?' it would be
if line[6] in ("","-") and "/images" in line[5]:
For IP counts in an access log, the previous example:
cat log | grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" | sort -n | uniq -c | sort -n
becomes something like this:
cat log | cut -d '\t' -f 3 | uniq -c | sort -n
Easier to read and understand, and far less computationally expensive (no regex) which, on 9 GB logs, makes a huge difference in how long it takes. When this gets REALLY neat is if you want to do the same thin
Awk is an excellent tool for building UNIX/Linux shell scripts. AWK is a programming language that is designed for processing text-based data, either in files or data streams, or using shell pipes. In other words you can combine awk with shell scripts or directly use at a shell prompt.
Print a Text File
awk '{ print }' /etc/passwd
OR
awk '{ print $0 }' /etc/passwd
Print Specific Field
Use : as the input field separator and print first field only i.e. usernames (will print the the first field. all other fields are ignored):
awk -F':' '{ print $1 }' /etc/passwd
Send output to sort command using a shell pipe:
awk -F':' '{ print $1 }' /etc/passwd | sort
Pattern Matching
You can only print line of the file if pattern matched. For e.g. display all lines from Apache log file if HTTP error code is 500 (9th field logs status error code for each http request):
awk '$9 == 500 { print $0}' /var/log/httpd/access.log
The part outside the curly braces is called the "pattern", and the part inside is the "action". The comparison operators include the ones from C:
== != < > <= >= ?:
If no pattern is given, then the action applies to all lines. If no action is given, then the entire line is printed. If "print" is used all by itself, the entire line is printed. Thus, the following are equivalent:
awk '$9 == 500 ' /var/log/httpd/access.log
awk '$9 == 500 {print} ' /var/log/httpd/access.log
awk '$9 == 500 {print $0} ' /var/log/httpd/access.log
Print Lines Containing tom, jerry AND vivek
Print pattern possibly on separate lines:
awk '/tom|jerry|vivek/' /etc/passwd
Print 1st Line From File
awk "NR==1{print;exit}" /etc/resolv.conf
awk "NR==$line{print;exit}" /etc/resolv.conf
Simply Arithmetic
You get the sum of all the numbers in a column:
awk '{total += $1} END {print total}' earnings.txt
Shell cannot calculate with floating point numbers, but awk can:
awk 'BEGIN {printf "%.3f\n", 2005.50 / 3}'
Call AWK From Shell Script
A shell script to list all IP addresses that accessing your website. This script use awk for processing log file and verification is done using shell script commands.
#!/bin/bash
d=$1
OUT=/tmp/spam.ip.$$
HTTPDLOG="/www/$d/var/log/httpd/access.log"
[ $# -eq 0 ] && { echo "Usage: $0 domain-name"; exit 999; }
if [ -f $HTTPDLOG ];
then
awk '{print}' $HTTPDLOG >$OUT
awk '{ print $1}' $OUT | sort -n | uniq -c | sort -n
else
echo "$HTTPDLOG not found. Make sure domain exists and setup correctly."
fi
/bin/rm -f $OUT
AWK and Shell Functions
Here is another example. chrootCpSupportFiles() find out the shared libraries required by each program (such as perl / php-cgi) or shared library specified on the command line and copy them to destination. This code calls awk to print selected fields from the ldd output:
chrootCpSupportFiles() {
# Set CHROOT directory name
local BASE="$1" # JAIL ROOT
local pFILE="$2" # copy bin file libs
[ ! -d $BASE ] && mkdir -p $BASE || :
FILES="$(ldd $pFILE | awk '{ print $3 }' |egrep -v ^'\(')"
for i in $FILES
do
dcc="$(dirname $i)"
[ ! -d $BASE$dcc ] && mkdir -p $BASE$dcc || :
/bin/cp $i $BASE$dcc
done
sldl="$(ldd $pFILE | grep 'ld-linux' | awk '{ print $1}')"
sldlsubdir="$(dirname $sldl)"
if [ ! -f $BASE$sldl ];
then
/bin/cp $sldl $BASE$sldlsubdir
else
:
fi
}
This function can be called as follows:
chrootCpSupportFiles /lighttpd-jail /usr/local/bin/php-cgi
AWK and Shell Pipes
List your top 10 favorite commands:
history | awk '{print $2}' | sort | uniq -c | sort -rn | head
Sample Output:
172 ls
144 cd
69 vi
62 grep
41 dsu
36 yum
29 tail
28 netstat
21 mysql
20 cat
whois cyberciti.com | awk '/Domain Expiration Date:/ { print $6"-"$5"-"$9 }'
Awk Program File
You can put all awk commands in a file and call the same from a shell script using the following syntax:
awk -f mypgoram.awk input.txt
Awk in Shell Scripts - Passing Shell Variables TO Awk
You can pass shell variables to awk using the -v option:
n1=5
n2=10
echo | awk -v x=$n1 -v y=$n2 -f program.awk
Assign the value n1 to the variable x, before execution of the program begins. Such variable values are available to the BEGIN block of an AWK program:
BEGIN{ans=x+y}
{print ans}
END{}
No comments:
Post a Comment