Parsing Apache Logs with tail, cut, sort, and uniq

A client experienced some intermittent website down time last week during the final few days of April 2021, and sent over that month’s Apache logs for me to see if there is anything out of the ordinary – excessive crawling, excessive probing, brute force password attacks, things of that nature. Below are a few commands I have used that I thought would be nice to keep handy for future uses. I am currently using Ubuntu 20.04 LTS.

While unrelated, just to form a complete picture, my client sent the logs to me in gz-compressed format. If you are not familiar on how to uncompress it, it is fairly straight forward:

gunzip 2021-APR.gz

Back on topic… I ended on parsing the file in three separate ways for me to get an overall view of things. I found that the final few days of April are represented in the roughly final 15,000 lines of the log file, so I decided to use the tail command as my main tool.

First, I did following command below to find which IP addresses hit the server the most:

tail -n 15000 filename.log | cut -f 1 -d ' ' | sort | uniq -c | sort -nr | more

Quick explanation:

The tail command pulls the final 15,000 lines from the log file (final few days of the month)
The cut command parses each line using a space delimiter and returns the first field (the IP addres)
The sort command sorts the results thus far
The uniq command groups the results thus far and provides a count
The second sort command reverse the sort so the highest result is on top
Finally, the more command creates screen-sized pagination so it’s easier to read

There is always more than one way to do something in Linux, of course. Just as an aside, the following functions very similarly:

cat filename.log | awk '{print $1}' | sort | uniq -c | sort -nr | tail -15000 | more

Then, I thought it would be nice to get an idea of how many requests were made per hour. This can be achieved with the command below.

tail -n 15000 filename.log | cut -f 4 -d ' ' | cut -f 1,2 -d ':' | sort | uniq -c | more

The main difference here is that I opted for the 4th (rather than 1st) result of the cut command, which gets me the timestamp element (rather than IP address), and then a second cut command parses it on the colon symbol and returns the first (date) and second (hour) for further grouping.

Finally, I tweaked it a little bit more so I get an idea of whether there was excessive requests within any minute-span. This can be achieved by expanding the second cut command slightly, as per below.

tail -n 15000 filename.log | cut -f 4 -d ' ' | cut -f 1,2,3 -d ':' | sort | uniq -c | more

Leave a Reply Cancel reply