Parsing access.log and error.logs using linux commands

Access logs

We are using following format, which is also default nginx format named “combined”:

$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"

Explanation of fields is as follows:

$remote_addr – IP from which request was made
$remote_user – HTTP Authenticated User. This will be blank for most apps as modern apps do not use HTTP-based authentication.
[$time_local] – timestamp as per server timezone
“$request” – HTTP request type GET, POST, etc + requested path without args + HTTP protocol version
$status – HTTP response code from server
$body_bytes_sent – size of server response in bytes
“$http_referer” – Referral URL (if present)
“$http_user_agent” – User agent as seen by server

Lets explore some commands which can help us analyse logs.

Sort access by Response Codes

cat access.log | cut -d '"' -f3 | cut -d ' ' -f2 | sort | uniq -c | sort -rn

Sample Output:

Same thing can be done using awk:

awk '{print $9}' access.log | sort | uniq -c | sort -rn

Sample Output:

As you can see it log says more than 700 requests were returned 404!

Lets find out which links are broken now?

Following will search for requests which resulted in 404 response and then sort them by number of requests per URL. You will get most visited 404 pages.

awk '($9 ~ /404/)' access.log | awk '{print $7}' | sort | uniq -c | sort -rn

For easyengine, use instead:

awk '($8 ~ /404/)' access.log | awk '{print $8}' | sort | uniq -c | sort -rn

Sample Output (truncated):

  21 /members/katrinakp/activity/2338/
  19 /blogger-to-wordpress/robots.txt
  14 /rtpanel/robots.txt

Similarly, for 502 (bad-gateway) we can run following command:

awk '($9 ~ /502/)' access.log | awk '{print $7}' | sort | uniq -c | sort -r

Sample Output (truncated):

    728 /wp-admin/install.php
    466 /
    146 /videos/
    130 /wp-login.php

Who are requesting broken links (or URLs resulting in 502)

awk -F\" '($2 ~ "/wp-admin/install.php"){print $1}' access.log | awk '{print $1}' | sort | uniq -c | sort -r

Sample Output:

     14 50.133.11.248
     12 97.106.26.244
     11 108.247.254.37
     10 173.22.165.123

404 for php files – mostly hacking attempts

awk '($9 ~ /404/)' access.log | awk -F\" '($2 ~ "^GET .*\.php")' | awk '{print $7}' | sort | uniq -c | sort -r | head -n 20

Most requested URLs

awk -F\" '{print $2}' access.log | awk '{print $2}' | sort | uniq -c | sort -r

Most requested URLs containing XYZ

awk -F\" '($2 ~ "ref"){print $2}' access.log | awk '{print $2}' | sort | uniq -c | sort -r

Useful: Tweaking fastcgi-buffers using access logs

Recommended Reading: http://www.the-art-of-web.com/system/logs/ – explains log parsing very nicely.

16 responses to “Parsing access.log and error.logs using linux commands”

Oscar says:

November 4, 2013 at 4:38 pm

Nice article. I found an error in “404 for php files – mostly hacking attempts”, you have “combined_log” there instead of “access.log”.
- Rahul Bansal says:
  
  November 5, 2013 at 7:42 pm
  
  Did not get your question? Can you clarify?
  - Gaptek Update says:
    
    December 22, 2013 at 10:26 pm
    
    He said, you write:
    awk ‘($9 ~ /404/)’ combined_log
    
    instead of
    awk ‘($9 ~ /404/)’ access.log
    - Rahul Bansal says:
      
      December 24, 2013 at 1:08 pm
      
      Thanks Gaptek for clarification. 🙂
      
      Thanks Oscar for pointing it out. 🙂
      
      Updated article to reflect corrections.
Kalim says:

January 16, 2014 at 4:12 pm

Can we grep the 404 URL from access log for particular time span for eg:- i want grep 404 error between “11/Jan/2014:18:00:00|11/Jan/2014:18:10:00”
- Rahul Bansal says:
  
  January 16, 2014 at 7:15 pm
  
  Not sure about grep but it might be possible with awk and/or custom shell scripting.
  
  May be run logs through a loop and separate out lines which satisfies date criteria, and store such lines in a new file.
- rodrigo says:
  
  January 22, 2014 at 1:30 am
  
  Estoy probando este parseador y pinta muy bien!
  
  # man goaccess
  
  If we want to parse only a certain time-frame from DATE a to DATE b, we can do:
  
  sed -n ‘/5\/Nov\/2010/,/5\/Dec\/2010/ p’ access.log | goaccess -a
  - Rahul Bansal says:
    
    January 22, 2014 at 1:00 pm
    
    Does this really works with date formats and days in between?
    - Kalim says:
      
      January 22, 2014 at 1:41 pm
      
      It works for day. not for particular time format.
    - rodrigo ferroni says:
      
      January 22, 2014 at 9:16 pm
      
      Hi Rahul!
      
      I was trying and it worked!
      
      To do that we need to use the command “sed”
      
      # sed -n ‘/22\/Jan\/2014:11:40:03/,/22\/Jan\/2014:12:23:24/ p’ access.log
      
      I red in the sedfaq.txt the follow lines:
      
      “Then I learned that sed could display only one paragraph of a file,
      beginning at the phrase “and where it came” and ending at the
      phrase “for all people”. My script looked like this:
      
      sed -n ‘/and where it came/,/for all people/p’ myfile”
      - Rahul Bansal says:
        
        January 24, 2014 at 9:27 pm
        
        Glad to know that. 🙂
        
        I will give it a try someday.
ad says:

February 4, 2014 at 11:08 am

sort does not work

awk ‘{print $9}’ access.log | sort | uniq -c | sort -r
877 503
4 301
42 500
3 405
30 499
3 “-”
29 502
24940 200
2312 404
21 304
1 504
1 403
1 302
- Rahul Bansal says:
  
  February 5, 2014 at 10:11 pm
  
  Can you please explain what is expected outcome? I might be able to tweak command accordingly.
  - ad says:
    
    February 6, 2014 at 5:04 pm
    
    I wanted results to be sorted according to the number of request as per the status code.
    
    Also I want to group all the request according to the user agent.
    
    awk ‘{print $14}’ access.log | sort | uniq -c | sort -r
    
    works but that just checks the 14th word, I need a way to check the full user agent string
    - Rahul Bansal says:
      
      February 11, 2014 at 4:51 pm
      
      I tested awk '{print $9}' access.log | sort | uniq -c | sort -r and cat access.log | cut -d '"' -f3 | cut -d ' ' -f2 | sort | uniq -c | sort -r.
      
      Both worked on our server.
      
      Can you list log_format value from your nginx config? if your access log lines have different format, you will need to change $14 to something else.
      
      I doubt OS/Distro has any role for these commands, still, may I know which OS/Distro you are using?