Finding causes of heavy usage on web server using access log

Contents...

Sometimes you noticed your server loading slowly or down and sometimes you also noticed that your servers resources usage are more at particular time than expect. It may be due to high/heavy usage on your server. During web page loading slowly may be at that time bots and unwanted scripts are hitting and executing on your web server. For dynamic websites many plugins and modules provide additional functionality but it may be impact on your web server performance. Remove unwanted plugins and modules from your web server for better performance . Plugins and modules are basically used for making websites more efficient like Caching plugins etc.,

In this article I am going to explain how you can find the causes of heavy usage on your web server.

CHECKING WEB SERVER ACCESS LOG LOCATION

You can check your web server access log to confirm exactly what is hitting your websites using access log file.Find out the web server access log location on your web server. In my case, my web server access log is located on “/var/log/httpd/vhost/” directory.

# ls -l /var/log/httpd/vhost/
total 240
-rw-r--r-- 1 root root 236290 Dec 5 10:53 access.log
-rw-r--r-- 1 root root 2523 Dec 5 04:16 error.log

EXAMINE ACCESS LOG

Now we can examine the access log to find out the culprits IPs and bots which is hitting your server most.Follow the below command to find out the which IPs are hitting your server most. Login to your server and navigate the access log fine.

#cd /var/log/httpd/vhost/
#ls –l
total 240
-rw-r--r-- 1 root root 236290 Dec 5 10:53 access.log
-rw-r--r-- 1 root root 2523 Dec 5 04:16 error.log

Top 10 IPs which are hitting your web server

#awk '{print $1}' access.log | sort | uniq -c | sort -nr | head
290 66.249.66.166
97 223.176.160.29
59 93.77.134.151
52 94.25.134.52
44 94.247.174.83
44 85.93.93.124
44 76.164.194.74
44 174.34.156.130
44 109.123.101.103

Use host command to check the hosting company from which specific IP hitting.

If host command not found on your system . Install it using below command.

# yum install bind-utils
========================================================================================================================================================================
Package Arch Version Repository Size
========================================================================================================================================================================
Installing:
bind-utils x86_64 32:9.8.2-0.47.rc1.el6_8.3 updates 187 k
Updating for dependencies:
bind x86_64 32:9.8.2-0.47.rc1.el6_8.3 updates 4.0 M
bind-libs x86_64 32:9.8.2-0.47.rc1.el6_8.3 updates 890 k

Transaction Summary
========================================================================================================================================================================
Install 1 Package(s)

#host 66.249.66.166
166.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-166.googlebot.com.

Above you can see “66.249.66.166” IP belongs to Google. You can block these types of IPs if it is hitting your server unnecessary.

BLOCK IPs USING .HTACCESS FILE

You can block culprit IPs using .htaccess file in Apache. Navigate your web server Document Root and create a .htaccess file. In My case document root is “/var/www/html/vhost/”

#cd /var/www/html/vhost/
#vim .htaccess
order allow,deny
deny from 66.249.66.166
allow from all

Save and exit.

REQUESTED TOP TEN FILES AND DIRECTORY ON WEB SERVER

We can check files and folders/directory on web server being called the most.

Listing Top 10 file and directories which is being called most on the web server

#awk '{print $7}' access.log | sort | uniq -c | sort -nk1 | tail -n10

8 /wp-content/themes/spacious/genericons/genericons.css?ver=3.3.1
8 /wp-content/uploads/2016/11/looklinux-bg.jpg
8 /wp-includes/js/jquery/jquery.js?ver=1.12.4
9 /wp-admin/
9 /wp-content/plugins/wp-pagenavi/pagenavi-css.css?ver=2.70
13 /questions/ask/
14 /wp-admin/nav-menus.php
126 /wp-admin/admin-ajax.php
220 /index.php
304 /

You can see above “/” directory is called 304 times.

EXAMINE SPIDERS, BOTS AND CRAWLERS ON WEB SERVER

You can examine your web server access log for bots,spiders and other crawlers which are hitting your server most and consuming server resource (Memory and CPU). These crawlers can slow down your websites.For batter web server performance you will need to create “robots.txt” file in your web server root directory. It allow to search engines what content should be indexes and what content shout not be indexes.

FINDING TOP USER-AGENT WHICH IS HITTING WEB SERVER

Follow the below command to find out all user-agent which is hitting your web server most.

#cat example.com_access.log |awk -F'"' '/GET/ {print $6}' | cut -d' ' -f1 | sort | uniq -c | sort -rn

687 Mozilla/5.0
476 Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)
26 facebookexternalhit/1.1
6 MetaInspector/4.7.2
3 Baiduspider-image+(+http://www.baidu.com/search/spider.htm)

Above you can see all robots which are hitting your web server.

STOP ROBOTS TO INDEX WEB SERVER CONTENT

Most of the time robots like Yahoo, Google,msnbot, and facebookexternalhit etc., crowel your sites and cause load spike and consumed server resources. For batter web server performance you will need to block these robots.

BLOCKING GOOGLEBOTS :

Above we found “66.249.66.166” IP belongs to google using host command from access log.

#host 66.249.66.166
166.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-166.googlebot.com.

Now I am going to block Googlebot using “robots.txt” file.

First navigate your web-server documents root directory and create “robots.txt “ file.

In my case my web server document root directory is “/var/www/html/vhost/”

#cd /var/www/html/vhost/
#vim robots.txt
#Block Googlebot
User-agent: Googlebot
Disallow: /

Fields Explanations:

#Block Googlebot :– This is only comment
User-agent: – Bot Name
Disallow: – Path to stop crowle, Here we block for indexing for everything on the server.

BLOCKING YAHOO :

You can also block crawler for time base in robots.txt. You cat set limits their fetching activity.To delay crawler you can tell yahoo not to fetch page more than once 20 seconds. Add below line in your robots.txt file for delay crawler.

#cd /var/www/html/vhost/
#vim robots.txt
#Delay yahoo crawler for 20 seconds
User-agent: Slurp
Crawl-delay: 20

Fields Explanations:

#Delay Yahoo Crawler for 20 seconds :- This is only comment
User-agent: – Slurp is name of Yahoo user agent.
Crawl-delay: User agent will wait 20 seconds between each fetch and request.

CONFIGURE FETCHING SLOW FOR GOOD BOTS

Some time you will need good bots to crawl your site for traffic purpose. Configure your robots.txt file like below.

#cd /var/www/html/vhost/
#vim robots.txt
# Slow crawling 3400 seconds for all bots
User-agent: *
Crawl-delay: 3400

Fields Explanations:

# Slow crawling 3400 seconds for all bots :- This is only comment
User-agent: – “ * ” for all User-agents.
Crawl-delay: – User agent will wait 3400 seconds between each fetch and request.

DISALLOW ALL BOTS TO CRAWL YOU WEBSITES

Add below line into your robots.txt file to disallow all bots.

#cd /var/www/html/vhost/
#vim robots.txt
# Disallow all bots
User-agent: *
Disallow: /

DISALLOW ONLY SPECIFIC FOLDER TO CRAWL YOUR WEBSITES

Follow below line and add into your robots.txt file to disallow only specific folder/directory.

#cd /var/www/html/vhost/
#vim robots.txt
# Disallow specific directory
User-agent: *
Disallow: /Your_Directory_Name/

Fields Explanations:

# Disallow specific directory :– This is only comment
User-agent: – “ * ” for all User-agents.
Disallow: /Your_Directory_Name/ :– Disallow the only mentions directory.

ALLOW EVERYTHING TO CRAWL YOUR WEBSITES

If you want to allow everything to crawl add below line into your robots.txt file.

#cd /var/www/html/vhost/
#vim robots.txt
# Allow Everything for crawling
User-agent: *
Disallow:

If you are getting 404 not found request in your web log create robots.txt file with above line.

I hope this article will be helpful to find causes of heavy usage on web server. If you have any queries and problem please comment in comment section.

Thanks:)

Thank you! for visiting LookLinux.

If you find this tutorial helpful please share with your friends to keep it alive. For more helpful topic browse my website www.looklinux.com. To become an author at LookLinux Submit Article. Stay connected to Facebook.

Finding causes of heavy usage on web server using access log

CHECKING WEB SERVER ACCESS LOG LOCATION

EXAMINE ACCESS LOG

Top 10 IPs which are hitting your web server

BLOCK IPs USING .HTACCESS FILE

REQUESTED TOP TEN FILES AND DIRECTORY ON WEB SERVER

Listing Top 10 file and directories which is being called most on the web server

EXAMINE SPIDERS, BOTS AND CRAWLERS ON WEB SERVER

FINDING TOP USER-AGENT WHICH IS HITTING WEB SERVER

STOP ROBOTS TO INDEX WEB SERVER CONTENT

BLOCKING GOOGLEBOTS :

BLOCKING YAHOO :

CONFIGURE FETCHING SLOW FOR GOOD BOTS

DISALLOW ALL BOTS TO CRAWL YOU WEBSITES

DISALLOW ONLY SPECIFIC FOLDER TO CRAWL YOUR WEBSITES

ALLOW EVERYTHING TO CRAWL YOUR WEBSITES

About the author

Santosh Prasad

Leave a Comment X

CHECKING WEB SERVER ACCESS LOG LOCATION

EXAMINE ACCESS LOG

Top 10 IPs which are hitting your web server

BLOCK IPs USING .HTACCESS FILE

REQUESTED TOP TEN FILES AND DIRECTORY ON WEB SERVER

Listing Top 10 file and directories which is being called most on the web server

EXAMINE SPIDERS, BOTS AND CRAWLERS ON WEB SERVER

FINDING TOP USER-AGENT WHICH IS HITTING WEB SERVER

STOP ROBOTS TO INDEX WEB SERVER CONTENT

BLOCKING GOOGLEBOTS :

BLOCKING YAHOO :

CONFIGURE FETCHING SLOW FOR GOOD BOTS

DISALLOW ALL BOTS TO CRAWL YOU WEBSITES

DISALLOW ONLY SPECIFIC FOLDER TO CRAWL YOUR WEBSITES

ALLOW EVERYTHING TO CRAWL YOUR WEBSITES

You may also like

About the author

Santosh Prasad

Leave a Comment X