As the visitors to your website increase so does the size of your log files. It has been estimated that they grow by 1MB for every 10,000 hits. This may not appear to be much of a problem but be aware that a 'hit' here means a log entry for every item that apache serves eg. images, css files, pdfs etc so just loading one page can have multiple log entries.
Ideally you want to reduce your current logs to managable files for example; one file for each month. When apache is running it opens the log file and holds it open for writing. Reducing the file sizes can have performance benefits and also make it easier to parse the log files for analysis.
1. Tell apache what to log
The first step is to setup the format of our log files in httpd.conf, we will use the combined log format as this is prefered by most log file analyzers and indeed recommended by apache. The LogFormat directive allows us to specify the format of the log file entries and to give that format a name, in this case 'combined'.LogFormat "%h %l %u %t "%r" %>s %b "%{Referer}i"
"%{User-agent}i"" combined
As mentioned earlier, apache will log every media item that it serves. It would greatly decrease the size of our logfiles if we could tell it to ignore items such as images. This is achieved by using the SetEnvIf Directive. The SetEnvIf directive allows us to set an enviroment variable on the basis of a request that we can use as a filter when configuring the actual log file for our host. The basic format for SetEnvIF is as follows
SetEnvIf Attribute match_condition new_variable_name
The Attribute part can be one off the following:
Remote_Host - the hostname (if available) of the client making the request
Remote_Addr - the IP address of the client making the request
Server_Addr - the IP address of the server on which the request was received (only with versions later than 2.0.43)
Request_Method - the name of the method being used (GET, POST, et cetera)
Request_Protocol - the name and version of the protocol with which the request was made (e.g., "HTTP/0.9", "HTTP/1.1", etc.)
Request_URI - the resource requested on the HTTP request line
Remote_Addr - the IP address of the client making the request
Server_Addr - the IP address of the server on which the request was received (only with versions later than 2.0.43)
Request_Method - the name of the method being used (GET, POST, et cetera)
Request_Protocol - the name and version of the protocol with which the request was made (e.g., "HTTP/0.9", "HTTP/1.1", etc.)
Request_URI - the resource requested on the HTTP request line
Match Condition is a perl compatible regular expression and the last section allows us to set the value of our enviroment variable which can take 1 of three actions
1. Give it a literal value - myenvar=hello
2. Just set it (will give it value of one) - myenvar
3 Unset it - !myenvar
2. Just set it (will give it value of one) - myenvar
3 Unset it - !myenvar
Now we tell apache not to log any images or other files that we don't want to count as pages.
SetEnvIfNoCase Request_URI ".(gif)|(jpg)|(png)|(css)|(js)|(ico)|(eot)$" dontlog
Essentially we are asking apache to check the incoming URI and if it marches any of the above extensions to set an enviromental variable called 'dontlog'. We don't need to assign a litteral value to this variable as later when we decide to log or not we only need to check the existance of this variable.
Please note the use of SetEnvIfNoCase instead of SetEnvIF. This works in an identical fashion to SetEnvIf except it is case insensitive which is good for us as people may name their images .JPG. This also removes the pesky request for those damn favicons from appearing in your logs!
2.Log Rotation
In its default configuration apache will log all requestes to one log file which will keep growing over time. As mentioned before we really want to reduce this to be more managable. Essentially what we want to do is on a specified date/time, stop logging to the current log file, create a new logfile, backup the old log file and continue logging to the new one. The only major stubling block to this is the fact that apache is holding the log file open for writing, so to perform any opperations on that file we must stop apache first, not great if you are running 24-7!Apache provides us with the use of piped logs to help us get round this. Piped logs give us the ability to pass the output of our CustomLog directive to another program using the pipe opperator '|' which by-passed the need to open the log files as the rotate logs program recieves the request from apache and deals with the writing to file.
Consider the following CustomLog directive;
CustomLog /path/to/my/logfile/access_log combined env=!dontlog
This is writing to the specified logfile (/path/to/my/logfile/access_log) in the combined format that we specied earlier and ignoring anything request that has had the dontlog env variable setup that we configured earlier.
We now need to pipe this to our log rotation program. Apache comes with its own program called logrotate but i will be using cronolog as it allows better seperation (by date) of log files.
cronolog is available here
Once you have downloaded the latest version unpack the archive:
tar -xvzf cronolog-1.6.X.tar.gz
switch to the newly created directory and issue the following commands:
./configure
make
make
then switch to root user and run
make install
once you have installed cronolog (default location is /usr/local/sbin) we need to change our custom log directive
CustomLog "|/usr/local/sbin/cronolog /path/to/my/logfile/%m-%Y/access_log combined env=!dontlog
This is now piping the log output to cronolog. I have specified cronolog to rotate my logs once a month by using "%m-%Y" this will place the new log in a directory created in the following format MM-YYYY eg:
/path/to/my/logfile/01-2007/access_log
/path/to/my/logfile/02-2007/access_log
/path/to/my/logfile/02-2007/access_log
you can specify a host of other date/time combinations which are shown on the cronolog website
once you have made these changes to your httpd.conf remember to restart apache so they take affect
3. AwStats
Awstats is a logfile analyzer. I also played around with webalizer which does a similar thing but in a toss-up i chose awStats.awStats is available here.
once you have downloaded and unpacked awStats you must move the contents of the unpacked directory to /usr/local/awstats
mv awstats-6.X /usr/local/awstats
then go to /usr/local/awstats/tools and run the configuration script
perl awstats_configure.pl
this will create a configuration file in /etc/awstats if you picked all the default options in the config script. If you entered www.mydomain.com as the website name the config file will actually be called /etc/awstats/awstats.www.mydomain.com.conf, however throughout the rest of the setup it is generally refered to as www.mydomain.com (awstats will add the awstats. and the .conf around it) NB. Please substitute www.mydomain.com with your real web address!
open up this config file and check the following lines
LogFile="/path/to/my/logfile/%MM-0-%YYYY-0/access_log"
LogType=W
LogFormat=1
SiteDomain="www.mydomain.com"
DirData="/path/to/awstats/datadir"
LogType=W
LogFormat=1
SiteDomain="www.mydomain.com"
DirData="/path/to/awstats/datadir"
where LogFile is the location of our logfile, "%MM-0-%YYYY-0" is telling awstats to read the logfile in the right directory that we specified earlier. the -0 is the time where 0 = now, thus look in directory currentMonth(MM)-currentYear(YYYY). LogType is W for web (as opposed to FTP, Squid etc). LogFormat is 1 for the combined log format we setup earlier. SiteDomain is the domain for your site. DataDir is the location where awstats will store its datafiles, this directory needs to be writable by the webserver and defaults to /var/lib/awstats.
Once you have configured the .conf file its time to parse the logfiles and build the output, use the following command
/usr/local/awstats/wwwroot/cgi-bin/awstats.pl -update -config=www.mydomain.com
if you have several sites then copy/paste a new .conf file in /etc/awstats changing the above sections for each domain, multiple configs can be run with one command
/usr/local/awstats/tools/awstats_updateall.pl now
when this is done we should be able to see the output for each domain by browsing to:
http://www.mydmain.com/awstats/awstats.pl?config=mydomain.com
where config=mydomain.com is the name of the config file we created earlier, this can be changed depending on which of your configs you wish to view (if you have more than one domain).
Now we need to add a cron entry to update this daily.
0 4 * * * /usr/local/awstats/tools/awstats_updateall.pl now > /dev/null 2>&1
This will update the stats every day at 4am
if you don't want to use the web browser to access awstats.pl directly (as this may increase load on the server) you can generate static reports using the following command
perl awstats.pl -config=www.mydomain.com -output -staticlinks
> awstats.mydomain.html
you will then have to add another cron entry to build these at regular intervals.
HTH!