The story of one forgotten Apache Log.

A long time ago in a galaxy far far away, when analyzing a remaining free space on a certain Linux server that served about 50-100 websites of diverse subject matters, it was discovered that an administrator forgot to enable the log rotation of a single website.

Forgot to do it a couple or so years ago. Time passed by, the server worked, as usually, there was plenty of space on modern disks, and the log grew (like a turnip from the children’s fairy tale). When the log caught my eye, I didn’t even want to believe my own eyes.

The file was huge; it occupied more than 135 GB. 135 gigabytes ! of the browsing history. At the same time, less than 40 GB of free space remained at the section. By normal practice, a solution to such an omission is trivial and simple – to delete the oversized file, to send a signal to apache to re-open the files, which will cheerfully continue to write reports about its work to a newly created file; of course, you shouldn’t forget to enable the rotation setting for this file.

But sometimes, website owners and web developers badly need the information from these very logs. And keeping the Murphy’s law and my previous experience of deleting supposedly useless and suddenly extremely important files in mind, I took on the task of dividing and compressing the file as if the rotation had been enabled, for example, on the monthly basis. A couple of years of logs would look like 24 + N files of compressed archives and 1 file of acceptable size.

First of all, I looked through the standard Linux utilities aimed at something like this; chances are that I’m not the first one to face a similar task and there is a solution – all I had to do is to find it.

Let’s start with the simplest commands and then move on to the “heavy staff”:

  • tail

– tail -n 1500 BIG_BIG_LOGFILE.log would show the last 1500 lines.

-n 1500 BIG_BIG_LOGFILE.log | gzip -c > log_tail1500.log.gz will create a compressed file with the last 1500 lines from our log. However, we have no idea which line to start from and the size of our source file won’t change.

  • head

-n $((TOTAL_LINES – 1500)) BIG_BIG_LOGFILE.log will show a standard output of the whole file but for the last 1500 lines. In this case, there is no sense in using this command, since there is no space on the disk to save it; it will take 30-40 minutes to execute such an operation.

  • grep

-grep -e «[../Oct/2012:..:..:.. -0200]» BIG_BIG_LOGFILE.log will show a standard output of the whole period of October 2012. The output can be compressed as well as the output of the tail command. Just like in the case of tail, the size of our source file won’t change. Inverse filtering is possible, but the available free space is not enough to storage its output.

  • split

is very similar to what I decided to do, however, it could split by rows or bytes, but I wanted to have it by months, as if it was rotated according to the schedule. Moreover, the Linux distribution package that contained the problem had a version that couldn’t transfer the standard output to another program (in our case, used for compression), and there wasn’t enough space on the disk for another similar volume.

  • awk

a full-fledged programming language and almost a runtime environment, specially created for processing text information. Within it, you can implement all the above mentioned functions and much more, but life had other ideas and I used python to do what I intended to do…

Yes, python; of course, it was possible to use awk, but it was just coincident with the fact that I wanted to try python in a real task, even if it was such a contrived one. Undoubtedly, you can write the task-solving code using C, but python attracted me, like a new shining toy held by a child, by its syntactical sugar, convenient arrays with slices of them made by REPL from iPython and the rest wonderful things that were to be mastered.

With the four functions at hand: read, seek, write, truncate – the algorithm looked like this:

  1. Search for the line in our log file with which the LAST MONTH begins.
  2. Output to a separate compressed file with the month and year in the file name: XXXX-Mon-2012.log.gz
  3. Cutting off the tail of the file starting from this line. Truncate function.
  4. Repeat steps 1, 2, 3 until I am content with the remaining size.

Before that, it was necessary to redirect the log to another file, since apache doesn’t know anything about step 3 and reduction of the file size doesn’t exist for it. The very first its record will return the file size to the original 135 GB, with the cut off space being filled with 0 instead of the deleted content, but this won’t reduce the occupied disk space.

For writing and testing, iPython environment was installed that adds quite a number of convenient goodies to a common python input, like for example, command history and auto-completion of the entered commands and the object methods/parameters.

Taking impressive size of the file and almost 100% ascending order of all records into consideration, the first step should have been done by the classical binary search and the basic file operations read and seek. But there is one catch – seek function gets the bytes position as a parameter, but we need to get the exact beginning of the line; we need to find not a single but several values ​​by the number of months in the file. Therefore, a combined method was applied, with the binary search used to select a sufficiently large area that will definitely contain several records from 3 to 10. In the resulting area, we successively search for a probable month shift.

To tell the truth, I haven’t written the program to the end; the file was successfully cut into pieces, which were compressed right when being saved on the disk. All the necessary functions have been written and they were called directly from iPython console. Of course, other functions appeared in addition to the below-mentioned , but with those my ideas clearly led me in a wrong direction and I just won’t include them in the blog.

I began with writing a function that extracted a date from the log line and converted it into a python date; the Regexp module makes it possible to do this in almost 1 line:

import re
    def date_in_str(line):
	got_it = re.search("\[(?P<date>.*?)(?= ) (?P<timezone>.*?)\]",line)
	if None != got_it:		
		try:
			dt = datetime.datetime.strptime(got_it.group(1),'%d/%b/%Y:%H:%M:%S')
		except ValueError:
			dt = None
	return dt

Then, let’s use this function to search for the first date in an arbitrary area read from our giant file:

def first_date(in_buf):
	line_start = 0
	all_lines = in_buf.splitlines()
	for my_line in all_lines:
		dt = date_in_str(my_line)
		if None != dt:			
			return line_start,dt,my_line
		line_start = line_start + len(my_line)

in_buf – a buffer, an array of symbols read by read function.

Then, a function was written that determined if there was a month change in the rows in the arbitrary buffer and returning an index in the line start buffer if there any:

def find_month_change(in_buf):
	all_lines = in_buf.splitlines()
	prev_date = 0
	line_start = 0
	for my_line in all_lines:
		dt = date_in_str(my_line)
		if None != dt:
			sys.stdout.write('.')
	        sys.stdout.flush()
			if prev_date == 0:
				prev_date = dt
			elif prev_date.month != dt.month:
				return line_start
			else:
				prev_date = dt
		line_start = line_start + len(my_line) + 1
	return None

Another small function that returned the first found date and an offset in the file from the offset position that read buf_size bytes at a time.

buf_size = 1024*64
def get_date_in_file(my_file,offset):
    my_file.seek(offset,0)
    buf = f_size.read(buf_size)	
    return first_date(buf)

All we are left to do is to combine them into a function that uses the binary search to find the border line of the last month. Here is the first option:

def get_it(my_file,start):
	first_offset,first_date = get_date_in_file(my_file,start)

	first_month =  first_date.year * 12 + first_date.month;

	my_file.seek(0,2)
	f_size = my_file.tell()
	position =  f_size - buf_size

	last_offset,last_date = get_date_in_file(my_file, position)
	last_month = last_date.year * 12 + last_date.month

	not_found = true
	end_shr = f_size 
	while not_found:
		position = start + int((end_shr - start)/ 2) – buf_size

		offset,mid_date = get_date_in_file(my_file, position)		
		mid_month = mid_date.year * 12 + mid_date.month
		mid_date_offset =  position + offset

		if  mid_month == ( last_month – 1):
		    month_chng_offset = find_month_change()
		    if  month_chng_offset != None:
			not_found = false
			#return  month_chng_offset
		else if mid_month == last_month:
		   end_shr =  mid_date_offset
		else:
		   start = mid_date_offset

This function is not immune from mistakes; it is written on the hypothesis that:

  • Records are ordered strictly ascending – this condition is almost 100% fulfilled; the date backlash is insignificant and doesn’t affect the algorithm stability.
  • There are records for all months – in this case, it is 100% fulfilled.
  • A month change will definitely fall into the read buffer, i.e. the last buffer line won’t be the last month line or an incomplete line of the next month, from which it’s impossible to read the whole date. – In this case, as luck would have it.

Upon second thought, it becomes clear that step 3 sometimes may not be fulfilled. The algorithm for reading lines with dates may be supplemented by checking that the records very close to the end of the last day of the month were read and to start reading the lines one by one until a line with the next day date is encountered.

Despite the above-mentioned disadvantages, the function fulfilled the task. I called it up manually in iPyhon, using parameters to pass the open file and the offset sufficient to find a month. From time to time, I randomly selected dates from various areas of the file, making an approximate map of the records distribution. If the function failed, I called it up with a different array and it returned the real index of the line start and the month shift.

After receiving the long hoped-for index, I called up another function, saving all the compressed data starting with this index:

import subprocess

def compress_tail(in_file,out_file):
  compressor = subprocess.Popen(['gzip','-','6'],1,None,subprocess.PIPE,out_file,subprocess.PIPE)
  buff = in_file.read(1024*1024)
  while True:
	compressor.stdin.write(buff)
	if len(buff) < 1024*1024:
	  has_data = False;
	  break
	buff = in_file.read(1024*1024)
  compressor.communicate()
  return compressor

Keeping it simple and not connecting gzip modules and something like them, we think of Unix Way and call up a child process with the connected open files STDIN = our giant log file with the fixed offset at the beginning of the month STDOUT = a new open file with the name with “gz” extension containing month and year. We feed the tail of our log to the child process by 1 Mb portions.

The last step is calling up truncate (offset-1).