Easy web access log processing with R

I recently had a need to diagnose a problem in my web site, so I wanted to do some analysis of the raw access log file.  There are many out-of-the-box analytical utility available but none of them allows me to drill down to the access pattern in more details.

So I located my web access log and downloaded them.  It is a text file, fields are delimited by spaces but can be enclosed in square brackets “[…]” or quotes “…”.  In addition, some of the field contain multiple information that can be broken down to facilitate analysis.  Here is an example of one line which represent one access request:

I decided to do an exercise to see how R can handle a file like this, and would like to share my findings with you.

R is a very powerful analytical language that has many building blocks (libraries) that are designed to make data analysis easy.  One of the basic requirement for data analysis is the data itself.  R has many library that are tailored to what your needs are.  In this article, I will focus on data stored in simple text files.  A very common use case for ad-hoc data analysis without a complex database or data warehouse.

There is a very handy library in R called readr.  readr is part of core tidyverse library, remember to source one of them in your R project: Library(readr), or library(tidyverse).

readr provide seven core functions:

  • read_csv(): to read comma separated (CSV) files
  • read_tsv(): to read tab separated files
  • read_delim(): to read general delimited files
  • read_fwf(): to read fixed width files
  • read_table(): to read tabular files where columns are separated by white space
  • read_log(): to read some of the most common web log files

Bingo! read_log() may just be exactly what I need for this job, so let’s give it a try.

Here is 2 simple lines of code to read the access log file:

> library(readr)
> accesslog <- read_log("access.log", skip=0, col_names=FALSE)

The entire log file is now in accesslog, and it is a “list” object and “data.frame” class.  You can find out more about the object here:

> typeof(accesslog)
[1] "list"
> class(accesslog)
[1] "tbl_df" "tbl" "data.frame"

accesslog object now has 6584 rows with 11 fields

Here is a view of accesslog, as you can column names are generic for now:

Next let’s create some user friendly column names:

> colnames(accesslog) = c("IP", "identd", "HTTP_User", "Timestamp", "URL", 
+ "Status", "Size", "Domain", "Referrer", "User Agent", "Dont_Know")

And now we see the user friendly names:

My analysis require slice and dicing the date field so that I get the date, hour, and also the specific weekday.  So let’s do some data transformation next:

> library(stringr)
> accesslog = cbind(accesslog, as.Date( str_match(accesslog[,"Timestamp"], "[0-9]{1,2}\\/[A-Za-z]{3}\\/[0-9]{4}"), format="%d/%b/%Y"))
> accesslog = cbind(accesslog, str_match(accesslog[,"Timestamp"], "[0-9]{4}:([0-9]{2}):"))
> accesslog = accesslog[,-(ncol(accesslog)-1)]
> accesslog = cbind(accesslog, weekdays(accesslog[,(ncol(accesslog) - 1)]))
> View(accesslog)

These few lines of code basically extract the field and create a new columns for Date, Hour, and Weekday.  It uses the stringr library for the string manipulation.  If you get lost a bit here, please look up some documentation on pattern matching in R.  It is pretty straight forward once you understand the syntax.

Now, here is our final table with the new extracted fields:

In case if you want to analyse the data using other tool, it is also simple to save this into a csv file for any future use:

> write.csv(accesslog, file="access.csv")

What I would like to highlight here is how simple and elegant solution this is with R.  It is exactly 10 lines of codes.

I will continue to use the data to create some data analysis graph in R, stay tuned, and I will share them with next.

10,514 total views, 1 views today

No Comments

Post a Comment