Tuesday, December 13, 2011

Trend predictions with RRD tool - the not so common RRD tricks which might come handy

Background

Have you ever been in a situation where you were asked by your boss to analyze a set of data and make a prediction based on it ? Well, if not then how about picture this situation : you've been collecting the data from your routers, servers , storages from the past 6 months but still have no clue on what to do with those data.

Sounds familiar with these ? or ever been in one of those situations before ? Well if the answer is "yes" then here's a hint : how about making a trend prediction ? By doing trend prediction and analysis you would be able to 'predict' where and when a certain value will hit on a certain date with a degree of certainty.

Data Analysis

You must be asking how would I do that ? Well, say hello to Statistics. By definition (based on Wikipedia) : "Statistics is the study of collection, organization, analysis and interpretation of data". In that sense, only by analyzing the existing data you would be able to construct new data points and this process is called Extrapolation

Extrapolation is actually a by-product result of curve-fitting. Curve-fitting itself is a process of constructing a mathematical function which can create best fit to a series of data points. This would literally means that, the process would create a model function where all the data points would fit together on that model.

Based on the data behaviour, curve-fitting can be categorized into two:
  1. Linear curve-fitting (for linear data: Least-Square, Root-Square, etc)
  2. Non-Linear curve-fitting (for non linear data: Polynomial, Conic, Frech Curve, etc)
Since the focus of this writing is only to apply simple statistics onto a more practical day-to-day scenarios then there won't be any discussion about the non-linear ones.

The Tools

Fortunately, there are few tools (freely downloadable & open source) that can do statistics analysis & curve-fitting at the same time. They are ( these are the ones that I know of)
The tools are full-blown mathematical software where you can do all the fancy mathematical stuffs and more than just creating a statistic analysis. Then here comes the question: "Do we need all these full-blown Mathematical tools for predicting a linear data set ?" The answer is : "No"

Learning all those tools might be a bit of challenging given that we're only using a small portion of their features. The good news is there's a less intimidating tool which can do comparatively the same function but not as overkill as the ones listed above.

RRDTool

The tool's name is RRD , as for those who are working in IT this name sounds very familiar especially since this tool has been used for graphing the data for bandwidth, storage capacity, number of concurrent users, etc. This open source tool is so common and popular that it became the "Standard De Facto" for graphing. Even so that some of the matured open source projects like: Cacti , Nagios and ZenOss are using RRDtool as a critical component of their product.

Which brings into my point, that rather than learning a new set of full-blown tools, why don't we just fully maximized the existing one which we used on everyday basis and has become the "De Facto" standard.

RRDTool Documentation

Before proceeding to the next section, I am assuming that the audience is well aware of what RRDtool is for and how it works, especially given that most of the audience on this blog are working in IT and fully aware of what RRD can do. But for those who are not familiar with RRDtool, fear not , as the project itself has already been matured and so did with its documentation. For those who are still new with RRD, a good RRD tutorial can be found on this link. Also a good knowledge of RPN (Reverse Polish Notation) might come handy as well, as RPN is being used heavily in RRD especially for data calculation, operator function and logic flows. RPN tutorial can also be found on this URL.

RRDTool Statistic Functions

While scouring through the RRDTool documentation, I've stumbled across this very interesting functions : LSLINT, LSLSLOPE, PREDICT & PREDICTSIGMA on this page . After I went looking on different places in the Internet, I finally found these three websites that can be used as references for those functions.
  1. Piece of Unix Hints
  2. Circle of Archimedes
  3. RRDTool Dev Mailing-list
Unfortunately those functions are only available with RRDTool version 1.3.x & 1.4.x. Anything below 1.3.x will not be supported. So make sure that your RRDtool version is greater than 1.3.x
To check the version please use this command:
shell$ rrdtool -v
Reconstructing the data from pre-1.3.x

The good thing about RRD is that, backward compatibility is still being maintained which means trend prediction can still be processed with the data collected from pre 1.3.x tools. All you need to do is just get the *.rrd file.

Even if that fails, there is still workaround available to recover it. The recovery workaround can be used by fully utilizing the rrdtool dump tool, this is by extracting the rrd data and converting the data into XML. Once XML file is created then *.rrd file will be re-generated with rrdtool restore tool.

This is how the workaround works in details:
shell$ rrdtool dump old_data_from_pre_1.3.rrd > old_data_from_pre_1.3.xml 
shell$ rrdtool restore  old_data_from_pre_1.3.xml  converted_data_from_pre_1.3.rrd 
Generating The Trend Prediction with RRDtool

Predicting the data trend with RRD can be done in two ways:
- Trend prediction using a linear data set
- Trend prediction using a non-linear data set

Trend prediction using a linear data set

For a linear data set, as have been described previously in the above section, a least-square method is being used. According to least-square method a linear line always follows this function:

y = a(x) + b

Where "a" denotes the gradient or the slope of the line and "b" is the constant value of the function (or some people call this "y-intercept").

Fortunately these variables can be derived from the following functions in RRD (note: you need to run version 1.3.x or greater to be able to do this)

LSLOPE = "a" which denotes the slope of the line ( or gradient )
LSLINT = "b" which denotes the constant value of the function ( or "y-intercept")

LSLOPE and LSLINT are deduced from the rrd data via the least-square method (note: assuming that the data being projected is 'linear')

To see this in real, please look at the following script:
(spoiler alert : you need to be familiar with rrdtool and RPN -reverse polish notation- first before be able to understand this)
1:   #!/bin/sh
2:   rrdtool graph proc_util_LSL.png \
3:   --imgformat=PNG \
4:   --start=-4w \
5:   --end='+2m' \
6:   --upper-limit=100 \
7:   --title="Monthly Number of Processes on `uname`" \
8:   --font TITLE:12: \
9:   --units-exponent=2 \
10:   --vertical-label="Processes" \
11:   --height=250 \
12:   --width=600 \
13:   DEF:a="proc_util.rrd":proc:AVERAGE:step=300 \
14:   DEF:a1="proc_util.rrd":proc:MAX:step=300 \
15:   DEF:a2="proc_util.rrd":proc:MIN:step=300 \
16:   DEF:a3="proc_util.rrd":proc:LAST:step=300 \
17:   CDEF:idle=a,0.00,EQ,INF,0,IF \
18:   VDEF:avg=a,AVERAGE \
19:   VDEF:slope=a,LSLSLOPE \
20:   VDEF:cons=a,LSLINT \
21:   CDEF:lsl2=a,POP,slope,COUNT,*,cons,+ \
22:   CDEF:pred=lsl2,90,100,LIMIT \
23:   VDEF:minpred=pred,FIRST \
24:   VDEF:maxpred=pred,LAST \
25:   GPRINT:a:AVERAGE:"No. of Processes Avg\: %8.0lf" \
26:   GPRINT:a1:MAX:"Max\: %8.0lf" \
27:   GPRINT:a2:MIN:"Min\: %8.0lf" \
28:   GPRINT:a3:LAST:"Last\: %8.0lf\n" \
29:   AREA:a#00B2EE:"Processes" \
30:   AREA:idle#AFEEEE \
31:   AREA:pred#BCD2EE \
32:   LINE2:avg#00FF00:"Avg":dashes=5 \
33:   LINE3:lsl2#ff0000:"Least Sqr Pred.\n":dashes=8 \
34:   LINE2:90 \
35:   AREA:5#FF000022::STACK \
36:   AREA:5#FF000044::STACK \
37:   GPRINT:minpred:"reaching 90 on \: %c\n":strftime \
38:   GPRINT:maxpred:"reaching 100 on \: %c\n":strftime \
39:   COMMENT:"\s" \
40:   COMMENT:"Graph Created on\: `date sed -e 's/:/\\\:/g'`\r" \
41:   COMMENT:"tiskanto(at)gmail(dot)com\r"  
If you look at the script it might be a bit complex at first but if you look closely the most important part are only sitting on these 3 lines , whereas the rest are just 'cosmetics':
19:   VDEF:slope=a,LSLSLOPE \
20:   VDEF:cons=a,LSLINT \
21:   CDEF:lsl2=a,POP,slope,COUNT,*,cons,+ \  
Line 19 : this is where I get the gradient or the "a" in y= a(x) + b
Line 20 : this is where I get the constant value or the "b" in y= a(x) + b

Line 21 :
this is a bit tricky, since we're dealing with RPN then we're going to make use the 'PUSH' and 'POP' command. What we're trying to do is to multiply every single data with the "a" as described in the least square to simulate a linear function which can only be done with the 'COUNT' function. However, there is a question on how we manipulate the data without having to modify the data itself. Fortunately we have 'POP' command, what it does is it puts the data into the stack and pops it out only to be manipulated later on ( this is one of the earliest implementation of data copy).

So, data will be manipulated with 'COUNT' by multiplying it with the 'a' (the gradient) and then add the result with the 'b' (the constant value), hence with this method we've got the full linear line function y= a(x) + b fully constructed.

I know sometimes it's a bit confusing but don't worry once you get a hang of it , the rest should be pretty easy.

The 'cosmetics' that I've put in this script is only to please the 'eye' and shouldn't be that much of difficult given that everything has been documented nicely in RRD. Some of the cosmetic tricks that I have shown you here even came from other URL references, which already described on the above section (if you read them carefully).

Following is the final result from the script:



As you can see from the above picture, using a least square method we can predict precisely when the process will hit 90 and 100. With this kind of method we can forecast a growth of a system and hence proactive measures can be taken before it becomes a problem.

Trend prediction using a non-linear data set

For a non-linear data set, there are several curve-fitting methods that can be used. However the one that is being used by RRD is based on the average values (the method is called 'sliding window average'). There is a catch for this method however, since this method is using a 'sliding window average value' then this prediction is only valid up to several days. The reason being is an average data will get 'skewed' through out the time and by then the prediction is no longer precise or valid anymore. For a more thorough prediction a polynomial method might be needed which is not being supported by RRD at the moment.

The details of how this method works can be read at this documentation ( look for 'PREDICT' and 'PREDICTSIGMA' function). The way these functions work are as followed (again in CDEF notation):

CDEF:predict_value = time, n, window, x, PREDICT

where:
- time and n are the number of time slots in seconds that needs to be 'shifted' backward from t=0
- window is the number of seconds where the sliding window average will be calculated from
- x is the actual data being predicted

so for example a statement like this :

CDEF:predict_value = 86400, 7, 900, x, PREDICT

would mean, data prediction will be calculated based on the past 7 days ( 24hrs x 7 ) with each value average will be calculated based on the data at 15 minutes window.

Note :
86400 is equal to 60 seconds x 60 minutes x 24 hour
900 is equal to 60 seconds x 15 minutes

The syntax for PREDICTSIGMA is the same with PREDICT.

To see how this function works please look at the following script, also please bear in mind that for the purpose this example, I won't be plotting the PREDICTSIGMA because it is only being used to calculate the 'upper' and 'lower' value of the prediction on the actual data.
1:   #!/bin/sh
2:   rrdtool graph test123.png \
3:   --imgformat=PNG \
4:   --start=-1w \
5:   --end='+1w' \
6:   --title="Monthly CPU Util. on test123" \
7:   --font TITLE:12: \
8:   --units-exponent=2 \
9:   --vertical-label="CPU Util." \
10:   --height=250 \
11:   --width=600 \
12:   DEF:a="test123.rrd":5min_cpu:AVERAGE:step=300 \
13:   DEF:a1="test123.rrd":5min_cpu:MAX:step=300 \
14:   DEF:a2="test123.rrd":5min_cpu:MIN:step=300 \
15:   DEF:a3="test123.rrd":5min_cpu:LAST:step=300 \
16:   CDEF:idle=a,0.00,EQ,INF,0,IF \
17:   CDEF:smooth=a,1800,TREND \
18:   CDEF:predict=86400,-7,900,a,PREDICT \
19:   CDEF:sigma=86400,-7,900,a,PREDICTSIGMA \
20:   VDEF:avg=a,AVERAGE \
21:   GPRINT:a:AVERAGE:"CPU Util Avg\: %8.0lf" \
22:   GPRINT:a1:MAX:"Max\: %8.0lf" \
23:   GPRINT:a2:MIN:"Min\: %8.0lf" \
24:   GPRINT:a3:LAST:"Last\: %8.0lf\n" \
25:   AREA:a#00B2EE:"CPU Util\n" \
26:   AREA:idle#AFEEEE \
27:   LINE2:avg#00FF00:"Avg\n":dashes=5 \
28:   LINE2:predict#FF00FF:"Trend Prediction\n":dashes=5 \
29:   LINE1:smooth#FF0000:"Sliding Window Smoothing": \
30:   COMMENT:"\s" \
31:   COMMENT:"Graph Created on\: `date sed -e 's/:/\\\:/g'`\r" \
32:   COMMENT:"tiskanto(at)gmail(dot)com\r"  
Again as has been described before, the only interesting part from the script is only located on these lines, whereas the other parts are just simply 'cosmetics':
17 CDEF:smooth=a,1800,TREND \
18 CDEF:predict=86400,-7,900,a,PREDICT \
Line 17 , is a TREND function which will create a half-hour sliding window average on 'a'
Line 18, prediction will be calculated based on a 7x24 hr past data where sliding window average will be calculated based on a 15-minute interval.

Following is the final result from the script.



If you look more closely on this graph, the prediction is able to predict accurately on the 'saturday' and 'sunday' trends. This can easily be notified by the 'peaks' on 'saturday' and 'sunday' on the extrapolated line.

One Last Thing

Remember that previously we discussed that this 'sliding window average' method can only be 'trusted' or 'valid' only for certain days due to the method being used. Any data that is using an average value will get 'skewed' through out time. The same thing happens to this method and we will prove and show this to you by modifying the 'sliding window' value on the PREDICT function and try to see how it behaves.



As can be seen above the prediction value gets 'skewed' simply by increasing the sliding window value. The prediction trend will 'move' towards the total average as the time goes by (total average is described in green).

Conclusion

Trend prediction and capacity planning especially in IT is sometimes a complex issue but this can be done easily given that the user knows how to use the tools effectively. As explained before, you don't need to be very good at statistics to be able to create a good and decent trend prediction on your system. RRDTool which has become standard de-facto for graphing in IT can do this job as good as other off-the-shelf commercial mathematical tools, --to some degree of course--. With a relatively short learning curve, a user who is already familiarized with RRD can change this little tool into a very powerful and useful one and not just for plotting.

I hope you find this article helpful, as it took quite a while for me to collate these information into one as there's only a handful article explaining about RRD trend prediction on the Internet. Credits also given to the blogs on the reference section who have given me an insight on how to see the RRDTool Trend Prediction works.

Wednesday, June 8, 2011

Analyzing spam distribution with OpenBSD's spamd, Google's FusionTable API & Python

Followings are the results that I've got from analyzing the spam distribution data with Google's GFT API, Python and OpenBSD. It took me 3 nights (all of these were done just before going to bed) to code and to test the prototype. At the end of this exercise I was quite satisfied with the results, although there were some minor stuffs which I thought it could have been done much better but overall I am quite pleased with everything.

The whole point of this exercise is to satisfy my curiosity on seeing spam distribution traffic across the globe and mapping those culprits geographically just to get a clear view of what is happening.

To get the final results, I need to setup these 3 components. They are as followed:
  • OpenBSD's spamd, which was used to harvest the spam offender's IP address. This was done by grey-trapping and grey-listing the spammers. (if you don't know what grey-listing and grey-trapping are, do a search on google, there are plenty of resources to read)
  • I have also used Python language to extract the data from OpenBSD's spamd logs/block list and to generate the GeoIP info from the offenders' IP addresses.
  • As for presentation and data manipulation, Google's Fusion table (GFT) was used on this exercise. Data extracted from OpenBSD were then transfered to the GFT with GFT Python API.
There were exceptions (or should I say 'issues') on transferring the data with GFT API:
  1. There are design constrains on GFT as descibed in here that request to GFT servers have been limited to 1 MB per request and API transfer can only be allowed to have 5 transfers per second.
  2. If you have a huge list of data ( in the range of thousands ) which needs to be populated to GFT, I'd suggest you to use the bulk transfer/import from GFT (via file). But again, you need to convert those data first into a CSV before importing them into GFT.
  3. From several tests, I happened to have problems when transferring / populating the data via the API to the GFT. The most that I could get was 'only' around 1700 - 1800 data in one go, this was before the connection being reset by peer (maybe Google engineers were thinking that I was DoS-ing their servers). There were also occasions where the connections were just get stuck / stalled even though my internet connection was still running.
  4. You also need to create your own API client code as GFT API is pretty much language agnostic. This means that the API can be written in any language, be it : Python, PHP, Java , JavaScript or even shell script. The only thing that needs to be remembered is how the data is being transfered and treated, all of these info can be found on GFT Developer's guide which can be found on this URL.
On overall I can say that GFT tool is quite useful especially when visualizing the data to the audience. Although it is still in a very early beta stage, but the features itself are quite powerful. GFT also allows users to create: view, tables, filter , query and mapping the data to the geographic location (thanks to geo-coding, but this may take much longer as GFT needs to translate the names and the location into a latitude and longitude format).

Followings are the spam source data that were extracted and processed with GFT:

This is a spam distribution map based on the originating IP address (please click on one of the dots on the map to get the full detail info of the location and IP address)


Heat map analysis can also been done with Google Fustion Tables as can be seen from this map


With GFT, spam distribution can also be classified and grouped based on the two digit country code.
Note:
  • A1 denotes IPs that are coming from anonymous proxies
  • A2 denotes IPs that are coming from satellite provider.
Please click on one of the peaks to see the country code with the number of originating spam source(s)




Intensity map was produced by aggregating the country on the GFT ( in the database terms it is called 'group by' aggregate). Please click on the colored area to see the info on the number of spam source(s) originating from that region.

Conclusion:
  • GFT can be used to visualize the data easily, it could also become an invaluable tool for data representation and analysis, provided the user knows how to use it.
  • There were constraints on GFT API especially when uploading / populating a large volume of data in one go. The max number that I could get was 'only' 1700 to 1800 record in one go, this was before the data got interrupted with 'connection reset by peer' error (possibly this was DoS attack mitigation mechanism on GFT).
  • The API it self is language agnostic, in the sense that there is/are no specific API client(s) for GFT. What you need to do is however to follow GFT protocol and standard. It is a double-edged sword as you need to create your own code and there is no standard. The good thing is, as long as you know how it works you can write the API in any language you like.
  • Some of the features are quite useful like : aggregate , filter and simulation (if there's date data type) these might get handy sometimes.
  • In the long run I think GFT has a bright future and might get traction from non-techie specific users.