Wednesday, June 8, 2011

Analyzing spam distribution with OpenBSD's spamd, Google's FusionTable API & Python

Followings are the results that I've got from analyzing the spam distribution data with Google's GFT API, Python and OpenBSD. It took me 3 nights (all of these were done just before going to bed) to code and to test the prototype. At the end of this exercise I was quite satisfied with the results, although there were some minor stuffs which I thought it could have been done much better but overall I am quite pleased with everything.

The whole point of this exercise is to satisfy my curiosity on seeing spam distribution traffic across the globe and mapping those culprits geographically just to get a clear view of what is happening.

To get the final results, I need to setup these 3 components. They are as followed:
  • OpenBSD's spamd, which was used to harvest the spam offender's IP address. This was done by grey-trapping and grey-listing the spammers. (if you don't know what grey-listing and grey-trapping are, do a search on google, there are plenty of resources to read)
  • I have also used Python language to extract the data from OpenBSD's spamd logs/block list and to generate the GeoIP info from the offenders' IP addresses.
  • As for presentation and data manipulation, Google's Fusion table (GFT) was used on this exercise. Data extracted from OpenBSD were then transfered to the GFT with GFT Python API.
There were exceptions (or should I say 'issues') on transferring the data with GFT API:
  1. There are design constrains on GFT as descibed in here that request to GFT servers have been limited to 1 MB per request and API transfer can only be allowed to have 5 transfers per second.
  2. If you have a huge list of data ( in the range of thousands ) which needs to be populated to GFT, I'd suggest you to use the bulk transfer/import from GFT (via file). But again, you need to convert those data first into a CSV before importing them into GFT.
  3. From several tests, I happened to have problems when transferring / populating the data via the API to the GFT. The most that I could get was 'only' around 1700 - 1800 data in one go, this was before the connection being reset by peer (maybe Google engineers were thinking that I was DoS-ing their servers). There were also occasions where the connections were just get stuck / stalled even though my internet connection was still running.
  4. You also need to create your own API client code as GFT API is pretty much language agnostic. This means that the API can be written in any language, be it : Python, PHP, Java , JavaScript or even shell script. The only thing that needs to be remembered is how the data is being transfered and treated, all of these info can be found on GFT Developer's guide which can be found on this URL.
On overall I can say that GFT tool is quite useful especially when visualizing the data to the audience. Although it is still in a very early beta stage, but the features itself are quite powerful. GFT also allows users to create: view, tables, filter , query and mapping the data to the geographic location (thanks to geo-coding, but this may take much longer as GFT needs to translate the names and the location into a latitude and longitude format).

Followings are the spam source data that were extracted and processed with GFT:

This is a spam distribution map based on the originating IP address (please click on one of the dots on the map to get the full detail info of the location and IP address)


Heat map analysis can also been done with Google Fustion Tables as can be seen from this map


With GFT, spam distribution can also be classified and grouped based on the two digit country code.
Note:
  • A1 denotes IPs that are coming from anonymous proxies
  • A2 denotes IPs that are coming from satellite provider.
Please click on one of the peaks to see the country code with the number of originating spam source(s)




Intensity map was produced by aggregating the country on the GFT ( in the database terms it is called 'group by' aggregate). Please click on the colored area to see the info on the number of spam source(s) originating from that region.

Conclusion:
  • GFT can be used to visualize the data easily, it could also become an invaluable tool for data representation and analysis, provided the user knows how to use it.
  • There were constraints on GFT API especially when uploading / populating a large volume of data in one go. The max number that I could get was 'only' 1700 to 1800 record in one go, this was before the data got interrupted with 'connection reset by peer' error (possibly this was DoS attack mitigation mechanism on GFT).
  • The API it self is language agnostic, in the sense that there is/are no specific API client(s) for GFT. What you need to do is however to follow GFT protocol and standard. It is a double-edged sword as you need to create your own code and there is no standard. The good thing is, as long as you know how it works you can write the API in any language you like.
  • Some of the features are quite useful like : aggregate , filter and simulation (if there's date data type) these might get handy sometimes.
  • In the long run I think GFT has a bright future and might get traction from non-techie specific users.

No comments: