orly going thirty: Parsing VPC Flow Logs with Pandas

Here's a trivial code snippet to parse AWS VPC Flow Logs. This is extremely useful when setting up permissive security groups and then tightening them up later.

This script will (probably) fail if there are too many VPC flow log files (and therefore the Python interpreter would run out of memory). However it's nice to see that Pandas read_csv can read S3 URL's directly (even gzip'ped CSV files).

You can also filter for REJECT rule and find out all the IP's that have been attempting to attack you.

from boto.s3.connection import S3Connection
import pandas as pd
import os

srcbucket = 'flowlogs-bucket-orly'
aws_access_key = 'AKIAxxx'
aws_secret_key = 'oRCIxxx'

os.environ['AWS_ACCESS_KEY_ID'] = aws_access_key
os.environ['AWS_SECRET_ACCESS_KEY'] = aws_secret_key

cols = [ 'timestamp', 'version', 'accountid', 'interfaceid', 'srcaddr', 'dstaddr', 'srcport', 'dstport', 'protocol', 'packets', 'bytes', 'start', 'end', 'action', 'logstatus']

# iterate over all the VPC flowlogs in the bucket
conn = S3Connection(aws_access_key, aws_secret_key)
bucket = conn.get_bucket(srcbucket)

count = 0
for key in bucket.list():
    keystr = key.name.encode('utf-8')
    if 'eni-' in keystr:
        s3url = 's3://' + srcbucket + '/' + keystr
        count = count + 1
        
        if (count == 1):
            df = pd.read_csv(s3url, delim_whitespace=True, header=None, names=cols, low_memory=True)
            df = df [ df['action'] == 'ACCEPT']
        else:
            df2 = pd.read_csv(s3url, delim_whitespace=True, header=None, names=cols, low_memory=True)
            df2 = df2 [ df2['action'] == 'ACCEPT']
            df = df.append(df2, ignore_index=True)
        
        print 'Processed ' + keystr

src = df.groupby(['srcaddr', 'dstaddr', 'dstport']).size().reset_index()
src.columns = [ 'srcaddr', 'dstaddr', 'dstport', 'count' ]

n = src.sort_values(by = ['count', 'srcaddr', 'dstaddr'], ascending=False)

print(n)

Parsing VPC Flow Logs with Pandas

No comments: