Today I needed to process some log files in search of some relations between data. After parsing the log file I got the following table.
data = [ ('timestamp', 'elapsed', 'error', 'retry', 'size', 'hosts'), (1379603191, 0.12, 2, 1, 123, 2313), (1379603192, 12.43, 0, 1, 3223, 2303), ... (1379609000, 0.43, 0, 1, 3223, 2303) ]
I easily converted this into a columned dict:
table = dict(zip( data[0], zip(*data[1:]) )) { 'timestamp' : [ 1379603191, 1379603191, ..., 1379609000], 'elapsed': [0.12, 12.43, ..., 0.43], ... }
In this way it was very easy to run basic stats:
print [k, max(v), min(v), stats.mean(v), stats.stdev(v) ] for k,v in table.items() ]
Check data distributions
from matplotlib import pyplot pyplot.hist(table['elapsed'])
And even look for basic correlation between columns:
from itertools import combination from scipy.stats.stats import pearsonr for f1, f2 in combinations(table.keys(), 2): r, p_value = pearsonr(table[f1], table[f2]) print("the correlation between %s and %s is: %s" % (f1, f2, r)) print("the probability of a given distribution (see manual) is: %s" % p_value)
Or draw scatter plots
from matplotlib import pyplot for f1, f2 in combinations(table.keys(), 2): pyplot.scatter(table[f1], table[2], label="%s_%s" % (f1,f2)) # add legend and other labels r, p = pearsonr(table[f1], table[f2]) pyplot.title("Correlation: %s v %s, %s" % (f1, f2, r)) pyplot.xlabel(f1) pyplot.ylabel(f2) pyplot.legend(loc='upper left') # show the legend in a suitable corner pyplot.savefig(f1 + "_" + f2 + ".png") pyplot.close()