Statistics 101 with ipython

Today I needed to process some log files in search of some relations between data. After parsing the log file I got the following table.

data = [ 
('timestamp', 'elapsed', 'error', 'retry', 'size', 'hosts'),
(1379603191, 0.12, 2, 1, 123, 2313),
(1379603192, 12.43, 0, 1, 3223, 2303),
...
(1379609000, 0.43, 0, 1, 3223, 2303)
]

I easily converted this into a columned dict:

table = dict(zip( data[0], zip(*data[1:]) ))
{
'timestamp' : [ 1379603191, 1379603191, ..., 1379609000],
'elapsed': [0.12, 12.43, ..., 0.43],
...
}

In this way it was very easy to run basic stats:

print [k, max(v), min(v), stats.mean(v), stats.stdev(v) ] for k,v in table.items() ]

Check data distributions

from matplotlib import pyplot
pyplot.hist(table['elapsed'])

And even look for basic correlation between columns:

from itertools import combination
from scipy.stats.stats import pearsonr
for f1, f2 in combinations(table.keys(), 2):
    r, p_value = pearsonr(table[f1], table[f2])
    print("the correlation between %s and %s is: %s" % (f1, f2, r))
    print("the probability of a given distribution (see manual) is: %s" % p_value)

Or draw scatter plots

from matplotlib import pyplot
for f1, f2 in combinations(table.keys(), 2):
    pyplot.scatter(table[f1], table[2], label="%s_%s" % (f1,f2))
    # add legend and other labels
    r, p = pearsonr(table[f1], table[f2])
    pyplot.title("Correlation: %s v %s, %s" % (f1, f2, r))
    pyplot.xlabel(f1)
    pyplot.ylabel(f2)
    pyplot.legend(loc='upper left') # show the legend in a suitable corner
    pyplot.savefig(f1 + "_" + f2 + ".png")
    pyplot.close()

Lascia un commento