Statistics 101 with ipython

Today I needed to process some log files in search of some relations between data. After parsing the log file I got the following table.

data = [ 
('timestamp', 'elapsed', 'error', 'retry', 'size', 'hosts'),
(1379603191, 0.12, 2, 1, 123, 2313),
(1379603192, 12.43, 0, 1, 3223, 2303),
...
(1379609000, 0.43, 0, 1, 3223, 2303)
]

I easily converted this into a columned dict:

table = dict(zip( data[0], zip(*data[1:]) ))
{
'timestamp' : [ 1379603191, 1379603191, ..., 1379609000],
'elapsed': [0.12, 12.43, ..., 0.43],
...
}

In this way it was very easy to run basic stats:

print [k, max(v), min(v), stats.mean(v), stats.stdev(v) ] for k,v in table.items() ]

Check data distributions

from matplotlib import pyplot
pyplot.hist(table['elapsed'])

And even look for basic correlation between columns:

from itertools import combination
from scipy.stats.stats import pearsonr
for f1, f2 in combinations(table.keys(), 2):
    r, p_value = pearsonr(table[f1], table[f2])
    print("the correlation between %s and %s is: %s" % (f1, f2, r))
    print("the probability of a given distribution (see manual) is: %s" % p_value)

Or draw scatter plots

from matplotlib import pyplot
for f1, f2 in combinations(table.keys(), 2):
    pyplot.scatter(table[f1], table[2], label="%s_%s" % (f1,f2))
    # add legend and other labels
    r, p = pearsonr(table[f1], table[f2])
    pyplot.title("Correlation: %s v %s, %s" % (f1, f2, r))
    pyplot.xlabel(f1)
    pyplot.ylabel(f2)
    pyplot.legend(loc='upper left') # show the legend in a suitable corner
    pyplot.savefig(f1 + "_" + f2 + ".png")
    pyplot.close()

Java native logging a-la-printf with slf4j

Just saw that the new java logging framework slf4j supports parametrized logs, like

log.info(“log this {} {} {} and {}”, 1,2,”three”,4);

More on http://www.catosplace.net/blogs/personal/?p=442
Enjoy


import org.junit.Test;
import org.junit.rules.TestName;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class LoggingTest {

@Rule public TestName name = new TestName();

final Logger logger =
LoggerFactory.getLogger(LoggingTest.class);

@Test
public void testA() {
logger.info("{} being run...", name.getMethodName());
}

@Test
public void testB() {
logger.info("{} being run...", name.getMethodName()); }
}
}

Logging a-la-printf with java

My C backgroud make me very unhappy when logging with java. I always have to type

log.info(String.format(“Logging field %s: with value: %d”, field, value));

Today I decide to extend the default log4j logger implementing my log.info & co using java varags.

I created two classes:

CustomLoggerFactory implements LoggerFactory:

  • @implement makeNewLoggerInstance(String name) return new MyLogger(name);

MyLogger extends Logger:

  • private MyLoggerFactory factory = new MyLoggerFactory();
  • @override getLogger(Class clazz)
  • info(String format, Object... params)

Now I can use MyLogger


MyLogger log = (MyLogger) MyLogger.getLogger(getClass());

log.info("Test base:string %s int %d" , "string", 0);

Further info here http://www.beknowledge.com/archives/article/extending-log4j-to-create-custom-logging-components

Log for the bare necessities, the simple bare necessities

Why did you do that if you weren’t asked to? Improving performance means doing just the minimum tasks. As stated in “Performance” section, http://logging.apache.org/log4j/1.2/manual.html , we’d better check if we need creating tonns of strings in java. So

if (log.isDebugEnabled()) {

log.debug(“Create” + “this” + message + “string”);

}

prevents creating unneeded debug/trace strings.