Virtuous cycle

Bartlomiej Owczarek weblog

Return to 'Virtuous cycle' home page

Clustering

I have this rather large file with loan data and I’m playing some statistics on it. During the day it was standard pivoting. But now in the evening I decided to check whether I can get anything by a fancy black-box clustering.

The best thing to get, of course, would be a set of clusters with significantly different loan performance (i.e. share of bad loans) that the others.

I’m using Cluto.

It is not particularly user-friendly. It requires input files that I need to kind of manually generate from Access and then fine-tune. And it is not excel add-in, but a command line program. But thanks to this it can handle my 100k records (my Excel version has 64k rows limit).

So far no results. But wait… just finished computing using the graph method. It took 17 minutes.

Nope. At the moment most distinctive cluster is ca. 6.5% better that the average in case of defaults. And ca. 11% in case of defaults considered fraud. I’m not impressed.

Guess they need to give me more data from the application. Currently I test on 6 variables and some are loan and not customer related so there is a field for improvement.

Or maybe I should read the manual some more and figure what what are the different optimization methods.