scikit learn extratreeclassifier hanging
- by denson
I'm running the scikit learn on some rather large training datasets ~1,600,000,000 rows with ~500 features. The platform is Ubuntu server 14.04, the hardware has 100gb of ram and 20 CPU cores.
The test datasets are about half as many rows.
I set n_jobs = 10, and am forest_size = 3*number_of_features so about 1700 trees.
If I reduce the number of features to about 350 it works fine but never completes the training phase with the full feature set of 500+. The process is still executing and using up about 20gb of ram but is using 0% of CPU. I have also successfully completed on datasets with ~400,000 rows but twice as many features which completes after only about 1 hour.
I am being careful to delete any arrays/objects that are not in use.
Does anyone have any ideas I might try?