What is the best way to store a table in C++
Posted
by
Topo
on Programmers
See other posts from Programmers
or by Topo
Published on 2013-02-23T06:31:46Z
Indexed on
2013/10/28
10:01 UTC
Read the original article
Hit count: 169
c++
|data-structures
I'm programming a decision tree in C++ using a slightly modified version of the C4.5 algorithm. Each node represents an attribute or a column of your data set and it has a children per possible value of the attribute.
My problem is how to store the training data set having in mind that I have to use a subset for each node so I need a quick way to only select a subset of rows and columns.
The main goal is to do it in the most memory and time efficient possible (in that order of priority).
The best way I have thought of is to have an array of arrays (or std::vector), or something like that, and for each node have a list (array, vector, etc) or something with the column,line
(probably a tuple) pairs that are valid for that node.
I now there should be a better way to do this, any suggestions?
UPDATE: What I need is something like this:
In the beginning I have this data:
Paris 4 5.0 True
New York 7 1.3 True
Tokio 2 9.1 False
Paris 9 6.8 True
Tokio 0 8.4 False
But for the second node I just need this data:
Paris 4 5.0
New York 7 1.3
Paris 9 6.8
And for the third node:
Tokio 2 9.1
Tokio 0 8.4
But with a table of millions of records with up to hundreds of columns.
What I have in mind is keep all the data in a matrix, and then for each node keep the info of the current columns and rows. Something like this:
Paris 4 5.0 True
New York 7 1.3 True
Tokio 2 9.1 False
Paris 9 6.8 True
Tokio 0 8.4 False
Node 2:
columns = [0,1,2]
rows = [0,1,3]
Node 3:
columns = [0,1,2]
rows = [2,4]
This way on the worst case scenario I just have to waste
size_of(int) * (number_of_columns + number_of_rows) * node
That is a lot less than having an independent data matrix for each node.
© Programmers or respective owner