Ideally, we would know the exact answer to every question. How
many people support presidential candidate A vs. B? How many people
suffer from H1N1 in a given state? Does this batch of manufactured
widgets have any defective parts? Knowing exact answers is expensive in
terms of time
and money and, in most
cases, is impractical if not impossible. Consider asking every person
in a region for their
candidate preference, testing every person with flu symptoms
for H1N1 (assuming every person reported when they had flu symptoms),
or destructively testing widgets to determine if they are "good"
(leaving no product to sell).
Knowing exact answers, fortunately, isn't
necessary or even useful in many situations. Understanding the
direction of a trend or statistically significant results may be sufficient to answer the underlying
question: who is likely to win the election, have we likely reached a
critical
threshold for flu, or is this batch of widgets good enough to ship?
Statistics help us to answer these questions with a certain degree of
confidence. This focuses on how we collect data.
In data mining, we focus on the use of data, that is data that
has already been collected. In some cases, we may have all the data (all purchases made by all customers), in others the data may have been collected using sampling (voters, their demographics and candidate choice). Building data mining models on all of
your data
can be expensive in terms of time and hardware resources. Consider a
company with 40 million customers. Do we need to mine all 40 million
customers to get useful data mining models? The quality of models built
on all data may be no better than models built on a relatively small sample. Determining how much is a reasonable amount of data
involves experimentation.
When starting the model building process on large datasets, it is often
more efficient to
begin with a small sample, perhaps 1000 - 10,000 cases
(records)
depending on the algorithm, source data, and hardware. This allows you
to see
quickly what issues might arise with choice of algorithm, algorithm
settings, data quality, and need for further data preparation. Instead of
waiting for a model on a large dataset to build only to find that
the results don't meet expectations, once you are satisfied with the
results on
the initial sample, you can take a larger sample to
see if model quality improves, and to get a sense of how the algorithm scales
to the particular dataset. If model accuracy or quality continues to
improve, consider increasing the sample size.
Sampling in data mining is also used to produce a held-aside
or test dataset for assessing classification and regression
model accuracy. Here, we reserve some of the build
data (data that includes known target values) to be used for an
honest estimate of model error using data
the model has not seen before. This sampling transformation is often
called a split
because the build data is split into two randomly selected sets, often
with 60% of the records being used for model building and 40% for
testing.
Sampling must be performed with care, as it
can adversely affect model quality and usability. Even a truly random sample
doesn't
guarantee that all values are represented in a given attribute.
This is particularly troublesome when the attribute with omitted values
is
the target. A predictive model that has not seen any examples for a
particular target value can never predict that target value! For other
attributes, values may consist of a single value (a constant
attribute) or
all unique values (an identifier attribute), each of which may be
excluded during mining. Values from categorical predictor attributes that didn't
appear in the training data are not used when testing
or scoring datasets.
In subsequent posts, we'll talk about three sampling
techniques using Oracle Database: simple random sampling without
replacement, stratified sampling, and simple random sampling with
replacement.