Extending Oracle CEP with Predictive Analytics
- by vikram.shukla(at)oracle.com
Introduction:
OCEP is often used as a
business rules engine to execute a set of business logic rules via CQL
statements, and take decisions based on the outcome of those rules. There are
times where configuring rules manually is sufficient because an application
needs to deal with only a small and well-defined set of static rules. However, in
many situations customers don't want to pre-define such rules for two reasons.
First, they are dealing with events with lots of columns and manually crafting
such rules for each column or a set of columns and combinations thereof is
almost impossible. Second, they are content with probabilistic outcomes and do not
care about 100% precision. The former is the case when a user is dealing with data
with high dimensionality, the latter when an application can live with "false"
positives as they can be discarded after further inspection, say by a Human
Task component in a Business Process Management software.
The primary goal of
this blog post is to show how this can be achieved by combining OCEP with
Oracle Data Mining® and leveraging the latter's rich set of algorithms and
functionality to do predictive analytics in real time on streaming events. The
secondary goal of this post is also to show how OCEP can be extended to invoke
any arbitrary external computation in an RDBMS from within CEP. The extensible
facility is known as the JDBC cartridge.
The rest of the post describes the steps required to
achieve this:
We
use the dataset available at http://blogs.oracle.com/datamining/2010/01/fraud_and_anomaly_detection_made_simple.html to showcase the capabilities. We use it to show how transaction
anomalies or fraud can be detected.
Building the model:
Follow the self-explanatory steps described at the
above URL to build the model. It is very
simple - it uses built-in Oracle Data Mining PL/SQL packages to cleanse,
normalize and build the model out of the dataset. You can also use graphical Oracle Data
Miner® to build the models.
To summarize, it involves:
Specifying which algorithms to use. In
this case we use Support Vector Machines as we're trying to find anomalies in
highly dimensional dataset.Build model on the data in the table for
the algorithms specified.
For this example, the table was populated in the scott/tiger
schema with appropriate privileges.
Configuring the Data Source:
This is the first step in building CEP application
using such an integration. Our
datasource looks as follows in the server config file. It is advisable that you use the Visualizer
to add it to the running server dynamically, rather than manually edit the
file.
<data-source>
<name>DataMining</name>
<data-source-params>
<jndi-names>
<element>DataMining</element>
</jndi-names>
<global-transactions-protocol>OnePhaseCommit</global-transactions-protocol>
</data-source-params>
<connection-pool-params>
<credential-mapping-enabled></credential-mapping-enabled>
<test-table-name>SQL SELECT 1 from DUAL</test-table-name>
<initial-capacity>1</initial-capacity>
<max-capacity>15</max-capacity>
<capacity-increment>1</capacity-increment>
</connection-pool-params>
<driver-params>
<use-xa-data-source-interface>true</use-xa-data-source-interface>
<driver-name>oracle.jdbc.OracleDriver</driver-name>
<url>jdbc:oracle:thin:@localhost:1522:orcl</url>
<properties>
<element>
<value>scott</value>
<name>user</name>
</element>
<element>
<value>{Salted-3DES}AzFE5dDbO2g=</value>
<name>password</name>
</element>
<element>
<name>com.bea.core.datasource.serviceName</name>
<value>oracle11.2g</value>
</element>
<element>
<name>com.bea.core.datasource.serviceVersion</name>
<value>11.2.0</value>
</element>
<element>
<name>com.bea.core.datasource.serviceObjectClass</name>
<value>java.sql.Driver</value>
</element>
</properties>
</driver-params>
</data-source>
Designing the EPN:
The EPN is very simple in this example.
We briefly describe each of the components.
The adapter ("DataMiningAdapter") reads
data from a .csv file and sends it to the CQL processor downstream. The
event payload here is same as that of the table in the database (refer to
the attached project or do a "desc table-name" from a SQL*PLUS prompt).
While this is for convenience in this example, it need not be the case.
One can still omit fields in the streaming events, and need not match all
columns in the table on which the model was built. Better yet, it does not
even need to have the same name as columns in the table, as long as you
alias them in the USING clause of the mining function. (Caveat: they still
need to draw values from a similar universe or domain, otherwise it
constitutes incorrect usage of the model).
There are
two things in the CQL processor ("DataMiningProc") that make scoring
possible on streaming events.
1.
User
defined cartridge function
Please refer to the
OCEP CQL reference manual to find more details about how to define such functions.
We include the function below in its entirety for illustration.
<?xml
version="1.0" encoding="UTF-8"?>
<jdbcctxconfig:config
xmlns:jdbcctxconfig="http://www.bea.com/ns/wlevs/config/application"
xmlns:jc="http://www.oracle.com/ns/ocep/config/jdbc">
<jc:jdbc-ctx>
<name>Oracle11gR2</name>
<data-source>DataMining</data-source>
<function name="prediction2">
<param name="CQLMONTH"
type="char"/>
<param name="WEEKOFMONTH"
type="int"/>
<param name="DAYOFWEEK"
type="char" />
<param name="MAKE"
type="char" />
<param name="ACCIDENTAREA" type="char"
/>
<param name="DAYOFWEEKCLAIMED" type="char"
/>
<param name="MONTHCLAIMED"
type="char" />
<param name="WEEKOFMONTHCLAIMED"
type="int" />
<param name="SEX"
type="char" />
<param name="MARITALSTATUS" type="char"
/>
<param name="AGE"
type="int" />
<param name="FAULT"
type="char" />
<param name="POLICYTYPE" type="char"
/>
<param name="VEHICLECATEGORY" type="char"
/>
<param name="VEHICLEPRICE"
type="char" />
<param name="FRAUDFOUND"
type="int" />
<param name="POLICYNUMBER"
type="int" />
<param name="REPNUMBER"
type="int" />
<param name="DEDUCTIBLE" type="int"
/>
<param name="DRIVERRATING" type="int"
/>
<param name="DAYSPOLICYACCIDENT" type="char"
/>
<param name="DAYSPOLICYCLAIM"
type="char" />
<param name="PASTNUMOFCLAIMS"
type="char" />
<param name="AGEOFVEHICLES"
type="char" />
<param name="AGEOFPOLICYHOLDER"
type="char" />
<param name="POLICEREPORTFILED"
type="char" />
<param name="WITNESSPRESNT"
type="char" />
<param name="AGENTTYPE"
type="char" />
<param name="NUMOFSUPP"
type="char" />
<param name="ADDRCHGCLAIM" type="char"
/>
<param name="NUMOFCARS"
type="char" />
<param name="CQLYEAR"
type="int" />
<param name="BASEPOLICY"
type="char" />
<return-component-type>char</return-component-type>
<sql><![CDATA[
SELECT
to_char(PREDICTION_PROBABILITY(CLAIMSMODEL, '0' USING *))
AS
probability
FROM (SELECT :CQLMONTH AS MONTH,
:WEEKOFMONTH AS WEEKOFMONTH, :DAYOFWEEK AS
DAYOFWEEK,
:MAKE AS MAKE,
:ACCIDENTAREA AS
ACCIDENTAREA,
:DAYOFWEEKCLAIMED AS
DAYOFWEEKCLAIMED,
:MONTHCLAIMED AS
MONTHCLAIMED,
:WEEKOFMONTHCLAIMED,
:SEX AS SEX,
:MARITALSTATUS AS
MARITALSTATUS,
:AGE AS AGE,
:FAULT AS FAULT,
:POLICYTYPE AS
POLICYTYPE,
:VEHICLECATEGORY AS VEHICLECATEGORY,
:VEHICLEPRICE AS
VEHICLEPRICE,
:FRAUDFOUND AS FRAUDFOUND,
:POLICYNUMBER AS
POLICYNUMBER,
:REPNUMBER AS
REPNUMBER,
:DEDUCTIBLE AS
DEDUCTIBLE,
:DRIVERRATING AS
DRIVERRATING,
:DAYSPOLICYACCIDENT
AS DAYSPOLICYACCIDENT,
:DAYSPOLICYCLAIM AS
DAYSPOLICYCLAIM,
:PASTNUMOFCLAIMS AS
PASTNUMOFCLAIMS,
:AGEOFVEHICLES AS
AGEOFVEHICLES,
:AGEOFPOLICYHOLDER
AS AGEOFPOLICYHOLDER,
:POLICEREPORTFILED
AS POLICEREPORTFILED,
:WITNESSPRESNT AS
WITNESSPRESENT,
:AGENTTYPE AS
AGENTTYPE,
:NUMOFSUPP AS
NUMOFSUPP,
:ADDRCHGCLAIM AS
ADDRCHGCLAIM,
:NUMOFCARS AS NUMOFCARS,
:CQLYEAR AS YEAR,
:BASEPOLICY AS
BASEPOLICY
FROM dual)
]]>
</sql>
</function>
</jc:jdbc-ctx>
</jdbcctxconfig:config>
2.
Invoking
the function for each event.
Once this function is
defined, you can invoke it from CQL as follows:
<?xml
version="1.0" encoding="UTF-8"?>
<wlevs:config xmlns:wlevs="http://www.bea.com/ns/wlevs/config/application">
<processor>
<name>DataMiningProc</name>
<rules>
<query id="q1"><![CDATA[
ISTREAM(SELECT S.CQLMONTH, S.WEEKOFMONTH, S.DAYOFWEEK, S.MAKE, :
S.BASEPOLICY,
C.F AS probability FROM
StreamDataChannel [NOW]
AS S,
TABLE(prediction2@Oracle11gR2(S.CQLMONTH, S.WEEKOFMONTH, S.DAYOFWEEK, S.MAKE,
..., S.BASEPOLICY) AS
F of char)
AS C)
]]></query>
</rules>
</processor>
</wlevs:config>
Finally,
the last stage in the EPN prints out the probability of the event being an
anomaly. One can also define a threshold in CQL to filter out events that
are normal, i.e., below a certain mark as defined by the analyst or
designer.
Sample Runs:
Now let's see how this behaves when events are
streamed through CEP. We use only two events for brevity, one normal and other
one not.
This is one of the "normal" looking events and the
probability of it being anomalous is less than 60%.
Event is:
eventType=DataMiningOutEvent object=q1
time=2904821976256 S.CQLMONTH=Dec, S.WEEKOFMONTH=5,
S.DAYOFWEEK=Wednesday, S.MAKE=Honda, S.ACCIDENTAREA=Urban,
S.DAYOFWEEKCLAIMED=Tuesday, S.MONTHCLAIMED=Jan, S.WEEKOFMONTHCLAIMED=1,
S.SEX=Female, S.MARITALSTATUS=Single, S.AGE=21, S.FAULT=Policy Holder,
S.POLICYTYPE=Sport - Liability, S.VEHICLECATEGORY=Sport, S.VEHICLEPRICE=more
than 69000, S.FRAUDFOUND=0, S.POLICYNUMBER=1, S.REPNUMBER=12, S.DEDUCTIBLE=300,
S.DRIVERRATING=1, S.DAYSPOLICYACCIDENT=more than 30, S.DAYSPOLICYCLAIM=more
than 30, S.PASTNUMOFCLAIMS=none, S.AGEOFVEHICLES=3 years,
S.AGEOFPOLICYHOLDER=26 to 30, S.POLICEREPORTFILED=No, S.WITNESSPRESENT=No,
S.AGENTTYPE=External, S.NUMOFSUPP=none, S.ADDRCHGCLAIM=1 year, S.NUMOFCARS=3 to
4, S.CQLYEAR=1994, S.BASEPOLICY=Liability, probability=.58931702982118561
isTotalOrderGuarantee=true\nAnamoly
probability: .58931702982118561
However, the following event is scored as an anomaly
with a very high probability of 89%. So
there is likely to be something wrong with it. A close look reveals that the
value of "deductible" field (10000) is not "normal". What exactly constitutes
normal here?. If you run the query on the database to find ALL distinct values
for the "deductible" field, it returns the following set: {300, 400, 500, 700}
Event is:
eventType=DataMiningOutEvent object=q1
time=2598483773496 S.CQLMONTH=Dec, S.WEEKOFMONTH=5,
S.DAYOFWEEK=Wednesday, S.MAKE=Honda, S.ACCIDENTAREA=Urban,
S.DAYOFWEEKCLAIMED=Tuesday, S.MONTHCLAIMED=Jan, S.WEEKOFMONTHCLAIMED=1,
S.SEX=Female, S.MARITALSTATUS=Single, S.AGE=21, S.FAULT=Policy Holder,
S.POLICYTYPE=Sport - Liability, S.VEHICLECATEGORY=Sport, S.VEHICLEPRICE=more
than 69000, S.FRAUDFOUND=0, S.POLICYNUMBER=1, S.REPNUMBER=12, S.DEDUCTIBLE=10000,
S.DRIVERRATING=1, S.DAYSPOLICYACCIDENT=more than 30, S.DAYSPOLICYCLAIM=more
than 30, S.PASTNUMOFCLAIMS=none, S.AGEOFVEHICLES=3 years,
S.AGEOFPOLICYHOLDER=26 to 30, S.POLICEREPORTFILED=No, S.WITNESSPRESENT=No,
S.AGENTTYPE=External, S.NUMOFSUPP=none, S.ADDRCHGCLAIM=1 year, S.NUMOFCARS=3 to
4, S.CQLYEAR=1994, S.BASEPOLICY=Liability, probability=.89171554529576691
isTotalOrderGuarantee=true\nAnamoly
probability: .89171554529576691
Conclusion:
By way of this example,
we show:
real-time scoring of events as they flow
through CEP leveraging Oracle Data Mining.how CEP applications can invoke complex arbitrary
external computations (function shipping) in an RDBMS.