Search Results

Search found 487 results on 20 pages for 'etl instrumentation'.

Page 19/20 | < Previous Page | 15 16 17 18 19 20 | Next Page >

Useful Command-line Commands on Windows

- by Sung Meister

The aim for this Wiki is to promote using a command to open up commonly used applications without having to go through many mouse clicks - thus saving time on monitoring and troubleshooting Windows machines. Answer entries need to specify Application name Commands Screenshot (Optional) Shortcut to commands && - Command Chaining %SYSTEMROOT%\System32\rcimlby.exe -LaunchRA - Remote Assistance (Windows XP) appwiz.cpl - Programs and Features (Formerly Known as "Add or Remove Programs") appwiz.cpl @,2 - Turn Windows Features On and Off (Add/Remove Windows Components pane) arp - Displays and modifies the IP-to-Physical address translation tables used by address resolution protocol (ARP) at - Schedule tasks either locally or remotely without using Scheduled Tasks bootsect.exe - Updates the master boot code for hard disk partitions to switch between BOOTMGR and NTLDR cacls - Change Access Control List (ACL) permissions on a directory, its subcontents, or files calc - Calculator chkdsk - Check/Fix the disk surface for physical errors or bad sectors cipher - Displays or alters the encryption of directories [files] on NTFS partitions cleanmgr.exe - Disk Cleanup clip - Redirects output of command line tools to the Windows clipboard cls - clear the command line screen cmd /k - Run command with command extensions enabled color - Sets the default console foreground and background colors in console command.com - Default Operating System Shell compmgmt.msc - Computer Management control.exe /name Microsoft.NetworkAndSharingCenter - Network and Sharing Center control keyboard - Keyboard Properties control mouse(or main.cpl) - Mouse Properties control sysdm.cpl,@0,3 - Advanced Tab of the System Properties dialog control userpasswords2 - Opens the classic User Accounts dialog desk.cpl - opens the display properties devmgmt.msc - Device Manager diskmgmt.msc - Disk Management diskpart - Disk management from the command line dsa.msc - Opens active directory users and computers dsquery - Finds any objects in the directory according to criteria dxdiag - DirectX Diagnostic Tool eventvwr - Windows Event Log (Event Viewer) explorer . - Open explorer with the current folder selected. explorer /e, . - Open explorer, with folder tree, with current folder selected. F7 - View command history find - Searches for a text string in a file or files findstr - Find a string in a file firewall.cpl - Opens the Windows Firewall settings fsmgmt.msc - Shared Folders fsutil - Perform tasks related to FAT and NTFS file systems ftp - Transfers files to and from a computer running an FTP server service getmac - Shows the mac address(es) of your network adapter(s) gpedit.msc - Group Policy Editor gpresult - Displays the Resultant Set of Policy (RSoP) information for a target user and computer httpcfg.exe - HTTP Configuration Utility iisreset - To restart IIS InetMgr.exe - Internet Information Services (IIS) Manager 7 InetMgr6.exe - Internet Information Services (IIS) Manager 6 intl.cpl - Regional and Language Options ipconfig - Internet protocol configuration lusrmgr.msc - Local Users and Groups Administrator msconfig - System Configuration notepad - Notepad? ;) mmsys.cpl - Sound/Recording/Playback properties mode - Configure system devices more - Displays one screen of output at a time mrt - Microsoft Windows Malicious Software Removal Tool mstsc.exe - Remote Desktop Connection nbstat - displays protocol statistics and current TCP/IP connections using NBT ncpa.cpl - Network Connections netsh - Display or modify the network configuration of a computer that is currently running netstat - Network Statistics net statistics - Check computer up time net stop - Stops a running service. net use - Connects a computer to or disconnects a computer from a shared resource, or displays information about computer connections odbcad32.exe - ODBC Data Source Administrator pathping - A traceroute that collects detailed packet loss stats perfmon - Opens Reliability and Performance Monitor ping - Determine whether a remote computer is accessible over the network powercfg.cpl - Power management control panel applet quser - Display information about user sessions on a terminal server qwinsta - See disconnected remote desktop sessions reg.exe - Console Registry Tool for Windows regedit - Registry Editor rasdial - Connects to a VPN or a dialup network robocopy - Backup/Restore/Copy large amounts of files reliably rsop.msc - Resultant Set of Policy (shows the combined effect of all group policies active on the current system/login) runas - Run specific tools and programs with different permissions than the user's current logon provides sc - Manage anything you want to do with services. schtasks - Enables an administrator to create, delete, query, change, run and end scheduled tasks on a local or remote system. secpol.msc - Local Security Settings services.msc - Services control panel set - Displays, sets, or removes cmd.exe environment variables. set DIRCMD - Preset dir parameter in cmd.exe start - Starts a separate window to run a specified program or command start. - opens the current directory in the Windows Explorer. shutdown.exe - Shutdown or Reboot a local/remote machine subst.exe - Associates a path with a drive letter, including local drives systeminfo -Displays a comprehensive information about the system taskkill - terminate tasks by process id (PID) or image name tasklist.exe - List Processes on local or a remote machine taskmgr.exe - Task Manager telephon.cpl - Telephone and Modem properties timedate.cpl - Date and Time title - Change the title of the CMD window you have open tracert - Trace route wmic - Windows Management Instrumentation Command-line winver.exe - Find Windows Version wscui.cpl - Windows Security Center wuauclt.exe - Windows Update AutoUpdate Client

Read the article
rsyslogd not monitoring all files

- by Tom O'Connor

So.. I've installed Logstash, and instead of using the logstash shipper (because it needs the JVM and is generally massive), I'm using rsyslogd with the following configuration. # Use traditional timestamp format $ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat $IncludeConfig /etc/rsyslog.d/*.conf # Provides kernel logging support (previously done by rklogd) $ModLoad imklog # Provides support for local system logging (e.g. via logger command) $ModLoad imuxsock # Log all kernel messages to the console. # Logging much else clutters up the screen. #kern.* /dev/console # Log anything (except mail) of level info or higher. # Don't log private authentication messages! *.info;mail.none;authpriv.none;cron.none;local6.none /var/log/messages # The authpriv file has restricted access. authpriv.* /var/log/secure # Log all the mail messages in one place. mail.* -/var/log/maillog # Log cron stuff cron.* /var/log/cron # Everybody gets emergency messages *.emerg * # Save news errors of level crit and higher in a special file. uucp,news.crit /var/log/spooler # Save boot messages also to boot.log local7.* /var/log/boot.log In /etc/rsyslog.d/logstash.conf there are 28 file monitor blocks using imfile $ModLoad imfile # Load the imfile input module $ModLoad imklog # for reading kernel log messages $ModLoad imuxsock # for reading local syslog messages $InputFileName /var/log/rabbitmq/startup_err $InputFileTag rmq-err: $InputFileStateFile state-rmq-err $InputFileFacility local6 $InputRunFileMonitor .... $InputFileName /var/log/some.other.custom.log $InputFileTag cust-log: $InputFileStateFile state-cust-log $InputFileFacility local6 $InputRunFileMonitor .... *.* @@10.90.0.110:5514 There are 28 InputFileMonitor blocks, each monitoring a different custom application logfile.. If I run [root@secret-gm02 ~]# lsof|grep rsyslog rsyslogd 5380 root cwd DIR 253,0 4096 2 / rsyslogd 5380 root rtd DIR 253,0 4096 2 / rsyslogd 5380 root txt REG 253,0 278976 1015955 /sbin/rsyslogd rsyslogd 5380 root mem REG 253,0 58400 1868123 /lib64/libgcc_s-4.1.2-20080825.so.1 rsyslogd 5380 root mem REG 253,0 144776 1867778 /lib64/ld-2.5.so rsyslogd 5380 root mem REG 253,0 1718232 1867780 /lib64/libc-2.5.so rsyslogd 5380 root mem REG 253,0 23360 1867787 /lib64/libdl-2.5.so rsyslogd 5380 root mem REG 253,0 145872 1867797 /lib64/libpthread-2.5.so rsyslogd 5380 root mem REG 253,0 85544 1867815 /lib64/libz.so.1.2.3 rsyslogd 5380 root mem REG 253,0 53448 1867801 /lib64/librt-2.5.so rsyslogd 5380 root mem REG 253,0 92816 1868016 /lib64/libresolv-2.5.so rsyslogd 5380 root mem REG 253,0 20384 1867990 /lib64/rsyslog/lmnsd_ptcp.so rsyslogd 5380 root mem REG 253,0 53880 1867802 /lib64/libnss_files-2.5.so rsyslogd 5380 root mem REG 253,0 23736 1867800 /lib64/libnss_dns-2.5.so rsyslogd 5380 root mem REG 253,0 20768 1867988 /lib64/rsyslog/lmnet.so rsyslogd 5380 root mem REG 253,0 11488 1867982 /lib64/rsyslog/imfile.so rsyslogd 5380 root mem REG 253,0 24040 1867983 /lib64/rsyslog/imklog.so rsyslogd 5380 root mem REG 253,0 11536 1867987 /lib64/rsyslog/imuxsock.so rsyslogd 5380 root mem REG 253,0 13152 1867989 /lib64/rsyslog/lmnetstrms.so rsyslogd 5380 root mem REG 253,0 8400 1867992 /lib64/rsyslog/lmtcpclt.so rsyslogd 5380 root 0r REG 0,3 0 4026531848 /proc/kmsg rsyslogd 5380 root 1u IPv4 1200589517 0t0 TCP 10.10.10.90 t:40629->10.10.10.90:5514 (ESTABLISHED) rsyslogd 5380 root 2u IPv4 1200589527 0t0 UDP *:45801 rsyslogd 5380 root 3w REG 253,3 17999744 2621483 /var/log/messages rsyslogd 5380 root 4w REG 253,3 13383 2621484 /var/log/secure rsyslogd 5380 root 5w REG 253,3 7180 2621493 /var/log/maillog rsyslogd 5380 root 6w REG 253,3 43321 2621529 /var/log/cron rsyslogd 5380 root 7w REG 253,3 0 2621494 /var/log/spooler rsyslogd 5380 root 8w REG 253,3 0 2621495 /var/log/boot.log rsyslogd 5380 root 9r REG 253,3 1064271998 2621464 /var/log/custom-application.monolog.log rsyslogd 5380 root 10u unix 0xffff81081fad2e40 0t0 1200589511 /dev/log You can see that there are nowhere near 28 logfiles actually being read. I really had to get one file monitored, so I moved it to the top, and it picked it up, but I'd like to be able to monitor all 28+ files, and not have to worry. OS is Centos 5.5 Kernel 2.6.18-308.el5 rsyslogd 3.22.1, compiled with: FEATURE_REGEXP: Yes FEATURE_LARGEFILE: Yes FEATURE_NETZIP (message compression): Yes GSSAPI Kerberos 5 support: Yes FEATURE_DEBUG (debug build, slow code): No Atomic operations supported: Yes Runtime Instrumentation (slow code): No Questions: Why is rsyslogd only monitoring a very small subset of the files? How can I fix this so that all the files are monitored?

Read the article
ODI 12c - Parallel Table Load

- by David Allan

In this post we will look at the ODI 12c capability of parallel table load from the aspect of the mapping developer and the knowledge module developer - two quite different viewpoints. This is about parallel table loading which isn't to be confused with loading multiple targets per se. It supports the ability for ODI mappings to be executed concurrently especially if there is an overlap of the datastores that they access, so any temporary resources created may be uniquely constructed by ODI. Temporary objects can be anything basically - common examples are staging tables, indexes, views, directories - anything in the ETL to help the data integration flow do its job. In ODI 11g users found a few workarounds (such as changing the technology prefixes - see here) to build unique temporary names but it was more of a challenge in error cases. ODI 12c mappings by default operate exactly as they did in ODI 11g with respect to these temporary names (this is also true for upgraded interfaces and scenarios) but can be configured to support the uniqueness capabilities. We will look at this feature from two aspects; that of a mapping developer and that of a developer (of procedures or KMs). 1. Firstly as a Mapping Developer..... 1.1 Control when uniqueness is enabled A new property is available to set unique name generation on/off. When unique names have been enabled for a mapping, all temporary names used by the collection and integration objects will be generated using unique names. This property is presented as a check-box in the Property Inspector for a deployment specification. 1.2 Handle cleanup after successful execution Provided that all temporary objects that are created have a corresponding drop statement then all of the temporary objects should be removed during a successful execution. This should be the case with the KMs developed by Oracle. 1.3 Handle cleanup after unsuccessful execution If an execution failed in ODI 11g then temporary tables would have been left around and cleaned up in the subsequent run. In ODI 12c, KM tasks can now have a cleanup-type task which is executed even after a failure in the main tasks. These cleanup tasks will be executed even on failure if the property 'Remove Temporary Objects on Error' is set. If the agent was to crash and not be able to execute this task, then there is an ODI tool (OdiRemoveTemporaryObjects here) you can invoke to cleanup the tables - it supports date ranges and the like. That's all there is to it from the aspect of the mapping developer it's much, much simpler and straightforward. You can now execute the same mapping concurrently or execute many mappings using the same resource concurrently without worrying about conflict. 2. Secondly as a Procedure or KM Developer..... In the ODI Operator the executed code shows the actual name that is generated - you can also see the runtime code prior to execution (introduced in 11.1.1.7), for example below in the code type I selected 'Pre-executed Code' this lets you see the code about to be processed and you can also see the executed code (which is the default view). References to the collection (C$) and integration (I$) names will be automatically made unique by using the odiRef APIs - these objects will have unique names whenever concurrency has been enabled for a particular mapping deployment specification. It's also possible to use name uniqueness functions in procedures and your own KMs. 2.1 New uniqueness tags You can also make your own temporary objects have unique names by explicitly including either %UNIQUE_STEP_TAG or %UNIQUE_SESSION_TAG in the name passed to calls to the odiRef APIs. Such names would always include the unique tag regardless of the concurrency setting. To illustrate, let's look at the getObjectName() method. At <% expansion time, this API will append %UNIQUE_STEP_TAG to the object name for collection and integration tables. The name parameter passed to this API may contain %UNIQUE_STEP_TAG or %UNIQUE_SESSION_TAG. This API always generates to the <? version of getObjectName() At execution time this API will replace the unique tag macros with a string that is unique to the current execution scope. The returned name will conform to the name-length restriction for the target technology, and its pattern for the unique tag. Any necessary truncation will be performed against the initial name for the object and any other fixed text that may have been specified. Examples are:- <?=odiRef.getObjectName("L", "%COL_PRFEMP%UNIQUE_STEP_TAG", "D")?> SCOTT.C$_EABH7QI1BR1EQI3M76PG9SIMBQQ <?=odiRef.getObjectName("L", "EMP%UNIQUE_STEP_TAG_AE", "D")?> SCOTT.EMPAO96Q2JEKO0FTHQP77TMSAIOSR_ Methods which have this kind of support include getFrom, getTableName, getTable, getObjectShortName and getTemporaryIndex. There are APIs for retrieving this tag info also, the getInfo API has been extended with the following properties (the UNIQUE* properties can also be used in ODI procedures); UNIQUE_STEP_TAG - Returns the unique value for the current step scope, e.g. 5rvmd8hOIy7OU2o1FhsF61 Note that this will be a different value for each loop-iteration when the step is in a loop. UNIQUE_SESSION_TAG - Returns the unique value for the current session scope, e.g. 6N38vXLrgjwUwT5MseHHY9 IS_CONCURRENT - Returns info about the current mapping, will return 0 or 1 (only in % phase) GUID_SRC_SET - Returns the UUID for the current source set/execution unit (only in % phase) The getPop API has been extended with the IS_CONCURRENT property which returns info about an mapping, will return 0 or 1. 2.2 Additional APIs Some new APIs are provided including getFormattedName which will allow KM developers to construct a name from fixed-text or ODI symbols that can be optionally truncate to a max length and use a specific encoding for the unique tag. It has syntax getFormattedName(String pName[, String pTechnologyCode]) This API is available at both the % and the ? phase. The format string can contain the ODI prefixes that are available for getObjectName(), e.g. %INT_PRF, %COL_PRF, %ERR_PRF, %IDX_PRF alongwith %UNIQUE_STEP_TAG or %UNIQUE_SESSION_TAG. The latter tags will be expanded into a unique string according to the specified technology. Calls to this API within the same execution context are guaranteed to return the same unique name provided that the same parameters are passed to the call. e.g. <%=odiRef.getFormattedName("%COL_PRFMY_TABLE%UNIQUE_STEP_TAG_AE", "ORACLE")%> <?=odiRef.getFormattedName("%COL_PRFMY_TABLE%UNIQUE_STEP_TAG_AE", "ORACLE")?> C$_MY_TAB7wDiBe80vBog1auacS1xB_AE <?=odiRef.getFormattedName("%COL_PRFMY_TABLE%UNIQUE_STEP_TAG.log", "FILE")?> C2_MY_TAB7wDiBe80vBog1auacS1xB.log 2.3 Name length generation As part of name generation, the length of the generated name will be compared with the maximum length for the target technology and truncation may need to be applied. When a unique tag is included in the generated string it is important that uniqueness is not compromised by truncation of the unique tag. When a unique tag is NOT part of the generated name, the name will be truncated by removing characters from the end - this is the existing 11g algorithm. When a unique tag is included, the algorithm will first truncate the <postfix> and if necessary the <prefix>. It is recommended that users will ensure there is sufficient uniqueness in the <prefix> section to ensure uniqueness of the final resultant name. SUMMARY To summarize, ODI 12c make it much simpler to utilize mappings in concurrent cases and provides APIs for helping developing any procedures or custom knowledge modules in such a way they can be used in highly concurrent, parallel scenarios.

Read the article
Blogging tips for SQL Server professionals

- by jamiet

For some time now I have been intending to put some material together relating my blogging experiences since I began blogging in 2004 and that led to me submitting a session for SQLBits recently where I intended to do just that. That didn’t get enough votes to allow me to present however so instead I resolved to write a blog post about it and Simon Sabin’s recent post Blogging – how do you do it? has prompted me to get around to completing it. So, here I present a compendium of tips that I’ve picked up from authoring a fair few blog posts over the past 6 years. Feedburner Feedburner.com is a service that can consume your blog’s default RSS feed and provide another, replacement, feed that has exactly the same content. You can then supply that replacement feed on your blog site for other people to consume in their RSS readers. Why would you want to do this? Well, two reasons actually: It makes your blog portable. If you ever want to move your blog to a different URL you don’t have to tell your subscribers to move to a different feed. The feedburner feed is a pointer to your blog content rather than being a copy of it. Feedburner will collect stats telling you how many people are subscribed to your feed, which RSS readers they use, stuff like that. Here’s a sample screenshot for http://sqlblog.com/blogs/jamie_thomson/: It also tells you what your most viewed posts are: Web stats like these are notoriously inaccurate but then again the method of measurement here is not important, what IS important is that it gives you a trustworthy ranking of your blog posts and (in my opinion) knowing which are your most popular posts is more important than knowing exactly how many views each post has had. This is just the tip of the iceberg of what Feedburner provides and I recommend every new blogger to try it! Monitor subscribers using Google Reader If for some reason Feedburner is not to your taste or (more likely) you already have an established RSS feed that you do not want to change then Google provide another way in which you can monitor your readership in the shape of their online RSS reader, Google Reader. It provides, for every RSS feed, a collection of stats including the number of Google Reader users that have subscribed to that RSS feed. This is really valuable information and in fact I have been recording this statistic for mine and a number of other blogs for a few years now and as such I can produce the following chart that indicates how readership is trending for those blogs over time: [Good news for my fellow SQLBlog bloggers.] As Stephen Few readily points out, its not the numbers that are important but the trend. Search Engine Optimisation (SEO) SEO (or “How do I get my blog to show up in Google”) is a massive area of expertise which I don’t want (and am unable) to cover in much detail here but there are some simple rules of thumb that will help: Tags – If your blog engine offers the ability to add tags to your blog post, use them. Invariably those tags go into the meta section of the page HTML and search engines lap that stuff up. For example, from my recent post Microsoft publish Visual Studio 2010 Database Project Guidance: Title – Search engines take notice of web page titles as well so make them specific and descriptive (e.g. “Configuring dtsConfig connection strings”) rather than esoteric and meaningless in a vain attempt to be humorous (e.g. “Last night a DJ saved my ETL batch”)! Title(2) – Make your title even more search engine friendly by mentioning high level subject areas, not dissimilar to Twitter hashtags. For example, if you look at all of my posts related to SSIS you will notice that nearly all contain the word “SSIS” in the title even if I had to shoehorn it in there by putting it in square brackets or similar. Another tip, if you ARE putting words into your titles in this artificial manner then put them at the end so that they’re not that prominent in search engine results; they’re there for the search engines to consume, not for human beings. Images – Always add titles and alternate text (ALT attribute) to images in your blog post. If you use Windows 7 or Windows Vista then you can use Live Writer (which Simon recommended) makes this easy for you. Headings – If you want to highlight section headings use heading tags (e.g. <H1>, <H2>, <H3> etc…) rather than just formatting the text appropriately – again, Live makes this easy. These tags give your blog posts structure that is understood by search engines and RSS readers alike. (I believe it makes them more amenable to CSS as well – though that’s not something I know too much about). If you check the HTML source for the blog post you’re reading right now you’ll be able to scan through and see where I have used heading tags. Microsoft provide a free tool called the SEO Toolkit that will analyse your blog site (for free) and tell you what things you should change to improve SEO. Go read more and download for free at Search Engine Optimization Toolkit. Did I mention that it was free? Miscellaneous Tips If you are including code in your blog post then ensure it is formatted correctly. Use SQL Server Central’s T-SQL prettifier for formatting T-SQL code. Use images and videos. Personally speaking there’s nothing I like less when reading a blog than paragraph after paragraph of text. Images make your blog more appealing which means people are more likely to read what you have written. Be original. Don’t plagiarise other people’s content and don’t simply rewrite the contents of Books Online. Every time you publish a blog post tweet a link to it. Include hashtags in your tweet that are more likely to grab people’s attention. That’s probably enough for now - I hope this blog post proves useful to someone out there. If you would appreciate a related session at a forthcoming SQLBits conference then please let me know. This will likely be my last blog post for 2010 so I would like to take this opportunity to thank everyone that has commented on, linked to or read any of my blog posts in that time. 2011 is shaping up to be a very interesting for SQL Server observers with the impending release of SQL Server code-named Denali and I promise I’ll have lots more content on that as the year progresses. Happy New Year. @Jamiet

Read the article
SQL SERVER – Faster SQL Server Databases and Applications – Power and Control with SafePeak Caching Options

- by Pinal Dave

Update: This blog post is written based on the SafePeak, which is available for free download. Today, I’d like to examine more closely one of my preferred technologies for accelerating SQL Server databases, SafePeak. Safepeak’s software provides a variety of advanced data caching options, techniques and tools to accelerate the performance and scalability of SQL Server databases and applications. I’d like to look more closely at some of these options, as some of these capabilities could help you address lagging database and performance on your systems. To better understand the available options, it is best to start by understanding the difference between the usual “Basic Caching” vs. SafePeak’s “Dynamic Caching”. Basic Caching Basic Caching (or the stale and static cache) is an ability to put the results from a query into cache for a certain period of time. It is based on TTL, or Time-to-live, and is designed to stay in cache no matter what happens to the data. For example, although the actual data can be modified due to DML commands (update/insert/delete), the cache will still hold the same obsolete query data. Meaning that with the Basic Caching is really static / stale cache. As you can tell, this approach has its limitations. Dynamic Caching Dynamic Caching (or the non-stale cache) is an ability to put the results from a query into cache while maintaining the cache transaction awareness looking for possible data modifications. The modifications can come as a result of: DML commands (update/insert/delete), indirect modifications due to triggers on other tables, executions of stored procedures with internal DML commands complex cases of stored procedures with multiple levels of internal stored procedures logic. When data modification commands arrive, the caching system identifies the related cache items and evicts them from cache immediately. In the dynamic caching option the TTL setting still exists, although its importance is reduced, since the main factor for cache invalidation (or cache eviction) become the actual data updates commands. Now that we have a basic understanding of the differences between “basic” and “dynamic” caching, let’s dive in deeper. SafePeak: A comprehensive and versatile caching platform SafePeak comes with a wide range of caching options. Some of SafePeak’s caching options are automated, while others require manual configuration. Together they provide a complete solution for IT and Data managers to reach excellent performance acceleration and application scalability for a wide range of business cases and applications. Automated caching of SQL Queries: Fully/semi-automated caching of all “read” SQL queries, containing any types of data, including Blobs, XMLs, Texts as well as all other standard data types. SafePeak automatically analyzes the incoming queries, categorizes them into SQL Patterns, identifying directly and indirectly accessed tables, views, functions and stored procedures; Automated caching of Stored Procedures: Fully or semi-automated caching of all read” stored procedures, including procedures with complex sub-procedure logic as well as procedures with complex dynamic SQL code. All procedures are analyzed in advance by SafePeak’s Metadata-Learning process, their SQL schemas are parsed – resulting with a full understanding of the underlying code, objects dependencies (tables, views, functions, sub-procedures) enabling automated or semi-automated (manually review and activate by a mouse-click) cache activation, with full understanding of the transaction logic for cache real-time invalidation; Transaction aware cache: Automated cache awareness for SQL transactions (SQL and in-procs); Dynamic SQL Caching: Procedures with dynamic SQL are pre-parsed, enabling easy cache configuration, eliminating SQL Server load for parsing time and delivering high response time value even in most complicated use-cases; Fully Automated Caching: SQL Patterns (including SQL queries and stored procedures) that are categorized by SafePeak as “read and deterministic” are automatically activated for caching; Semi-Automated Caching: SQL Patterns categorized as “Read and Non deterministic” are patterns of SQL queries and stored procedures that contain reference to non-deterministic functions, like getdate(). Such SQL Patterns are reviewed by the SafePeak administrator and in usually most of them are activated manually for caching (point and click activation); Fully Dynamic Caching: Automated detection of all dependent tables in each SQL Pattern, with automated real-time eviction of the relevant cache items in the event of “write” commands (a DML or a stored procedure) to one of relevant tables. A default setting; Semi Dynamic Caching: A manual cache configuration option enabling reducing the sensitivity of specific SQL Patterns to “write” commands to certain tables/views. An optimization technique relevant for cases when the query data is either known to be static (like archive order details), or when the application sensitivity to fresh data is not critical and can be stale for short period of time (gaining better performance and reduced load); Scheduled Cache Eviction: A manual cache configuration option enabling scheduling SQL Pattern cache eviction based on certain time(s) during a day. A very useful optimization technique when (for example) certain SQL Patterns can be cached but are time sensitive. Example: “select customers that today is their birthday”, an SQL with getdate() function, which can and should be cached, but the data stays relevant only until 00:00 (midnight); Parsing Exceptions Management: Stored procedures that were not fully parsed by SafePeak (due to too complex dynamic SQL or unfamiliar syntax), are signed as “Dynamic Objects” with highest transaction safety settings (such as: Full global cache eviction, DDL Check = lock cache and check for schema changes, and more). The SafePeak solution points the user to the Dynamic Objects that are important for cache effectiveness, provides easy configuration interface, allowing you to improve cache hits and reduce cache global evictions. Usually this is the first configuration in a deployment; Overriding Settings of Stored Procedures: Override the settings of stored procedures (or other object types) for cache optimization. For example, in case a stored procedure SP1 has an “insert” into table T1, it will not be allowed to be cached. However, it is possible that T1 is just a “logging or instrumentation” table left by developers. By overriding the settings a user can allow caching of the problematic stored procedure; Advanced Cache Warm-Up: Creating an XML-based list of queries and stored procedure (with lists of parameters) for periodically automated pre-fetching and caching. An advanced tool allowing you to handle more rare but very performance sensitive queries pre-fetch them into cache allowing high performance for users’ data access; Configuration Driven by Deep SQL Analytics: All SQL queries are continuously logged and analyzed, providing users with deep SQL Analytics and Performance Monitoring. Reduce troubleshooting from days to minutes with database objects and SQL Patterns heat-map. The performance driven configuration helps you to focus on the most important settings that bring you the highest performance gains. Use of SafePeak SQL Analytics allows continuous performance monitoring and analysis, easy identification of bottlenecks of both real-time and historical data; Cloud Ready: Available for instant deployment on Amazon Web Services (AWS). As you can see, there are many options to configure SafePeak’s SQL Server database and application acceleration caching technology to best fit a lot of situations. If you’re not familiar with their technology, they offer free-trial software you can download that comes with a free “help session” to help get you started. You can access the free trial here. Also, SafePeak is available to use on Amazon Cloud. Reference: Pinal Dave (http://blog.sqlauthority.com)Filed under: PostADay, SQL, SQL Authority, SQL Performance, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

Read the article
Windows Azure Virtual Machine Readiness and Capacity Assessment for SQL Server

- by SQLOS Team

Windows Azure Virtual Machine Readiness and Capacity Assessment for Windows Server Machine Running SQL Server With the release of MAP Toolkit 8.0 Beta, we have added a new scenario to assess your Windows Azure Virtual Machine Readiness. The MAP 8.0 Beta performs a comprehensive assessment of Windows Servers running SQL Server to determine you level of readiness to migrate an on-premise physical or virtual machine to Windows Azure Virtual Machines. The MAP Toolkit then offers suggested changes to prepare the machines for migration, such as upgrading the operating system or SQL Server. MAP Toolkit 8.0 Beta is available for download here Your participation and feedback is very important to make the MAP Toolkit work better for you. We encourage you to participate in the beta program and provide your feedback at [email protected] or through one of our surveys. Now, let’s walk through the MAP Toolkit task for completing the Windows Azure Virtual Machine assessment and capacity planning. The tasks include the following: Perform an inventory View the Windows Azure VM Readiness results and report Collect performance data for determine VM sizing View the Windows Azure Capacity results and report Perform an inventory: 1. To perform an inventory against a single machine or across a complete environment, choose Perform an Inventory to launch the Inventory and Assessment Wizard as shown below: 2. After the Inventory and Assessment Wizard launches, select either the Windows computers or SQL Server scenario to inventory Windows machines. HINT: If you don’t care about completely inventorying a machine, just select the SQL Server scenario. Click Next to Continue. 3. On the Discovery Methods page, select how you want to discover computers and then click Next to continue. Description of Discovery Methods: Use Active Directory Domain Services -- This method allows you to query a domain controller via the Lightweight Directory Access Protocol (LDAP) and select computers in all or specific domains, containers, or OUs. Use this method if all computers and devices are in AD DS. Windows networking protocols -- This method uses the WIN32 LAN Manager application programming interfaces to query the Computer Browser service for computers in workgroups and Windows NT 4.0–based domains. If the computers on the network are not joined to an Active Directory domain, use only the Windows networking protocols option to find computers. System Center Configuration Manager (SCCM) -- This method enables you to inventory computers managed by System Center Configuration Manager (SCCM). You need to provide credentials to the System Center Configuration Manager server in order to inventory the managed computers. When you select this option, the MAP Toolkit will query SCCM for a list of computers and then MAP will connect to these computers. Scan an IP address range -- This method allows you to specify the starting address and ending address of an IP address range. The wizard will then scan all IP addresses in the range and inventory only those computers. Note: This option can perform poorly, if many IP addresses aren’t being used within the range. Manually enter computer names and credentials -- Use this method if you want to inventory a small number of specific computers. Import computer names from a files -- Using this method, you can create a text file with a list of computer names that will be inventoried. 4. On the All Computers Credentials page, enter the accounts that have administrator rights to connect to the discovered machines. This does not need to a domain account, but needs to be a local administrator. I have entered my domain account that is an administrator on my local machine. Click Next after one or more accounts have been added. NOTE: The MAP Toolkit primarily uses Windows Management Instrumentation (WMI) to collect hardware, device, and software information from the remote computers. In order for the MAP Toolkit to successfully connect and inventory computers in your environment, you have to configure your machines to inventory through WMI and also allow your firewall to enable remote access through WMI. The MAP Toolkit also requires remote registry access for certain assessments. In addition to enabling WMI, you need accounts with administrative privileges to access desktops and servers in your environment. 5. On the Credentials Order page, select the order in which want the MAP Toolkit to connect to the machine and SQL Server. Generally just accept the defaults and click Next. 6. On the Enter Computers Manually page, click Create to pull up at dialog to enter one or more computer names. 7. On the Summary page confirm your settings and then click Finish. After clicking Finish the inventory process will start, as shown below: Windows Azure Readiness results and report After the inventory progress has completed, you can review the results under the Database scenario. On the tile, you will see the number of Windows Server machine with SQL Server that were analyzed, the number of machines that are ready to move without changes and the number of machines that require further changes. If you click this Azure VM Readiness tile, you will see additional details and can generate the Windows Azure VM Readiness Report. After the report is generated, select View | Saved Reports and Proposals to view the location of the report. Open up WindowsAzureVMReadiness* report in Excel. On the Windows tab, you can see the results of the assessment. This report has a column for the Operating System and SQL Server assessment and provides a recommendation on how to resolve, if there a component is not supported. Collect Performance Data Launch the Performance Wizard to collect performance information for the Windows Server machines that you would like the MAP Toolkit to suggest a Windows Azure VM size for. Windows Azure Capacity results and report After the performance metrics are collected, the Azure VM Capacity title will display the number of Virtual Machine sizes that are suggested for the Windows Server and Linux machines that were analyzed. You can then click on the Azure VM Capacity tile to see the capacity details and generate the Windows Azure VM Capacity Report. Within this report, you can view the performance data that was collected and the Virtual Machine sizes. MAP Toolkit 8.0 Beta is available for download here Your participation and feedback is very important to make the MAP Toolkit work better for you. We encourage you to participate in the beta program and provide your feedback at [email protected] or through one of our surveys. Useful References: Windows Azure Homepage How to guides for Windows Azure Virtual Machines Provisioning a SQL Server Virtual Machine on Windows Azure Windows Azure Pricing Peter Saddow Senior Program Manager – MAP Toolkit Team

Read the article
T-SQL Tuesday #31 - Logging Tricks with CONTEXT_INFO

- by Most Valuable Yak (Rob Volk)

This month's T-SQL Tuesday is being hosted by Aaron Nelson [b | t], fellow Atlantan (the city in Georgia, not the famous sunken city, or the resort in the Bahamas) and covers the topic of logging (the recording of information, not the harvesting of trees) and maintains the fine T-SQL Tuesday tradition begun by Adam Machanic [b | t] (the SQL Server guru, not the guy who fixes cars, check the spelling again, there will be a quiz later). This is a trick I learned from Fernando Guerrero [b | t] waaaaaay back during the PASS Summit 2004 in sunny, hurricane-infested Orlando, during his session on Secret SQL Server (not sure if that's the correct title, and I haven't used parentheses in this paragraph yet). CONTEXT_INFO is a neat little feature that's existed since SQL Server 2000 and perhaps even earlier. It lets you assign data to the current session/connection, and maintains that data until you disconnect or change it. In addition to the CONTEXT_INFO() function, you can also query the context_info column in sys.dm_exec_sessions, or even sysprocesses if you're still running SQL Server 2000, if you need to see it for another session. While you're limited to 128 bytes, one big advantage that CONTEXT_INFO has is that it's independent of any transactions. If you've ever logged to a table in a transaction and then lost messages when it rolled back, you can understand how aggravating it can be. CONTEXT_INFO also survives across multiple SQL batches (GO separators) in the same connection, so for those of you who were going to suggest "just log to a table variable, they don't get rolled back": HA-HA, I GOT YOU! Since GO starts a new batch all variable declarations are lost. Here's a simple example I recently used at work. I had to test database mirroring configurations for disaster recovery scenarios and measure the network throughput. I also needed to log how long it took for the script to run and include the mirror settings for the database in question. I decided to use AdventureWorks as my database model, and Adam Machanic's Big Adventure script to provide a fairly large workload that's repeatable and easily scalable. My test would consist of several copies of AdventureWorks running the Big Adventure script while I mirrored the databases (or not). Since Adam's script contains several batches, I decided CONTEXT_INFO would have to be used. As it turns out, I only needed to grab the start time at the beginning, I could get the rest of the data at the end of the process. The code is pretty small: declare @time binary(128)=cast(getdate() as binary(8)) set context_info @time ... rest of Big Adventure code ... go use master; insert mirror_test(server,role,partner,db,state,safety,start,duration) select @@servername, mirroring_role_desc, mirroring_partner_instance, db_name(database_id), mirroring_state_desc, mirroring_safety_level_desc, cast(cast(context_info() as binary(8)) as datetime), datediff(s,cast(cast(context_info() as binary(8)) as datetime),getdate()) from sys.database_mirroring where db_name(database_id) like 'Adv%'; I declared @time as a binary(128) since CONTEXT_INFO is defined that way. I couldn't convert GETDATE() to binary(128) as it would pad the first 120 bytes as 0x00. To keep the CAST functions simple and avoid using SUBSTRING, I decided to CAST GETDATE() as binary(8) and let SQL Server do the implicit conversion. It's not the safest way perhaps, but it works on my machine. :) As I mentioned earlier, you can query system views for sessions and get their CONTEXT_INFO. With a little boilerplate code this can be used to monitor long-running procedures, in case you need to kill a process, or are just curious how long certain parts take. In this example, I added code to Adam's Big Adventure script to set CONTEXT_INFO messages at strategic places I want to monitor. (His code is in UPPERCASE as it was in the original, mine is all lowercase): declare @msg binary(128) set @msg=cast('Altering bigProduct.ProductID' as binary(128)) set context_info @msg go ALTER TABLE bigProduct ALTER COLUMN ProductID INT NOT NULL GO set context_info 0x0 go declare @msg1 binary(128) set @msg1=cast('Adding pk_bigProduct Constraint' as binary(128)) set context_info @msg1 go ALTER TABLE bigProduct ADD CONSTRAINT pk_bigProduct PRIMARY KEY (ProductID) GO set context_info 0x0 go declare @msg2 binary(128) set @msg2=cast('Altering bigTransactionHistory.TransactionID' as binary(128)) set context_info @msg2 go ALTER TABLE bigTransactionHistory ALTER COLUMN TransactionID INT NOT NULL GO set context_info 0x0 go declare @msg3 binary(128) set @msg3=cast('Adding pk_bigTransactionHistory Constraint' as binary(128)) set context_info @msg3 go ALTER TABLE bigTransactionHistory ADD CONSTRAINT pk_bigTransactionHistory PRIMARY KEY NONCLUSTERED(TransactionID) GO set context_info 0x0 go declare @msg4 binary(128) set @msg4=cast('Creating IX_ProductId_TransactionDate Index' as binary(128)) set context_info @msg4 go CREATE NONCLUSTERED INDEX IX_ProductId_TransactionDate ON bigTransactionHistory(ProductId,TransactionDate) INCLUDE(Quantity,ActualCost) GO set context_info 0x0 This doesn't include the entire script, only those portions that altered a table or created an index. One annoyance is that SET CONTEXT_INFO requires a literal or variable, you can't use an expression. And since GO starts a new batch I need to declare a variable in each one. And of course I have to use CAST because it won't implicitly convert varchar to binary. And even though context_info is a nullable column, you can't SET CONTEXT_INFO NULL, so I have to use SET CONTEXT_INFO 0x0 to clear the message after the statement completes. And if you're thinking of turning this into a UDF, you can't, although a stored procedure would work. So what does all this aggravation get you? As the code runs, if I want to see which stage the session is at, I can run the following (assuming SPID 51 is the one I want): select CAST(context_info as varchar(128)) from sys.dm_exec_sessions where session_id=51 Since SQL Server 2005 introduced the new system and dynamic management views (DMVs) there's not as much need for tagging a session with these kinds of messages. You can get the session start time and currently executing statement from them, and neatly presented if you use Adam's sp_whoisactive utility (and you absolutely should be using it). Of course you can always use xp_cmdshell, a CLR function, or some other tricks to log information outside of a SQL transaction. All the same, I've used this trick to monitor long-running reports at a previous job, and I still think CONTEXT_INFO is a great feature, especially if you're still using SQL Server 2000 or want to supplement your instrumentation. If you'd like an exercise, consider adding the system time to the messages in the last example, and an automated job to query and parse it from the system tables. That would let you track how long each statement ran without having to run Profiler. #TSQL2sDay

Read the article
T-SQL Tuesday #31 - Logging Tricks with CONTEXT_INFO

- by Most Valuable Yak (Rob Volk)

This month's T-SQL Tuesday is being hosted by Aaron Nelson [b | t], fellow Atlantan (the city in Georgia, not the famous sunken city, or the resort in the Bahamas) and covers the topic of logging (the recording of information, not the harvesting of trees) and maintains the fine T-SQL Tuesday tradition begun by Adam Machanic [b | t] (the SQL Server guru, not the guy who fixes cars, check the spelling again, there will be a quiz later). This is a trick I learned from Fernando Guerrero [b | t] waaaaaay back during the PASS Summit 2004 in sunny, hurricane-infested Orlando, during his session on Secret SQL Server (not sure if that's the correct title, and I haven't used parentheses in this paragraph yet). CONTEXT_INFO is a neat little feature that's existed since SQL Server 2000 and perhaps even earlier. It lets you assign data to the current session/connection, and maintains that data until you disconnect or change it. In addition to the CONTEXT_INFO() function, you can also query the context_info column in sys.dm_exec_sessions, or even sysprocesses if you're still running SQL Server 2000, if you need to see it for another session. While you're limited to 128 bytes, one big advantage that CONTEXT_INFO has is that it's independent of any transactions. If you've ever logged to a table in a transaction and then lost messages when it rolled back, you can understand how aggravating it can be. CONTEXT_INFO also survives across multiple SQL batches (GO separators) in the same connection, so for those of you who were going to suggest "just log to a table variable, they don't get rolled back": HA-HA, I GOT YOU! Since GO starts a new batch all variable declarations are lost. Here's a simple example I recently used at work. I had to test database mirroring configurations for disaster recovery scenarios and measure the network throughput. I also needed to log how long it took for the script to run and include the mirror settings for the database in question. I decided to use AdventureWorks as my database model, and Adam Machanic's Big Adventure script to provide a fairly large workload that's repeatable and easily scalable. My test would consist of several copies of AdventureWorks running the Big Adventure script while I mirrored the databases (or not). Since Adam's script contains several batches, I decided CONTEXT_INFO would have to be used. As it turns out, I only needed to grab the start time at the beginning, I could get the rest of the data at the end of the process. The code is pretty small: declare @time binary(128)=cast(getdate() as binary(8)) set context_info @time ... rest of Big Adventure code ... go use master; insert mirror_test(server,role,partner,db,state,safety,start,duration) select @@servername, mirroring_role_desc, mirroring_partner_instance, db_name(database_id), mirroring_state_desc, mirroring_safety_level_desc, cast(cast(context_info() as binary(8)) as datetime), datediff(s,cast(cast(context_info() as binary(8)) as datetime),getdate()) from sys.database_mirroring where db_name(database_id) like 'Adv%'; I declared @time as a binary(128) since CONTEXT_INFO is defined that way. I couldn't convert GETDATE() to binary(128) as it would pad the first 120 bytes as 0x00. To keep the CAST functions simple and avoid using SUBSTRING, I decided to CAST GETDATE() as binary(8) and let SQL Server do the implicit conversion. It's not the safest way perhaps, but it works on my machine. :) As I mentioned earlier, you can query system views for sessions and get their CONTEXT_INFO. With a little boilerplate code this can be used to monitor long-running procedures, in case you need to kill a process, or are just curious how long certain parts take. In this example, I added code to Adam's Big Adventure script to set CONTEXT_INFO messages at strategic places I want to monitor. (His code is in UPPERCASE as it was in the original, mine is all lowercase): declare @msg binary(128) set @msg=cast('Altering bigProduct.ProductID' as binary(128)) set context_info @msg go ALTER TABLE bigProduct ALTER COLUMN ProductID INT NOT NULL GO set context_info 0x0 go declare @msg1 binary(128) set @msg1=cast('Adding pk_bigProduct Constraint' as binary(128)) set context_info @msg1 go ALTER TABLE bigProduct ADD CONSTRAINT pk_bigProduct PRIMARY KEY (ProductID) GO set context_info 0x0 go declare @msg2 binary(128) set @msg2=cast('Altering bigTransactionHistory.TransactionID' as binary(128)) set context_info @msg2 go ALTER TABLE bigTransactionHistory ALTER COLUMN TransactionID INT NOT NULL GO set context_info 0x0 go declare @msg3 binary(128) set @msg3=cast('Adding pk_bigTransactionHistory Constraint' as binary(128)) set context_info @msg3 go ALTER TABLE bigTransactionHistory ADD CONSTRAINT pk_bigTransactionHistory PRIMARY KEY NONCLUSTERED(TransactionID) GO set context_info 0x0 go declare @msg4 binary(128) set @msg4=cast('Creating IX_ProductId_TransactionDate Index' as binary(128)) set context_info @msg4 go CREATE NONCLUSTERED INDEX IX_ProductId_TransactionDate ON bigTransactionHistory(ProductId,TransactionDate) INCLUDE(Quantity,ActualCost) GO set context_info 0x0 This doesn't include the entire script, only those portions that altered a table or created an index. One annoyance is that SET CONTEXT_INFO requires a literal or variable, you can't use an expression. And since GO starts a new batch I need to declare a variable in each one. And of course I have to use CAST because it won't implicitly convert varchar to binary. And even though context_info is a nullable column, you can't SET CONTEXT_INFO NULL, so I have to use SET CONTEXT_INFO 0x0 to clear the message after the statement completes. And if you're thinking of turning this into a UDF, you can't, although a stored procedure would work. So what does all this aggravation get you? As the code runs, if I want to see which stage the session is at, I can run the following (assuming SPID 51 is the one I want): select CAST(context_info as varchar(128)) from sys.dm_exec_sessions where session_id=51 Since SQL Server 2005 introduced the new system and dynamic management views (DMVs) there's not as much need for tagging a session with these kinds of messages. You can get the session start time and currently executing statement from them, and neatly presented if you use Adam's sp_whoisactive utility (and you absolutely should be using it). Of course you can always use xp_cmdshell, a CLR function, or some other tricks to log information outside of a SQL transaction. All the same, I've used this trick to monitor long-running reports at a previous job, and I still think CONTEXT_INFO is a great feature, especially if you're still using SQL Server 2000 or want to supplement your instrumentation. If you'd like an exercise, consider adding the system time to the messages in the last example, and an automated job to query and parse it from the system tables. That would let you track how long each statement ran without having to run Profiler. #TSQL2sDay

Read the article
SQL SERVER – SSIS Look Up Component – Cache Mode – Notes from the Field #028

- by Pinal Dave

[Notes from Pinal]: Lots of people think that SSIS is all about arranging various operations together in one logical flow. Well, the understanding is absolutely correct, but the implementation of the same is not as easy as it seems. Similarly most of the people think lookup component is just component which does look up for additional information and does not pay much attention to it. Due to the same reason they do not pay attention to the same and eventually get very bad performance. Linchpin People are database coaches and wellness experts for a data driven world. In this 28th episode of the Notes from the Fields series database expert Tim Mitchell (partner at Linchpin People) shares very interesting conversation related to how to write a good lookup component with Cache Mode. In SQL Server Integration Services, the lookup component is one of the most frequently used tools for data validation and completion. The lookup component is provided as a means to virtually join one set of data to another to validate and/or retrieve missing values. Properly configured, it is reliable and reasonably fast. Among the many settings available on the lookup component, one of the most critical is the cache mode. This selection will determine whether and how the distinct lookup values are cached during package execution. It is critical to know how cache modes affect the result of the lookup and the performance of the package, as choosing the wrong setting can lead to poorly performing packages, and in some cases, incorrect results. Full Cache The full cache mode setting is the default cache mode selection in the SSIS lookup transformation. Like the name implies, full cache mode will cause the lookup transformation to retrieve and store in SSIS cache the entire set of data from the specified lookup location. As a result, the data flow in which the lookup transformation resides will not start processing any data buffers until all of the rows from the lookup query have been cached in SSIS. The most commonly used cache mode is the full cache setting, and for good reason. The full cache setting has the most practical applications, and should be considered the go-to cache setting when dealing with an untested set of data. With a moderately sized set of reference data, a lookup transformation using full cache mode usually performs well. Full cache mode does not require multiple round trips to the database, since the entire reference result set is cached prior to data flow execution. There are a few potential gotchas to be aware of when using full cache mode. First, you can see some performance issues – memory pressure in particular – when using full cache mode against large sets of reference data. If the table you use for the lookup is very large (either deep or wide, or perhaps both), there’s going to be a performance cost associated with retrieving and caching all of that data. Also, keep in mind that when doing a lookup on character data, full cache mode will always do a case-sensitive (and in some cases, space-sensitive) string comparison even if your database is set to a case-insensitive collation. This is because the in-memory lookup uses a .NET string comparison (which is case- and space-sensitive) as opposed to a database string comparison (which may be case sensitive, depending on collation). There’s a relatively easy workaround in which you can use the UPPER() or LOWER() function in the pipeline data and the reference data to ensure that case differences do not impact the success of your lookup operation. Again, neither of these present a reason to avoid full cache mode, but should be used to determine whether full cache mode should be used in a given situation. Full cache mode is ideally useful when one or all of the following conditions exist: The size of the reference data set is small to moderately sized The size of the pipeline data set (the data you are comparing to the lookup table) is large, is unknown at design time, or is unpredictable Each distinct key value(s) in the pipeline data set is expected to be found multiple times in that set of data Partial Cache When using the partial cache setting, lookup values will still be cached, but only as each distinct value is encountered in the data flow. Initially, each distinct value will be retrieved individually from the specified source, and then cached. To be clear, this is a row-by-row lookup for each distinct key value(s). This is a less frequently used cache setting because it addresses a narrower set of scenarios. Because each distinct key value(s) combination requires a relational round trip to the lookup source, performance can be an issue, especially with a large pipeline data set to be compared to the lookup data set. If you have, for example, a million records from your pipeline data source, you have the potential for doing a million lookup queries against your lookup data source (depending on the number of distinct values in the key column(s)). Therefore, one has to be keenly aware of the expected row count and value distribution of the pipeline data to safely use partial cache mode. Using partial cache mode is ideally suited for the conditions below: The size of the data in the pipeline (more specifically, the number of distinct key column) is relatively small The size of the lookup data is too large to effectively store in cache The lookup source is well indexed to allow for fast retrieval of row-by-row values No Cache As you might guess, selecting no cache mode will not add any values to the lookup cache in SSIS. As a result, every single row in the pipeline data set will require a query against the lookup source. Since no data is cached, it is possible to save a small amount of overhead in SSIS memory in cases where key values are not reused. In the real world, I don’t see a lot of use of the no cache setting, but I can imagine some edge cases where it might be useful. As such, it’s critical to know your data before choosing this option. Obviously, performance will be an issue with anything other than small sets of data, as the no cache setting requires row-by-row processing of all of the data in the pipeline. I would recommend considering the no cache mode only when all of the below conditions are true: The reference data set is too large to reasonably be loaded into SSIS memory The pipeline data set is small and is not expected to grow There are expected to be very few or no duplicates of the key values(s) in the pipeline data set (i.e., there would be no benefit from caching these values) Conclusion The cache mode, an often-overlooked setting on the SSIS lookup component, represents an important design decision in your SSIS data flow. Choosing the right lookup cache mode directly impacts the fidelity of your results and the performance of package execution. Know how this selection impacts your ETL loads, and you’ll end up with more reliable, faster packages. If you want me to take a look at your server and its settings, or if your server is facing any issue we can Fix Your SQL Server. Reference: Pinal Dave (http://blog.sqlauthority.com)Filed under: Notes from the Field, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL Tagged: SSIS

Read the article
Fraud Detection with the SQL Server Suite Part 2

- by Dejan Sarka

This is the second part of the fraud detection whitepaper. You can find the first part in my previous blog post about this topic. My Approach to Data Mining Projects It is impossible to evaluate the time and money needed for a complete fraud detection infrastructure in advance. Personally, I do not know the customer’s data in advance. I don’t know whether there is already an existing infrastructure, like a data warehouse, in place, or whether we would need to build one from scratch. Therefore, I always suggest to start with a proof-of-concept (POC) project. A POC takes something between 5 and 10 working days, and involves personnel from the customer’s site – either employees or outsourced consultants. The team should include a subject matter expert (SME) and at least one information technology (IT) expert. The SME must be familiar with both the domain in question as well as the meaning of data at hand, while the IT expert should be familiar with the structure of data, how to access it, and have some programming (preferably Transact-SQL) knowledge. With more than one IT expert the most time consuming work, namely data preparation and overview, can be completed sooner. I assume that the relevant data is already extracted and available at the very beginning of the POC project. If a customer wants to have their people involved in the project directly and requests the transfer of knowledge, the project begins with training. I strongly advise this approach as it offers the establishment of a common background for all people involved, the understanding of how the algorithms work and the understanding of how the results should be interpreted, a way of becoming familiar with the SQL Server suite, and more. Once the data has been extracted, the customer’s SME (i.e. the analyst), and the IT expert assigned to the project will learn how to prepare the data in an efficient manner. Together with me, knowledge and expertise allow us to focus immediately on the most interesting attributes and identify any additional, calculated, ones soon after. By employing our programming knowledge, we can, for example, prepare tens of derived variables, detect outliers, identify the relationships between pairs of input variables, and more, in only two or three days, depending on the quantity and the quality of input data. I favor the customer’s decision of assigning additional personnel to the project. For example, I actually prefer to work with two teams simultaneously. I demonstrate and explain the subject matter by applying techniques directly on the data managed by each team, and then both teams continue to work on the data overview and data preparation under our supervision. I explain to the teams what kind of results we expect, the reasons why they are needed, and how to achieve them. Afterwards we review and explain the results, and continue with new instructions, until we resolve all known problems. Simultaneously with the data preparation the data overview is performed. The logic behind this task is the same – again I show to the teams involved the expected results, how to achieve them and what they mean. This is also done in multiple cycles as is the case with data preparation, because, quite frankly, both tasks are completely interleaved. A specific objective of the data overview is of principal importance – it is represented by a simple star schema and a simple OLAP cube that will first of all simplify data discovery and interpretation of the results, and will also prove useful in the following tasks. The presence of the customer’s SME is the key to resolving possible issues with the actual meaning of the data. We can always replace the IT part of the team with another database developer; however, we cannot conduct this kind of a project without the customer’s SME. After the data preparation and when the data overview is available, we begin the scientific part of the project. I assist the team in developing a variety of models, and in interpreting the results. The results are presented graphically, in an intuitive way. While it is possible to interpret the results on the fly, a much more appropriate alternative is possible if the initial training was also performed, because it allows the customer’s personnel to interpret the results by themselves, with only some guidance from me. The models are evaluated immediately by using several different techniques. One of the techniques includes evaluation over time, where we use an OLAP cube. After evaluating the models, we select the most appropriate model to be deployed for a production test; this allows the team to understand the deployment process. There are many possibilities of deploying data mining models into production; at the POC stage, we select the one that can be completed quickly. Typically, this means that we add the mining model as an additional dimension to an existing DW or OLAP cube, or to the OLAP cube developed during the data overview phase. Finally, we spend some time presenting the results of the POC project to the stakeholders and managers. Even from a POC, the customer will receive lots of benefits, all at the sole risk of spending money and time for a single 5 to 10 day project: The customer learns the basic patterns of frauds and fraud detection The customer learns how to do the entire cycle with their own people, only relying on me for the most complex problems The customer’s analysts learn how to perform much more in-depth analyses than they ever thought possible The customer’s IT experts learn how to perform data extraction and preparation much more efficiently than they did before All of the attendees of this training learn how to use their own creativity to implement further improvements of the process and procedures, even after the solution has been deployed to production The POC output for a smaller company or for a subsidiary of a larger company can actually be considered a finished, production-ready solution It is possible to utilize the results of the POC project at subsidiary level, as a finished POC project for the entire enterprise Typically, the project results in several important “side effects” Improved data quality Improved employee job satisfaction, as they are able to proactively contribute to the central knowledge about fraud patterns in the organization Because eventually more minds get to be involved in the enterprise, the company should expect more and better fraud detection patterns After the POC project is completed as described above, the actual project would not need months of engagement from my side. This is possible due to our preference to transfer the knowledge onto the customer’s employees: typically, the customer will use the results of the POC project for some time, and only engage me again to complete the project, or to ask for additional expertise if the complexity of the problem increases significantly. I usually expect to perform the following tasks: Establish the final infrastructure to measure the efficiency of the deployed models Deploy the models in additional scenarios Through reports By including Data Mining Extensions (DMX) queries in OLTP applications to support real-time early warnings Include data mining models as dimensions in OLAP cubes, if this was not done already during the POC project Create smart ETL applications that divert suspicious data for immediate or later inspection I would also offer to investigate how the outcome could be transferred automatically to the central system; for instance, if the POC project was performed in a subsidiary whereas a central system is available as well Of course, for the actual project, I would repeat the data and model preparation as needed It is virtually impossible to tell in advance how much time the deployment would take, before we decide together with customer what exactly the deployment process should cover. Without considering the deployment part, and with the POC project conducted as suggested above (including the transfer of knowledge), the actual project should still only take additional 5 to 10 days. The approximate timeline for the POC project is, as follows: 1-2 days of training 2-3 days for data preparation and data overview 2 days for creating and evaluating the models 1 day for initial preparation of the continuous learning infrastructure 1 day for presentation of the results and discussion of further actions Quite frequently I receive the following question: are we going to find the best possible model during the POC project, or during the actual project? My answer is always quite simple: I do not know. Maybe, if we would spend just one hour more for data preparation, or create just one more model, we could get better patterns and predictions. However, we simply must stop somewhere, and the best possible way to do this, according to my experience, is to restrict the time spent on the project in advance, after an agreement with the customer. You must also never forget that, because we build the complete learning infrastructure and transfer the knowledge, the customer will be capable of doing further investigations independently and improve the models and predictions over time without the need for a constant engagement with me.

Read the article
Profiling Startup Of VS2012 – SpeedTrace Profiler

- by Alois Kraus

SpeedTrace is a relatively unknown profiler made a company called Ipcas. A single professional license does cost 449€+VAT. For the test I did use SpeedTrace 4.5 which is currently Beta. Although it is cheaper than dotTrace it has by far the most options to influence how profiling does work. First you need to create a tracing project which does configure tracing for one process type. You can start the application directly from the profiler or (much more interesting) it does attach to a specific process when it is started. For this you need to check “Trace the specified …” radio button and enter the process name in the “Process Name of the Trace” edit box. You can even selectively enable tracing for processes with a specific command line. Then you need to activate the trace project by pressing the Activate Project button and you are ready to start VS as usual. If you want to profile the next 10 VS instances that you start you can set the Number of Processes counter to e.g. 10. This is immensely helpful if you are trying to profile only the next 5 started processes. As you can see there are many more tabs which do allow to influence tracing in a much more sophisticated way. SpeedTrace is the only profiler which does not rely entirely on the profiling Api of .NET. Instead it does modify the IL code (instrumentation on the fly) to write tracing information to disc which can later be analyzed. This approach is not only very fast but it does give you unprecedented analysis capabilities. Once the traces are collected they do show up in your workspace where you can open the trace viewer. I do skip the other windows because this view is by far the most useful one. You can sort the methods not only by Wall Clock time but also by CPU consumption and wait time which none of the other products support in their views at the same time. If you want to optimize for CPU consumption sort by CPU time. If you want to find out where most time is spent you need Clock Total time and Clock Waiting. There you can directly see if the method did take long because it did wait on something or it did really execute stuff that did take so long. Once you have found a method you want to drill deeper you can double click on a method to get to the Caller/Callee view which is similar to the JetBrains Method Grid view. But this time you do see much more. In the middle is the clicked method. Above are the methods that call you and below are the methods that you do directly call. Normally you would then start digging deeper to find the end of the chain where the slow method worth optimizing is located. But there is a shortcut. You can press the magic button to calculate the aggregation of all called methods. This is displayed in the lower left window where you can see each method call and how long it did take. There you can also sort to see if this call stack does only contain methods (e.g. WCF connect calls which you cannot make faster) not worth optimizing. YourKit has a similar feature where it is called Callees List. In the Functions tab you have in the context menu also many other useful analysis options One really outstanding feature is the View Call History Drilldown. When you select this one you get not a sum of all method invocations but a list with the duration of each method call. This is not surprising since SpeedTrace does use tracing to get its timings. There you can get many useful graphs how this method did behave over time. Did it become slower at some point in time or was only the first call slow? The diagrams and the list will tell you that. That is all fine but what should I do when one method call was slow? I want to see from where it was coming from. No problem select the method in the list hit F10 and you get the call stack. This is a life saver if you e.g. search for serialization problems. Today Serializers are used everywhere. You want to find out from where the 5s XmlSerializer.Deserialize call did come from? Hit F10 and you get the call stack which did invoke the 5s Deserialize call. The CPU timeline tab is also useful to find out where long pauses or excessive CPU consumption did happen. Click in the graph to get the Thread Stacks window where you can get a quick overview what all threads were doing at this time. This does look like the Stack Traces feature in YourKit. Only this time you get the last called method first which helps to quickly see what all threads were executing at this moment. YourKit does generate a rather long list which can be hard to go through when you have many threads. The thread list in the middle does not give you call stacks or anything like that but you see which methods were found most often executing code by the profiler which is a good indication for methods consuming most CPU time. This does sound too good to be true? I have not told you the best part yet. The best thing about this profiler is the staff behind it. When I do see a crash or some other odd behavior I send a mail to Ipcas and I do get usually the next day a mail that the problem has been fixed and a download link to the new version. The guys at Ipcas are even so helpful to log in to your machine via a Citrix Client to help you to get started profiling your actual application you want to profile. After a 2h telco I was converted from a hater to a believer of this tool. The fast response time might also have something to do with the fact that they are actively working on 4.5 to get out of the door. But still the support is by far the best I have encountered so far. The only downside is that you should instrument your assemblies including the .NET Framework to get most accurate numbers. You can profile without doing it but then you will see very high JIT times in your process which can severely affect the correctness of the measured timings. If you do not care about exact numbers you can also enable in the main UI in the Data Trace tab logging of method arguments of primitive types. If you need to know what files at which times were opened by your application you can find it out without a debugger. Since SpeedTrace does read huge trace files in its reader you should perhaps use a 64 bit machine to be able to analyze bigger traces as well. The memory consumption of the trace reader is too high for my taste. But they did promise for the next version to come up with something much improved.

Read the article
Windows Azure Virtual Machine Readiness and Capacity Assessment for SQL Server

- by SQLOS Team

Windows Azure Virtual Machine Readiness and Capacity Assessment for Windows Server Machine Running SQL Server With the release of MAP Toolkit 8.0 Beta, we have added a new scenario to assess your Windows Azure Virtual Machine Readiness. The MAP 8.0 Beta performs a comprehensive assessment of Windows Servers running SQL Server to determine you level of readiness to migrate an on-premise physical or virtual machine to Windows Azure Virtual Machines. The MAP Toolkit then offers suggested changes to prepare the machines for migration, such as upgrading the operating system or SQL Server. MAP Toolkit 8.0 Beta is available for download here Your participation and feedback is very important to make the MAP Toolkit work better for you. We encourage you to participate in the beta program and provide your feedback at [email protected] or through one of our surveys. Now, let’s walk through the MAP Toolkit task for completing the Windows Azure Virtual Machine assessment and capacity planning. The tasks include the following: Perform an inventory View the Windows Azure VM Readiness results and report Collect performance data for determine VM sizing View the Windows Azure Capacity results and report Perform an inventory: 1. To perform an inventory against a single machine or across a complete environment, choose Perform an Inventory to launch the Inventory and Assessment Wizard as shown below: 2. After the Inventory and Assessment Wizard launches, select either the Windows computers or SQL Server scenario to inventory Windows machines. HINT: If you don’t care about completely inventorying a machine, just select the SQL Server scenario. Click Next to Continue. 3. On the Discovery Methods page, select how you want to discover computers and then click Next to continue. Description of Discovery Methods: Use Active Directory Domain Services -- This method allows you to query a domain controller via the Lightweight Directory Access Protocol (LDAP) and select computers in all or specific domains, containers, or OUs. Use this method if all computers and devices are in AD DS. Windows networking protocols -- This method uses the WIN32 LAN Manager application programming interfaces to query the Computer Browser service for computers in workgroups and Windows NT 4.0–based domains. If the computers on the network are not joined to an Active Directory domain, use only the Windows networking protocols option to find computers. System Center Configuration Manager (SCCM) -- This method enables you to inventory computers managed by System Center Configuration Manager (SCCM). You need to provide credentials to the System Center Configuration Manager server in order to inventory the managed computers. When you select this option, the MAP Toolkit will query SCCM for a list of computers and then MAP will connect to these computers. Scan an IP address range -- This method allows you to specify the starting address and ending address of an IP address range. The wizard will then scan all IP addresses in the range and inventory only those computers. Note: This option can perform poorly, if many IP addresses aren’t being used within the range. Manually enter computer names and credentials -- Use this method if you want to inventory a small number of specific computers. Import computer names from a files -- Using this method, you can create a text file with a list of computer names that will be inventoried. 4. On the All Computers Credentials page, enter the accounts that have administrator rights to connect to the discovered machines. This does not need to a domain account, but needs to be a local administrator. I have entered my domain account that is an administrator on my local machine. Click Next after one or more accounts have been added. NOTE: The MAP Toolkit primarily uses Windows Management Instrumentation (WMI) to collect hardware, device, and software information from the remote computers. In order for the MAP Toolkit to successfully connect and inventory computers in your environment, you have to configure your machines to inventory through WMI and also allow your firewall to enable remote access through WMI. The MAP Toolkit also requires remote registry access for certain assessments. In addition to enabling WMI, you need accounts with administrative privileges to access desktops and servers in your environment. 5. On the Credentials Order page, select the order in which want the MAP Toolkit to connect to the machine and SQL Server. Generally just accept the defaults and click Next. 6. On the Enter Computers Manually page, click Create to pull up at dialog to enter one or more computer names. 7. On the Summary page confirm your settings and then click Finish. After clicking Finish the inventory process will start, as shown below: Windows Azure Readiness results and report After the inventory progress has completed, you can review the results under the Database scenario. On the tile, you will see the number of Windows Server machine with SQL Server that were analyzed, the number of machines that are ready to move without changes and the number of machines that require further changes. If you click this Azure VM Readiness tile, you will see additional details and can generate the Windows Azure VM Readiness Report. After the report is generated, select View | Saved Reports and Proposals to view the location of the report. Open up WindowsAzureVMReadiness* report in Excel. On the Windows tab, you can see the results of the assessment. This report has a column for the Operating System and SQL Server assessment and provides a recommendation on how to resolve, if there a component is not supported. Collect Performance Data Launch the Performance Wizard to collect performance information for the Windows Server machines that you would like the MAP Toolkit to suggest a Windows Azure VM size for. Windows Azure Capacity results and report After the performance metrics are collected, the Azure VM Capacity title will display the number of Virtual Machine sizes that are suggested for the Windows Server and Linux machines that were analyzed. You can then click on the Azure VM Capacity tile to see the capacity details and generate the Windows Azure VM Capacity Report. Within this report, you can view the performance data that was collected and the Virtual Machine sizes. MAP Toolkit 8.0 Beta is available for download here Your participation and feedback is very important to make the MAP Toolkit work better for you. We encourage you to participate in the beta program and provide your feedback at [email protected] or through one of our surveys. Useful References: Windows Azure Homepage How to guides for Windows Azure Virtual Machines Provisioning a SQL Server Virtual Machine on Windows Azure Windows Azure Pricing Peter Saddow Senior Program Manager – MAP Toolkit Team

Read the article
SSIS - Range lookups

- by Repieter

When developing an ETL solution in SSIS we sometimes need to do range lookups in SSIS. Several solutions for this can be found on the internet, but now we have built another solution which I would like to share, since it's pretty easy to implement and the performance is fast. You can download the sample package to see how it works. Make sure you have the AdventureWorks2008R2 and AdventureWorksDW2008R2 databases installed. (Apologies for the layout of this blog, I don't do this too often :)) To give a little bit more information about the example, this is basically what is does: we load a facttable and do an SCD type 2 lookup operation of the Product dimension. This is done with a script component. First we query the Data warehouse to create the lookup dataset. The query that is used for that is: SELECT [ProductKey] ,[ProductAlternateKey] ,[StartDate] ,ISNULL([EndDate], '9999-01-01') AS EndDate FROM [DimProduct] The output of this query is stored in a DataTable: string lookupQuery = @" SELECT [ProductKey] ,[ProductAlternateKey] ,[StartDate] ,ISNULL([EndDate], '9999-01-01') AS EndDate FROM [DimProduct]"; OleDbCommand oleDbCommand = new OleDbCommand(lookupQuery, _oleDbConnection); OleDbDataAdapter adapter = new OleDbDataAdapter(oleDbCommand); _dataTable = new DataTable(); adapter.Fill(_dataTable); Now that the dimension data is stored in the DataTable we use the following method to do the actual lookup: public int RangeLookup(string businessKey, DateTime lookupDate) { // set default return value (Unknown) int result = -1; DataRow[] filteredRows; filteredRows = _dataTable.Select(string.Format("ProductAlternateKey = '{0}'", businessKey)); for (int i = 0; i < filteredRows.Length; i++) { // check if the lookupdate is found between the startdate and enddate of any of the records if (lookupDate >= (DateTime)filteredRows[i][2] && lookupDate < (DateTime)filteredRows[i][3]) { result = (filteredRows[i][0] == null) ? -1 : (int)filteredRows[i][0]; break; } } filteredRows = null; return result; } This method is executed for every row that passes the script component. This is implemented in the ProcessInputRow method public override void Input0_ProcessInputRow(Input0Buffer Row) { // Perform the lookup operation on the current row and put the value in the Surrogate Key Attribute Row.ProductKey = RangeLookup(Row.ProductNumber, Row.OrderDate); } Now what actually happens?! 1. Every record passes the business key and the orderdate to the RangeLookup method. 2. The DataTable is then filtered on the business key of the current record. The output is stored in a DataRow [] object. 3. We loop over the DataRow[] object to see where the orderdate meets the following expression: (lookupDate >= (DateTime)filteredRows[i][2] && lookupDate < (DateTime)filteredRows[i][3]) 4. When the expression returns true (so where the data is between the Startdate and the EndDate), the surrogate key of the dimension record is returned We have done some testing with this solution and it works great for us. Hope others can use this example to do their range lookups.

Read the article
Using MS Standalone profiler in VS2008 Professional

- by fishdump

I am trying to profile my .NET dll while running it from VS unit testing tools but I am having problems. I am using the standalone command-line profiler as VS2008 Professional does not come with an inbuilt profiler. I have an open CMD window and have run the following commands (I instrumented it earlier which is why vsinstr gave the warning that it did): C:\...\BusinessRules\obj\Debug>vsperfclrenv /samplegclife /tracegclife /globalsamplegclife /globaltracegclife Enabling VSPerf Sampling Attach Profiling. Allows to 'attaching' to managed applications. Current Profiling Environment variables are: COR_ENABLE_PROFILING=1 COR_PROFILER={0a56a683-003a-41a1-a0ac-0f94c4913c48} COR_LINE_PROFILING=1 COR_GC_PROFILING=2 C:\...\BusinessRules\obj\Debug>vsinstr BusinessRules.dll Microsoft (R) VSInstr Post-Link Instrumentation 9.0.30729 x86 Copyright (C) Microsoft Corp. All rights reserved. Error VSP1018 : VSInstr does not support processing binaries that are already instrumented. C:\...\BusinessRules\obj\Debug>vsperfcmd /start:trace /output:foo.vsp Microsoft (R) VSPerf Command Version 9.0.30729 x86 Copyright (C) Microsoft Corp. All rights reserved. C:\...\BusinessRules\obj\Debug> I then ran the unit tests that exercised the instrumented code. When the unit tests were complete, I did... C:\...\BusinessRules\obj\Debug>vsperfcmd /shutdown Microsoft (R) VSPerf Command Version 9.0.30729 x86 Copyright (C) Microsoft Corp. All rights reserved. Waiting for process 4836 ( C:\Program Files\Microsoft Visual Studio 9.0\Common7\IDE\vstesthost.exe) to shutdown... It was clearly waiting for VS2008 to close so I closed it... Shutting down the Profile Monitor ------------------------------------------------------------ C:\...\BusinessRules\obj\Debug> All looking good, there was a 3.2mb foo.vsp file in the directory. I next did... C:\...\BusinessRules\obj\Debug>vsperfreport foo.vsp /summary:all Microsoft (R) VSPerf Report Generator, Version 9.0.0.0 Copyright (C) Microsoft Corporation. All rights reserved. VSP2340: Environment variables were not properly set during profiling run and managed symbols may not resolve. Please use vsperfclrenv before profiling. File opened Successfully opened the file. A report file, foo_Header.csv, has been generated. A report file, foo_MarksSummary.csv, has been generated. A report file, foo_ProcessSummary.csv, has been generated. A report file, foo_ThreadSummary.csv, has been generated. Analysis completed A report file, foo_FunctionSummary.csv, has been generated. A report file, foo_CallerCalleeSummary.csv, has been generated. A report file, foo_CallTreeSummary.csv, has been generated. A report file, foo_ModuleSummary.csv, has been generated. C:\...\BusinessRules\obj\Debug> Notice the warning about environment variables and using vsperfclrenv? But I had run it! Maybe I used the wrong switches? I don't know. Anyway, loading the csv files into Excel or using the perfconsole tool gives loads of useful info with useless symbol names: *** Loading commands from: C:\temp\PerfConsole\bin\commands\timebytype.dll *** Adding command: timebytype *** Loading commands from: C:\temp\PerfConsole\bin\commands\partition.dll *** Adding command: partition Welcome to PerfConsole 1.0 (for bugs please email: [email protected]), for help type: ?, for a quickstart type: ?? > load foo.vsp *** Couldn't match to either expected sampled or instrumented profile schema, defaulting to sampled *** Couldn't match to either expected sampled or instrumented profile schema, defaulting to sampled *** Profile loaded from 'foo.vsp' into @foo > > functions @foo >>>>> Function Name Exclusive Inclusive Function Name Module Name -------------------- -------------------- -------------- --------------- 900,798,600,000.00 % 900,798,600,000.00 % 0x0600003F 20397910 14,968,500,000.00 % 44,691,540,000.00 % 0x06000040 14736385 8,101,253,000.00 % 14,836,330,000.00 % 0x06000041 5491345 3,216,315,000.00 % 6,876,929,000.00 % 0x06000042 3924533 <snip> 71,449,430.00 % 71,449,430.00 % 0x0A000074 42572 52,914,200.00 % 52,914,200.00 % 0x0A000073 0 14,791.00 % 13,006,010.00 % 0x0A00007B 0 199,177.00 % 6,082,932.00 % 0x2B000001 5350072 2,420,116.00 % 2,420,116.00 % 0x0A00008A 0 836.00 % 451,888.00 % 0x0A000045 0 9,616.00 % 399,436.00 % 0x0A000039 0 18,202.00 % 298,223.00 % 0x06000046 1479900 I am so close to being able to find the bottlenecks, if only it will give me the function and module names instead of hex numbers! What am I doing wrong? --- Alistair.

Read the article
How do you automap List<float> or float[] with Fluent NHibernate?

- by Tom Bushell

Having successfully gotten a sample program working, I'm now starting to do Real Work with Fluent NHibernate - trying to use Automapping on my project's class heirarchy. It's a scientific instrumentation application, and the classes I'm mapping have several properties that are arrays of floats e.g. private float[] _rawY; public virtual float[] RawY { get { return _rawY; } set { _rawY = value; } } These arrays can contain a maximum of 500 values. I didn't expect Automapping to work on arrays, but tried it anyway, with some success at first. Each array was auto mapped to a BLOB (using SQLite), which seemed like a viable solution. The first problem came when I tried to call SaveOrUpdate on the objects containing the arrays - I got "No persister for float[]" exceptions. So my next thought was to convert all my arrays into ILists e.g. public virtual IList<float> RawY { get; set; } But now I get: NHibernate.MappingException: Association references unmapped class: System.Single Since Automapping can deal with lists of complex objects, it never occured to me it would not be able to map lists of basic types. But after doing some Googling for a solution, this seems to be the case. Some people seem to have solved the problem, but the sample code I saw requires more knowledge of NHibernate than I have right now - I didn't understand it. Questions: 1. How can I make this work with Automapping? 2. Also, is it better to use arrays or lists for this application? I can modify my app to use either if necessary (though I prefer lists). Edit: I've studied the code in Mapping Collection of Strings, and I see there is test code in the source that sets up an IList of strings, e.g. public virtual IList<string> ListOfSimpleChildren { get; set; } [Test] public void CanSetAsElement() { new MappingTester<OneToManyTarget>() .ForMapping(m => m.HasMany(x => x.ListOfSimpleChildren).Element("columnName")) .Element("class/bag/element").Exists(); } so this must be possible using pure Automapping, but I've had zero luck getting anything to work, probably because I don't have the requisite knowlege of manually mapping with NHibernate. Starting to think I'm going to have to hack this (by encoding the array of floats as a single string, or creating a class that contains a single float which I then aggregate into my lists), unless someone can tell me how to do it properly. End Edit Here's my CreateSessionFactory method, if that helps formulate a reply... private static ISessionFactory CreateSessionFactory() { ISessionFactory sessionFactory = null; const string autoMapExportDir = "AutoMapExport"; if( !Directory.Exists(autoMapExportDir) ) Directory.CreateDirectory(autoMapExportDir); try { var autoPersistenceModel = AutoMap.AssemblyOf<DlsAppOverlordExportRunData>() .Where(t => t.Namespace == "DlsAppAutomapped") .Conventions.Add( DefaultCascade.All() ) ; sessionFactory = Fluently.Configure() .Database(SQLiteConfiguration.Standard .UsingFile(DbFile) .ShowSql() ) .Mappings(m => m.AutoMappings.Add(autoPersistenceModel) .ExportTo(autoMapExportDir) ) .ExposeConfiguration(BuildSchema) .BuildSessionFactory() ; } catch (Exception e) { Debug.WriteLine(e); } return sessionFactory; }

Read the article
Hardware/Software inventory open source projects

- by Dick dastardly

Dear Stackoverflowers I would like to develop a Network Inventory application that works on any operating system. Reports on every possible resource attacehd to a network. Reports all pertinent details of hardware and software. Thats (and i hate to use the phrase) my "End Game". However I am running before i can crawl here. I have no experience of this type of development, e.g. discovering a computers hardware and software settings. I've spent almost two weeks googling and come up short! :-(. So I am turning to you to ask these questions:- My first step is to find an existing open source project i can incorporate into my own code that extracts the fine grained details i am after, e.g. EVERYTHING there is to know about the hardaware and software on a single machine. Does this project exist? or do i have to develop that first? Have i got to write all this in C? I am guessing getting this information about a computer is going to be easier than for printers, scanners, routers etc... e.g. everything else you would find attached to a network. Once i have access to a single computers details i then need to investigate how i can traverse an entire newtork of printers, scanners, routers, load balancers, switches, firewalls, workstations, servers, storeage devices, laptops, monitors, the list goes on and on One problem i have is i dont have a 1000 machine newtork to play on! Is there any such resource available on theinternet? (is that a silly question?) Anywho, if you dont ask you wont find out! One aspect iam really looking forward to finding out how to travers the entire network, should i be using TCP/IP for this? Whats a good site, blog, usergorup, book for TCP/IP development? How do i go about getting through firewalls? How many questions can i ask in one go? :-) My previous question on this topic ended up with PYTHON being championed as the language/script to go with to develop this application in. Having looked at a few PYTHON examples they all seemed to be related to WINDOWS networks and interrogating Windows Management Instrumentation (WMI). I had the feeling you cant rely on whats in WMI, and even if you can that s no good for UNIX netwrks. Surely there exist common code for extracting hardware and software details from a computer? Why cant i find it on the internet? Pease help? Theres no prizes though :-( Thanks in advance I would like to appologise if i have broken forum rules or not tried hard enough on my own before asking for assistance. I just would like to start moving forward with this as its one of the best projects i have been involved with. I am inspired by the many differnt number of challenges involved and that if i manage to produce a useful application at the end of it it would hopefully be extremely helpful to many people. That sit Thanks in advance DD

Read the article
Android passing an arraylist back to parent activity

- by Nicklas O

Hi there. I've been searching for a simple example of this with no luck. In my android application I have two activities: 1. The main activity which is launched at startup 2. A second activity which is launched by pressing a button on the main activty. When the second activity is finished (by pressing a button) I want it to send back an ArrayList of type MyObject to the main activity and close itself, which the main activity can then do whatever with it. How would I go about achieving this? I have been trying a few things but it is crashing my application when I start the second activity. When the user presses button to launch second activity: Intent i = new Intent(MainActivity.this, secondactivity.class); startActivityForResult(i, 1); The array which is bundled back after pressing a button on the second activity: Intent intent= getIntent(); Bundle b = new Bundle(); b.putParcelableArrayList("myarraylist", mylist); intent.putExtras(b); setResult(RESULT_OK, intent); finish(); And finally a listener on the main activity (although I'm not sure of 100% when this code launches...) protected void onActivityResult(int requestCode, int resultCode, Intent data) { super.onActivityResult(requestCode, resultCode, data); if(resultCode==RESULT_OK && requestCode==1){ Bundle extras = data.getExtras(); final ArrayList<MyObject> mylist = extras.getParcelableArrayList("myarraylist"); Toast.makeText(MainActivity.this, mylist.get(0).getName(), Toast.LENGTH_SHORT).show(); } } Any ideas where I am going wrong? The onActivityResult() seems to be crashing my application. EDIT: This is my class MyObject, its called plan and has a name and an id import android.os.Parcel; import android.os.Parcelable; public class Plan implements Parcelable{ private String name; private String id; public Plan(){ } public Plan(String name, String id){ this.name = name; this.id = id; } public String getName(){ return name; } public void setName(String name){ this.name = name; } public String getId(){ return id; } public void setId(String id){ this.id = id; } public String toString(){ return "Plan ID: " + id + " Plan Name: " + name; } @Override public int describeContents() { // TODO Auto-generated method stub return 0; } @Override public void writeToParcel(Parcel dest, int flags) { dest.writeString(id); dest.writeString(name); } public static final Parcelable.Creator<Plan> CREATOR = new Parcelable.Creator<Plan>() { public Plan createFromParcel(Parcel in) { return new Plan(); } @Override public Plan[] newArray(int size) { // TODO Auto-generated method stub return new Plan[size]; } }; } This is my logcat E/AndroidRuntime( 293): java.lang.RuntimeException: Unable to instantiate activ ity ComponentInfo{com.daniel.android.groupproject/com.me.android.projec t.secondactivity}: java.lang.NullPointerException E/AndroidRuntime( 293): at android.app.ActivityThread.performLaunchActiv ity(ActivityThread.java:2417) E/AndroidRuntime( 293): at android.app.ActivityThread.handleLaunchActivi ty(ActivityThread.java:2512) E/AndroidRuntime( 293): at android.app.ActivityThread.access$2200(Activi tyThread.java:119) E/AndroidRuntime( 293): at android.app.ActivityThread$H.handleMessage(Ac tivityThread.java:1863) E/AndroidRuntime( 293): at android.os.Handler.dispatchMessage(Handler.ja va:99) E/AndroidRuntime( 293): at android.os.Looper.loop(Looper.java:123) E/AndroidRuntime( 293): at android.app.ActivityThread.main(ActivityThrea d.java:4363) E/AndroidRuntime( 293): at java.lang.reflect.Method.invokeNative(Native Method) E/AndroidRuntime( 293): at java.lang.reflect.Method.invoke(Method.java:5 21) E/AndroidRuntime( 293): at com.android.internal.os.ZygoteInit$MethodAndA rgsCaller.run(ZygoteInit.java:860) E/AndroidRuntime( 293): at com.android.internal.os.ZygoteInit.main(Zygot eInit.java:618) E/AndroidRuntime( 293): at dalvik.system.NativeStart.main(Native Method) E/AndroidRuntime( 293): Caused by: java.lang.NullPointerException E/AndroidRuntime( 293): at com.daniel.android.groupproject.login.<init>( login.java:51) E/AndroidRuntime( 293): at java.lang.Class.newInstanceImpl(Native Method ) E/AndroidRuntime( 293): at java.lang.Class.newInstance(Class.java:1479) E/AndroidRuntime( 293): at android.app.Instrumentation.newActivity(Instr umentation.java:1021) E/AndroidRuntime( 293): at android.app.ActivityThread.performLaunchActiv ity(ActivityThread.java:2409) E/AndroidRuntime( 293): ... 11 more

Read the article
Windows Mobile 6.5 GPS Device - WaitForMultipleObjects returns 258 (timeout)

- by wizmagister

I’ve created a GPS program that track positions in realtime in the background for Windows mobile 6.1 in 2008-2009. It ran fine on these devices for many years. For some reason, the same code never worked perfectly on Windows Mobile 6.5. After many hour of operations (mostly when nobody use the device), I receive a “Timeout” (code 258) from the function "WaitForMultipleObjects": this.GPSEvent_WaitValue = WaitForMultipleObjects(2, this.GPSEvent_Handles, 0, 45000); Again, this can work for hours and suddenly, it's just impossible to get another position without : UPDATE: - Restarting the device (GoogleMap confirms that there's no GPS device present!) It has something to do with Windows Mobile going to sleep and slowing up my thread. Here's the core code (adapted from Microsoft SDK Sample): /// <summary> /// When "WindowsMobile" wake up the program to check for a new position /// </summary> private void OnNextGPSEvent_Callback() { int SecondsToNextWakeUp = ETL.Mobile.Device.ScheduledCallback.MINIMUM_SECONDTONEXTWAKEUP; switch (this.SleepingState) { case SleepingStateType.SleepingForNextPosition: // Get position this.GPSEvent_WaitValue = (WaitForEventThreadResultType)WaitForMultipleObjects(2, this.GPSEvent_Handles, 0, 45000); switch (this.GPSEvent_WaitValue) { case WaitForEventThreadResultType.Event_LocationChanged: // Got a new position this.FireLocationChanged(this.GetCurrentPosition()); // Manage device shutdown (save battery) if (this.PositionFrequency > MIN_SECONDS_FREQUENCY_FORDEVICE_SHUTDOWN) { // Close device this.CloseDevice(); SecondsToNextWakeUp = (this.PositionFrequency - GPSDEVICE_LOAD_SECONDS_LOAD_TIME); this.SleepingState = SleepingStateType.SleepingBeforeDeviceWakeUp; } else { // Default Wait Time this.SleepingState = SleepingStateType.SleepingForNextPosition; } break; case WaitForEventThreadResultType.Event_StateChanged: break; case WaitForEventThreadResultType.Timeout: case WaitForEventThreadResultType.Failed: case WaitForEventThreadResultType.Stop: // >>>>>>>>>>>>>> This is where the error happens <<<<<<<<<<<<<<<<<<<<<<<<<<< // >>>>>>>>>>>>>> This is where the error happens <<<<<<<<<<<<<<<<<<<<<<<<<<< // >>>>>>>>>>>>>> This is where the error happens <<<<<<<<<<<<<<<<<<<<<<<<<<< // Too many errors this.ConsecutiveErrorReadingDevice++; if (this.ConsecutiveErrorReadingDevice > MAX_ERRORREADINGDEVICE) { this.CloseDevice(); SecondsToNextWakeUp = (this.PositionFrequency - GPSDEVICE_LOAD_SECONDS_LOAD_TIME); this.SleepingState = SleepingStateType.SleepingBeforeDeviceWakeUp; } else { // Default Wait Time this.SleepingState = SleepingStateType.SleepingForNextPosition; } break; } #endregion break; case SleepingStateType.SleepingBeforeDeviceWakeUp: this.OpenDevice(); SecondsToNextWakeUp = GPSDEVICE_LOAD_SECONDS_LOAD_TIME; this.SleepingState = SleepingStateType.SleepingForNextPosition; break; } if (this.IsListeningGPSEvent) { // Ajustement du prochain rappel this.NextGPSEvent_Callback.SecondToNextWakeUp = SecondsToNextWakeUp; this.NextGPSEvent_Callback.RequestWakeUpCallback(); } } /// <summary> ///Create Thread /// </summary> private void StartListeningThreadForGPSEvent() { // We only want to create the thread if we don't have one created already and we have opened the gps device if (this._GPSEventThread == null) { // Create and start thread to listen for GPS events this._GPSEventThread = new System.Threading.Thread(new System.Threading.ThreadStart(this.ListeningThreadForGPSEvent)); this._GPSEventThread.Start(); } } private void ListeningThreadForGPSEvent() { this.GPSEvent_WaitValue = WaitForEventThreadResultType.Stop; this.IsListeningGPSEvent = true; // Allocate handles worth of memory to pass to WaitForMultipleObjects this.GPSEvent_Handles = Helpers.LocalAlloc(12); Marshal.WriteInt32(this.GPSEvent_Handles, 0, this._StopHandle.ToInt32()); Marshal.WriteInt32(this.GPSEvent_Handles, 4, this._NewLocationHandle.ToInt32()); Marshal.WriteInt32(this.GPSEvent_Handles, 8, this._GPSDeviceStateChanged.ToInt32()); this.Start_NextGPSEvent_Timer(this.PositionFrequency); this.SleepingState = SleepingStateType.SleepingBeforeDeviceWakeUp; this.OnNextGPSEvent_Callback(); }

Read the article
Data Profiling without SSIS

Strangely enough for a predominantly SSIS blog, this post is all about how to perform data profiling without using SSIS. Whilst the Data Profiling Task is a worthy addition, there are a couple of limitations I’ve encountered of late. The first is that it requires SQL Server 2008, and not everyone is there yet. The second is that it can only target SQL Server 2005 and above. What about older systems, which are the ones that we probably need to investigate the most, or other vendor databases such as Oracle? With these limitations in mind I did some searching to find a quick and easy alternative to help me perform some data profiling for a project I was working on recently. I only had SQL Server 2005 available, and anyway most of my target source systems were Oracle, and of course I had short timescales. I looked at several options. Some never got beyond the download stage, they failed to install or just did not run, and others provided less than I could have produced myself by spending 2 minutes writing some basic SQL queries. In the end I settled on an open source product called DataCleaner. To quote from their website: DataCleaner is an Open Source application for profiling, validating and comparing data. These activities help you administer and monitor your data quality in order to ensure that your data is useful and applicable to your business situation. DataCleaner is the free alternative to software for master data management (MDM) methodologies, data warehousing (DW) projects, statistical research, preparation for extract-transform-load (ETL) activities and more. DataCleaner is developed in Java and licensed under LGPL. As quoted above it claims to support profiling, validating and comparing data, but I didn’t really get past the profiling functions, so won’t comment on the other two. The profiling whilst not prefect certainly saved some time compared to the limited alternatives. The ability to profile heterogeneous data sources is a big advantage over the SSIS option, and I found it overall quite easy to use and performance was good. I could see it struggling at times, but actually for what it does I was impressed. It had some data type niggles with Oracle, and some metrics seem a little strange, although thankfully they were easy to augment with some SQL queries to ensure a consistent picture. The report export options didn’t do it for me, but copy and paste with a bit of Excel magic was sufficient. One initial point for me personally is that I have had limited exposure to things of the Java persuasion and whilst I normally get by fine, sometimes the simplest things can throw me. For example installing a JDBC driver, why do I have to copy files to make it all work, has nobody ever heard of an MSI? In case there are other people out there like me who have become totally indoctrinated with the Microsoft software paradigm, I’ve written a quick start guide that details every step required. Steps 1- 5 are the key ones, the rest is really an excuse for some screenshots to show you the tool. Quick Start Guide Step 1 - Download Data Cleaner. The Microsoft Windows zipped exe option, and I chose the latest stable build, currently DataCleaner 1.5.3 (final). Extract the files to a suitable location. Step 2 - Download Java. If you try and run datacleaner.exe without Java it will warn you, and then open your default browser and take you to the Java download site. Follow the installation instructions from there, normally just click Download Java a couple of times and you’re done. Step 3 - Download Microsoft SQL Server JDBC Driver. You may have SQL Server installed, but you won’t have a JDBC driver. Version 3.0 is the latest as of April 2010. There is no real installer, we are in the Java world here, but run the exe you downloaded to extract the files. The default Unzip to folder is not much help, so try a fully qualified path such as C:\Program Files\Microsoft SQL Server JDBC Driver 3.0\ to ensure you can find the files afterwards. Step 4 - If you wish to use Windows Authentication to connect to your SQL Server then first we need to copy a file so that Data Cleaner can find it. Browse to the JDBC extract location from Step 3 and drill down to the file sqljdbc_auth.dll. You will have to choose the correct directory for your processor architecture. e.g. C:\Program Files\Microsoft SQL Server JDBC Driver 3.0\sqljdbc_3.0\enu\auth\x86\sqljdbc_auth.dll. Now copy this file to the Data Cleaner extract folder you chose in Step 1. An alternative method is to edit datacleaner.cmd in the data cleaner extract folder as detailed in this data cleaner wiki topic, but I find copying the file simpler. Step 5 – Now lets run Data Cleaner, just run datacleaner.exe from the extract folder you chose in Step 1. Step 6 – Complete or skip the registration screen, and ignore the task window for now. In the main window click settings. Step 7 – In the Settings dialog, select the Database drivers tab, then click Register database driver and select the Local JAR file option. Step 8 – Browse to the JDBC driver extract location from Step 3 and drill down to select sqljdbc4.jar. e.g. C:\Program Files\Microsoft SQL Server JDBC Driver 3.0\sqljdbc_3.0\enu\sqljdbc4.jar Step 9 – Select the Database driver class as com.microsoft.sqlserver.jdbc.SQLServerDriver, and then click the Test and Save database driver button. Step 10 - You should be back at the Settings dialog with a the list of drivers that includes SQL Server. Just click Save Settings to persist all your hard work. Step 11 – Now we can start to profile some data. In the main Data Cleaner window click New Task, and then Profile from the task window. Step 12 – In the Profile window click Open Database Step 13 – Now choose the SQL Server connection string option. Selecting a connection string gives us a template like jdbc:sqlserver://<hostname>:1433;databaseName=<database>, but obviously it requires some details to be entered for example jdbc:sqlserver://localhost:1433;databaseName=SQLBits. This will connect to the database called SQLBits on my local machine. The port may also have to be changed if using such as when you have a multiple instances of SQL Server running. If using SQL Server Authentication enter a username and password as required and then click Connect to database. You can use Window Authentication, just add integratedSecurity=true to the end of your connection string. e.g jdbc:sqlserver://localhost:1433;databaseName=SQLBits;integratedSecurity=true. If you didn’t complete Step 4 above you will need to do so now and restart Data Cleaner before it will work. Manually setting the connection string is fine, but creating a named connection makes more sense if you will be spending any length of time profiling a specific database. As highlighted in the left-hand screen-shot, at the bottom of the dialog it includes partial instructions on how to create named connections. In the folder shown C:\Users\<Username>\.datacleaner\1.5.3, open the datacleaner-config.xml file in your editor of choice add your own details. You’ll see a sample connection in the file already, just add yours following the same pattern. e.g.  <bean class="dk.eobjects.datacleaner.gui.model.NamedConnection"> <property name="name" value="SQLBits Local Connection" /> <property name="driverClass" value="com.microsoft.sqlserver.jdbc.SQLServerDriver" /> <property name="connectionString" value="jdbc:sqlserver://localhost:1433;databaseName=SQLBits;integratedSecurity=true" /> <property name="tableTypes"> <list> <value>TABLE</value> <value>VIEW</value> </list> </property> </bean> Step 14 – Once back at the Profile window, you should now see your schemas, tables and/or views listed down the left hand side. Browse this tree and double-click a table to select it for profiling. You can then click Add profile, and choose some profiling options, before finally clicking Run profiling. You can see below a sample output for three of the most common profiles, click the image for full size. I hope this has given you a taster for DataCleaner, and should help you get up and running pretty quickly.

Read the article
ASP.NET Web Forms Extensibility: Providers

- by Ricardo Peres

Introduction This will be the first of a number of posts on ASP.NET extensibility. At this moment I don’t know exactly how many will be and I only know a couple of subjects that I want to talk about, so more will come in the next days. I have the sensation that the providers offered by ASP.NET are not widely know, although everyone uses, for example, sessions, they may not be aware of the extensibility points that Microsoft included. This post won’t go into details of how to configure and extend each of the providers, but will hopefully give some pointers on that direction. Canonical These are the most widely known and used providers, coming from ASP.NET 1, chances are, you have used them already. Good support for invoking client side, either from a .NET application or from JavaScript. Lots of server-side controls use them, such as the Login control for example. Membership The Membership provider is responsible for managing registered users, including creating new ones, authenticating them, changing passwords, etc. ASP.NET comes with two implementations, one that uses a SQL Server database and another that uses the Active Directory. The base class is Membership and new providers are registered on the membership section on the Web.config file, as well as parameters for specifying minimum password lengths, complexities, maximum age, etc. One reason for creating a custom provider would be, for example, storing membership information in a different database engine. 1: <membership defaultProvider="MyProvider"> 2: <providers> 3: <add name="MyProvider" type="MyClass, MyAssembly"/> 4: </providers> 5: </membership> Role The Role provider assigns roles to authenticated users. The base class is Role and there are three out of the box implementations: XML-based, SQL Server and Windows-based. Also registered on Web.config through the roleManager section, where you can also say if your roles should be cached on a cookie. If you want your roles to come from a different place, implement a custom provider. 1: <roleManager defaultProvider="MyProvider"> 2: <providers> 3: <add name="MyProvider" type="MyClass, MyAssembly" /> 4: </providers> 5: </roleManager> Profile The Profile provider allows defining a set of properties that will be tied and made available to authenticated or even anonymous ones, which must be tracked by using anonymous authentication. The base class is Profile and the only included implementation stores these settings in a SQL Server database. Configured through profile section, where you also specify the properties to make available, a custom provider would allow storing these properties in different locations. 1: <profile defaultProvider="MyProvider"> 2: <providers> 3: <add name="MyProvider" type="MyClass, MyAssembly"/> 4: </providers> 5: </profile> Basic OK, I didn’t know what to call these, so Basic is probably as good as a name as anything else. Not supported client-side (doesn’t even make sense). Session The Session provider allows storing data tied to the current “session”, which is normally created when a user first accesses the site, even when it is not yet authenticated, and remains all the way. The base class and only included implementation is SessionStateStoreProviderBase and it is capable of storing data in one of three locations: In the process memory (default, not suitable for web farms or increased reliability); A SQL Server database (best for reliability and clustering); The ASP.NET State Service, which is a Windows Service that is installed with the .NET Framework (ok for clustering). The configuration is made through the sessionState section. By adding a custom Session provider, you can store the data in different locations – think for example of a distributed cache. 1: <sessionState customProvider=”MyProvider”> 2: <providers> 3: <add name=”MyProvider” type=”MyClass, MyAssembly” /> 4: </providers> 5: </sessionState> Resource A not so known provider, allows you to change the origin of localized resource elements. By default, these come from RESX files and are used whenever you use the Resources expression builder or the GetGlobalResourceObject and GetLocalResourceObject methods, but if you implement a custom provider, you can have these elements come from some place else, such as a database. The base class is ResourceProviderFactory and there’s only one internal implementation which uses these RESX files. Configuration is through the globalization section. 1: <globalization resourceProviderFactoryType="MyClass, MyAssembly" /> Health Monitoring Health Monitoring is also probably not so well known, and actually not a good name for it. First, in order to understand what it does, you have to know that ASP.NET fires “events” at specific times and when specific things happen, such as when logging in, an exception is raised. These are not user interface events and you can create your own and fire them, nothing will happen, but the Health Monitoring provider will detect it. You can configure it to do things when certain conditions are met, such as a number of events being fired in a certain amount of time. You define these rules and route them to a specific provider, which must inherit from WebEventProvider. Out of the box implementations include sending mails, logging to a SQL Server database, writing to the Windows Event Log, Windows Management Instrumentation, the IIS 7 Trace infrastructure or the debugger Trace. Its configuration is achieved by the healthMonitoring section and a reason for implementing a custom provider would be, for example, locking down a web application in the event of a significant number of failed login attempts occurring in a small period of time. 1: <healthMonitoring> 2: <providers> 3: <add name="MyProvider" type="MyClass, MyAssembly"/> 4: </providers> 5: </healthMonitoring> Sitemap The Sitemap provider allows defining the site’s navigation structure and associated required permissions for each node, in a tree-like fashion. Usually this is statically defined, and the included provider allows it, by supplying this structure in a Web.sitemap XML file. The base class is SiteMapProvider and you can extend it in order to supply you own source for the site’s structure, which may even be dynamic. Its configuration must be done through the siteMap section. 1: <siteMap defaultProvider="MyProvider"> 2: <providers><add name="MyProvider" type="MyClass, MyAssembly" /> 3: </providers> 4: </siteMap> Web Part Personalization Web Parts are better known by SharePoint users, but since ASP.NET 2.0 they are included in the core Framework. Web Parts are server-side controls that offer certain possibilities of configuration by clients visiting the page where they are located. The infrastructure handles this configuration per user or globally for all users and this provider is responsible for just that. The base class is PersonalizationProvider and the only included implementation stores settings on SQL Server. Add new providers through the personalization section. 1: <webParts> 2: <personalization defaultProvider="MyProvider"> 3: <providers> 4: <add name="MyProvider" type="MyClass, MyAssembly"/> 5: </providers> 6: </personalization> 7: </webParts> Build The Build provider is responsible for compiling whatever files are present on your web folder. There’s a base class, BuildProvider, and, as can be expected, internal implementations for building pages (ASPX), master pages (Master), user web controls (ASCX), handlers (ASHX), themes (Skin), XML Schemas (XSD), web services (ASMX, SVC), resources (RESX), browser capabilities files (Browser) and so on. You would write a build provider if you wanted to generate code from any kind of non-code file so that you have strong typing at development time. Configuration goes on the buildProviders section and it is per extension. 1: <buildProviders> 2: <add extension=".ext" type="MyClass, MyAssembly” /> 3: </buildProviders> New in ASP.NET 4 Not exactly new since they exist since 2010, but in ASP.NET terms, still new. Output Cache The Output Cache for ASPX pages and ASCX user controls is now extensible, through the Output Cache provider, which means you can implement a custom mechanism for storing and retrieving cached data, for example, in a distributed fashion. The base class is OutputCacheProvider and the only implementation is private. Configuration goes on the outputCache section and on each page and web user control you can choose the provider you want to use. 1: <caching> 2: <outputCache defaultProvider="MyProvider"> 3: <providers> 4: <add name="MyProvider" type="MyClass, MyAssembly"/> 5: </providers> 6: </outputCache> 7: </caching> Request Validation A big change introduced in ASP.NET 4 (and refined in 4.5, by the way) is the introduction of extensible request validation, by means of a Request Validation provider. This means we are not limited to either enabling or disabling event validation for all pages or for a specific page, but we now have fine control over each of the elements of the request, including cookies, headers, query string and form values. The base provider class is RequestValidator and the configuration goes on the httpRuntime section. 1: <httpRuntime requestValidationType="MyClass, MyAssembly" /> Browser Capabilities The Browser Capabilities provider is new in ASP.NET 4, although the concept exists from ASP.NET 2. The idea is to map a browser brand and version to its supported capabilities, such as JavaScript version, Flash support, ActiveX support, and so on. Previously, this was all hardcoded in .Browser files located in %WINDIR%\Microsoft.NET\Framework(64)\vXXXXX\Config\Browsers, but now you can have a class inherit from HttpCapabilitiesProvider and implement your own mechanism. Register in on the browserCaps section. 1: <browserCaps provider="MyClass, MyAssembly" /> Encoder The Encoder provider is responsible for encoding every string that is sent to the browser on a page or header. This includes for example converting special characters for their standard codes and is implemented by the base class HttpEncoder. Another implementation takes care of Anti Cross Site Scripting (XSS) attacks. Build your own by inheriting from one of these classes if you want to add some additional processing to these strings. The configuration will go on the httpRuntime section. 1: <httpRuntime encoderType="MyClass, MyAssembly" /> Conclusion That’s about it for ASP.NET providers. It was by no means a thorough description, but I hope I managed to raise your interest on this subject. There are lots of pointers on the Internet, so I only included direct references to the Framework classes and configuration sections. Stay tuned for more extensibility!

Read the article
How to Use USER_DEFINED Activity in OWB Process Flow

- by Jinggen He

Process Flow is a very important component of Oracle Warehouse Builder. With Process Flow, we can create and control the ETL process by setting all kinds of activities in a well-constructed flow. In Oracle Warehouse Builder 11gR2, there are 28 kinds of activities, which fall into three categories: Control activities, OWB specific activities and Utility activities. For more information about Process Flow activities, please refer to OWB online doc. Most of those activities are pre-defined for some specific use. For example, the Mapping activity allows execution an OWB mapping in Process Flow and the FTP activity allows an interaction between the local host and a remote FTP server. Besides those activities for specific purposes, the User Defined activity enables you to incorporate into a Process Flow an activity that is not defined within Warehouse Builder. So the User Defined activity brings flexibility and extensibility to Process Flow. In this article, we will take an amazing tour of using the User Defined activity. Let's start. Enable execution of User Defined activity Let's start this section from creating a very simple Process Flow, which contains a Start activity, a User Defined activity and an End Success activity. Leave all parameters of activity USER_DEFINED unchanged except that we enter /tmp/test.sh into the Value column of the COMMAND parameter. Then let's create the shell script test.sh in /tmp directory. Here is the content of /tmp/test.sh (this article is demonstrating a scenario in Linux system, and /tmp/test.sh is a Bash shell script): echo Hello World! > /tmp/test.txt Note: don't forget to grant the execution privilege on /tmp/test.sh to OS Oracle user. For simplicity, we just use the following command. chmod +x /tmp/test.sh OK, it's so simple that we’ve almost done it. Now deploy the Process Flow and run it. For a newly installed OWB, we will come across an error saying "RPE-02248: For security reasons, activity operator Shell has been disabled by the DBA". See below. That's because, by default, the User Defined activity is DISABLED. Configuration about this can be found in <ORACLE_HOME>/owb/bin/admin/Runtime.properties: property.RuntimePlatform.0.NativeExecution.Shell.security_constraint=DISABLED The property can be set to three different values: NATIVE_JAVA, SCHEDULER and DISBALED. Where NATIVE_JAVA uses the Java 'Runtime.exec' interface, SCHEDULER uses a DBMS Scheduler external job submitted by the Control Center repository owner which is executed by the default operating system user configured by the DBA. DISABLED prevents execution via these operators. We enable the execution of User Defined activity by setting: property.RuntimePlatform.0.NativeExecution.Shell.security_constraint= NATIVE_JAVA Restart the Control Center service for the change of setting to take effect. cd <ORACLE_HOME>/owb/rtp/sql sqlplus OWBSYS/<password of OWBSYS> @stop_service.sql sqlplus OWBSYS/<password of OWBSYS> @start_service.sql And then run the Process Flow again. We will see that the Process Flow completes successfully. The execution of /tmp/test.sh successfully generated a file /tmp/test.txt, containing the line Hello World!. Pass parameters to User Defined Activity The Process Flow created in the above section has a drawback: the User Defined activity doesn't accept any information from OWB nor does it give any meaningful results back to OWB. That's to say, it lacks interaction. Maybe, sometimes such a Process Flow can fulfill the business requirement. But for most of the time, we need to get the User Defined activity executed according to some information prior to that step. In this section, we will see how to pass parameters to the User Defined activity and pass them into the to-be-executed shell script. First, let's see how to pass parameters to the script. The User Defined activity has an input parameter named PARAMETER_LIST. This is a list of parameters that will be passed to the command. Parameters are separated from one another by a token. The token is taken as the first character on the PARAMETER_LIST string, and the string must also end in that token. Warehouse Builder recommends the '?' character, but any character can be used. For example, to pass 'abc,' 'def,' and 'ghi' you can use the following equivalent: ?abc?def?ghi? or !abc!def!ghi! or |abc|def|ghi| If the token character or '\' needs to be included as part of the parameter, then it must be preceded with '\'. For example '\\'. If '\' is the token character, then '/' becomes the escape character. Let's configure the PARAMETER_LIST parameter as below: And modify the shell script /tmp/test.sh as below: echo $1 is saying hello to $2! > /tmp/test.txt Re-deploy the Process Flow and run it. We will see that the generated /tmp/test.txt contains the following line: Bob is saying hello to Alice! In the example above, the parameters passed into the shell script are static. This case is not so useful because: instead of passing parameters, we can directly write the value of the parameters in the shell script. To make the case more meaningful, we can pass two dynamic parameters, that are obtained from the previous activity, to the shell script. Prepare the Process Flow as below: The Mapping activity MAPPING_1 has two output parameters: FROM_USER, TO_USER. The User Defined activity has two input parameters: FROM_USER, TO_USER. All the four parameters are of String type. Additionally, the Process Flow has two string variables: VARIABLE_FOR_FROM_USER, VARIABLE_FOR_TO_USER. Through VARIABLE_FOR_FROM_USER, the input parameter FROM_USER of USER_DEFINED gets value from output parameter FROM_USER of MAPPING_1. We achieve this by binding both parameters to VARIABLE_FOR_FROM_USER. See the two figures below. In the same way, through VARIABLE_FOR_TO_USER, the input parameter TO_USER of USER_DEFINED gets value from output parameter TO_USER of MAPPING_1. Also, we need to change the PARAMETER_LIST of the User Defined activity like below: Now, the shell script is getting input from the Mapping activity dynamically. Deploy the Process Flow and all of its necessary dependees then run the Process Flow. We see that the generated /tmp/test.txt contains the following line: USER B is saying hello to USER A! 'USER B' and 'USER A' are two outputs of the Mapping execution. Write the shell script within Oracle Warehouse Builder In the previous section, the shell script is located in the /tmp directory. But sometimes, when the shell script is small, or for the sake of maintaining consistency, you may want to keep the shell script inside Oracle Warehouse Builder. We can achieve this by configuring these three parameters of a User Defined activity properly: COMMAND: Set the path of interpreter, by which the shell script will be interpreted. PARAMETER_LIST: Set it blank. SCRIPT: Enter the shell script content. Note that in Linux the shell script content is passed into the interpreter as standard input at runtime. About how to actually pass parameters to the shell script, we can utilize variable substitutions. As in the following figure, ${FROM_USER} will be replaced by the value of the FROM_USER input parameter of the User Defined activity. So will the ${TO_USER} symbol. Besides the custom substitution variables, OWB also provide some system pre-defined substitution variables. You can refer to the online document for that. Deploy the Process Flow and run it. We see that the generated /tmp/test.txt contains the following line: USER B is saying hello to USER A! Leverage the return value of User Defined activity All of the previous sections are connecting the User Defined activity to END_SUCCESS with an unconditional transition. But what should we do if we want different subsequent activities for different shell script execution results? 1. The simplest way is to add three simple-conditioned out-going transitions for the User Defined activity just like the figure below. In the figure, to simplify the scenario, we connect the User Defined activity to three End activities. Basically, if the shell script ends successfully, the whole Process Flow will end at END_SUCCESS, otherwise, the whole Process Flow will end at END_ERROR (in our case, ending at END_WARNING seldom happens). In the real world, we can add more complex and meaningful subsequent business logic. 2. Or we can utilize complex conditions to work with different results of the User Defined activity. Previously, in our script, we only have this line: echo ${FROM_USER} is saying hello to ${TO_USER}! > /tmp/test.txt We can add more logic in it and return different values accordingly. echo ${FROM_USER} is saying hello to ${TO_USER}! > /tmp/test.txt if CONDITION_1 ; then ...... exit 0 fi if CONDITION_2 ; then ...... exit 2 fi if CONDITION_3 ; then ...... exit 3 fi After that we can leverage the result by checking RESULT_CODE in condition expression of those out-going transitions. Let's suppose that we have the Process Flow as the following graph (SUB_PROCESS_n stands for more different further processes): We can set complex condition for the transition from USER_DEFINED to SUB_PROCESS_1 like this: Other transitions can be set in the same way. Note that, in our shell script, we return 0, 2 and 3, but not 1. As in Linux system, if the shell script comes across a system error like IO error, the return value will be 1. We can explicitly handle such a return value. Summary Let's summarize what has been discussed in this article: How to create a Process Flow with a User Defined activity in it How to pass parameters from the prior activity to the User Defined activity and finally into the shell script How to write the shell script within Oracle Warehouse Builder How to do variable substitutions How to let the User Defined activity return different values and in what way can we leverage

Read the article
SQL SERVER – Weekly Series – Memory Lane – #039

- by Pinal Dave

Here is the list of selected articles of SQLAuthority.com across all these years. Instead of just listing all the articles I have selected a few of my most favorite articles and have listed them here with additional notes below it. Let me know which one of the following is your favorite article from memory lane. 2007 FQL – Facebook Query Language Facebook list following advantages of FQL: Condensed XML reduces bandwidth and parsing costs. More complex requests can reduce the number of requests necessary. Provides a single consistent, unified interface for all of your data. It’s fun! UDF – Get the Day of the Week Function The day of the week can be retrieved in SQL Server by using the DatePart function. The value returned by the function is between 1 (Sunday) and 7 (Saturday). To convert this to a string representing the day of the week, use a CASE statement. UDF – Function to Get Previous And Next Work Day – Exclude Saturday and Sunday While reading ColdFusion blog of Ben Nadel Getting the Previous Day In ColdFusion, Excluding Saturday And Sunday, I realize that I use similar function on my SQL Server Database. This function excludes the Weekends (Saturday and Sunday), and it gets previous as well as next work day. Complete Series of SQL Server Interview Questions and Answers Data Warehousing Interview Questions and Answers – Introduction Data Warehousing Interview Questions and Answers – Part 1 Data Warehousing Interview Questions and Answers – Part 2 Data Warehousing Interview Questions and Answers – Part 3 Data Warehousing Interview Questions and Answers Complete List Download 2008 Introduction to Log Viewer In SQL Server all the windows event logs can be seen along with SQL Server logs. Interface for all the logs is same and can be launched from the same place. This log can be exported and filtered as well. DBCC SHRINKFILE Takes Long Time to Run If you are DBA who are involved with Database Maintenance and file group maintenance, you must have experience that many times DBCC SHRINKFILE operations takes a long time but any other operations with Database are relatively quicker. mssqlsystemresource – Resource Database The purpose of resource database is to facilitates upgrading to the new version of SQL Server without any hassle. In previous versions whenever version of SQL Server was upgraded all the previous version system objects needs to be dropped and new version system objects to be created. 2009 Puzzle – Write Script to Generate Primary Key and Foreign Key In SQL Server Management Studio (SSMS), there is no option to script all the keys. If one is required to script keys they will have to manually script each key one at a time. If database has many tables, generating one key at a time can be a very intricate task. I want to throw a question to all of you if any of you have scripts for the same purpose. Maximizing View of SQL Server Management Studio – Full Screen – New Screen I had explained the following two different methods: 1) Open Results in Separate Tab - This is a very interesting method as result pan shows up in a different tab instead of the splitting screen horizontally. 2) Open SSMS in Full Screen - This works always and to its best. Not many people are aware of this method; hence, very few people use it to enhance performance. 2010 Find Queries using Parallelism from Cached Plan T-SQL script gets all the queries and their execution plan where parallelism operations are kicked up. Pay attention there is TOP 10 is used, if you have lots of transactional operations, I suggest that you change TOP 10 to TOP 50 This is the list of the all the articles in the series of computed columns. SQL SERVER – Computed Column – PERSISTED and Storage This article talks about how computed columns are created and why they take more storage space than before. SQL SERVER – Computed Column – PERSISTED and Performance This article talks about how PERSISTED columns give better performance than non-persisted columns. SQL SERVER – Computed Column – PERSISTED and Performance – Part 2 This article talks about how non-persisted columns give better performance than PERSISTED columns. SQL SERVER – Computed Column and Performance – Part 3 This article talks about how Index improves the performance of Computed Columns. SQL SERVER – Computed Column – PERSISTED and Storage – Part 2 This article talks about how creating index on computed column does not grow the row length of table. SQL SERVER – Computed Columns – Index and Performance This article summarized all the articles related to computed columns. 2011 SQL SERVER – Interview Questions and Answers – Frequently Asked Questions – Data Warehousing Concepts – Day 21 of 31 What is Data Warehousing? What is Business Intelligence (BI)? What is a Dimension Table? What is Dimensional Modeling? What is a Fact Table? What are the Fundamental Stages of Data Warehousing? What are the Different Methods of Loading Dimension tables? Describes the Foreign Key Columns in Fact Table and Dimension Table? What is Data Mining? What is the Difference between a View and a Materialized View? SQL SERVER – Interview Questions and Answers – Frequently Asked Questions – Data Warehousing Concepts – Day 22 of 31 What is OLTP? What is OLAP? What is the Difference between OLTP and OLAP? What is ODS? What is ER Diagram? SQL SERVER – Interview Questions and Answers – Frequently Asked Questions – Data Warehousing Concepts – Day 23 of 31 What is ETL? What is VLDB? Is OLTP Database is Design Optimal for Data Warehouse? If denormalizing improves Data Warehouse Processes, then why is the Fact Table is in the Normal Form? What are Lookup Tables? What are Aggregate Tables? What is Real-Time Data-Warehousing? What are Conformed Dimensions? What is a Conformed Fact? How do you Load the Time Dimension? What is a Level of Granularity of a Fact Table? What are Non-Additive Facts? What is a Factless Facts Table? What are Slowly Changing Dimensions (SCD)? SQL SERVER – Interview Questions and Answers – Frequently Asked Questions – Data Warehousing Concepts – Day 24 of 31 What is Hybrid Slowly Changing Dimension? What is BUS Schema? What is a Star Schema? What Snow Flake Schema? Differences between the Star and Snowflake Schema? What is Difference between ER Modeling and Dimensional Modeling? What is Degenerate Dimension Table? Why is Data Modeling Important? What is a Surrogate Key? What is Junk Dimension? What is a Data Mart? What is the Difference between OLAP and Data Warehouse? What is a Cube and Linked Cube with Reference to Data Warehouse? What is Snapshot with Reference to Data Warehouse? What is Active Data Warehousing? What is the Difference between Data Warehousing and Business Intelligence? What is MDS? Explain the Paradigm of Bill Inmon and Ralph Kimball. SQL SERVER – Azure Interview Questions and Answers – Guest Post by Paras Doshi – Day 25 of 31 Paras Doshi has submitted 21 interesting question and answers for SQL Azure. 1.What is SQL Azure? 2.What is cloud computing? 3.How is SQL Azure different than SQL server? 4.How many replicas are maintained for each SQL Azure database? 5.How can we migrate from SQL server to SQL Azure? 6.Which tools are available to manage SQL Azure databases and servers? 7.Tell me something about security and SQL Azure. 8.What is SQL Azure Firewall? 9.What is the difference between web edition and business edition? 10.How do we synchronize On Premise SQL server with SQL Azure? 11.How do we Backup SQL Azure Data? 12.What is the current pricing model of SQL Azure? 13.What is the current limitation of the size of SQL Azure DB? 14.How do you handle datasets larger than 50 GB? 15.What happens when the SQL Azure database reaches Max Size? 16.How many databases can we create in a single server? 17.How many servers can we create in a single subscription? 18.How do you improve the performance of a SQL Azure Database? 19.What is code near application topology? 20.What were the latest updates to SQL Azure service? 21.When does a workload on SQL Azure get throttled? SQL SERVER – Interview Questions and Answers – Guest Post by Malathi Mahadevan – Day 26 of 31 Malachi had asked a simple question which has several answers. Each answer makes you think and ponder about the reality of the IT world. Look at the simple question – ‘What is the toughest challenge you have faced in your present job and how did you handle it’? and its various answers. Each answer has its own story. SQL SERVER – Interview Questions and Answers – Guest Post by Rick Morelan – Day 27 of 31 Rick Morelan of Joes2Pros has written an excellent blog post on the subject how to find top N values. Most people are fully aware of how the TOP keyword works with a SELECT statement. After years preparing so many students to pass the SQL Certification I noticed they were pretty well prepared for job interviews too. Yes, they would do well in the interview but not great. There seemed to be a few questions that would come up repeatedly for almost everyone. Rick addresses similar questions in his lucid writing skills. 2012 Observation of Top with Index and Order of Resultset SQL Server has lots of things to learn and share. It is amazing to see how people evaluate and understand different techniques and styles differently when implementing. The real reason may be absolutely different but we may blame something totally different for the incorrect results. Read the blog post to learn more. How do I Record Video and Webcast How to Convert Hex to Decimal or INT Earlier I asked regarding a question about how to convert Hex to Decimal. I promised that I will post an answer with Due Credit to the author but never got around to post a blog post around it. Read the original post over here SQL SERVER – Question – How to Convert Hex to Decimal. Query to Get Unique Distinct Data Based on Condition – Eliminate Duplicate Data from Resultset The natural reaction will be to suggest DISTINCT or GROUP BY. However, not all the questions can be solved by DISTINCT or GROUP BY. Let us see the following example, where a user wanted only latest records to be displayed. Let us see the example to understand further. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Memory Lane, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL, Technology

Read the article
Fraud Detection with the SQL Server Suite Part 1

- by Dejan Sarka

While working on different fraud detection projects, I developed my own approach to the solution for this problem. In my PASS Summit 2013 session I am introducing this approach. I also wrote a whitepaper on the same topic, which was generously reviewed by my friend Matija Lah. In order to spread this knowledge faster, I am starting a series of blog posts which will at the end make the whole whitepaper. Abstract With the massive usage of credit cards and web applications for banking and payment processing, the number of fraudulent transactions is growing rapidly and on a global scale. Several fraud detection algorithms are available within a variety of different products. In this paper, we focus on using the Microsoft SQL Server suite for this purpose. In addition, we will explain our original approach to solving the problem by introducing a continuous learning procedure. Our preferred type of service is mentoring; it allows us to perform the work and consulting together with transferring the knowledge onto the customer, thus making it possible for a customer to continue to learn independently. This paper is based on practical experience with different projects covering online banking and credit card usage. Introduction A fraud is a criminal or deceptive activity with the intention of achieving financial or some other gain. Fraud can appear in multiple business areas. You can find a detailed overview of the business domains where fraud can take place in Sahin Y., & Duman E. (2011), Detecting Credit Card Fraud by Decision Trees and Support Vector Machines, Proceedings of the International MultiConference of Engineers and Computer Scientists 2011 Vol 1. Hong Kong: IMECS. Dealing with frauds includes fraud prevention and fraud detection. Fraud prevention is a proactive mechanism, which tries to disable frauds by using previous knowledge. Fraud detection is a reactive mechanism with the goal of detecting suspicious behavior when a fraudster surpasses the fraud prevention mechanism. A fraud detection mechanism checks every transaction and assigns a weight in terms of probability between 0 and 1 that represents a score for evaluating whether a transaction is fraudulent or not. A fraud detection mechanism cannot detect frauds with a probability of 100%; therefore, manual transaction checking must also be available. With fraud detection, this manual part can focus on the most suspicious transactions. This way, an unchanged number of supervisors can detect significantly more frauds than could be achieved with traditional methods of selecting which transactions to check, for example with random sampling. There are two principal data mining techniques available both in general data mining as well as in specific fraud detection techniques: supervised or directed and unsupervised or undirected. Supervised techniques or data mining models use previous knowledge. Typically, existing transactions are marked with a flag denoting whether a particular transaction is fraudulent or not. Customers at some point in time do report frauds, and the transactional system should be capable of accepting such a flag. Supervised data mining algorithms try to explain the value of this flag by using different input variables. When the patterns and rules that lead to frauds are learned through the model training process, they can be used for prediction of the fraud flag on new incoming transactions. Unsupervised techniques analyze data without prior knowledge, without the fraud flag; they try to find transactions which do not resemble other transactions, i.e. outliers. In both cases, there should be more frauds in the data set selected for checking by using the data mining knowledge compared to selecting the data set with simpler methods; this is known as the lift of a model. Typically, we compare the lift with random sampling. The supervised methods typically give a much better lift than the unsupervised ones. However, we must use the unsupervised ones when we do not have any previous knowledge. Furthermore, unsupervised methods are useful for controlling whether the supervised models are still efficient. Accuracy of the predictions drops over time. Patterns of credit card usage, for example, change over time. In addition, fraudsters continuously learn as well. Therefore, it is important to check the efficiency of the predictive models with the undirected ones. When the difference between the lift of the supervised models and the lift of the unsupervised models drops, it is time to refine the supervised models. However, the unsupervised models can become obsolete as well. It is also important to measure the overall efficiency of both, supervised and unsupervised models, over time. We can compare the number of predicted frauds with the total number of frauds that include predicted and reported occurrences. For measuring behavior across time, specific analytical databases called data warehouses (DW) and on-line analytical processing (OLAP) systems can be employed. By controlling the supervised models with unsupervised ones and by using an OLAP system or DW reports to control both, a continuous learning infrastructure can be established. There are many difficulties in developing a fraud detection system. As has already been mentioned, fraudsters continuously learn, and the patterns change. The exchange of experiences and ideas can be very limited due to privacy concerns. In addition, both data sets and results might be censored, as the companies generally do not want to publically expose actual fraudulent behaviors. Therefore it can be quite difficult if not impossible to cross-evaluate the models using data from different companies and different business areas. This fact stresses the importance of continuous learning even more. Finally, the number of frauds in the total number of transactions is small, typically much less than 1% of transactions is fraudulent. Some predictive data mining algorithms do not give good results when the target state is represented with a very low frequency. Data preparation techniques like oversampling and undersampling can help overcome the shortcomings of many algorithms. SQL Server suite includes all of the software required to create, deploy any maintain a fraud detection infrastructure. The Database Engine is the relational database management system (RDBMS), which supports all activity needed for data preparation and for data warehouses. SQL Server Analysis Services (SSAS) supports OLAP and data mining (in version 2012, you need to install SSAS in multidimensional and data mining mode; this was the only mode in previous versions of SSAS, while SSAS 2012 also supports the tabular mode, which does not include data mining). Additional products from the suite can be useful as well. SQL Server Integration Services (SSIS) is a tool for developing extract transform–load (ETL) applications. SSIS is typically used for loading a DW, and in addition, it can use SSAS data mining models for building intelligent data flows. SQL Server Reporting Services (SSRS) is useful for presenting the results in a variety of reports. Data Quality Services (DQS) mitigate the occasional data cleansing process by maintaining a knowledge base. Master Data Services is an application that helps companies maintaining a central, authoritative source of their master data, i.e. the most important data to any organization. For an overview of the SQL Server business intelligence (BI) part of the suite that includes Database Engine, SSAS and SSRS, please refer to Veerman E., Lachev T., & Sarka D. (2009). MCTS Self-Paced Training Kit (Exam 70-448): Microsoft® SQL Server® 2008 Business Intelligence Development and Maintenance. MS Press. For an overview of the enterprise information management (EIM) part that includes SSIS, DQS and MDS, please refer to Sarka D., Lah M., & Jerkic G. (2012). Training Kit (Exam 70-463): Implementing a Data Warehouse with Microsoft® SQL Server® 2012. O'Reilly. For details about SSAS data mining, please refer to MacLennan J., Tang Z., & Crivat B. (2009). Data Mining with Microsoft SQL Server 2008. Wiley. SQL Server Data Mining Add-ins for Office, a free download for Office versions 2007, 2010 and 2013, bring the power of data mining to Excel, enabling advanced analytics in Excel. Together with PowerPivot for Excel, which is also freely downloadable and can be used in Excel 2010, is already included in Excel 2013. It brings OLAP functionalities directly into Excel, making it possible for an advanced analyst to build a complete learning infrastructure using a familiar tool. This way, many more people, including employees in subsidiaries, can contribute to the learning process by examining local transactions and quickly identifying new patterns.

Read the article
Solving Big Problems with Oracle R Enterprise, Part II

- by dbayard

Part II – Solving Big Problems with Oracle R Enterprise In the first post in this series (see https://blogs.oracle.com/R/entry/solving_big_problems_with_oracle), we showed how you can use R to perform historical rate of return calculations against investment data sourced from a spreadsheet. We demonstrated the calculations against sample data for a small set of accounts. While this worked fine, in the real-world the problem is much bigger because the amount of data is much bigger. So much bigger that our approach in the previous post won’t scale to meet the real-world needs. From our previous post, here are the challenges we need to conquer: The actual data that needs to be used lives in a database, not in a spreadsheet The actual data is much, much bigger- too big to fit into the normal R memory space and too big to want to move across the network The overall process needs to run fast- much faster than a single processor The actual data needs to be kept secured- another reason to not want to move it from the database and across the network And the process of calculating the IRR needs to be integrated together with other database ETL activities, so that IRR’s can be calculated as part of the data warehouse refresh processes In this post, we will show how we moved from sample data environment to working with full-scale data. This post is based on actual work we did for a financial services customer during a recent proof-of-concept. Getting started with the Database At this point, we have some sample data and our IRR function. We were at a similar point in our customer proof-of-concept exercise- we had sample data but we did not have the full customer data yet. So our database was empty. But, this was easily rectified by leveraging the transparency features of Oracle R Enterprise (see https://blogs.oracle.com/R/entry/analyzing_big_data_using_the). The following code shows how we took our sample data SimpleMWRRData and easily turned it into a new Oracle database table called IRR_DATA via ore.create(). The code also shows how we can access the database table IRR_DATA as if it was a normal R data.frame named IRR_DATA. If we go to sql*plus, we can also check out our new IRR_DATA table: At this point, we now have our sample data loaded in the database as a normal Oracle table called IRR_DATA. So, we now proceeded to test our R function working with database data. As our first test, we retrieved the data from a single account from the IRR_DATA table, pull it into local R memory, then call our IRR function. This worked. No SQL coding required! Going from Crawling to Walking Now that we have shown using our R code with database-resident data for a single account, we wanted to experiment with doing this for multiple accounts. In other words, we wanted to implement the split-apply-combine technique we discussed in our first post in this series. Fortunately, Oracle R Enterprise provides a very scalable way to do this with a function called ore.groupApply(). You can read more about ore.groupApply() here: https://blogs.oracle.com/R/entry/analyzing_big_data_using_the1 Here is an example of how we ask ORE to take our IRR_DATA table in the database, split it by the ACCOUNT column, apply a function that calls our SimpleMWRR() calculation, and then combine the results. (If you are following along at home, be sure to have installed our myIRR package on your database server via “R CMD INSTALL myIRR”). The interesting thing about ore.groupApply is that the calculation is not actually performed in my desktop R environment from which I am running. What actually happens is that ore.groupApply uses the Oracle database to perform the work. And the Oracle database is what actually splits the IRR_DATA table by ACCOUNT. Then the Oracle database takes the data for each account and sends it to an embedded R engine running on the database server to apply our R function. Then the Oracle database combines all the individual results from the calls to the R function. This is significant because now the embedded R engine only needs to deal with the data for a single account at a time. Regardless of whether we have 20 accounts or 1 million accounts or more, the R engine that performs the calculation does not care. Given that normal R has a finite amount of memory to hold data, the ore.groupApply approach overcomes the R memory scalability problem since we only need to fit the data from a single account in R memory (not all of the data for all of the accounts). Additionally, the IRR_DATA does not need to be sent from the database to my desktop R program. Even though I am invoking ore.groupApply from my desktop R program, because the actual SimpleMWRR calculation is run by the embedded R engine on the database server, the IRR_DATA does not need to leave the database server- this is both a performance benefit because network transmission of large amounts of data take time and a security benefit because it is harder to protect private data once you start shipping around your intranet. Another benefit, which we will discuss in a few paragraphs, is the ability to leverage Oracle database parallelism to run these calculations for dozens of accounts at once. From Walking to Running ore.groupApply is rather nice, but it still has the drawback that I run this from a desktop R instance. This is not ideal for integrating into typical operational processes like nightly data warehouse refreshes or monthly statement generation. But, this is not an issue for ORE. Oracle R Enterprise lets us run this from the database using regular SQL, which is easily integrated into standard operations. That is extremely exciting and the way we actually did these calculations in the customer proof. As part of Oracle R Enterprise, it provides a SQL equivalent to ore.groupApply which it refers to as “rqGroupEval”. To use rqGroupEval via SQL, there is a bit of simple setup needed. Basically, the Oracle Database needs to know the structure of the input table and the grouping column, which we are able to define using the database’s pipeline table function mechanisms. Here is the setup script: At this point, our initial setup of rqGroupEval is done for the IRR_DATA table. The next step is to define our R function to the database. We do that via a call to ORE’s rqScriptCreate. Now we can test it. The SQL you use to run rqGroupEval uses the Oracle database pipeline table function syntax. The first argument to irr_dataGroupEval is a cursor defining our input. You can add additional where clauses and subqueries to this cursor as appropriate. The second argument is any additional inputs to the R function. The third argument is the text of a dummy select statement. The dummy select statement is used by the database to identify the columns and datatypes to expect the R function to return. The fourth argument is the column of the input table to split/group by. The final argument is the name of the R function as you defined it when you called rqScriptCreate(). The Real-World Results In our real customer proof-of-concept, we had more sophisticated calculation requirements than shown in this simplified blog example. For instance, we had to perform the rate of return calculations for 5 separate time periods, so the R code was enhanced to do so. In addition, some accounts needed a time-weighted rate of return to be calculated, so we extended our approach and added an R function to do that. And finally, there were also a few more real-world data irregularities that we needed to account for, so we added logic to our R functions to deal with those exceptions. For the full-scale customer test, we loaded the customer data onto a Half-Rack Exadata X2-2 Database Machine. As our half-rack had 48 physical cores (and 96 threads if you consider hyperthreading), we wanted to take advantage of that CPU horsepower to speed up our calculations. To do so with ORE, it is as simple as leveraging the Oracle Database Parallel Query features. Let’s look at the SQL used in the customer proof: Notice that we use a parallel hint on the cursor that is the input to our rqGroupEval function. That is all we need to do to enable Oracle to use parallel R engines. Here are a few screenshots of what this SQL looked like in the Real-Time SQL Monitor when we ran this during the proof of concept (hint: you might need to right-click on these images to be able to view the images full-screen to see the entire image): From the above, you can notice a few things (numbers 1 thru 5 below correspond with highlighted numbers on the images above. You may need to right click on the above images and view the images full-screen to see the entire image): The SQL completed in 110 seconds (1.8minutes) We calculated rate of returns for 5 time periods for each of 911k accounts (the number of actual rows returned by the IRRSTAGEGROUPEVAL operation) We accessed 103m rows of detailed cash flow/market value data (the number of actual rows returned by the IRR_STAGE2 operation) We ran with 72 degrees of parallelism spread across 4 database servers Most of our 110seconds was spent in the “External Procedure call” event On average, we performed 8,200 executions of our R function per second (110s/911k accounts) On average, each execution was passed 110 rows of data (103m detail rows/911k accounts) On average, we did 41,000 single time period rate of return calculations per second (each of the 8,200 executions of our R function did rate of return calculations for 5 time periods) On average, we processed over 900,000 rows of database data in R per second (103m detail rows/110s) R + Oracle R Enterprise: Best of R + Best of Oracle Database This blog post series started by describing a real customer problem: how to perform a lot of calculations on a lot of data in a short period of time. While standard R proved to be a very good fit for writing the necessary calculations, the challenge of working with a lot of data in a short period of time remained. This blog post series showed how Oracle R Enterprise enables R to be used in conjunction with the Oracle Database to overcome the data volume and performance issues (as well as simplifying the operations and security issues). It also showed that we could calculate 5 time periods of rate of returns for almost a million individual accounts in less than 2 minutes. In a future post, we will take the same R function and show how Oracle R Connector for Hadoop can be used in the Hadoop world. In that next post, instead of having our data in an Oracle database, our data will live in Hadoop and we will how to use the Oracle R Connector for Hadoop and other Oracle Big Data Connectors to move data between Hadoop, R, and the Oracle Database easily.

Read the article
Solving Big Problems with Oracle R Enterprise, Part I

- by dbayard

Abstract: This blog post will show how we used Oracle R Enterprise to tackle a customer’s big calculation problem across a big data set. Overview: Databases are great for managing large amounts of data in a central place with rigorous enterprise-level controls. R is great for doing advanced computations. Sometimes you need to do advanced computations on large amounts of data, subject to rigorous enterprise-level concerns. This blog post shows how Oracle R Enterprise enables R plus the Oracle Database enabled us to do some pretty sophisticated calculations across 1 million accounts (each with many detailed records) in minutes. The problem: A financial services customer of mine has a need to calculate the historical internal rate of return (IRR) for its customers’ portfolios. This information is needed for customer statements and the online web application. In the past, they had solved this with a home-grown application that pulled trade and account data out of their data warehouse and ran the calculations. But this home-grown application was not able to do this fast enough, plus it was a challenge for them to write and maintain the code that did the IRR calculation. IRR – a problem that R is good at solving: Internal Rate of Return is an interesting calculation in that in most real-world scenarios it is impractical to calculate exactly. Rather, IRR is a calculation where approximation techniques need to be used. In this blog post, we will discuss calculating the “money weighted rate of return” but in the actual customer proof of concept we used R to calculate both money weighted rate of returns and time weighted rate of returns. You can learn more about the money weighted rate of returns here: http://www.wikinvest.com/wiki/Money-weighted_return First Steps- Calculating IRR in R We will start with calculating the IRR in standalone/desktop R. In our second post, we will show how to take this desktop R function, deploy it to an Oracle Database, and make it work at real-world scale. The first step we did was to get some sample data. For a historical IRR calculation, you have a balances and cash flows. In our case, the customer provided us with several accounts worth of sample data in Microsoft Excel. The above figure shows part of the spreadsheet of sample data. The data provides balances and cash flows for a sample account (BMV=beginning market value. FLOW=cash flow in/out of account. EMV=ending market value). Once we had the sample spreadsheet, the next step we did was to read the Excel data into R. This is something that R does well. R offers multiple ways to work with spreadsheet data. For instance, one could save the spreadsheet as a .csv file. In our case, the customer provided a spreadsheet file containing multiple sheets where each sheet provided data for a different sample account. To handle this easily, we took advantage of the RODBC package which allowed us to read the Excel data sheet-by-sheet without having to create individual .csv files. We wrote ourselves a little helper function called getsheet() around the RODBC package. Then we loaded all of the sample accounts into a data.frame called SimpleMWRRData. Writing the IRR function At this point, it was time to write the money weighted rate of return (MWRR) function itself. The definition of MWRR is easily found on the internet or if you are old school you can look in an investment performance text book. In the customer proof, we based our calculations off the ones defined in the The Handbook of Investment Performance: A User’s Guide by David Spaulding since this is the reference book used by the customer. (One of the nice things we found during the course of this proof-of-concept is that by using R to write our IRR functions we could easily incorporate the specific variations and business rules of the customer into the calculation.) The key thing with calculating IRR is the need to solve a complex equation with a numerical approximation technique. For IRR, you need to find the value of the rate of return (r) that sets the Net Present Value of all the flows in and out of the account to zero. With R, we solve this by defining our NPV function: where bmv is the beginning market value, cf is a vector of cash flows, t is a vector of time (relative to the beginning), emv is the ending market value, and tend is the ending time. Since solving for r is a one-dimensional optimization problem, we decided to take advantage of R’s optimize method (http://stat.ethz.ch/R-manual/R-patched/library/stats/html/optimize.html). The optimize method can be used to find a minimum or maximum; to find the value of r where our npv function is closest to zero, we wrapped our npv function inside the abs function and asked optimize to find the minimum. Here is an example of using optimize: where low and high are scalars that indicate the range to search for an answer. To test this out, we need to set values for bmv, cf, t, emv, tend, low, and high. We will set low and high to some reasonable defaults. For example, this account had a negative 2.2% money weighted rate of return. Enhancing and Packaging the IRR function With numerical approximation methods like optimize, sometimes you will not be able to find an answer with your initial set of inputs. To account for this, our approach was to first try to find an answer for r within a narrow range, then if we did not find an answer, try calling optimize() again with a broader range. See the R help page on optimize() for more details about the search range and its algorithm. At this point, we can now write a simplified version of our MWRR function. (Our real-world version is more sophisticated in that it calculates rate of returns for 5 different time periods [since inception, last quarter, year-to-date, last year, year before last year] in a single invocation. In our actual customer proof, we also defined time-weighted rate of return calculations. The beauty of R is that it was very easy to add these enhancements and additional calculations to our IRR package.)To simplify code deployment, we then created a new package of our IRR functions and sample data. For this blog post, we only need to include our SimpleMWRR function and our SimpleMWRRData sample data. We created the shell of the package by calling: To turn this package skeleton into something usable, at a minimum you need to edit the SimpleMWRR.Rd and SimpleMWRRData.Rd files in the \man subdirectory. In those files, you need to at least provide a value for the “title” section. Once that is done, you can change directory to the IRR directory and type at the command-line: The myIRR package for this blog post (which has both SimpleMWRR source and SimpleMWRRData sample data) is downloadable from here: myIRR package Testing the myIRR package Here is an example of testing our IRR function once it was converted to an installable package: Calculating IRR for All the Accounts So far, we have shown how to calculate IRR for a single account. The real-world issue is how do you calculate IRR for all of the accounts?This is the kind of situation where we can leverage the “Split-Apply-Combine” approach (see http://www.cscs.umich.edu/~crshalizi/weblog/815.html). Given that our sample data can fit in memory, one easy approach is to use R’s “by” function. (Other approaches to Split-Apply-Combine such as plyr can also be used. See http://4dpiecharts.com/2011/12/16/a-quick-primer-on-split-apply-combine-problems/). Here is an example showing the use of “by” to calculate the money weighted rate of return for each account in our sample data set. Recap and Next Steps At this point, you’ve seen the power of R being used to calculate IRR. There were several good things: R could easily work with the spreadsheets of sample data we were given R’s optimize() function provided a nice way to solve for IRR- it was both fast and allowed us to avoid having to code our own iterative approximation algorithm R was a convenient language to express the customer-specific variations, business-rules, and exceptions that often occur in real-world calculations- these could be easily added to our IRR functions The Split-Apply-Combine technique can be used to perform calculations of IRR for multiple accounts at once. However, there are several challenges yet to be conquered at this point in our story: The actual data that needs to be used lives in a database, not in a spreadsheet The actual data is much, much bigger- too big to fit into the normal R memory space and too big to want to move across the network The overall process needs to run fast- much faster than a single processor The actual data needs to be kept secured- another reason to not want to move it from the database and across the network And the process of calculating the IRR needs to be integrated together with other database ETL activities, so that IRR’s can be calculated as part of the data warehouse refresh processes In our next blog post in this series, we will show you how Oracle R Enterprise solved these challenges.

Read the article

Search Results

Search found 487 results on 20 pages for 'etl instrumentation'.

Page 19/20 | < Previous Page | 15 16 17 18 19 20 | Next Page >

- by Sung Meister

- by Tom O'Connor

- by David Allan

- by jamiet

- by Pinal Dave

- by SQLOS Team

- by Most Valuable Yak (Rob Volk)

- by Most Valuable Yak (Rob Volk)

- by Pinal Dave

- by Dejan Sarka

- by Alois Kraus

- by SQLOS Team

- by Repieter

- by fishdump

- by Tom Bushell

- by Dick dastardly

- by Nicklas O

- by wizmagister

- by Ricardo Peres

- by Jinggen He

- by Pinal Dave

- by Dejan Sarka

- by dbayard

- by dbayard

< Previous Page | 15 16 17 18 19 20 | Next Page >