We have an x-files problem with our .NET application. Or, rather, hybrid Win32 and .NET application.
When it attempts to communicate with Oracle, it just dies. Vanishes. Goes to the big black void in the sky. No event log message, no exception, no nothing.
If we simply ask the application to talk to a MS SQL Server instead, which has the effect of replacing the usage of OracleConnection and related classes with SqlConnection and related classes, it works as expected.
Today we had a breakthrough.
For some reason, a customer had figured out that by placing all the application files in a directory on his desktop, it worked as expected with Oracle as well. Moving the directory down to the root of the drive, or in C:\Temp or, well, around a bit, made the crash reappear.
Basically it was 100% reproducable that the application worked if run from directory on desktop, and failed if run from directory in root.
Today we figured out that the difference that counted was wether there was a space in the directory name or not.
So, these directories would work:
C:\Program Files\AppDir\Executable.exe
C:\Temp Lemp\AppDir\Executable.exe
C:\Documents and Settings\someuser\Desktop\AppDir\Executable.exe
whereas these would not:
C:\CompanyName\AppDir\Executable.exe
C:\Programfiler\AppDir\Executable.exe <-- Program Files in norwegian
C:\Temp\AppDir\Executable.exe
I'm hoping someone reading this has seen similar behavior and have a "aha, you need to twiddle the frob on the oracle glitz driver configuration" or similar.
Anyone?
Followup #1: Ok, I've processed the procmon output now, both files from when I hit the button that attempts to open the window that triggers the cascade failure, and I've noticed that they keep track mostly, there's some smallish differences near the top of both files, and they they keep track a long way down.
However, when one run fails, the other keeps going and the next few lines of the log output are these:
ReadFile C:\oracle\product\10.2.0\db_1\BIN\orageneric10.dll SUCCESS Offset: 274 432, Length: 32 768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O
ReadFile C:\oracle\product\10.2.0\db_1\BIN\orageneric10.dll SUCCESS Offset: 233 472, Length: 32 768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O
After this, the working run continues to execute, and the other touches the mscorwks.dll files a few times before threads close down and the app closes. Thus, the failed run does not touch the above files.
Followup #2: Figured I'd try to upgrade the oracle client drivers, but 10.2.0.1 is apparently the highest version available for Windows 2003 server and XP clients.
Followup #3: Well, we've ended up with a black-box solution. Basically we found that the problem is somewhere related to XPO and Oracle. XPO has a system-table it manages, called XPObjectType, with three columns: Oid, TypeName and AssemblyName. Due to how Oracle is configured in the databases we talk to, the column names were OID, TYPENAME and ASSEMBLYNAME. This would ordinarily not be a problem, except that XPO talks to the schema information directly and checks if the table is there with the right column names, and XPO doesn't handle case differences so it sees a XPObjectType table with three unknown columns and none of those it expects.
Exactly what XPO does now I don't really know, but if I dropped this table, and recreated it with the right case, using double quotes around all the column names to get the case right, the problem doesn't crop up.
Exactly where the space in the folder name comes into this, I still have no idea, but this problem had two tiers:
Stop the application from crashing at our customers, short-term solution
Fix the bug, long-term solution
Right now tier 1 is solved, tier 2 will be put back into the queue for now and prioritized. We're facing some bigger changes to our data tier anyway so this might not be a problem we need to solve, at least if all our Oracle-customers verify that the table-fix actually gets rid of the problem.
I'll accept the answer by Dave Markle since though Process Monitor (the big brother of File Monitor) didn't actually pinpoint the problem, I was able to use it to determine that after my breakpoint in user-code where XPO had built up the query for this table, no I/O happened until all the entries for the application closing down was logged, which led me to believe it was this table that was the culprit, or at least influenced the problem somehow.
If I manage to get to the real cause of this, I'll update the post.