20150331 - Chip - Mar 31 2015
I was using WinDbg to trace through the highly experimental deferment code when WinDebug764 crashed on boot. This began a three hour tour to restore and update Windows.
20150330 - Chip - Mar 30 2015
I started working on the code to allow the user to interactively allow/deny network connections. This requires deferring operations until the user has a chance to decide. It also meant reviving the WinDebug764 VM so I could step through this code using WinDbg.
20150322 - Chip - Mar 22 2015
I am now able to retrieve the process name from NetMon.sys, which solves one more problem. Once I had it running I noticed a connection to South Africa by my least favorite process SvcHost.exe. I need to have the ThreadID to know which client is responsible.
OK, that took about 30 minutes. I now have the ThreadID. With this extra bit of information I was able to trace the connection to South Africa back to SvcHost.exe > Dnscache Service > dnsrslvr.dll. I can assume this was a DNS lookup that was being referred to a nameserver in South Africa.
20150320 - Chip - Mar 20 2015
I extracted the core code from GeoDB into its own library project, GeoLib. I then dropped GeoDB.cpp from the Control project and replaced the calls with the GeoLib functions. There was a bit of clean up, and I added a new function "GeoDB::Lookup4_Precheck()" that looks for predefined IP4 ranges (ie 192.168.X.X) and shortcuts the lookup. After about 2 hours, NetMon was running again with improved GeoDB info.
The performance is now very stable and not much (a bit) drag when running FireFox or IE. I now feel comfortable running NetMon in the background on my primary development system (VS12).
The stream capture seems to be stable as well. I captured ~300KB while browsing the Amazon site and I was able to step through all the HTTP GET requests. The stream viewer needs a lot of UI work to become usable and useful.
I found that the HTTPS streams are encoded; I had thought I was capturing the
stream outside the encryption. I will need to do some research on this. The relevant
docs are found in the VS13 help at
Windows Driver
Device and Driver Technologies
Network Devices
Windows Filtering Platform Callout Drivers
Developing IPSec-Compatible Callout Drivers
I bumped the VerProd version to 2.2, indicating a new stability milestone.
****
While writing up my notes during the review of the Driver\Log.cpp code, I realized that the while() loop could be replaced by a set of if-then-else statements, making infinite loops impossible. This made me rethink all the circular FIFO code I have written over the years. I wrote a non-loop version of BufRead():
BufRead() can also be rewritten using two if-then-else blocks, one for the header and the second for the body. The performance improvement is nonzero but miniscule, its advantage is rendering infinite loops logically impossible.
I can also drop BufCt completely, relying strictly on BufRd/BufWr. I would replace BufCt with an inline function to calculate it from BufRd and BufWr. BufCt was always redundant.
20150317 - Chip - Mar 17 2015
I noticed that my GeoDB data is over a year old and opened up the GeoDB project to rebuild it. I found that I had neglected to commit the source database to SCM and I never wrote up any notes on the process of obtaining the source data and how to convert it. In addition, maxmind.com has moved to a new data format. So I am working on GeoDB to bring it back to life.
The notes on GeoDB are here: GeoDB
20150316 - Chip - Mar 16 2015
It seems it is the easy things that take the longest...
I wanted to be able to control the sort order in the event list, which looked easy -- just add a call to ConnectSetSort() in response to a double-click message. My first mistake was using the column index directly and the log was sorted on the wrong column. That was an easy fix. I thought it would be a 15 minute task until I noticed that the items started falling out of order after a while. I stared at the sorting code for hours, tried different variations, rewrote it a couple times, added verbose logging, error checking, and debug dumps.
After about six hours, I finally started looking upstream and eventually found the bug in code before the re-sort is called. I was giving the re-sort routine a bad index to the updated item IIF the item was a file access that was newly created.
The event list is now sorting properly.
The driver has been rock-solid for over two days now. I left it installed for autoload, so it is loading up every time VS12 boots. I can now run VS as my normal user and connect to the driver every time, even if the control app crashes. The driver has not caused any problems, even when the hooks are in place without the control app running. In this case, the log buffer is continuously full and cycling.
===*=*===
This is big news, because popping up the UAC admin dialog is always jarring. Being able to run normally dramatically lowers the annoyance factor for daily use.
I accomplished this by adding a couple options the control what happens during shutdown, then checking for an existing driver before trying to install on startup. I do not need to be an admin to open a handle to a driver that is already loaded.
The new options are:
--fastexit: Exit immediately
--uninstall: Completely uninstall the driver on exit.
--reinstall: Completely reinstall the driver on start.
--nounload: Leave the driver in place on exit.
--autoload: Configure driver to load on boot.
--unhook: Unhook file and net hooks on exit.
To install the driver so that it will always be available for use by normal
users, run once as admin:
> NetMonUI --fastexit --reinstall --autoload --nounload --unhook --driver \path\to\NetMon.sys
This configuration loads the NetMon driver during every boot, but
it remains inactive (no hooks) until NetMonUI runs. I can now run
NetMonUI as a normal user anytime.
To install the driver so that it will only be loaded on demand, run
once as admin:
> NetMonUI --fastexit --reinstall --driver \path\to\NetMon.sys
To start the driver on demand, I need to run as admin:
> NetMonUI --fastexit --nounload --unhook
Now I can run NetMon as a normal user during the current session:
> NetMonUI
This configuration does not load the NetMon driver at boot, the
admin must manually start the driver after each boot.
To uninstall the driver so that it never loads, run as admin:
> NetMonUI --fastexit --uninstall
Crash: IpHelper::SendQueryV() crashed on exit. I am assuming the thread was closed before getaddrinfo() had returned. I need to make sure there are no outstanding address queries when closing.
Links:
OSR IFS FAQ
=============
NetMon has been very stable for the past few days, the Log:BufRead bug seems to have been the guilty party.
I had a new BlueScreen pop up a few minutes after closing NetMonUI:
DRIVER_UNLOADED_WITHOUT_CANCELLING_PENDING_OPERATIONS
I was unable to force WinDbg to load the NetMon symbols, so I have no more information than:
NetMon+2948
The most aggrevating thing about WinDbg is that it seems to be a never-ending headache to load the proper symbols, and it spends so freaking much time loading Microsoft symbols that are usually of little or no interest.
I was able to trace the code, from this slim clue, back to DrvIrpPassThrough(). There must have been an orphaned IRP and NetMon was blamed.
Woot!
I finally found the IRQL_NOT_LESS_THAN BlueScreen bug! It naturally had nothing to do with IRQL. This error code can apparently also be caused by a bad pointer reference, (perhaps) if the memory is valid but paged out. The actual cause was a logic bug in how I was reading log entries when the entry header was split by a buffer wrap-around. Once I rewrote Log:BufRead() I was able to run NetMon with both Net and File logging active while I browsed MSN.com using Internet Explorer 11 (a stress test for anything!).
NetMon is now safe to demo!
I found the bug by loading a crash dump from C:\Windows\MiniDump\ into WinDbg, setting the sympath to the directory containing NetMon.pdb, and reloading NetMon.sys. This then pointed the fickle finger of blame directly at the memcpy() in Log:BufRead(). It only took about five minutes for me to then discover the bug, and another half hour to rewrite the function. This time checking every possible error condition.
This is the original function. The bug is that I assume the LOGHDR will be coherent. But in reality, the buffer wrap-around can happen anywhere, including inside the LOGHDR. This opened the door for a buffer overrun to occur when I try to read pEntry->ByteCt, resulting in a bogus EntrySz and a possible memcpy() beyond the end of pBuf[].
Once again, the "Rule of Fast Computing: If it happens once in a million times, it happens every second."
Here is the fixed version. Now I treat the header no differently than the body and only use ByteCt after I have read the entire LOGHDR.
Interesting note: While NetMon was running in Visual Studio, devenv.exe was scanning through every source file on my system -- including the Linux projects, projects that did not involve Visual Studio in any way. I'm sure Redmond this this is a convenience, but I think it is illicit snooping.
The NetMon project has been gathering dust for over a year...
I spent some time creating an improved scrolling text control. The standard edit control repaints the window after every update, which brings the system to a crawl when NetMon is spewing out a line of text for every file access on the system. The new version only repaints once every 500ms or in response to user input (scrollbar, text selection). This dramatically improves system performance while NetMon is running. The code is in OsUser/TextCtrl.cpp
I am now displaying the process name, which makes the event display much more useful.
This link describes how to extract the command line from a remote process.
I started working on the control application.
Added volume drive letter to the reported file names. Still missing the full path for relative files.
I noticed that svchost.exe was opening connections to Zhengzhou China 1.199.19.184. Hmmm...
I am still struggling with the "IE goes away" bug. Part of the problem is that I do not have a Known Good Reference to start from, and the IFS filter is pretty much "all or nothing" -- it is difficult to disable any part without disabling the whole thing.
I rebuilt the CodeProjects.com FsFilter project, hoping it could serve as my KGR. I based my IfsFilter on this project, so the code is very similar. I had to create a VS13 project to build it, but that went quickly since I had created my step-by-step notes. I was able to build, install, and run FsFilter in under an hour.
Good news: FsFilter does not seem to suffer from the same problem. I am able to install FsFilter, then launch IE without any problems. (I did need to decimate the debug output, printing only 1 of every 256 files opened. Otherwise WinDbg slowed the target down to an unbearably slow pace.) So now I have a KGR that works and I can start a careful comparison of FsFilter and NetMon to find the critical difference.
=====
That was quick! I modified FsFilter by zeroing out MajorFunction[IRP_MJ_CLEANUP], and FsFilter immediately demonstrated the same "IE goes away" problem.
SUCCESS! All I had to do was set the IrpMask registry setting to 0xFFFFFFFF and now IE works while NetMon is running! I was worried that the "Crash on exit" bug would reappear, but that must have been cured by one of the other fixes made in the past few days.
The NetMon driver is now feature complete! I can now focus on the NetMonUI, which is much more straight-forward and can be debugged using VS13. I will need to set up User-Mode remote debugging next.
=8^)
Ran into an interesting crash. IRP_MJ_READ where
pIrp->AssociatedIrp.SystemBuffer was zero even though
pStack->Parameters.Read.Length was 0x1000. The MSDN docs imply
this would never happen if Device.Flags has DO_BUFFERED_IO set.
Ahhh... but I forgot to set DO_BUFFERED_IO when creating the Stream device. My bad. Easy catch with WinDbg! :) Hopefully this will fix the Stream code, which always was a bit flaky.
Another crash. This time it blew up in the TCP driver. There are
some hints that might point to mutex usage.
I caught a BugCheck in WinDbg. It appears LogWrite() is sometimes being called at IRQL DISPATCH_LEVEL, which is an illegal condition for calling KeWaitForMutexObject() with a timeout. I believe that if I am running at DISPATCH_LEVEL or above, I can assume I am not interruptible (except by hardware events, which have nothing to do with me). Also, I know that all mutexes are cleared before a kernel thread is rescheduled. So... I am now checking KeGetCurrentIrql() and only calling KeWaitForMutexObject() if less than DISPATCH_LEVEL.
=====
I tried several alternatives and variations on my mutex code. Eventually I created a version that bypassed everything if IRQL was >= DISPATCH_LEVEL (2). This version did not crash, but it became immediately apparent that my driver was being called a lot at DISPATCH_LEVEL. It is illegal to do any scheduling or waiting at dispatch level. So I abandoned the mutex and replaced them with spin locks. And everything seems to work. I don't know whether to laugh or cry. Live and learn.
The browser (both IE and FF) is still hosed when I enable the file filter, but hopefully I have finally put a stake through the heart of the Stream and Log problems. (Fingers crossed...)
=====
Today's links:
CodeProject.com Kernel Mutexes
OSROnline.com: Use spinlocks
OSROnline.com: FastIoDispatch
AntiGhost: Filter driver project
After I shut everything down last night, I was doing a little bedtime reading on my Kindle when I came across this cylus.org blog post. It talks about setting up a VM-to-VM WinDbg environment. I made a mental note to investigate in the morning.
This morning I awoke early and tried it out. I had to make a couple changes, but it worked! I now have a completely self-contained development environment in VS12 which is able to debug device drivers running in DbgWin764! Yippee!
Within moments of debugging, NetMon popped into WinDbg with a page fault. Normally, this would be a BSoD with the faulting instruction deep inside NT.dll. But since I had WinDbg with full and up-to-date symbols, I could look at the stack and within seconds uncovered a very nasty problem. Can you see it?
I had assumed that UNICODE_STRING.Buffer[] would be a valid null-terminated string. It is not! I must use and rely on UNICODE_STRING.Length, as Buffer[] is not required to have a Zbyte. In this case, the end of the string ran off into unpaged memory. This caused StrLenW() to fault.
The good news: WinDbg paid for itself within moments. With a debugger, this was a trivial bug to find. Without a debugger, it could have had me scratching my head about random crashes for a very long time.
The bad news: I am embarrassed that I made such a fundamentally wrong assumption about UNICODE_STRING.
Qualified Yippee! I am now able to use the integrated WinDbg in VS2013 to debug NetMon. However, this version of WinDbg runs much, much slower than the command-line version. The target system takes many times longer to boot. Also, the output from DbgPrint() is not being drawn to the debugger immediate window properly so I cannot see the debug output from NetMon. On the other hand, it sure is nice to pop straight to the source code!
=====
Now that I am able to use WinDbg from within the VS12 VM, I can abandon the staging arena on the Host system. I no longer need to make the source code visible to the Host. All I need from the Host is a place to exchange the binaries between VS12 and the debug target.
I made a lot of progress over these two days, fixed several bugs, and put myself in a position to be much more efficient in the future. I have not seen a BSoD since this morning. The last(?) remaining show stopper is the mysterious "IE goes away" bug.
=====
Go 49ers!!
With the game tied and 3 seconds left the 49ers kick the winning field goal, which passes through the arms of the defender. The kick was only milliseconds away from being blocked.
=====
OSR Forum
Nynaeve.net
Cylus.org
Yesterday was a day of frustration with WinDbg. With renewed vigor I took another stab at it this morning...
A glimmer of success! I was able to connect using WinDbg from the command line on the host (not using VS12). The problem seemed to be a mismatch between the "debugport" entry in the boot manager. First, I used ComPorts inside the target VM to identify the "physical" COM port available to the guest OS. ComPorts reported a single port, COM1. I then used bcdedit from an administrator DOS prompt to set the following entries to match:
In the target VM settings, there should be a single Serial
Port device:
In the Host system (not in a VM), find WinDbg in the WDK and launch it:
This will sit in the "Waiting to reconnect..." state. Reboot the target, making sure to select the correct boot configuration. If the debug port is correct, the target will pause for about ten seconds after the screen goes black and "Starting Windows" is displayed but before the Windows flag appears. This is the one and only chance WinDbg has to connect to the target. WinDbg will sometimes close itself (crash) at this point, and I will have only a couple of seconds to restart WinDbg to catch the target connection. If the flag appears, I have missed the connection. If all the stars align WinDbg will spew some text, probably complain that the symbols are not correct, and the target will freeze until I release it with the "g" command from WinDbg.
NOTE: VMware seems to move the COM ports around when there are multiple VM's running around. After getting this to work with debugport=1, ComPorts later reported only COM2 available -- while WinDbg was connected to COM1. This may be why it was so incredibly difficult to get the settings right -- I may be chasing a moving target for the configuration. If I am having trouble connecting to the target, the first thing to check is that the COM port reported by ComPorts is the same as the debugport entry in BCD. UPDATE: This may be a red herring; it is not clear whether debugport is actually used.
I tried again to run WinDbg from within the VS12 VM, but it never connected. It appears VM-to-VM connections do not work.
=====
The next task is to load up the proper symbols for NetMon.sys and source code debugging.
WinDbg needs to run on the host machine and it needs access to the symbols and source files for the project. I don't want to create a complete development environment on the Host, I want to use a single, portable environment in VS12. I can't share folders out of a VM without creating a network share, which involves the domain controller (which may not always be accessible).
My solution is to share the source code directory on the host into VS12 and run Visual Studio in the VS12 VM, loading the project from Z:NetMonNetMon.sln, which is physically located on the Host system. I compile everything inside VS12, run WinDbg from the Host, and run NetMon from the target VM.
Host directory: C:SrcnlitedHQDevSBChipWin8NetMon
This
directory is shared into VS12 as "NetMon".
VS12 directory:
Z:NetMon
I copy the binaries to the DbgWin764 VM to the C:NetMon directory and run it from there.
When I break into WinDbg, I can load the NetMon symbols:
SUCCESS! I now have source code for NetMon.sys! Within five minutes after loading the symbols, I found two printf() bugs.
=====
The log corruption is definitely due to re-entrant calls to to Log. I re-enabled the mutex code, reread the WDK docs, and found a mistake in the parameters to KeWaitForMutex(). Once I corrected this, the log corruption errors went away and the overall performance improved dramatically. NetMon is now much more stable than it was this morning.
=====
I still see a problem where Internet Explorer hangs after some time. NetMon is still running and there are no error messages. Once IE goes away it never comes back, even after closing and restarting the program. It seems there is a deadlock in the Services Host process.
I did see "NetMon|ERR ERR[-3]: Filter::CbClassify: EXCEPTION!". NetMon continued to operate without any apparent problems, so the exception handler did its job. Of course, now I have no idea what caused the exception. Now that I have WinDbg running, it is better to crash than to handle the exception.
=====
I need to log the FastIO operations. Access to files that are cached is not being recorded.
=====
Today was a very productive day!
20140103 - Chip - Jan 03 2014
Once again attempting to use WinDbg on Win764...
Visual Studio 2013 is already installed, but not the WDK. I copied the WDK environment, which allowed me to rebuild the drivers. However, the "Driver" menu item is not present nor the kernel transport debugger.
I am installing the WDK from ColdStore W:DiskDevMSDNVS2013_WDKwdksetup.exe
The WDK 8.1 installed seemed to go smoothly, but installed to the
stupid path:
C:Program Files (x86)Windows Kits8.1
DbgWin764 appeared in the "Attach to process" window, but never connected.
I eventually tried to "reprovision" the target, but it stalled at this point:
Then it failed:
After several attempts, the deployment completed without errors, but I was still stuck at "Waiting to reconnect..."
I tried to do a deployment to Cree, a (relatively) virgin target. I logged in as the local Administrator, installed the target setup from ColdStore W:DisksDevMSDNVS2013_WDKInstallersWDK Test Target Setup x64-x64_en-us.msi, then provisioned from VS12.
The setup fails on the "Configuring debuggers (x64)" step. I can watch the target reboot during the installation, so I know network communication is happening. I think the problem is that the x64 debugger is not starting up on the target.
I need to disable the firewall between the Installing .NET and Installing VC redist steps, where it prints "Attempting to connect..."
I am now trying to deploy to a Win8/32bit target...
The installation seemed clean, but the same end result: unable to connect.
20140102 - Chip - Jan 02 2014
I found the cause of the "Still calling IFS filter after unloading" crash. I was decrementing FilterCt twice during the removal process. NetMon now unloads cleanly after installing the IFS filters.
I disabled the WFP filters and IE loaded fine. So the problem is on the WFP side, not on the IFS side. IFS logging seems to be working fine now.
The following sequence works:
There is an eventual crash, but it seems that the "IE stalls" problem may possibly be related to IE starting up for the first time. Possibly creating new devices? I am not sure that the previous statement about WFP/IFS responsibility is true, the positive result may have been due to starting IE first and unrelated to disabling the WFP filter.
The combination that seems to work:
Start and close IE first, set registry entries to disable
logging and the IFS filter, start NetMon, start IE, browse.
I tried starting NetMon first, then IE, and it worked fine. Logging and IFS are still disabled. I tried with logging enabled, and it again worked fine. So I am right back to where I was before I adding the IFS filter.
I tried with nothing disabled, and everything ran OK for about five minutes, then crashed. The blue screen usually reports the crash in the vicinity of NetMon+A5E6.
This tends to point the fickle finger of blame at the IFS filter.
20131222 - Chip - Dec 22 2013
I found the cause of the "Crash on Exit" bug: using the default handler for IRP_MJ_CLEANUP. If I leave pDriver->MajorFunction[IRP_MJ_CLEANUP] unset (0), everything works fine.
Netmon now mostly works, capturing Network and File access.
20131217 - Chip - Dec 17 2013
Trying to compile NetMon on Hopi. Configuration problems again. The WDK8.1 does not include the kernel mode libraries and headers. I have a WDK80 that I think I copied from ChipPro.
Success! After some WDK fun (see below) and installing the nlited driver signing certificate, I rebuilt NetMon and ran it on DbgWin764. Capturing network and file accesses.
=====
For some reason, I am seeing this error:
There is some weird discrepency going on here. Hopindis.h is 13950 lines, while ChipProndis.h is 14185 lines. Both are advertised as WDK 8.0 but one works and the other doesn't. This conspires with Microsoft completely fucking up the integration of the DDK into Visual Studio. It is just a bunch of include files and libraries, it doesn't need to be complicated at all!!! Instead, there are hidden registry settings, directories scattered in esoteric locations (Who was the idiot who came up with "C:Program Files (x86)Windows Kits8.1" for fuck's sake!), and all sorts of hidden incompatibilities between versions. The WDK refuses to install or uninstall on Hopi for some unknown reason.
My only recourse is to zip up the WDK directory from a Known Good Reference machine, unzip it on Hopi, and use that. In fact, the WDK should be distributed as a zipfile -- no installer at all!
After unzipping WDK80_Combined.zip to D:CLDDKWindows80 , (almost) everything built. I need to import my nlited signing certificate.
20131216 - Chip - Dec 16 2013
I am working on KOFY today. I rebuilt the NetMon project and noticed that the GeoDB project failed to link, with "can't find OsUser.lib". I had noticed this before, but since I don't need to run GeoDB I ignored it. Today I decided to figure out what was going on. This led to spending about an hour unwinding the build configuration and WDK compiler constants...
The GeoDB problem was because somehow the configuration manager had mixed x32 and x64 builds together. The configuration manager should look like this:
After I straightened this out, I did a batch build for everything. This failed on the 32bit version of the driver, which led to sorting out the compiler constants...
The compiler constants are a tricky thing to get right when building Windows device drivers. This is where the target version of Windows and the WDK are specified, and must be set correctly for both the build environment and the target system. The "normal" method is to load these constants into the command line options for the project ( Driver > Properties > C/C++ > Preprocessor > Definitions ). This can lead to very long strings that differ only slightly between configurations, but since they are stored as strings any difference causes the whole sequence to be stored separately.
My 32bit build problem was caused by a set of constants being repeated in the project properties page. The first set, which was visible in the edit box, set the target OS correctly. The duplicate set, which was scrolled way off to the right, reset the target OS incorrectly. This was very easy to overlook and very difficult to manage since it is buried deep in the project configuration XML files. The 32bit debug settings were:
KERNEL;_WINDOWS;DBGOUT;_DEBUG;NTDDI_VERSION=0x06010000; STD_CALL;CONDITIONAL_HANDLING;NT_UP;NT_INST=0;WIN32=100;_NT1X_=100; WINNT;_WIN32_WINNT=0x0601;WIN32_LEAN_AND_MEAN;DEVL;FPO;_IDWBUILD; _X86_;i686;STD_CALL;CONDITIONAL_HANDLING;NT_UP;NT_INST=0;WIN32=100; _NT1X_=100;WINNT;_WIN32_WINNT=0x0600;WIN32_LEAN_AND_MEAN;DEVL;FPO; _IDWBUILD
A much better method is to move all those constants to a normal header file (NtBuild.h) and manage it just like any other source file. Then the project settings simplify down to Debug/Release and 32bit/64bit.
The key to making this work is to force NtBuild.h to be included first into every source file. This is set in Driver > Properties > C/C++ > Advanced > Forced include File.
I have finally resumed adding the Installable File System filter, after a bit of winding road through converting the driver to DEVICE_OBJECT - centric and fixing the mysterious "Crash on Exit" bug.
I am copying the example from CodeProject.com to start.
SUCCESS! It took about 3 hours to get it working right, coexisting with all the other devices in the driver. I am now able to see all the file opens. Crashes on exit; this is because I had to enable the default IRP handler for all functions for the IfsFilter.
CRASH! The driver would always crash within a few seconds. I noticed this note on the CodeProject page: "A file system filter driver must have the fast-IO dispatch table. If you’ve forgot to set up the fast-IO dispatch table, it will lead to system crash.".
Fortunately, the FastIO.c code dropped into place mostly untouched. Plus, I would still have crashes after a few seconds. The FastIo code as-is assumed the device was created by the IfsFilter. In my project there are four distinct device types all handled by the same driver, and all share the same IRP handler and FastIO vectors. This is a design error with NT, the IRP and FastIO handlers should be a feature of the Device, not the Driver objects. So I had to add some boilerplate to each FastIo function to first determine whether the device was an IfsFilter device before passing it on.
In order to pass all the IFS functions through, I had to switch back to using a default IRP handler for all functions. This rewoke the "Crash on Exit" bug.
Once I did this, NetMon was stable. All this took until about 5pm, seven hours after beginning to add the IFS code.
20131214 - Chip - Dec 14 2013
NetMon.sys now creates three distinct device objects, Control, Stream, and IfsFilter. The original simple approach of creating the DEVICE_OBJECT as part of the DRIVER_OBJECT was causing problems. So I moved all the device code into the individual classes. Turns out the device management code isn't as complex as it originally seemed to be. So now each object (Control, Stream, IfFilter) creates its own DEVICE_OBJECT within its own context.
This was a rewrite of a significant portion of the NetMon.sys driver. I started around 9:30am and had the first draft done by 1pm. Another 45 minutes for the first complete compile and link. Another 2 hours to get it running again. (It crashes on exit.)
I spent about 3 hours trying to find the "Crash on Exit" bug. I found some problems with inappropriate calls to constructors and destructors; because the memory for the device objects is being allocated inside IoCreateDevice(), calling the class constructors and destructors results in a double-alloc or double-free. Even after fixing these, the driver still crashes on exit. I have narrowed it down to CloseHandle() on the Control file, but I cannot see the bug.
FIXED: The problem was setting IoCreateDevice(DeviceType) to a made-up value (0xCD10). I set it to FILE_DEVICE_UNKNOWN and the driver unloaded without any problems. Apparently 0xCD11 just happened to work, and when I changed it to 0xCD10 it crashes.
UNFIXED: Still crashes. DeviceType has nothing to do with it.
REFIXED: I removed the code that assigned a default IRP handler to every possible code, which was a "helpful hint" from the MSDN docs. I am now just assigning CREATE, CLOSE, DEVICE_CONTROL, and READ -- which is the way I originally wrote it. Then the driver unloads properly. I suppose there is a code that is being sent before or after the CLOSE that the default handler is screwing up. Now that I know what is causing the crash I can log all the unhandled IRPs to find which is causing problems. This bug took about six hours to find.
20131212 - Chip - Dec 12 2013
I don't like the idea of using an INF, so I am trying to create an IFS Filter rather than an IFS MiniFilter.
I am working on Kofy today, which introduces another round of "Find the WDK". On Kofy, the WDK81 libraries are located at D:CLWDK81kmx64 . There is no WHL directory, so I had to change (again) the library search path.
Problem: I can't find a attachment point that covers the entire file system; IFS Filters attach to volumes, not systems. This means I need to either enumerate all the volumes, create a device object for each, then attach. Otherwise, I can follow the samples and wait for volume registration callbacks and attach then. Either way, I seems I need to create a separate device object for each volume. This means a rewrite of NetMon.sys that will touch nearly every part, as I will need to create the device objects from a usage perspective, not a driver perspective.
The alternative is to go back to the MiniFilter and resign myself to using an INF file.
20131211 - Chip - Dec 11 2013
The good news: I have not seen a BSoD since the last build.
I started adding a file operation filter to NetMon. It compiles, links, and runs, but fails to install. At least it did not crash. I believe there is something wrong with my REGISTRATION structure, probably in the Contexts list.
I downloaded the MSDN MiniSpy example. I will need to study it for a bit.
The problem is that the registry entries for the minifilter have not been created. The Microsoft Way uses an INF file to install the driver, which also creates the registry entries. I want to programatically install the driver, which MS is declaring taboo. I believe this is part of their strategy to lock down the installation of malware drivers by forcing the installation through the INF parser.
It looks like I will need to install (at least once) with an INF. So tomorrow's task will be to adapt the INF from the MiniSpy project.
Links:
OSR thread
File systems blog
Minifilter project
EaseFilter.com
Tech Kaizen, good list of links
Another stab at the BSoD...
I read this in the MSDN: Using Automatic Synchronization
Almost all of the code in a framework-based driver resides in event callback functions. The framework automatically synchronizes most of a driver's callback functions, as follows:
The framework always synchronizes general device object, functional device object (FDO), and physical device object (PDO) event callback functions with each other so that only one of the callback functions (except EvtDeviceSurpriseRemoval, EvtDeviceQueryRemove, and EvtDeviceQueryStop) can be called at a time for each device. These callback functions support Plug and Play (PnP) and power management events and are called at IRQL = PASSIVE_LEVEL.
This made me think of two things:
I commented out the mutex calls, and it seems to be running... I left it running overnight, and it is still runnng the next morning.
BSoD: NetMon.sys PAGE_FAULT_IN_NONPAGED_AREA. So sad. This was after running all night and interactively this morning for almost an hour. On the bright side, this is probably a bad (or misused) pointer or buffer overrun. It is (probably) not a synchronization problem.
Finally, a dump with a clue!
Pull up the NetMon.map and look up $59A6 in the Rva+Base column (not Address):
NetMon+59a6 is ... Log::BufWrite() [5880+0126]. Now pull up DriverOutWinx64DebugLog.cod and go to Log::BufWrite+0126:
Aha! There is my offending line! So now I know where it crashed but not why.
The bugcheck code is:
Code | 0x50 | PAGE_FAULT_IN_NONPAGED_AREA |
Referenced | 0xFFFFF81`03C8B050 | pSrc |
Mode | 1 | Write |
IP | 0xFFFF880`03DED9A6 | Log::BufWrite+0126 |
Since `3C8B050 appears in the stack twice, I am assuming this is pSrc. Since it is also the referenced bad address, I am assuming it is the bad pointer. Since a '1' appears in the stack I am assuming it is the Type argument to Log:WriteItem(). The only function that calls WriteItem(Type=LOG_TEXTA) is Log::WriteA(). WriteA() passes pSrc as the buffer returned by TmpBufGet(). Since pSrc is nonzero, it can only be a buffer overrun of TmpBuf::pBuf[]. Probably by a single byte.
Just as a quick hack, I added some padding to the byte count requested in TmpBuf::Get(). I ran this for a few minutes on my naked ChipPro system and it did not crash. I need to let it soak for a while. I can believe this is the source of the crash, because it would happen very rarely (when the end of the requested buffer exactly matches the end of TmpBuf::pBuf[] AND StrFormatA() fills the buffer. But this does not explain why the rep movsb apparently failed on the first byte. (But I don't know the value of ECX, so it could have actually been the last byte? I don't think so, the CPU exception catches the actual address, not the value of the registers.) Still a bit of a mystery...
UPDATE 20140105: This crash was most likely due to the UNICODE_STRING bug.
=====
Following up on a hint about SXS spewing debug crap... I created zipfiles containing the sxs dll's from retail installations. I will try replacing the dll's on WinDebug. The zipfile includes a .bat file to copy the files. PendMoves is the SysInternals utility to schedule file moves during reboot. I need to be able to debug my drivers!
SXS7.zip (Windows7 64bit): [FILE 7032]20131129 - Chip - Nov 29 2013
Chasing the BSoD...
It seems to be triggered by calls to StreamWrite().
My OsMutexAcquire() seems to blow up with "driver attempted a wait from a DPC."
Stream capture is not enabled until a connection is selected in NetMonUI and the command "capture in out mask 255.0.0.0". A debug line in CbStream4() causes all streams to be then captured.
If I comment out the BufRead code I am able to capture streams without any problems. I suspect the crash is happening when there is a collision on the mutex.
Next attempt: Replace the mutex with a spin lock.
I think I may have it working; I have recorded over 100MB and it is still going.... Crash! After recording 148MB. The fix was to use KeInitializeMutex(), KeWaitForMutexObject(), KeReleaseMutex().
I made similar changes to Log.cpp and was able to record 120MB before a BSoD. This bsod was unusual, there was a brief bluescreen message saying "Hardware failure" before the standard bluescreen. I am beginning to wonder if there may be some system instability in the VM when I am thrashing both the network and disk constantly. I will need to run it outside a VM.
Well... that didn't go as expected. :( When I ran NetMon on ChipPro/Win81
Firefox could no longer decode any pages. Somehow I am mangling the data?
FIXED: I needed to not set the actionType
and rights unless I want to block the data. Hmmm...
This tells me that some other filter is modifying the data. NetMon is now
working on Windows 8 (8.1).
20131127 - Chip - Nov 27 2013
NetMon Running on Windows 7 Use FWPM transactions: FwpmTransactionBegin0() FwpmTransactionCommit0() Reordered FwpmCalloutAdd and FwpsCalloutRegister() FwpmCalloutAdd0() first to create the callout. FwpsCalloutRegister0() to enable callout. Flags=0 (FWPM_CALLOUT_FLAG_REGISTERED fails) calloutKey uses same GUID as FwpmCalloutAdd0()I found some WFP documents on the wild and wooly web that helped in making NetMon run on Windows 7.
NetMon now runs on Windows 7.
Trying to install VS2013 remote kernel debugger tools on Win764D. I seem to still be stuck at this screen:
The installer never connected. I clicked Cancel.
VS > Debug > Attach to process > Kernel Transport > WinDbg764. Reboot WinDbg. VS then connects, but now WinDbg764 is running so slow that it is impossible to use. Kernel debugging is horribly broken.
20131124 - Chip - Nov 24 2013
It has been another frustrating day with VS13 and Windows81. I am working on ChipPro2 (Surface Pro) trying to debug NetMon running on VMware/Windows81/64bit/Debug. I resurrected the VM that had been thoroughly hosed by attempting to install the VS13 remote debugging tools. (Won't boot with a BSoD INVALID_BOOT_DEVICE).
I was able to bring the VM back to life using the same product key by performing a "Reset" from the emergency boot mode. This is essentially a new installation. However, every 10 seconds the system locks up for about 40 seconds. This is with nothing running, just boot and go to the desktop. I suspect, but do not know for sure, that this is something to do with being a debug build. My next step will be to create a completely new VM, this time using the Release Windows 8.1 installer.
I eventually gave up on W864D and reverted back to my Win832 (32bit) VM. The first time everything worked fine. The second time, the driver silently refused to load. The only clue was a Windows/System event that reported an error, "Unable to find file." I flailed around a while before I finally tried copying NetMon.sys to C:WindowsSystem32NetMon.sys and invoked NetMonUI --driver C:WindowsSystem32NetMon.sys and voila! the driver loaded.
I am trying to find and fix the BSoD that seems to always eventually occur. I added a try/except block around the Classify handler that should output "EXCEPTION" to the debugger if it triggers. I have been using IE10 on Win832 for over twenty minutes witout any problem. After about 30 minutes a BSoD IRQL_NOT_LESS_OR_EQUAL.
I added IrqlChk() calls around the Device file Open, Ioctl, and Read handlers. There are a small number of entry points to NetMon.sys, and I am running out of ideas about where the crash is occurring. This version ran over 5200 seconds before crashing. I started capturing to a file (CAPTURE IN OUT MASK 255.0.0.0) about 10 minutes before the crash. Note that the error changes, this time it is ATTEMPTED_SWITCH_FROM_DPC.
BUG: The IpHelper resolver seems to get stuck after a while. PendingCt ratchets up to the limit of 100.
20131008 - Chip - Oct 10 2013
20131007 - Chip - Oct 7 2013
20131005 - Chip - Oct 5 2013
20131002 - Chip - Oct 2 2013
20130930 - Chip - Sep 30 2013
20130929 - Chip - Sep 29 2013
20130928 - Chip - Sep 28 2013
20130927 - Chip - Sep 27 2013
20130925 - Chip - Sep 25 2013
20130924 - Chip - Sep 24 2013
20130923 - Chip - Sep 23 2013
20130922 - Chip - Sep 22 2013
20130921 - Chip - Sep 21 2013
20130916 - Chip - Sep 16 2013
20130915 - Chip - Sep 15 2013
20130914 - Chip - Sep 14 2013
20130912 - Chip - Sep 12 2013
20130911 - Chip - Sep 11 2013