Manifest Madness
It seems that manifests were supposed to reduce the amount of dll hell that the average Windows developer experiences. Whilst I am sure this is true, there seems to be a paucity of information on the web. Finding information on manifests on the web seems to be a difficult task.
I came across a manifest issue recently that I think is worth describing here precisely due to the lack of information elsewhere.
Disclaimer: most of the opinions on OS behaviour are based on non-exhaustive observations of our particular problem, especially in the absence of any particularly obvious documentation. They may not necessarily be correct.
The Technical Problem.
Quite simply, we have a C++/native comand line process (an exe and lots of dlls) built under Visual Studio 2005 on XP and Vista. The release build works correctly as expected. However, on a machine that does not have VS2005 installed the application fails to work: it fails silently on the command line. We ship the VC runtime redistributable dlls in the same folder as our exe along with their manifests.
Common wisdom on the web, which is basically solving the problem without actually understanding what the problem is, is to “just install the VC8 redistributable runtime on your machine”. Personally, I cannot abide ‘fixing’ something, without understanding what I’ve fixed and why it needed fixing, and you almost always find you reuse the knowledge you have learnt again in the future.
The Non-Technical Problem.
This is a work problem. At my company, applications are automatically distributed to locked down machines that have no admin access. Rolling out the redistrbutable patch, for purely bureaucractic reasons, is not a solution. Also, I only have one test machine, and have to therefore be very careful with what I install on it, otherwise I have to wait for a long time for the machine to be rebuilt.
Preamble
Since it is obvious that installing VS2005 would solve the problem, and most likely the redistributable package, in order to start diagnosing the problem, we need to put the minimal set of tools on a machine, in this case Windows XP.
At this point, I wasn’t aware of what the problem actually was, so the first stop was our good old friend depends.exe (http://www.dependencywalker.com).
Hint 1: I would suggest to anyone working in a corporate or otherwise locked down environment to either ship a set of tools with your application or making them available on an internal network drive.
Once I had loaded the application into depends, it appeared that everything was fine. Later, however, it was obvious that this was not the case. Buried in one of the collapsed nodes was a problem that I did not see.
The next step was to fire up the venerable DrWatson. Straight away it was clear that the application was doing the Windows equivalent of a core dump. The dump itself was apparently pointing to a empty (non-pure) virtual function that returned a void. At this point I was still under the misguided idea that the problem was caused by a problem that could be diagnosed by a debug or core dump analysis.
After some head scratching, I decided to get windbg installed on the test machine. This would have a low impact on the machine configuration as it would not install lots of unnecessary junk in order to work. It did, of course, require admin rights: exactly what you don’t need in a corporate environment.
Firing up the Windows debugger and attaching it to our process, it was clear on the very first run that the problem lay after a particular dll load event. In this instance the debugger was also pointing to a different dll and a different part of the code.
Hint 2: ship .map and .pdb files with your release build, and keep a copy of the build available: source code, pdb, objects etc.
I now started to investigate the loading of the dlls. Next stop was psmon from the sysinsternals/Microsoft website. Running this against my process then confirmed the loading that I saw in the Windows debugger with a little bit more interesting, including the attempted load of ‘our.dll.2.manifest’. Whilst this would not appear to be an error, it is what first attracted my attention back to manifest files.
Back to depends.exe; I now tried to load ‘our.dll’ into depends.exe and received the message:
“The side-by-side configuration information for ‘our.dll’ contains errors. This application has failed to start because the application configuration is incorrect. Reinstalling the application may fix the problem.”
How is that for a helpful message? At least it is a start. So the problem with the silent failure of our application is now firmly identified to lie withing ‘our.dll’. There were also additional system log messages in the event viewer, including the usual ‘the command completed successfully’ in a reported error.
Debugging manifests
The next stop was google. What does the error mean, and how do I debug it for more information? I then found this “classic” (for all the wrong reasons).
The summary is that you can debug the manifest/sxs but only on Vista.
Yes, that’s right, Windows XP has manifests, but you can’t get a dump of the manifest files. If you know otherwise, leave a comment below.
As luck would have it, I have a Vista box. Unfortunately, it has Visual Studio 2005 installed, and therefore, the offending application works on that machine. Giving sxstrace a go, doesn’t reveal anything I didn’t already expect to see. Running it was a little strange: start a console window, and run
sxstrace trace -logfile:sxstrace.etl
In a second console:
myapp.exe
Wait for it to finish, back to console 1, hit enter to stop the trace, then:
sxstrace parse -logfile:sxstrace.etl -outfile:sxstrace.txt
to make it human readable. This then gives you a summary of the manifests that the dll depends on. As this gave no information as it was running on a machine where the app worked, I went back to depends.exe.
Hint 3: In depends.exe expand all nodes, and switch on the full path option for the dlls.
At this point I noticed that part of the dll in question was referencing another dll that was in turn referencing an out of date VC8 runtime.
This is where the manifest madness begins. As far as I am aware, under Windows 2000 onwards, the search behaviour for a dll named ‘widget.dll’ (for example), is to search the local folder first, except in the presence of manifests or redirection. The behaviour with manifests is to load dlls via their manifest descriptions first.
Hint 4: One way to see the manifest of a dll is to just drag into Visual Studio, and look at the second resource ordinal.
Consequently our app appeared to be loading two versions of the VC80 runtime. There are two workarounds to this. The first is to rebuild the offending dll against a newer version of the VC runtime.
If you cannot do that, for example, it is a third party dll, then you can redirect the loading by creating an ‘ourapp.exe.config’ file. Of course, this didn’t work for us, but we were lucky in that we could recompile the dll. However, the extra manifest entry still appeared to be there.
The reason why the app.exe.config doesn’t work is also interesting. As far as I can find there is no documentation that states this explicitly: application config files can only do binding redirects if the dlls they are redirecting are installed in the WinSxS folder. The behaviour I was hoping for (which may not be correct, since this mechanism covers the .NET world as well), is that the redirect occurs on and to a manifest file and dll in the current working directory of your application.
Back to the problem; the recompiled dll still showed that the second version of the vc runtime was in the manifest. At this point depends was telling us that nothing directly depended on the oldest vc runtime.
So the question then was “where is this extra manifest entry coming from?” In the land of Unix:
find . | xargs grep 50608
will find all files that contain the string ‘50608′ which is part of the version number of the offending vc runtime. Without using such a scatter gun approach, we know that the value can only be coming from a limited number of places, the source code, or something that is linked in. We knew it wasn’t the source code, so that left some static libs.
Dragging each one into notepad finally revealed that on of the static libs that we use was referencing the old vc runtime. This therefore meant that that particular static lib was built at some point by VS2005 RTM, and not VS2005 SP1 which the rest of the application was being built with. By some good fortune it was our (not my!) source code, and we resurrected it, rebuilt it, and guess what?
Problem solved!
Summary
After this somewhat lengthy article, the summary is brief: I located the problem to be in ‘our.dll’, and that the problem was in turn due to an old lib built by an older compiler that was being linked in.
It is really disappointing that on Windows XP, at least, there does not appear to be a tool for further diagnosing dll problems such as this, other than a random scattering of blog postings, and poor documentation.
At the very least there should be a tool that list the manifests of each dll, and where they were found. The manifest tool, in verbose mode, does not do this.
Final note: if you choose to leave a comment, please be helpful and polite. If you have questions about manifests, it’s unlikely I’ll be able to help beyond what I learnt whilst writing this article.
Troubleshooting manifests
These are a list of links that I gradually worked through to figure out what on earth was going on:
http://channel9.msdn.com/forums/TechOff/22266-Side-by-side-screwup/?CommentID=272900
http://blogs.msdn.com/jreddy/archive/2005/12/23/troubleshooting-c-c-isolated-applications-and-side-by-side-assemblies-scenario-based-with-solutions.aspx
http://blogs.msdn.com/dsvc/archive/2008/08/07/part-2-troubleshooting-vc-side-by-side-problems.aspx