ritter.vg
tech > code > adventures in code > hacking with the clr: diffing assemblies
26 Apr 2009 13:08:00 EST
I decided to look into exactly what changes when you compile a .Net assembly. For reference purposes:
[assembly: AssemblyTitle("ConsoleApplication1")]
[assembly: AssemblyDescription("")]
[assembly: AssemblyConfiguration("")]
[assembly: AssemblyCompany("")]
[assembly: AssemblyProduct("ConsoleApplication1")]
[assembly: AssemblyCopyright("Copyright 2009")] 
[assembly: AssemblyTrademark("")]
[assembly: AssemblyCulture("")]
[assembly: ComVisible(false)]
[assembly: Guid("34ea927d-9b44-48c0-8d82-38d5a836c447")]
[assembly: AssemblyVersion("1.0.0.0")]
[assembly: AssemblyFileVersion("1.0.0.0")]

static void Main(string[] args)
{
Console.WriteLine("Hello World");
}

So, easy enough to figure out, .Net embeds a timestamp in the dll of when it was built:

And of course every new build gets a seperate GUID. Here are the GUIDs in two different dll's: GUIDs changing between DLL's

And here is the left GUID in the corresponding PDB file. GUID corresponding with PDB

The assmebly gets a GUID every time you build, which is matched in the PDB. So if you throw away the PDB, you can't ever debug that assembly . because no other PDB will ever match it... (except when that's not true. more notes on this later)

Right by the GUIDs I found a second timestamp, outlined here in blue: Second Timestamp in the DLL

Then I found the ModuleDef. It's an additional GUID that applies only to this assembly, this build. It changes every time, and it's purpose is to allow the assembly to be renamed by still keeping track of versioning info. (However, this is advised against.) ModuleDef - an always unique GUID

The final, most interesting part I found is a single byte shown in earlier photos but indicated here: Single Byte highlighted

After playing around a little bit, I discovered that it incremented as I built the assembly multiple times in Visual Studio. It would sometimes reset to 1 however. If I issued a Rebuild or a Clean it would reset. However, closing Visual Studio and then reopening it did not reset it. A quick while loop and an rm while holding down F6 confirmed it has at least 2 bytes reserved for the purpose of this counter, and it seems likely all four bytes (the high two or three being 00's) are reserved for it. (I also switched to a different diff viewer because the first was giving me problems.) Single byte rolling over.  Probably an entire word dedicated to this

As I was messing around with this, I noticed something odd. Sometimes the PDB-GUID wouldn't change! Here you can see the 01 to 02 increment, and the timestamp, but the 16 bytes right before that single incrementing byte are the same! The PDB GUID does not always change!

I triggered a rebuild, which reset the counter, and the PDB-GUID did indeed change: The PDB GUID does change when the counter resets

So I wanted to look into this, so the next step was take the most complex part of the equation out - Visual Studio. I want to know exactly what Visual Studio is doing when it's compiling them, and how it's different. I know the files aren't changing. Let's look at the commands it's running:

E:\WINDOWS\Microsoft.NET\Framework\v3.5\Csc.exe /noconfig /nowarn:1701,1702 /errorreport:prompt /warn:4 
/define:DEBUG;TRACE /reference:"E:\Program Files (x86)\Reference 
Assemblies\Microsoft\Framework\v3.5\System.Core.dll" /reference:"E:\Program Files (x86)\Reference 
Assemblies\Microsoft\Framework\v3.5\System.Data.DataSetExtensions.dll" 
/reference:E:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\System.Data.dll 
/reference:E:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\System.dll 
/reference:E:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\System.Xml.dll /reference:"E:\Program Files 
(x86)\Reference Assemblies\Microsoft\Framework\v3.5\System.Xml.Linq.dll" /debug+ /debug:full 
/filealign:512 /optimize- /out:obj\Debug\ConsoleApplication1.exe /target:exe 
Program.cs 
Properties\AssemblyInfo.cs

That's a whole lot. And there's no difference between two consequtive runs. Nothing that indicates the counter should be incremented or anything! Not that I don't trust Visual Studio, but I went and ran this from the command line myself.

I immediately got suspicious because the exe and pdb outputting to a different directory. So VS was definely doing some shuffling behind my back. And I was right. VS is doing some funny stuff to keep the PDB and Assembly GUIDs the same. Here we can see that two subsequent runs from the command line will produce counters at 01 and different PDB-GUIDs. Different GUIDs from the CLI Later on, I'm going to have to look into what exactly Visual Studio is doing to keep that PDB the same while incrementing the counter...

Moving on, the strangest part of all of this is that if you edit the source file, VS will not necessarily change the PDB-GUID. Here is a diff showing two different Hello World programs outputting slightly different string literals, but with the same PDB-GUID. Two different string literals - same PDB GUID.

The string literal appears right before the strange 16 byte random value.

This is all interesting, but now I want to know how it works when you compile the program on a whole other machine! You can see some path information buried in the dll already, I want to see if anything else changes.

Because of differeing string literal sizes and some shuffing, the AddressOfEntryPoint changed in the PE Header, as well as some other information. The Address of Entry gets changed.. Differing PE Headers Different Data Table sizes

At this point I knew too much would be thrown off and I need to take string literals out of the equation. I whipped out the handy subst command and made my drive letters and directories line up. With the PE Headers in sync, I was in business to see the machine specific information embedded in these dll's. To help orient you, I've circled the PDB-GUID in blue, and the timestamp in red. Difference between two different machines - timestamp in red, PDB-GUID in blue

The bottom, single-byte change is the difference between a lowercase and capital T. The top change is also easy to figure out . that's the GUID for the assembly, from up top.

[assembly: Guid("34ea927d-9b44-48c0-8d82-38d5a836c447")]

It may be worth noting that if you do not specify this attribute, it is not added for you automatically.

I was almost expecting some sort of complication-operating system information to be embedded. The original compuer was running Windows x64 Pro while the second was running XP Home 32 bit. But every single difference was accounted for!

So every assembly has:

  1. A Timestamp, in two locations
  2. A GUID that matched the PDB
  3. What appears to be a completely random GUID generated every compile.
  4. A counter indicating what the build of the assembly is - generated only in subsequent Visual Studio builds.

comparing debug and release mode

But I wanted to know more. I wanted to know what the difference between Debug and Release mode is. Now part of release mode optimizes the IL, doing things like unrolling loops, inlining functions, hoisting, tail recursion, and others I'm sure. But this simplistic of a program doesn't have any optimizations. Now a recurring theme in this diff is offset problems. Due to string lengths (/Debug/ vs /Release/ for example) bytes would be offset and we wouldn't get a very telling diff. So I do a little behind-the-scenes futzing with string lengths to help things line up. And ultimately, there's not a whole lot, structure-wise, between Debug and Release for a program this small. Here we have the Timestamp, but more importantly we have a difference in the size of the first entire section. Timestamp and length difference.

Let's look at the GUID Area. Here we can see the ModuleDef GUID at the top in purple, the timestamp in green, the PDB-GUID in red, and the counter in blue. There are a few other bytes hanging around that are different as well. GUIDs, timestamp, and counter.

We can identify two of those bytes by diffing the generated IL from ILDasm . the 02 00 and 07 01 are values for the Debuggable Attribute. Debug Flag and Attribute.

This is the really cool part - the actual instructions. We can identify missing 2 missing bytes by looking at the difference in the IL between the Debug (right) and Release (left). The Debug IL puts nops in at braces, so you can place breakpoints on them. The nops are missing in the Release version. Missing No-Ops.

So we're basically done, right? Let's list all the diffs:

  1. Timestamp
  2. no-ops
  3. ModuleDef GUID
  4. Debug Attribute
  5. Second Timestamp
  6. PDB-GUID
  7. Directory Difference
  8. Several 4 Byte Offsets (DataDirectory.Debug, SizeOFData, AddressOfRawData, PointerToRawData, DataDirectory.MetaData)

At this point you should be wondering about 4 byte offsets. We only had 2 bytes from nops! But all the offsets are 4 bytes: The Four-Byte Offsets.

And now you're going to get mad at me. Right here you can see two extraneous bytes in red. In blue are the no-ops from before. Two extraneous bytes (red) with brace-nops in blue.

For all the other diffs above, I deleted those bytes. They were throwing everything off! I couldn't diff properly! I was pretty sure they were no-ops. But to verify that, I needed to figure out the bytes preceeding it.. I painfully searched each OpCode on MSDN and came up with this: The op codes translated to IL

So the verdict I came up with, was that they were nops being inserted, probably for the compiler generated constructor of the Program class.

Comments
Add a comment...
required
required, hidden, gravatared

required, markdown enabled (help)
you type:you see:
*italics*italics
**bold**bold
[stolen from reddit!](http://reddit.com)stolen from reddit!
* item 1
* item 2
* item 3
  • item 1
  • item 2
  • item 3
> quoted text
quoted text
Lines starting with four spaces
are treated like code:

    if 1 * 2 < 3:
        print "hello, world!"
Lines starting with four spaces
are treated like code:
if 1 * 2 < 3:
    print "hello, world!"