For the past year and change I've been working with dkg and Linus Nordberg on Certificate Transparency Gossip. I'll assume you're familiar with Certificate Transparency (you can read more about it here.) The point of CT Gossip is to detect Certificate Transparency logs that have misbehaved (either accidentally, maliciously, or by having been compromised.)
The CT Gossip spec is large, and complicated - perhaps too complicated to be fully implemented! This blog post is not about an overview of the specification, but rather about a nuanced problem we faced during the development - and why we made the decision we made. I'll take this problem largely into the abstract - focusing on the difficulty of providing protections against an intelligent adversary with statistics on their side. I won't reframe the problem or go back to the drawing board here. I imagine someone will want to, and we can have that debate. But right now I want to focus on the problem directly in front of us.
In several points of the Gossip protocol an entity will have a bucket of items. We will call the entity the 'server' for simplicity - this is not always the case, but even when it is the web browser (a client), we can model it as a server. So the server has a bucket of items and a client (who will be our adversary) can request items from the bucket.
The server will respond with a selection of items of its choosing - which items and how many to respond with are choices the server makes. The server also chooses to delete items from the bucket at a time and by a policy of the server's choosing.
What's in the bucket? Well by and large they are innocuous items. But when an adversary performs an attack - evidence of that attack is placed into the bucket. The goal of the adversary is to 'flush' the evidence out of the bucket such that it is not sent to any legitimate clients, and is only sent to the adversary (who will of course delete the evidence of their attack.) Besides requesting items from the bucket, the attacker can place (innocuous) items into the bucket, causing the bucket to require more storage space.
The adversary can create any number of Sybils (or fake identities) - so there's no point in the server trying to track who they send an item to in an effort to send it to a diversity of requestors. We assume this approach will always fail, as the adversary can simply create false identities on different network segments.
Similarly, it's not clear how to distinguish normal client queries from an adversary performing a flushing attack. So we don't make an effort to do so.
Our goal is to define policies for the 'Release' Algorithm (aka 'which items from the bucket do I send') and the 'Deletion' Algorithm (aka 'do I delete this item from the bucket') such that an attacker is unsure about whether or not a particular item (evidence of their attack) actually remains in the bucket - or if they have successfully flushed it.
This problem is tantalizingly close to existing problems that exist in mix networks. Perhaps the best treatment of the flushing attack, and how different mixing algorithms resist it, is From a Trickle to a Flood from 2002.
But as intimated - while the problem is close, it is not the same. In particular, when (most | deployed) mix networks release a message, they remove it from the server. They do not retain it and send a duplicate of it later. Whereas in our situation, that is absolutely the case. This difference is very important.
The second difference is the attacker's goal. With Mix Networks, the attacker's goal is not to censor or discard messages, but instead to track them. In our model, we do want to eliminate messages from the network.
Defining The Attacker
So we have defined the problem: Server has a bucket. Attacker wants to flush an item from the bucket. How can we make the attacker unsure if they've flushed it? But we haven't defined the capabilities of the attacker.
To start with, we assume the attacker knows the algorithm. The server will draw random numbers during it, but the probabilities that actions will be taken are fixed probabilities (or are determined by a known algorithm.)
If we don't place limits on the attacker, we can never win. For example, if the attacker is all-powerful it can just peek inside the bucket. If the attacker can send an infinite number of queries per second - infinity times any small number is still infinity.
So we define the costs and limits. An attacker's cost is time and queries. They need to complete an attack before sufficient clock time (literally meaning hours or days) elapses, and they need to complete the attack using less than a finite number of queries. This number of queries is actually chosen to be a function of clock time - we assume the attacker has infinite bandwidth and is only gated by how quickly they can generate queries. We also assume the attacker is able to control the network of the server for a limited period of time - meaning they can isolate the server from the internet and ensure the only queries it receives are the attacker's. (Not that the server knows this of course.)
The defender's cost is disk space. With infinite disk space, the defender can win - we must design a mechanism that allows the defender to win without using infinite disk space.
An attacker WINS if they can achieve ANY of these three objectives:
- Determine with certainty greater than 50% whether an item remains in the opponent's bucket while sending fewer than M queries to the opponent.
- Determine with certainty greater than 50% whether an item remains in the opponent's bucket before N amount of time has past
- Cause the defender to use more than O bytes of storage.
M is chosen to be a number of queries that we consider feasible for an attacker to do in a set period of time. N is chosen to be long enough that sustaining the attack represents undue political or technical burden on an adversary. O is chosen to be a disk space size large enough that client developers or server operators are scared off of deploying Gossip.
Those numbers are pretty well established, but what about N and O? Basically, we can only make a 'good guess' about these. For example, sustaining a BGP hijack of Twitter or Facebook's routes for more than a short period of time would be both noticeable and potentially damaging politically. TLS MITM attacks have, in the past, been confined to brief period of time. And O? How much disk space is too much? In both cases we'll have to evaluate things in terms of "I know it when I see it."
An Introduction to the Statistics We'll Need
Let's dive into the math and see, if we use the structure above, how we might design a defense that meets our 768-million mark.
It turns out, the statistics of this isn't that hard. We'll use a toy example first.
- When I query the server, it has a 10% chance of returning an object, if it has it - and it performs this 10% test for each item. (You'll note that one of the assumptions we make about the 'Retrieval Algorithm' is that is evaluates each item independently.)
Thanks to the wonder of statistics - if it never sends me the object, then is no way to be certain it does not have it. I could have just gotten really, really unlucky over those umpteen million queries.
But the probability of being that unlucky, of not receiving the object after N queries if the server has it - that can be calculated. I'll call this, colloquially, being 'confident' to a certain degree.
How many queries must I make to be 50% confident the server does not have an object? 75%? 90%?
- Assume the server has the item. The probability of not receiving the item after one query is 90%.
- After two queries: 90% x 90% or 81%. Successive multiplications yield the following:
- ~59% chance of not receiving the item after 5 queries
- ~35% chance of not receiving the item after 10 queries
The equation is a specific instance of the Binomial Probability Formula:
F(n) = nCr * p^r * q^(n-r) nCr is the 'n choose r' equation: n! / (r! * (n-r)!) p is the probability of the event happening (here .1) r is the number of desired outcomes (here it is 0 - we want no item to be returned) q is the probability of the event not happening (here 1 - .1 or .9) n is the number of trials
Our equations can be checked:22 queries to be 90% confident the server does not have the item.
Also worth noting is that equation can be thankfully simplified. Because r is 0, we only need to calculate q^(n) - which matches our initial thought process.
Going Back to the 768 Million
So here's what to do with this math: I can use this method to figure out what the probability of sending an item will need to be, to defend against an attacker using the definition of winning we define above. I want .50 = q^(768million). That is to say, I want, after 768 Million queries, an attacker to have a 50% confidence level that the item does not remain in the bucket.
Now it just so happens that Wolfram Alpha can't solve the 768-millionth root of .5, but it can solve the 76.896 millionth root of .5 so we'll go with that. It's .99999999098591.
That is to say, to achieve the 50% confidence interval the probability of sending an item from the bucket needs to be about .00000009%.
Do you see a problem here? One problem is that I never actually defined the defender having the goal of ever sending an item! At this probability, an item has a 50% of being sent after about 50 million requests. I don't know how long it takes Google to reach the number of visits - but realistically this means the 'evidence of attack' would just never get shared.
So.... Send it more frequently?
This math, sending it so infrequently, would surely represent the end game. In the beginning, surely we would send the item more frequently, and then the more we send it, the less often we would send it. We could imagine it as a graph:
| | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x +-------------------------------------------------------------------------------
But the problem, remember, is not just figuring out when to send the item, but also when to delete it.
Consider Deleting After Sending?
Let's imagine a simple deletion algorithm.
- The server will 'roll for deletion' after sending the item to a client who requests it.
- The likelihood of deletion shall be 1%.
Now recall in the beginning, after an item is newly placed into the bucket, it shall be sent with high probability. Let's fix this probability at a lowly 40%, and say this probability applies for the first 500 times it is sent. What is the probability that an item has been deleted by the 500th response? It is 99%. And how many queries are needed on average by the attacker to have the item returned 500 times at 40% probability of sending? It is (thanks to some trial and error) 1249.
What this means is that an attacker who sends on average 1249 queries in the beginning (right after the evidence of the attack goes into the bucket) can be supremely confident that the item has been deleted.
Then, the attacker sends more queries - but far fewer than the 768-million figure. If the item is not returned in short order, the attacker can be very confident that the item was deleted. This is because at the top of that curve, the likelihood of receiving the item quickly is very good. When the item doesn't appear quickly, it's either because the attacker hit a .000000001% chance of being unlucky - or it's because the item was deleted.
'Rolling for deletion' after an item is sent is a poor strategy - it doesn't work when we want to send the item regurally.
A Deletion Algorithm That May Work
- The server will 'roll for deletion' every hour, and the odds of deleting an item are... we'll say 5%.
We can use the Binomial Probability Formula, again, to calculate how likely we are to delete the item after so many hours. It's 1 minus the probability of the deletion not occurring, which is .95num_hours
- 51% chance of deletion after 14 hours.
- 71% chance of deletion after 24 hours
- 91% chance of deletion after 48 hours
If we use a rough yardstick of 'Two Days' for the attacker's timeframe (with deletion rolls once an hour) to yield a 50% confidence level, the equation becomes .50 = q^48 or a 1.4% chance of deletion.
But What About Uncertainty!
If you're following along closely, you may have realized a flaw with the notion of "1.4% chance of deletion every hour." While it's true that after 2 days the probability an item is deleted is 50%, an attacker will be able to know if it has been deleted or not!
This is because the attacker is sending tons of queries, and we already determined that trying to keep the attacker in the dark about whether an item is 'in the bucket' requires such a low probability of sending the item that it's infeasible. So the attacker will know whether or not the item is in the bucket, and there's a 50% chance (that the attacker cannot influence) of it being deleted after two days.
This not ideal. But it seems to the best tradeoff we can make. The attacker will know whether or not the evidence has been erased, but can do nothing to encourage it to be erased. They merely must wait it out.
But what About Disk Space?
So far what we've determined is:
- A deletion algorithm that is based on how often the server sends the item won't work.
- A deletion algorithm that is based on time seems like it will work...
But we haven't determined how much disk will be used by this algorithm. To calculate this number, we must look at the broader CT and CT Gossip ecosystem.
We store two types of data STHs, and [SCTs+Cert Chains]. These are stored by both a Web Browser and Web Server. STHs and SCTs are multiplied by the number of trusted logs in the ecosystem, which we'll place at '20'. We'll make the following size assumptions:
- The size of a SCT is ~120 bytes.
- The size of a STH is ~250 bytes.
- A certificate chain is 5KB.
- But a disk sector is 4KB, so everything is 4KB, except for the chain which is 8KB. (Note that this is 'naive storage'. It doesn't include any associated counters or metadata which would increase size, nor does it include more efficient storage mechanisms which would decrease size.)
A server's SCT Store will be limited by the number of certificates issued for the domains it is authoritative for multiplied by the number of logs it trusts. Let's be conservative and say 10,000 certs. ((10000 SCTs * 4 Kb * 20 logs) + (10000 Cert Chains * 8kb)) / 1024 Kb/Mb = 860MB. That's a high number but it's not impossible for a server.
A server's STH store could in theory store every active STH out there. We limit Gossip to STHs in the past week, and STHs are issued on average once an hour. This would be (20 logs * 7 days * 24 hours * 4 Kb) / 1024 Kb/Mb = 13.1MB and that's quite reasonable.
On the client side, a client's STH store would be the same: 13.1MB.
Its SCT store is another story though. First, there is no time limit for how long I may store a SCT. Secondly, I store SCTs (and cert chains) for all sites I visit. Let's say the user has visited 10000 sites, each of which have 3 different certificates with 10 SCTs each. That's ((10000 Sites * 3 Cert Chains * 8 Kb) + (10000 Sites * 3 Certificates * 10 SCTs * 4 Kb)) / 1024 Kb/Mb) / 1024 Mb/Gb = 1.4 GB. On a client, that's clearly an unacceptable amount of data.
Deleting Data From the Client
So what we want to solve is the disk-space-on-the-client problem. If we can solve that we may have a workable solution. A client whose SCT Store is filling up can do one, or more, of the following (plus other proposals I haven't enumerated):
- Delete data that's already been sent
- Delete new, incoming data (freeze the state)
- Delete the oldest data
- Delete data randomly
I argue a mix of the the first and last is the best. Let's rule out the middle two right away. These are purely deterministic behavior. If I want to 'hide' a piece of evidence, I could either send it, then fill up the cache to flush it, or flood the cache to fill it up and prevent it being added.
On its face, deleting data at random seems like a surefire recipe for failure - an attacker performs an attack (which places the evidence item in the bucket), then floods the bucket with new items. Once the bucket if full, the probability of the the evidence item being deleted rises with each new item placed in. (With a 30,0000 item cache, the odds of evicting a particular item is 50% after 51,000 queries - 30,000 queries to fill it and 21,000 to have a 50% chance of flushing it.) These numbers are far short of 768-million query figure we wish to protect ourselves against.
Deleting data that's already been sent is a good optimization, but does not solve the problem - if an attacker is flooding a cache, all of the data will be unsent.
We seem to be sunk. In fact - we were unable to come to a generic fix for this attack. The best we can do it make a few recommendations that make the attack slightly more difficult to carry out.
- Aggressively attempt Inclusion Proof Resolution for SCTs in the cache. If the SCT is resolved, discard the SCT and save the STH. If this particular SCT is not resolved, but others are, save this SCT. If all SCT resolution fails, take no special action.
- Prioritize deleting SCTs that have already been sent to the server. If a SCT has been sent to the server, it means it has been sent over a connection that excludes that SCT. If it was a legit SCT, all is well (it's been reported). If it was a malicious SCT - either it's been reported to the legitimate server (and ideally will be identified) or it's been reported to an illegitimate server necessitating a second, illegitimate SCT we have in our cache.
- In the future, it may be possible for servers to supply SCTs with Inclusion Proofs to recent STHs; this would allow clients to discard data more aggressively.
The final recommendation is therefore:
- Servers and Clients will each store valid STHs without bound. The size needed for this is a factor of the number of logs and validity window (which is one week). The final size is manageable, under 20MB with naive storage.
- Servers will store SCTs and Certificate Chains without bound. The size needed for this is a factor of the number of certificates issued for domains the server is authoritative for, and the number of logs. The final size is manageable for most servers (under 1GB with naive storage) and can be reduced by whitelisting certain certificates/SCTs to discard.
- Clients will store SCTs and Certificate Chains in a fixed-size cache of their choosing, employ strategies to make flushing attacks more difficult, but ultimately remain vulnerable to a persistent flushing attack.
Recently I wanted to run a complex query across every certificate in the CT logs. That would obviously take some time to process - but I was more interested in ease-of-execution than I was in making things as fast as possible. I ended up using a few tools, and writing a few tools, to make this happen.
- Catlfish is a CT Log server that's written by a friend (and CT Gossip coauther). I'm not interested in the log server, just the tools - specifically fetchallcerts.py to download the logs.
- fetchallcerts.py requires the log keys, in PEM format. (Not sure why.) Run this tool to download all the logs' keys.
- fetchallcerts.py only works on one log at a time. A quick bash script will run this across all logs.
With these tools you can download all the certificates in all the logs except the two logs who use RSA instead of ECC. (That's CNNIC and Venafi.) They come down in zipfiles and take up about 145 GB.
Now we need to process them! For that you can use findcerts.py. The way this script works is using python's multiprocessing (one process per CPU) to process a zipfile at a time. It uses pyasn1 and pyx509 to parse the certificates. You write the filtering function at the top of the file, you can also choose which certs to process (leaf, intermediate(s), and root). You can limit the filtering to a single zip file (for testing) or to a single log (since logs will often contain duplicates of each other.)
The example criteria I have in there looks for a particular domain name. This is a silly criteria - there are much faster ways to look for certs matching a domain name. But if you want to search for a custom extension or combination of extensions - it makes a lot more sense. You can look at pyx509 to see what types of structures are exposed to you.
A word on minutia - pyasn1 is slow. It's full featured but it's slow. On the normal library it took about 18 minutes to process a zip file. By using cython and various other tweaks and tricks in both it and pyx509, I was able to get that down to about 4 minutes, 1.5 if you only process leaf certs. So I'd recommend using my branches of pyasn1 and pyx509.
All in all, it's definetly not the fastest way to do this - but it was the simplest. I can run a query across one of the google logs in about 18 hours, which is fast enough for my curiosity for most things.
A little bit ago NCC Group North America had an all-hands retreat, and solicited technical talks. I fired off a one-line e-mail: "All About Tor - Everything from the Directory Authorities to the Link Protocol to Pluggable Transports to everything in between." And promptly forgot about it for... a couple months. I ended up building the deck, with a level of detail I thought was about 80% of what I wanted, and gave a dry-run for my 45 minute talk. It ran two full hours.
I cut a bunch of content for the talk, but knew I would need to finish the whole thing and make it available. Which I finally did! The slides are available here, and are released CC Attribution-ShareAlike. The source for the presentation is available in keynote format.
Major thanks to all the folks I bugged to build this, especially Nick Mathewson, and those who gave me feedback on mailing lists.
A little bit ago, while in London at Real World Crypto and hanging out with some browser and SSL folks, I mentioned the thought "Why isn't there a directive in HSTS to require OCSP Stapling?" (Or really just hard fail on revocation, but pragmatically they're the same thing.) They responded, "I don't know, I don't think anyone's proposed it." I had batted the idea around a little bit, and ended up posting about it, but it didn't get much traction. I still think it's a good idea I'll probably revisit sometime soon... but while thinking more about it and in other conversations a more immediate thought came up.
What about requiring Certificate Transparency?
The motivation behind requiring Certificate Transparency for your site is to ensure that any certificate used to authenticate your site is publicly logged. Excepting, of course, locally installed roots which everyone is upset with because of SuperFish. But that's independent of the topic at hand. As a site operator, you can be certain that no one has compromised or coerced a CA, maybe even the CA you chose to pin to, into issuing a certificate for your site behind your back. Instead, you can see that certificate in a Certificate Transparency log!
Deployment, and Risks
I mentioned pinning. Pinning is a very strong security mechanism, but also a very risky one. You can lock your users out of your site by incorrectly pinning to the wrong Intermediate CA or losing a backup key. Requiring CT is also risky.
But, the risk of footgunning yourself is much lower, and similar to requiring OCSP Stapling, is mostly around getting yourself into a situation where your infrastructure can't cash the check your mouth just wrote. But CT has three ways to get a SCT to the user: a TLS Extension, a Certificate Extension, and an OCSP Extension. So (in theory) there are more ways to support your decision.
In reality, not so much. Certificate Authorities are gearing up to support CT, and if you work closely with one, you may even be able to purchase a cert with embedded SCTs. (DigiCert says all you have to do is contact them, same with Comodo.) So depending on your choice of CA, you may be able to leverage this mechanism.
Getting an SCT into an OCSP response is probably trickier. Not only does this require the cooperation of the CA, but because most CAs purchase software and hardware to run their OCSP responders it also likely requires that vendor to do some development.
I'm not aware of any CA that supports this mechanism of delivering SCTs, but I could be wrong. Apparently, Comodo supports the OCSP delivery option! (And it's very easy for them to enable it, which they so very nicely did for ritter.vg. So you can try it out on this site, live.)
Fortunately, the third mechanism is entirely in your control as the server operator. You can deliver the SCTs in a TLS extension if the client requests them. Sounds great right? Heh. Let's go on an adventure. Or, if you want to skip the adventure and just see the screenshots, you can do that too.
Now, to be clear, CT is in its infancy. So things will get easier. In fact right now CT is only deployed to enable Extended Validation indicators in Chrome - and for nothing else. So don't take this blog post as a critique of the system. Rather take it as jumping ahead a couple years and proofing out tomorrow's protection mechanisms, today.
Let's log a cert.
First off, before you can actually deliver SCTs to your clients, you have to get an SCT - so you have to log your cert. There are several functional, operating logs. Submitting a certificate is not done through a web form, but rather an API, as specified in the RFC.
To help with this, I created a tiny python script. You will need to open your site in your browser first, and download your certificate and the entire certificate chain up to a known root. Store them as Base64-encoded .cer or .pem files. They should be ASCII and start with "-----BEGIN CERTIFICATE-----".
Now call submit-cert.py with the certificate chain, in order, starting with the leaf. You can specify a specific log to submit to, or you can just try them all. No harm in that!
./submit-cert.py --cert leaf.cer --cert intermediate1.cer --cert intermediate2.cer --cert root.cer rocketeer Timestamp 1410307847307 Signature BAMARzBFAiEAzNhW8IUPJY1c8vbLDAmufuppc9mYdBLbtSwTHLrnklACID5iG8kafP8pcxny1yKciiewhg8VRybMR4h3wJlTV3s5 Error communicating with izenpen certly Timestamp 1424321473254 Signature BAMARzBFAiA091WNEs3R/SWVjRaAlpwUpY0l/YYgUH3sMYBlI4XB9AIhAPVMyODwhig48IpE0EJgzKpdAi/iorBUIuy1qH4qrO5g Error communicating with digicert aviator Timestamp 1406484668045 Signature BAMARzBFAiEA2TVxYDf30ndQlANozAp+HVQ1IFyfGRjsZPa3TZWeeRcCIFFDpPnHQbxfhXQ7bXtueAFiiGG3HfvWqFnc9L+M/+pt pilot Timestamp 1406495947353 Signature BAMARzBFAiEAvckWLUX2H/p1dPbZmn/kaxeAbAEqehQYsgscJMzrqNYCIGGQaJ0MtG8Z13+nk2sstFAwqN+t8wsAEqNdZZmrL0e0
You can see that we had a few errors: we couldn't submit it to the two CA-run logs: Digicert and Izenpen. I think one of them is restricting IPs and the other may not be accepting the particular root I'm issued off of. No worries though, we got success responses from all 3 Google logs and the other independent log, certly. You'll be able to see your certificates within 24 hours at ctwatch.net which is a monitor and nice front-end for querying the logs. (The only one I'm aware of actually.)
Something else you might notice if you examine things carefully is that your certificate is probably already in the Google logs. (Check ctwatch.net for your cert now.) They pre-seeded their logs with, as far as I can tell, every certificate they've seen on the Internet.
Let's get OpenSSL 1.0.2
Wait, what? Yea, before we can configure Apache to send the TLS extension, we need to make sure our TLS library supports the extension, and in OpenSSL's case, that means 1.0.2. Fortunately, gentoo makes this easy. In general, this entire process is not going to be that difficult if you're already running Apache 2.4 on gentoo - but if you're not... it's probably going to be pretty annoying.
Let's Configure Apache
Okay, now that we're using OpenSSL 1.0.2, we need to configure our webserver to send the TLS extension. This is really where the bleeding edge starts to happen. I'm not aware of any way to do this for nginx or really for anything but Apache.
And for Apache, the code isn't even in a stable release, it's in trunk. (And it's not that well tested.) But it does exist, thanks to the efforts of Jeff Trawick. So if you're willing to compile Apache yourself, you can get it working. You don't have to compile trunk, you can instead patch 2.4 and them compile the module specifically. I discovered that it's pretty easily to automatically add patches to a package in gentoo thanks to /etc/portage/patches so that's even better! (For me anyway.) The two patches you will need for Apache 2.4 I have here and here. These patches (and this process) are based off Jeff's work but be aware his repo's code is out of date in comparison with httpd's trunk, and his patch didn't work for me out of the box. Jeff updated his patch, and it works out-of-the-box (for me). You can find it here.
As of Apache 2.4.20, you no longer need to patch Apache! Thanks Jeff! For more information check out his github repo.
You do need to compile the Apache module. For that, you need to go check out httpd's trunk. After that you need to build the module which is quite simple. There's a sample command right after the checkout instructions, for me it was:
cd modules/ssl apxs -ci -I/usr/include/openssl mod_ssl_ct.c ssl_ct_util.c ssl_ct_sct.c ssl_ct_log_config.c
This even goes so far as to install the module for you! Let's go configure it. I wanted to be able to control it with a startup flag, so where the modules were loaded I specified:
<IfDefine SSL_CT> LoadModule ssl_ct_module modules/mod_ssl_ct.so </IfDefine>
Now if you read the module documentation you discover this module does a lot. It uses the command line tools from the certificate-transparency project to automatically submit your certificates, it handles proxies, it does auditing... It's complicated. Instead, we want the simplest thing that works - so we're going to ignore all that functionality and just configure it statically using SCTs we give it ourselves.
I have two VHosts running (which actually caused a bug, pay attention to that if you have multiple VHosts) - This is fixed! I configured it like this:
<IfDefine SSL_CT> CTSCTStorage /run/ct-scts CTStaticSCTs /etc/apache2/ssl/rittervg-leaf.cer /etc/apache2/ssl/rittervg-scts CTStaticSCTs /etc/apache2/ssl/cryptois-leaf.cer /etc/apache2/ssl/cryptois-scts </IfDefine>
The CTSCTStorage directive is required, it's a working directory where it's going to store some temporary files. The CTStaticSCTs directive tells it look in this directory for files ending in .sct for SCTs for that certificate.
So we need to put our SCTs in that directory - but in what format? It's not really documented, but it's the exact SCT structure that's going to go into the extension. You'd think that structure would be documented in the RFC - in fact you'd probably expect it to be right here, where they say SerializedSCT... but no dice, that's not defined. Instead you can find it here but it's not really apparent. I figured it out mostly by reading Chromium's source code.
To aid you in producing these .sct files, I again wrote a little script. You give it the log name, the timestamp, and the signature that you got from the prior script you ran, and it outputs a small little binary file. (You can output it to a file, or output it in base64 for easy copy/pasting between terminals.)
With these .sct files present, you can fire up Apache (if necessary passing -D SSL_CT, perhaps in /etc/conf.d/apache) and see if it starts up! If it does, visit your site in Chrome (which will send the CT TLS Extension) and see if you can load your site. If you can, look in the Origin Information Bubble and see if you have transparency records:
Hopefully you do, and you can click in to see it:
"From an unknown log" appears because Rocketeer and Certly are still pending approval and inclusion
It's a known issue that you don't get the link or SCT viewer on Mac, but it will say "and is publicly auditable".
Requiring Certificate Transparency
That's all well and good and took way more work than you expected, but this only sends CT information, it doesn't require it. For that we need to go, edit, and compile another giant project: Chromium.
Wait seriously? You're going to go patch Chromium?
Hell yea. Like I said in the beginning: requiring CT, today, is a proof of concept of something from the future. We may get to the day where Chrome requires CT for all certificates - both EV and DV. But not yet, and not for several years at least. Today, we need to patch it. But rather than patching it to require CT for every domain on the internet - that would break, AFAIK, every single domain except my two and Digicert's test site - instead we're making it a directive in HSTS that a site operator can specify.
Building Chromium is not trivial. Unless you're on gentoo in which case it's literally how you install the browser. I worked off Chromium 42.0.2292.0 which is already out of date since I started this project 10 days ago. But whatever. I used the manual ebuild commands to pause between the unpack and compile stages to test my edits - unless you use that exact version, you'll almost certainly not get a clean apply of my patch.
The patch to chromium is over here, and it adds support for the HSTS directive requireCT. If present, it will fail a connection to any site that it has noted this for, unless it supplies SCTs in either the TLS Extension, the Certificate Extension, or an OCSP Staple. (The last two are untested, but probably work.)
And that, I think, is kinda cool.
Disk Encryption is “a litany of difficult tradeoffs and messy compromises” as our good friend and mentor Tom Ptacek put it in his blog post. That sounds depressing, but it’s pretty accurate - trying to encrypt an entire hard drive is riddled with constraints. For example:
- Disk Encryption must be really, really fast. Essentially, if the crypto happens slower than the disk read speed (said another way, if the CPU is a bottleneck) - your solution is untenable to the mass market
- It must support random read and write access - any sector may be read at any time, and any sector may be updated at any time
- You really need to avoid updating multiple sectors for a single write - if power is lost during the operation, the inconsistencies will not be able to be resolved easily, if at all
- People expect hard disks to provide roughly the amount of advertised space. Stealing significant amounts of space for ‘overhead’ is not feasible. (This goes doubly so if encryption is applied after operating system installation - there may not be space to steal!)
The last two constraints mean that the ciphertext must be the exact same size as the plaintext. There’s simply no room to store IVs, nonces, counters, or authentication tags. And without any of those things, there’s no way to provide cryptographic authentication in any of the common ways we know how to provide it. No HMACs over the sector and no room for a GCM tag (or OCB, CCM, or EAX, all of which expand the message). Which brings us to…
Because of the constraints imposed by the disk format, it’s extremely difficult to find a way to correctly authenticate the ciphertext. Instead, disk encryption relies on ‘poor-man’s authentication’.
The best solution is to use poor-man’s authentication: encrypt the data and trust to the fact that changes in the ciphertext do not translate to semantically sensible changes to the plaintext. For example, an attacker can change the ciphertext of an executable, but if the new plaintext is effectively random we can hope that there is a far higher chance that the changes will crash the machine or application rather than doing something the attacker wants.
We are not alone in reaching the conclusion that poor-man’s authentication is the only practical solution to the authentication problem. All other disk-level encryption schemes that we are aware of either provide no authentication at all, or use poor-man’s authentication. To get the best possible poor-man’s authentication we want the BitLocker encryption algorithm to behave like a block cipher with a block size of 512–8192 bytes. This way, if the attacker changes any part of the ciphertext, all of the plaintext for that sector is modified in a random way.
That excerpt comes from an excellent paper by Niels Ferguson of Microsoft in 2006 explaining how BitLocker works. The property of changing a single bit, and it propagating to many more bits, is diffusion and it’s actually a design goal of block ciphers in general. When talking about disk encryption in this post, we’re going to use diffusion to refer to how much changing a single bit (or byte) on an encrypted disk affects the resulting plaintext.
BitLocker in Windows Vista & 7
When BitLocker was first introduced, it operated in AES-CBC with something called the Elephant Diffuser. The BitLocker paper is an excellent reference both on how Elephant works, and why they created it. At its heart, the goal of Elephant is to provide as much diffusion as possible, while still being highly performant.
The paper also includes Microsoft’s Opinion of AES-CBC Mode used by itself. I’m going to just quote:
Any time you want to encrypt data, AES-CBC is a leading candidate. In this case it is not suitable, due to the lack of diffusion in the CBC decryption operation. If the attacker introduces a change d in ciphertext block i, then plaintext block i is randomized, but plaintext block i + 1 is changed by d. In other words, the attacker can flip arbitrary bits in one block at the cost of randomizing the previous block. This can be used to attack executables. You can change the instructions at the start of a function at the cost of damaging whatever data is stored just before the function. With thousands of functions in the code, it should be relatively easy to mount an attack.
The current version of BitLocker [Ed: BitLocker in Vista and Windows 7] implements an option that allows customers to use AES-CBC for the disk encryption. This option is aimed at those few customers that have formal requirements to only use government-approved encryption algorithms. Given the weakness of the poor-man’s authentication in this solution, we do not recommend using it.
BitLocker in Windows 8 & 8.1
BitLocker in Windows 8 and 8.1 uses AES-CBC mode, without the diffuser, by default. It’s actually not even a choice, the option is entirely gone from the Group Policy Editor. (There is a second setting that applies to only “Windows Server 2008, Windows 7, and Windows Vista” that lets you choose Diffuser.) Even using the commandline there’s no way to encrypt a new disk using Diffuser - Manage-BDE says “The encryption methods aes128_Diffuser and aes256_Diffuser are deprecated. Valid volume encryption methods: aes128 and aes256.” However, we can confirm that the code to use Diffuser is still present - disks encrypted under Windows 7 with Diffuser continue to work fine on Windows 8.1.
AES-CBC is the exact mode that Microsoft considered (quoting from above) “unsuitable” in 2006 and “recommended against”. They explicitly said “it should be relatively easy to mount an attack”.
And it is.
As written in the Microsoft paper, the problem comes from the fact that an attacker can modify the ciphertext and perform very fine-grained modification of the resulting plaintext. Flipping a single bit in the ciphertext results reliably scrambles the next plaintext block in an unpredictable way (the rainbow block), and flips the exact same bit in the subsequent plaintext block (the red line):
This type of fine-grained control is exactly what Poor Man’s Authentication is designed to combat. We want any change in the ciphertext to result in entirely unpredictable changes in the plaintext and we want it to affect an extremely large swath of data. This level of fine-grained control allows us to perform targeted scrambling, but more usefully, targeted bitflips.
But what bits do we flip? If the disk is encrypted, don’t we lack any idea of where anything interesting is stored? Yes and no. In our testing, two installations of Windows 8 onto the same format of machine put the system DLLs in identical locations. This behavior is far from guarenteed, but if we do know where a file is expected to be, perhaps through educated guesswork and installing the OS on the same physical hardware, then we will know the location, the ciphertext, and the plaintext. And at that point, we can do more than just flip bits, we can completely rewrite what will be decrypted upon startup. This lets us do much more than what people have suggested around changing a branch condition: we just write arbitrary assembly code. So we did. Below is a short video that shows booting up a Virtual Machine showing a normal unmodified BitLockered disk on Windows 8, shutting it down and modifying the ciphertext on the underlying disk, starting it back up, and achieving arbitrary code execution.
This is possible because we knew the location of a specific file on the disk (and therefore the plaintext), calculated what ciphertext would be necessary to write out desired shellcode, and wrote it onto the disk. (The particular file we chose did move around during installation, so we did ‘cheat’ a little - with more time investment, we could change our target to a system dll that hasn’t been patched in Windows Updates or moved since installation.) Upon decryption, 16 bytes were garbled, but we chose the position and assembly code carefully such that the garbled blocks were always skipped over. To give credit where others have demonstrated similar work, this is actually the same type of attack that Jakob Lell demonstrated against LUKS partitions last year.
The obvious question comes up when discussing disk encryption modes: why not use XTS, a mode specifically designed for disk encryption and standardized and blessed by NIST? XTS is used in LUKS and Truecrypt, and prevents targeted bitflipping attacks. But it’s not perfect. Let’s look at what happens when we flip a single bit in ciphertext encrypted using XTS:
A single bit change completely scrambles the full 16 byte block of the ciphertext, there’s no control over the change. That’s good, right? It’s not bad, but it’s not as good as it could be. Unfortunately, XTS was not considered in the original Elephant paper (it was relatively new in 2006), so we don’t have their thoughts about it in direct comparison to Elephant. But the authors of Elephant evaluated another disk encryption mode that had the same property:
LRW provides some level of poor-man’s authentication, but the relatively small block size of AES (16 bytes) still leaves a lot of freedom for an attacker. For example, there could be a configuration file (or registry entry) with a value that, when set to 0, creates a security hole in the OS. On disk the setting looks something like “enableSomeSecuritySetting=1”. If the start of the value falls on a 16-byte boundary and the attacker randomizes the plaintext value, there is a 2−16 chance that the first two bytes of the plaintext will be 0x30 0x00 which is a string that encodes the ASCII value ’0’.
For BitLocker we want a block cipher whose block size is much larger.
Furthermore, they elaborate upon this in their comments to NIST on XTS, explicitly calling out the small amount of diffusion. A 16-byte scramble is pretty small. It’s only 3-4 assembly instructions. To compare how XTS’ diffusion compares to Elephant’s, we modified a single bit on the disk of a BitLockered Windows 7 installation that corresponded to a file of all zeros. The resulting output shows that 512 bytes (the smallest sector size in use) were modified:
This amount of diffusion is obviously much larger than 16 bytes. It’s also not perfect - a 512 byte scramble, in the right location, could very well result in a security bypass. Remember, this is all ‘Poor Man’s Authentication’ - we know the solution is not particularly strong, we’re just trying to get the best we can. But it’s still a lot harder to pop calc with.
From talking with Microsoft about this issue, one of the driving factors in this change was performance. Indeed, when BitLocker first came out and was documented, the paper spends a considerable amount of focus on evaluating algorithms based on cycles/byte. Back then, there were no AES instructions built into processors - today there are, and it has likely shifted the bulk of the workload for BitLocker onto the Diffuser. And while we think of computers as becoming more powerful since 2006 - tablets, phones, and embedded devices are not the ‘exception’ but a major target market.
Using Full Disk Encryption (including BitLocker in Windows 8) is clearly better than not - as anyone’s who had a laptop stolen from a rental car knows. Ultimately, I’m extremely curious what requirements the new BitLocker design had placed on it. Disk Encryption is hard, and even XTS (standardized by NIST) has significant drawbacks. With more information about real-world design constraints, the cryptographic community can focus on developing something better than Elephant or XTS.
I’d like to thank Kurtis Miller for his help with Windows shellcode, Justin Troutman for finding relevant information, Jakob Lell for beating me to it by a year, DaveG for being DaveG, and the MSRC.
This post originally appeared on the Cryptography Services blog.