Recently I wanted to run a complex query across every certificate in the CT logs. That would obviously take some time to process - but I was more interested in ease-of-execution than I was in making things as fast as possible. I ended up using a few tools, and writing a few tools, to make this happen.
- Catlfish is a CT Log server that's written by a friend (and CT Gossip coauther). I'm not interested in the log server, just the tools - specifically fetchallcerts.py to download the logs.
- fetchallcerts.py requires the log keys, in PEM format. (Not sure why.) Run this tool to download all the logs' keys.
- fetchallcerts.py only works on one log at a time. A quick bash script will run this across all logs.
With these tools you can download all the certificates in all the logs except the two logs who use RSA instead of ECC. (That's CNNIC and Venafi.) They come down in zipfiles and take up about 145 GB.
Now we need to process them! For that you can use findcerts.py. The way this script works is using python's multiprocessing (one process per CPU) to process a zipfile at a time. It uses pyasn1 and pyx509 to parse the certificates. You write the filtering function at the top of the file, you can also choose which certs to process (leaf, intermediate(s), and root). You can limit the filtering to a single zip file (for testing) or to a single log (since logs will often contain duplicates of each other.)
The example criteria I have in there looks for a particular domain name. This is a silly criteria - there are much faster ways to look for certs matching a domain name. But if you want to search for a custom extension or combination of extensions - it makes a lot more sense. You can look at pyx509 to see what types of structures are exposed to you.
A word on minutia - pyasn1 is slow. It's full featured but it's slow. On the normal library it took about 18 minutes to process a zip file. By using cython and various other tweaks and tricks in both it and pyx509, I was able to get that down to about 4 minutes, 1.5 if you only process leaf certs. So I'd recommend using my branches of pyasn1 and pyx509.
All in all, it's definetly not the fastest way to do this - but it was the simplest. I can run a query across one of the google logs in about 18 hours, which is fast enough for my curiosity for most things.