2021-04-20

Certificate Deep Dive - Part 2 - What's in the box?

Continuing our deep dive into certificates, in this part we'll look at the actual certificate files themselves and explore their structure in depth.

Contents

What's in a Certificate File?

The short answer is that a Certificate file is a DER-encoded form of an ASN.1 structure which is defined in RFC 2459.  This DER-encoding can optionally be Base-64 encoded as well.

So what's the long story ... Well, the easiest way to see that is to work through an example.

We'll start with a certificate file.  This is a hex dump of a certificate from a 1996 NASA root certificate that we can work through:

0000  30 82 01 EA 30 82 01 94 02 02 01 2D 30 0D 06 09  0‚.ê0‚.”...-0...
0010  2A 86 48 86 F7 0D 01 01 04 05 00 30 81 80 31 0B  *†H†÷......0.€1.
0020  30 09 06 03 55 04 06 13 02 55 53 31 36 30 34 06  0...U....US1604.
0030  03 55 04 0A 13 2D 4E 61 74 69 6F 6E 61 6C 20 41  .U...-National A
0040  65 72 6F 6E 61 75 74 69 63 73 20 61 6E 64 20 53  eronautics and S
0050  70 61 63 65 20 41 64 6D 69 6E 69 73 74 72 61 74  pace Administrat
0060  69 6F 6E 31 19 30 17 06 03 55 04 0B 13 10 54 65  ion1.0...U....Te
0070  73 74 20 45 6E 76 69 72 6F 6E 6D 65 6E 74 31 1E  st Environment1.
0080  30 1C 06 03 55 04 0B 13 15 4D 44 35 2D 52 53 41  0...U....MD5-RSA
0090  2D 4E 41 53 41 2D 50 69 6C 6F 74 2D 43 41 30 1E  -NASA-Pilot-CA0.
00A0  17 0D 39 36 30 34 33 30 32 32 30 35 30 30 5A 17  ..960430220500Z.
00B0  0D 39 37 30 34 33 30 32 32 30 35 30 30 5A 30 81  .970430220500Z0.
00C0  80 31 0B 30 09 06 03 55 04 06 13 02 55 53 31 36  €1.0...U....US16
00D0  30 34 06 03 55 04 0A 13 2D 4E 61 74 69 6F 6E 61  04..U...-Nationa
00E0  6C 20 41 65 72 6F 6E 61 75 74 69 63 73 20 61 6E  l Aeronautics an
00F0  64 20 53 70 61 63 65 20 41 64 6D 69 6E 69 73 74  d Space Administ
0100  72 61 74 69 6F 6E 31 19 30 17 06 03 55 04 0B 13  ration1.0...U...
0110  10 54 65 73 74 20 45 6E 76 69 72 6F 6E 6D 65 6E  .Test Environmen
0120  74 31 1E 30 1C 06 03 55 04 0B 13 15 4D 44 35 2D  t1.0...U....MD5-
0130  52 53 41 2D 4E 41 53 41 2D 50 69 6C 6F 74 2D 43  RSA-NASA-Pilot-C
0140  41 30 59 30 0A 06 04 55 08 01 01 02 02 02 00 03  A0Y0...U........
0150  4B 00 30 48 02 41 00 B9 A6 5F 9F 86 A8 0B DC AD  K.0H.A.¹¦_Ÿ†¨.Ü.
0160  62 B5 DE B7 C3 AC D5 FD 51 F7 80 ED FF A2 8F 20  bµÞ·Ã¬ÕýQ÷€íÿ¢. 
0170  60 78 54 23 61 4A 03 A4 5A 9A A7 0E 27 CD 00 4B  `xT#aJ.¤Zš§.'Í.K
0180  6C 05 EE 83 CA 4A 91 1C 1A D9 98 7C 60 E0 E1 5E  l.îƒÊJ‘..Ù˜|`àá^
0190  C4 56 BB 61 7F 51 C9 02 03 01 00 01 30 0D 06 09  ÄV»a.QÉ.....0...
01A0  2A 86 48 86 F7 0D 01 01 04 05 00 03 41 00 7F 5A  *†H†÷.......A..Z
01B0  38 10 3E 40 2B 23 C5 78 27 4A A1 F1 D3 88 1C 53  8.>@+#Åx'J¡ñÓˆ.S
01C0  C4 B8 F4 35 54 6F 57 F6 5D 5A 0B 9C 79 48 6F C4  ĸô5ToWö]Z.œyHoÄ
01D0  67 5F 49 39 3B A9 A9 1D 3E 5E B6 2F 5B 2E 48 96  g_I9;©©.>^¶/[.H–
01E0  18 93 4C 27 82 F4 00 9F DA 73 E4 A6 1D 97        .“L'‚ô.ŸÚsä¦.—


We also need to reference RFC 2459 Section 4.1 which tells us the ASN.1 structure of the certificate file, for now we'll only look at the first couple of fields ...

	Certificate  ::=  SEQUENCE
	{
		tbsCertificate       TBSCertificate,
		signatureAlgorithm   AlgorithmIdentifier,
		signatureValue       BIT STRING
	}

	TBSCertificate  ::=  SEQUENCE
	{
		version         [0]  EXPLICIT Version DEFAULT v1,
		serialNumber         CertificateSerialNumber,
		signature            AlgorithmIdentifier,
		issuer               Name,
		validity             Validity,
		subject              Name,
		subjectPublicKeyInfo SubjectPublicKeyInfo,
		issuerUniqueID  [1]  IMPLICIT UniqueIdentifier OPTIONAL,
		subjectUniqueID [2]  IMPLICIT UniqueIdentifier OPTIONAL,
		extensions      [3]  EXPLICIT Extensions OPTIONAL
	}

Finally, we need to know how to convert an ASN.1 structure into binary using DER encoding.  The Wikipedia article on BER encoding gives us enough information to start reading the data, if you want more information, the full specification document is called "International Telecommunication Union X.690 Series X: Data Networks and Open System Communications OSI Networking and System Aspects ASN.1 Encoding Rules: Specification of Basic Encoding Rules (BER), Canonical Encoding Rules (CER) and Distinguished Encoding Rules (DER)".

BER encoding is reasonably easy to read as it's a standard Type-Length-Value (TLV) encoding.  The Type is stored as one or more bytes called "Identifier Octets", then a length stored as one or more bytes, then the payload value.

I said the file was DER encoded, rather than BER encoded.  DER encoding is a subset of BER encoding which has more strict rules for the encoder which ensure that a given piece of data can be encoded with only one possible binary outcome.  This is important for signatures and other hash operations which require that one certificate will always have the same binary representation.  For specificity, I'll use the term "DER encoding" for the rest of the article.  

[The file extension usually used for binary certificates is ".cer".  I do wonder if it means "certificate" or if it's referring to CER-encoded content (another subset of BER).  Unfortunately I couldn't find anything to confirm or deny this, so it's probably just an abbreviation for certificate.]

Let's get to work:

We expect the first byte in our data to be the first octet of some "Identifier Octets".  The value of the first byte in our example is 0x30, which (according to Wikipedia) is interpreted as: 

  • Tag class 0 (bits 0x80 and 0x40 clear) means it is a standard tag native to ASN.1 (as opposed to a custom tag defined by some other standard).  Almost all of the tags in the structure will be in Tag Class 0.
  • "Constructed" (bit 0x20 set) means that content of this tag is more DER-encoded data (as opposed to some arbitrary payload bytes).
  • Tag 0x10 means "SEQUENCE".

[If the Tag number in the first identifier octet is 31 (0x1F, the highest value that can be stored), then further bytes contain the additional data for the tag number.  For the certificate and depth we are going to explore here, we don't expect to find any Tag numbers greater than 30 (0x1E), so all chunks will have only one Identifier Octet byte each.]

Comparing our breakdown to the RFC on certificates, this all checks out because the "Certificate" structure starts with a SEQUENCE element.

After the identifier octets come the Length octets, and there are two different ways these might be encoded:

  • Short Form: If the length to be encoded is 127 (0x7F) or less, the length will be stored directly in a single byte.
  • Long Form: If the length to be encoded is 128 (0x80) or more, then the first byte will be [128 + the number of bytes needed to store the length], with more bytes following to store the actual length itself as a big-endian integer.

[BER-encoding allows for a third, "indefinite" form where the content is delimited by an "end-of-content" marker, but this isn't allowed in DER-encoding so we don't need to worry about it as it'll never appear in a certificate.]

A few examples to cement that idea:
  • 0x25 is a Short Form encoding, meaning a content length of 37 bytes is to follow
  • 0x7F is a Short Form encoding, meaning a content length of 127 bytes is to follow
  • 0x81 0x80 is a Long Form encoding.  The first byte indicates that one bye of length follows, and that length is 0x80, meaning a content length of 128 bytes is to follow.
  • 0x82 0x5F 0x47 is a long form encoding.  The first byte indicates that two bytes of length follows (to be read big-endian), that length is 0x5F47, meaning a content length of 24391 bytes is to follow.

We'll look at the length encoding a little more in the Identification section below.

Looking back at our file, the next few bytes, which encode the length of this first tag, are 0x82 0x01 0xEA.  The first 0x82 indicates that we should read the next 2 bytes (big-endian) to get the actual length, which then gives us a length of 0x1EA.  We can check this value because the entire certificate file is 0x1EE long, so the 1 identifier byte, the 3 length bytes, and the 0x1EA bytes of content all add up correctly to 0x1EE.  Fantastic, we've found our opening sequence!

Because this first SEQUENCE tag is marked as a "Constructed" element, the content of this element is to be interpreted as more DER-encoded data (as opposed to "Primitive" which means the content is some other, non-DER data).

Now we know how to read the Identifier and Length bytes, and we know to recurse into the constructed data, we can rinse-and-repeat the above steps to do a breadth-first read of the first few levels of our test certificate:

30 82 01 EA = SEQUENCE (IsConstructed = true), Length = 0x1EA = 490, RFC "Certificate" structure
{
	30 82 01 94 = SEQUENCE (IsConstructed = true), Length = 0x194 = 404, RFC "TBSCertificate" structure
	{
		02 02 = INTEGER (IsConstructed = false), Length = 0x2 = 2, RFC "serialNumber" field
		{
			01 2D
		}
		30 0D = SEQUENCE (IsConstructed = true), Length = 0xD = 13, RFC "signature" field
		{
			06 09 2A 86 48 86 F7 0D 01 01 04 05 00
		}
		30 81 80 = SEQUENCE (IsConstructed = true), Length = 0x80 = 128, RFC "issuer" field
		{
			31 0B 30 09 06 03 55 04 06 13 02 55 53 31 36 30 34 06 
			03 55 04 0A 13 2D 4E 61 74 69 6F 6E 61 6C 20 41 
			65 72 6F 6E 61 75 74 69 63 73 20 61 6E 64 20 53 
			70 61 63 65 20 41 64 6D 69 6E 69 73 74 72 61 74 
			69 6F 6E 31 19 30 17 06 03 55 04 0B 13 10 54 65 
			73 74 20 45 6E 76 69 72 6F 6E 6D 65 6E 74 31 1E 
			30 1C 06 03 55 04 0B 13 15 4D 44 35 2D 52 53 41 
			2D 4E 41 53 41 2D 50 69 6C 6F 74 2D 43 41 
		}
		30 1E = SEQUENCE (IsConstructed = true), Length = 0x1E = 30, RFC "validity" field
		{
			17 0D 39 36 30 34 33 30 32 32 30 35 30 30 5A 17 
			0D 39 37 30 34 33 30 32 32 30 35 30 30 5A
		}
		30 81 80 = SEQUENCE (IsConstructed = true), Length = 0x80 = 128, RFC "subject" field
		{
			31 0B 30 09 06 03 55 04 06 13 02 55 53 31 36 
			30 34 06 03 55 04 0A 13 2D 4E 61 74 69 6F 6E 61 
			6C 20 41 65 72 6F 6E 61 75 74 69 63 73 20 61 6E 
			64 20 53 70 61 63 65 20 41 64 6D 69 6E 69 73 74 
			72 61 74 69 6F 6E 31 19 30 17 06 03 55 04 0B 13 
			10 54 65 73 74 20 45 6E 76 69 72 6F 6E 6D 65 6E 
			74 31 1E 30 1C 06 03 55 04 0B 13 15 4D 44 35 2D 
			52 53 41 2D 4E 41 53 41 2D 50 69 6C 6F 74 2D 43 
			41 
		}
		30 59 = SEQUENCE (IsConstructed = true), Length = 0x59 = 89, RFC "subjectPublicKeyInfo" field
		{
			30 0A 06 04 55 08 01 01 02 02 02 00 03 
			4B 00 30 48 02 41 00 B9 A6 5F 9F 86 A8 0B DC AD 
			62 B5 DE B7 C3 AC D5 FD 51 F7 80 ED FF A2 8F 20 
			60 78 54 23 61 4A 03 A4 5A 9A A7 0E 27 CD 00 4B 
			6C 05 EE 83 CA 4A 91 1C 1A D9 98 7C 60 E0 E1 5E 
			C4 56 BB 61 7F 51 C9 02 03 01 00 01 
		}
	}
	30 0D = SEQUENCE (IsConstructed = true), Length = 0xD = 13, RFC "signatureAlgorithm" field
	{
		06 09 2A 86 48 86 F7 0D 01 01 04 05 00 
	}
	03 41 = BIT STRING (IsConstructed = false), Length = 0x41 = 65, RFC "signatureValue" field
	{
		00 7F 5A 
		38 10 3E 40 2B 23 C5 78 27 4A A1 F1 D3 88 1C 53 
		C4 B8 F4 35 54 6F 57 F6 5D 5A 0B 9C 79 48 6F C4 
		67 5F 49 39 3B A9 A9 1D 3E 5E B6 2F 5B 2E 48 96 
		18 93 4C 27 82 F4 00 9F DA 73 E4 A6 1D 97      
	}
}

We could continue by recursing into all the IsConstructed fields and breaking those down further, but that can be an exercise for another time.  We'll round off this breadth-first interpretation of the certificate file with an hex view, annotated with the different regions we just discovered.


[Yes, I know "octet" is synonymous with "byte" for all modern scenarios where a byte is 8 bits so to include both is a tautology, but I wanted to be clear as the specification uses "octet" whereas I feel people are more familiar with "byte".]

But wait, there are fields missing?

The more astute readers will have spotted that several fields are missing from this certificate according to the RFC specification, in particular:

  • version (EXPLICIT [0] DEFAULT v1),
  • issuerUniqueID (IMPLICIT [1] OPTIONAL),
  • subjectUniqueID (IMPLICIT [2] OPTIONAL),
  • extensions (EXPLICIT [3] OPTIONAL)

All these fields are optional and are numbered in the specification, and to keep things simple our example was chosen as it did not include any of these fields.

When a certificate includes one of these fields, it is identified by a Tag in the "Context-Specific" class, where the number in the specification is the Tag Number, therefore:

  • the version field will be included in a tag with Identifier 0xA0,
  • the issuerUniqueID field will be in a tag with Identifier 0xA1,
  • the subjectUniqueID field will be in a tag with Identifier 0xA2, and
  • the extensions field will be in a tag with Identifier 0xA3.

Let's look at an except from a more recent, version 3 certificate and see how it differs.

30 82 08 47 = SEQUENCE (IsConstructed = true), Length = 0x847 = 2119, RFC "Certificate" structure
{
	30 82 07 2F = SEQUENCE (IsConstructed = true), Length = 0x72F = 1839, RFC "TBSCertificate" structure
	{
		A0 03 = Context-Specific Tag 0 (IsConstructed = true), Length 0x3 = 3, RFC "version" field
		{
			02 01 = INTEGER (IsConstructed = false), Length = 0x1 = 1
			{
				02
			}
		}
		02 10 = INTEGER (IsConstructed = false), Length = 0x10 = 16, RFC "serialNumber" field
		{
			0F E2 6A FE B9 45 7D 27 CF D9 9E EE 4F 5F 47 5C 
		}
		... snip ...
	}
	... snip ...
}

We can see this extra tag is interpreted in the same way as the others, and that it contains an integer field, with the value of 2.  The RFC states that a version value of 0 means "v1", 1 means "v2", and 2 means "v3".

The extensions field is included in a similar way, with 0xA3 as the only identifier octet byte, followed by a length, which then contains a SEQUENCE of "Extension" structures, as defined in the RFC.

All modern public website certificates will include at least the version and extension fields as the CA/Browser Forum Baseline Requirements indicates that certificates have to be Version 3, and have to include the "Subject Alternate Name" extension.  Therefore most modern TLS certificates can be expected to contain both these fields.

What about Base64 Encoding?

Certificate files, Certificate Signing Requests (CSRs), Private Key files (PVKs), and other related files are commonly found in one of two formats; either stored in raw binary of the DER-encoding, or where that binary has been base64-encoded and stored as a text file with a text header and footer.  Most tools are smart enough to accept files in either form as converting between them is trivial.

The base64-encoded versions are larger (as base-64 encoding increases the data size by around one-third) but are still very useful for transmitting files in text media, such as being sent by email or being pasted into a web form. 

Our test NASA certificate from above, in base64-encoding is this:

-----BEGIN CERTIFICATE-----
MIIB6jCCAZQCAgEtMA0GCSqGSIb3DQEBBAUAMIGAMQswCQYDVQQGEwJVUzE2MDQG
A1UEChMtTmF0aW9uYWwgQWVyb25hdXRpY3MgYW5kIFNwYWNlIEFkbWluaXN0cmF0
aW9uMRkwFwYDVQQLExBUZXN0IEVudmlyb25tZW50MR4wHAYDVQQLExVNRDUtUlNB
LU5BU0EtUGlsb3QtQ0EwHhcNOTYwNDMwMjIwNTAwWhcNOTcwNDMwMjIwNTAwWjCB
gDELMAkGA1UEBhMCVVMxNjA0BgNVBAoTLU5hdGlvbmFsIEFlcm9uYXV0aWNzIGFu
ZCBTcGFjZSBBZG1pbmlzdHJhdGlvbjEZMBcGA1UECxMQVGVzdCBFbnZpcm9ubWVu
dDEeMBwGA1UECxMVTUQ1LVJTQS1OQVNBLVBpbG90LUNBMFkwCgYEVQgBAQICAgAD
SwAwSAJBALmmX5+GqAvcrWK13rfDrNX9UfeA7f+ijyBgeFQjYUoDpFqapw4nzQBL
bAXug8pKkRwa2Zh8YODhXsRWu2F/UckCAwEAATANBgkqhkiG9w0BAQQFAANBAH9a
OBA+QCsjxXgnSqHx04gcU8S49DVUb1f2XVoLnHlIb8RnX0k5O6mpHT5eti9bLkiW
GJNMJ4L0AJ/ac+SmHZc
-----END CERTIFICATE-----

As you can see, the base64-encoded text has a header and footer, with a specific string to identify the type of data in the payload.  There are other equivalents headers for most of the other file types.  One of the best references for this is the OpenSSL source which includes the strings for many of these file types.

I should note that while many people refer to base64-encoded certificate or private key data as a "PEM" file, that isn't accurate.  PEM stands for "Privacy Enhanced Mail" which was created for a secure mail application many years ago, and then hijacked by others to store these different types of base64-encoded data.

How Big is a Certificate?

Before we move on to identifying certificate files, it's helpful to have an idea of how large certificate files might be.

For a lower bound, we can look at the RFC for the certificate.  There are certain required fields that have to be present and structure that has to be included, but an extreme lower bound can be considered at 89 bytes, as follows:

  • 2 bytes for "Certificate" structure header
  • 2 bytes for "TBSCertificate" structure header
  • 2 bytes for "serialNumber" field (we'll assume length of zero, which appears valid for a root certificate)
  • 5 bytes for "signature" field (we'll assume a one-byte OID)
  • 14 bytes for "issuer" field (assuming one element, with a three-byte OID and a one-byte string)
  • 32 bytes for "validity" field (the RFC gives two formats for time, this uses the shorter one)
  • 14 bytes for "subject" field (assuming one element, with a three-byte OID and a one-byte string)
  • 10 bytes for "subjectPublicKeyInfo" field (assuming a one-byte OID and a one-byte key bit string)
  • 5 bytes for "signatureAlgorithm" field (assuming a one-byte OID)
  • 3 bytes for "signature" field (assuming a one-byte bit string)

I'm fairly sure this couldn't actually be used to create a valid certificate for deeper reasons, such as the invalid, one-byte OIDs.

What about an upper bound?  Well DER-encoding itself can actually handle lengths up to ridiculous numbers of bytes.  An 0xFF length byte would be followed by 127 length bytes, giving a maximum certificate size of just over 2^1016 (two to the power of 1016) which is unfathomably enormous.

However, more practical limits exist.  Most TLS software will contain some hard limit for the amount of certificate data it will download from the server when trying to establish a secure TLS connection.  Because this is data that the client computer is storing into memory, a nefarious server could just keep sending more and more certificate data until the client software ran out of memory and failed.  Windows SChannel for example will close the connection with an "Alert: Illegal Parameter" if a server attempts to send a certificate that is above its limit.  While I couldn't find this documented, this limit appears (at least for Windows 10) to be between 27KB and 200KB.  From looking at WireShark (a network packet capture tool), it appears that SChannel does this by reading the length header from the certificate data the server sends, rather than waiting to receive an amount of certificate data. 

This can be demonstrated using a website called badssl.com which has tests for many different TLS and certificate scenarios.  It includes two tests where artificially large certificates have been created by adding many Subject Alternate Names (SANs) into the certificate data.  The first of these has 1'000 SANs (producing a certificate of around 27KB) which succeeds, and another has a certificate with 10'000 SANs (with a total certificate size of just under 200KB), which triggers SChannel's alert.

It also appears that Microsoft's IIS web server cannot use a TLS certificate that is larger than 16KB.  I created several Certificates of Unusual Size* to empirically test this limit, and it appears that trying to use a certificate larger than 16KB gives a "Value does not fall within the expected range" when trying to bind the certificate using the IIS management tool.  A 15.7KB certificate was fine, but a 16.1KB certificate failed.  More data is more good.

To put those certificate sizes into context with the current certificates (at the time of writing) from a few public websites:

  • facebook.com's certificate has 11 SANs and weighs in at 1.5KB
  • google.com's certificate has 73 SANs and is 2.3KB
  • ebay.co.uk's certificate has 72 SANs and is 2.8KB
  • My test certificate with 450 SANs is 8.6KB.
  • My test certificate with 875 SANs is 15.7KB
  • The badssl.com certificate with 1000 SANs is 27KB.
I know I've been focussing on Subject Alternate Names as the primary means for making certificate files larger, and while there are other fields that can add data to the certificate file size, they all work out having a similarly small impact for sensible values.  Unless you're reading this from the distant future where RSA keys with 512Kbit or more are the standard to protect against the quantum computers that we all carry around in our heads while riding on our hover boards, we can consider these sizes to be typical.

So at least for now, we can confidently say a certificate will be between 89 bytes and 200KB, certificates larger than 16KB would be unusual, and most certificates will be between 1KB and 4KB.

To be super nerdy, we could also observe that a certificate can never be 130 bytes, 259 bytes, or 65540 bytes long.  This is because a certificate which has a payload of 127 bytes long will have 2 bytes of header (one for the 0x30 sequence, and one for the length) making 129 bytes in total, but a certificate which is 128 bytes needs 3 bytes of header (one for the 0x30 sequence, and two for the length) making 131 bytes in total.  There is no combination which produces a certificate which is 130 bytes long.  There are similar limits for payloads which require three and four length bytes.  /nerdy

* It turns out they do exist.

How Do I Identify a Certificate File?

Well, I do love a good file format.

We've talked about the structure of the file above, and talked about the expected overall length of the file, so we'll now look at using that information to identify a certificate file as far as we can.  I'll also focus on binary certificates, and then extend that to look at base64-encoded certificates as well.

While our aim here is to try to find a byte pattern that can identify a certificate file, this isn't very easy as certificate files don't contain a positive file identifier, and it's especially difficult to separate a Certificate file from a Certificate Signing Request (CSR) file as they are very similar.  We'll have to settle for determining if it might be a certificate file.

The start is very easy, if it doesn't start with a 0x30 byte (the identifier byte for the first sequence), it's not a binary certificate.  It's a simple test but it eliminates 255 of 256 possibilities.  However, 0x30 is also the ASCII code for '0', so an ASCII or UTF8 encoded text file starting with a '0' will pass this test.

The second byte is going to be the first length byte for the overall certificate.  Taking our calculation from above, we'll take 89 bytes as the shortest length and 16KB as the longest.  The second byte is therefore going to be 0x59 to 0x7F, or 0x81, or 0x82.  It can't be less than 0x59 as the certificate would be too short to be valid.  It can't be 0x80 as that is invalid (as it indicates 0 length bytes), and it is very unlikely to be greater than 0x82 as that would indicate a certificate length of greater than 64KB.

For 0x81 or 0x82, we would then expect to have one or two more bytes of length, but we wouldn't be able to make any helpful assumptions about their value.

Following the length bytes, we're going to have another 0x30 byte to indicate the start of the "TBSCertificate" structure, and then another encoded length.

This gives us 6 possible patterns to look for in the first few bytes of a certificate:

  1. 30 xx 30 yy  (where xx is between 0x50 and 0x80)
  2. 30 81 xx 30 xx
  3. 30 81 xx 30 81 xx
  4. 30 82 xx xx 30 81 xx
  5. 30 82 xx xx 30 82 xx xx
  6. 30 82 xx xx 30 xx
The majority of certificates in the wild will be in category 5, covering certificates from less than 1KB up to 64KB in size.

[You'll notice I include category 6, but I'm not certain if this combination is practical.  This represents a certificate structure of at least 256 bytes, with a TBS certificate of at most 127 bytes (ignoring the header structures).  This would mean that the signatureAlgorithm and signatureValue fields combined would have to be at least 129 bytes long.  We could use SHA2-512 to give us 64 bytes for the signatureValue to get us half-way there, but I don't believe any encoded OIDs are anywhere near the 64(ish) bytes needed from the signatureAlgorithm for the rest.  Is this possible, answers on a postcard please?]

While this is great as a pattern it doesn't identify a certificate specifically, it identifies a DER-encoded ASN.1 file that starts with two sequences.  Below we'll look further through the RFCs to separate the certs from the CSRs, but for most purposes I'd recommend stopping here and using the patterns above to decide that the file is a "Certificate or CSR or ASN.1" and have the handling program decide which when the file is opened.

So, carrying on through the specifications we could possibly check the version number field which is optional in the certificate file, and is mandatory in the CSR.  However, this is only present in version 2 or 3 certificates, so will misidentify any version 1 certificates.  If present, the header bytes above will be followed by the sequence A0 03 02 01 xx.

Looking further, both file types then have an integer field (the serialNumber in a version 1 certificate, and the version field in the CSR), and then a distinguished name sequence field (the issuer name in the certificate, and the subject name in the CSR) neither of which help to tell a certificate from a CSR.

The first reliable difference between the two file types doesn't appear until you hit the Validity field in the certificate, which has no equivalent in the CSR.  There are two possible date formats, both of which have specific lengths; a short date format with a 2-digit year for any dates before 2050, and a long date format with a 4-digit year for dates in 2050 and beyond.  As we can assume the notBefore date will be before the notAfter date, this gives three possibilities for the byte sequence at the beginning of the validity field:

  • 30 1E 17 0D - For a certificate expiring before 2050
  • 30 20 17 0D - For a certificate issued before 2050, and expiring after.
  • 30 22 18 0F - For a certificate issued after 2050.
Of course, these three byte sequences could appear in a CSR, so lets consider that for a moment.  The probability of one of them appearing by random chance (such as in a hash, signature, or serial number) is 3 in 2^32 (around 1 in 1.5 billion) so pretty low.  The sequences are unlikely to appear in text as they all contain bytes which are control characters in ASCII.  They don't make sense as DER length bytes except as encoding dates.  While they could appear as part of an OID encoding, I couldn't find any meaningful matches.  All in all, it is unlikely that one of these will appear in a CSR.

So if you need to quickly separate a certificate from a CSR by pattern matching against the bytes, we can confidently say that a certificate (of any version) will contain one of these three byte sequences somewhere within the file, whereas they are unlikely to appear in a CSR.  Beyond that, implement a DER-decoder and check the ASN.1 object tree against the specification for a certificate.


Why Do My Certificates All Start with "MII"?

So, this is cool in a very geeky way.  To be honest, this was the question that I wanted to answer first, but only now do we have all the pieces to be able to do the answer justice.

Short answer:  We figured out above that most certificates in our expected size of 1KB to 4KB have the pattern 30 82 xx xx 30 82 xx xx as their first bytes, which in base64-encoding gives "MII___CC" (where an underscore is a wildcard).

As above, we found the first few bytes of the certificate when it is binary encoded fall into one of six categories:

  1. 30 xx 30 yy  (where xx is between 0x59 and 0x7F)
  2. 30 81 xx 30 xx
  3. 30 81 xx 30 81 xx
  4. 30 82 xx xx 30 81 xx
  5. 30 82 xx xx 30 82 xx xx
  6. 30 82 xx xx 30 xx

The most common of these is category 5 which represents certificates from less than 1KB up to 64KB, although to get the answer we expect we actually only want the subset of this category which is up to 16KB long, which means the first length nibble is between 0 and 3, as in 30 82 [0-3]x xx 30 82 xx xx

Base-64 encoding takes 3 bytes and encodes it into 4 ASCII characters.  It does this by taking the 24 bits from the 3 bytes (3 bytes * 8 bits/byte = 24 bits) and dividing it into 4 blocks of 6 bits each (4 * 6 = 24 bits).  These 6 bit blocks are then converted into ASCII characters using a lookup table to produce the 4 characters of output.  A single 6 bit block has 64 combinations, hence the name.

We can see how the bytes align with the base-64 characters as follows:

   3    0      8    2  [0-3]    x      x    x      3    0      8    2
0011 0000 1000 0010 00xx xxxx xxxx xxxx 0011 0000 1000 0010 AAAA AABB BBBB CCCC CCDD DDDD AAAA AABB BBBB CCCC CCDD DDDD M I I _ _ _ C C 

As you can see from this alignment, this will always produce the familiar "MII___CC" pattern seen in the base-64 encoding of most certificates.   tada.wav

[As mentioned above, this pattern will actually begin any base64-encoded, binary representation of an ASN.1 structure that begins with two sequences and is within the size constraints, so this will match CSRs and other files.]

While that works for certificates between 1KB and 16KB, the other categories have similar patterns, so here there are for completeness.  Here, an underscore is a wildcard, and brackets contain the possible characters for a single position.

  1. M[FGH]_w
  2. MI[EFGH]_M
  3. MI[EFGH]_MI[EFGH]
  4. MI[IJKL]__[DTjz]CB
  5. MI[IJKL]__[DTjz]CC
  6. MI[IJKL]__[DTjz][AB]

What are OIDs?

In a few places in this article I have mentioned OIDs, these are "Object Identifiers" which are used in certificates to specifically identify fields and other information in the data.  Each OID is a set of decimal numbers separated by periods.  The registry of assigned OIDs is maintained by the "International Telecommunication Union" (ITU) and the "International Organisation for Standardisation" (ISO), and companies can make requests to register new OID values.

Appendix A of the certificate RFC lists many of the OIDs that are commonly used within certificates.  I have included these below as examples of OIDs, along with links to http://www.oid-info.com/ which is the best online reference I have found for them. 

[I have deliberately referenced an older version of the RFC for certificates throughout these articles to give a grounding in the technology from the most simple version.  However, the OIDs in appendix A are quite out of date because of the development of new algorithms etc. so looking at one of the updated versions is recommended if you intend to implement based on this.]

Distinguished Name Fields (Subject and Issuer)

Each of the individual fields that make up the Subject and Issuer data have a separate OID, such as the Common Name, Organisation, Locality, etc.  Most of these will be familiar and can be seen in any website certificate, but some will be less familiar to most people:

Algorithm Identifiers

These are used in a few different places in the certificate, but primarily identify the type of public key that is being carried and the hashing algorithm used for the signature of the certificate.  The RFC states that this list is extensible, so obviously this is not a complete list:


Closing Thoughts

That completes the second part of our deep dive into certificates.  As ever, I hope that you have found something in here useful.

Between these first two articles I have included most of the detail that I wanted to describe.  While I have some further notes and thoughts on a few more topics, it is not quite enough yet to justify a part three so I'll revisit this in at some point in the future.  If you have any questions or suggestions about certificates or other topics, please feel free to leave a comment.



No comments:

Post a Comment