2017-04-17

2017-04-17 AtomicityCodec Introduction

AtomicityCodec is an XML-based grammar designed for declaratively specifying the outline structure of a file.  I have been working recently to create it to fill a number of roles within my OS.  This is a long post to explain the what and why, but I'm leaving the actual technical details of the grammar for a future post.

The grammar is designed to achieve the following goals:

  • Define a wide variety of file formats including filesystems.
  • Provide a first-level parsing of files for an associated codec to then use a reference for further parsing.
  • Provide a means of identifying a file based on its content.
  • Serve as a basis for an "intelligent" hex viewer, which will have applications beyond AtomicityOS.
  • Provide a means for basic file format or minor variant of an existing format to be added to the system without needing to write any code.
  • Design a binary file parsing grammar that will have uses to a wider audience than just myself.
  • Possibly have the format usable to save out files as well as load them.

Background


I have long been interested in file formats, and always wanted a means to easily define, modify, and play with files in an easy and responsive manner.  There are a few hex viewer tools on the market that allow a structure definition to be applied to the data to visualise or extract values, but these tend to be rather limited.  A number of groups have attempted to create a grammar to describe a binary file format, but these projects have either been abandoned, or commercial with a limited usefulness.  So I have defined my own grammar (as with everything else) which is still evolving (I'm considering adding iterative loops) but is sufficiently complete to describe a host of formats in a breadth-first manner.  This grammar allows me to make ad-hoc changes to how the structure of a given file is parsed and to see the results of those changes in real-time.

My intention is that this system will provide a first-level parsing of files for the codec system within my OS.  I could have written file loading code for the plethora of file formats that I want my OS to be able to support in the first release, but I would have spent a very long time writing boring, boiler-plate code that is just different enough between each format to not be sharable.  Not to mention the possibility that the code may then contain buffer overflow vulnerabilities and other nasties that certain larger software houses seem to have great trouble avoiding.  The AtomicityCodec mechanism allows me to complete a great deal of the code quickly, declaratively, portably, and with a host of other benefits.


File Type Identification


As this will ultimately be used within my OS, a goal is for the grammar to define how to identify whether a given file matches a given grammar.  In order to to be able to test a given file against the codecs that the system supports, I could:

  • search for a match on the file extension (this is so limited it's stupid),
  • pass the file to every codec in turn so that they can each run a function and decide the file isn't for them (this isn't scalable),
  • have identification code that is separate from the parsing code so that it can be loaded into the generic file identifier (with all the security issues that would involve),
  • have each codec define a string or other blob that can be interpreted by the file identifier to check a file (which would be separate from the actual parsing code),
  • or do something else.

I went with the "do something else" option, and that "something else" is to use the XML grammars and to define within them a series of "allowable values" for any field.  This enables the file identifier to intelligently work out a signature (and a specificity) for each file, and for the file parsing and identification data to be declared together.

Advantages of this approach:

  • The file identifier is free to work out what parts of the file will be relevant.  Most formats can be confirmed or rejected with the first 32 or 64 bytes of data, but some (such as CD file-system ISO9660) require other data.
  • The system can quickly eliminate formats with a more specific signature.  PNG defines an eight-byte signature (giving a 1 in 2**64 chance of a false positive) where as GZIP only defines a two-byte signature (1 in 65536 chance).  Generic formats such as "text" are especially broad in their definition.
  • The file identifier can work out if there is extra data at the end of the file which can be interpreted as another file format, which is completely valid for some formats (such as the concatenated PNG formats used for icons).
  • The system can also identify the storage of one format within another (such as low-resolution, low-quality lossy JPEG thumbnails embedded inside a lossless image format, EXIF data embedded in multiple different image format types, or various steganography techniques) by having the grammar define where in a format arbitrary data can be stored.

Even file formats which traditionally are not considered to have a signature can have one derived using this mechanism.  For example, a FAT file-system (either on-disk or in an image file) has no explicit signature in the specification to identify it.  However, there are a number of fields within the first few bytes of the file-system that have sufficient restrictions on their values to be viable as a signature through this means.  Specifically, the first byte is a JUMP instruction which is defined by the specification as having one of two possible values, the "Bytes Per Sector" value which is a 16-bit field with only four allowable values, and the "Sectors per Cluster" value is another byte with eight allowable values.  Working all this out gives us a 1 in 67108864 (= 1 in 2**26) chance that a given non-FAT file will be misidentified as a FAT file.  I like those odds.

As I mentioned earlier, CD File systems are slightly troublesome in that they define the first 32KB of the file-system to be for arbitrary system use, and the file signature only appears after that point.  At some point the file identification subsystem is going to have to trade-off the likelihood that the file is a CD image (or other late-signature format) against the cost of loading more of data from the file.  At least this mechanism allows it that capability.


ZIP File Format


While I'm on the subject of specific file formats, I'll have an aside rant at the ZIP file format.  To positively identify a ZIP file (or any of its derivatives including DOCX and the Microsoft Office formats, or PK3, or JAR) the APPNOTE specification says that you use the existence of the metadata File Headers and other structures.  It also says that the only one of these that has to be there is the "End Of Central Directory Header" record which will appear at an unspecified point, somewhere near the end of the file.  Oh and there can be multiple of these "End of Central Directory Header" (EoCDH) records in the file, with only the last one being the "correct" one.  It's also the first structure you have to find in order to actually read a ZIP file, but how you actually find the correct "End of Central Directory Header" record is not specified, and different implementors have applied this differently.

Possibly the most "correct" way to find it is to parse the first header structure in the file, and work out the offset to the next header in the file, and follow the chain to the end of the file.  This approach has a number of problems; you have to step through the whole file which will be slow if it contains many entries, each header type has a different way of calculating the offset to the next header so the format is not extensible (it's not a TLV format), and a single bit error in any of the fields required to calculate the offset will render the chain unfollowable.

Some implementors attempt to improve performance by skipping to near the end of the file and searching forwards to find the required header (I believe Windows Compressed Folders does this, but I need to confirm).  This fails badly for a number of reasons.  I said before that the EoCDH record will appear "near" the end of the file, technically it has to be the last structure in the file, except that it includes a variable-size "Zip File Comment" field that means the structure's offset from the end of the file cannot be predicted.  More than that, the content of the "Zip File Comment" has no constraints, so could easily contain a fake (accidental or otherwise) EoCDH record signature.  Also, if the ZIP file contains a ZIP file stored with no compression, it will contain a perfectly valid EoCDHR of its own, but that doesn't end at the end of the file.

Ultimately, if I wanted to, I could construct a perfectly valid ZIP file that contained three valid-looking EoCDH records that all ended at the end of the file and would trip up any ZIP utility that attempted to scan backwards.  I create a first ZIP file which has a non-zero "ZIP File Comment Length" field, but no "ZIP File Comment" data (which would be invalid, but bear with me).  The value in the "ZIP File Comment Length" would be carefully calculated to be a specific value that I'll explain later.  Store (without compression) this first ZIP file into a second ZIP file.  Give this second ZIP file a "ZIP File Comment" that contains another valid EoCDH record.  This second ZIP file is now complete, and contains three perfectly valid-looking EoCDH records; one in the inner ZIP file (with a comment length that validly includes the rest of the outer ZIP file as its "ZIP File Comment", so that it ends at the end of the outer file), the correct one for the outer ZIP file, and the fake one in the outer ZIP file's comment field.  Any ZIP file utility that attempts to scan backwards from the end of the file to find the EoCDH record will fail miserably.  I accept this is a far-fetched example, but it demonstrates TWO scenarios where such an approach will fail, which shouldn't be necessary or even possible with a well-defined file-format.

To be more constructive, I recommend a specific method of modifying the ZIP file format in a completely compatible way to resolve this and improve ZIP performance everywhere (I may even submit this to PKWare if I get the chance).  Define that a ZIP file should contain, as its first entry, a Local File Header with specific properties.  The file name field would contain a signature such as "PKZIPFILE" (or similar, look at the PNG format signature), it would have a deliberately invalid checksum, it need not contain any compressed data, it would not appear in the Central Directory, and it would contain somewhere within it (perhaps in the date-time fields, in the File name, or in the file data itself) the offset to the EoCDH record.  In order to maintain the feature of ZIP files being capable of being written in a forward-only manner, if the offset for the EoCDH record is not known when the start of the file is written, the value is written as 0xFFFFFFFF (or the 64-bit equivalent), and then post-updated if the medium allows.  If the value cannot be updated, the header still serves as a positive file type identification.

A ZIP file utility that is compliant with this modification will read the filename to positively identify the file as a ZIP file, will read the EoCDH record offset and skip to it directly.  It will still need to perform sanity checks such as checking the signature and the end point to ensure that the file has not been updated by a ZIP utility not compliant with this modification, and will fall back to whatever-it-did-before if these tests fail.  When updating a ZIP file, a compliant utility will add or update this file header.

A ZIP utility not compliant with this modification will correctly ignore this entry as it does not appear in the Central Directory.  It will consider it a redundant file that has been deleted from the ZIP by a forward-only utility and either ignore it or remove it, neither being detrimental.  Worst-case scenario is that the utility attempts to extract the file and either fails because of the checksum error, or extracts a file called "PKZIPFILE" with nonsense content.  This latter case means that the utility violates the standard in at least two ways (extracting a file not in the Central Directory and ignoring the checksum) and should be discarded anyway.

Anyway, that was a long aside, where was I?

Usefully Parsing a File


The main reason for writing the grammars is to read a file and to be able to present the data to a program in a form it can use.  The complete solution will contain several parts; a generic file parser library, a collection of grammars and codecs, a collection of "classes" (such as bitmap image, vector image, PCM audio, 3D object, movie, etc.), and the third party-application.

The overflow of the workflow is similar to the Datatypes system from AmigaOS, and will be:

  • The third-party application will prompt the user for a file to open,
  • The third-party application will invoke the generic parser library, sending the filename and a collection of classes the application supports,
  • The generic parser library will use the collection of grammars to attempt identify the type of file,
  • The generic parser library will then parse the file against the grammar to create a parse tree structure,
  • The generic parser library will then attempt to find a codec that matches both the file format and one of the classes supported by the application,
  • The generic parser will then send the parse tree to the codec to load the file as that class, then return the class object back to the applcation,
  • If the generic parser cannot find an exact match for the file format and a supported class, it may try to find a partial match (such as only returning the audio track from a movie file) or possibly do something a little more exotic (#spoilers).

I don't intend for any grammar to ever parse every detail of the file, only to read values significant to the structure of the file, and possibly include in the parse tree a few primitive fields that may be beneficial to parse generically such as audio sample frequency, image width or number of movie frames.  The parse tree should be an index into the file that the codec can use, and should only every be a fraction of the file size.  It will not contain the actual content of the file such as compressed file data, sound waveform data, or JPEG encoded content.  These blobs of data will be represented in the parse tree as a file offset and a length, allowing the codec to jump directly to the required data to read and interpret without fear of encountering a buffer overflow, reading beyond the end of the file, or other issues.

The relationship between grammar and codec will, in many cases, be one-to-one as the codec needs to know the node names and paths within the parse tree that the grammar will generate in order to read it.  However, some formats are so similar that the differences between them can be represented entirely within the grammars so that a single codec could utilise many grammars.  Think of PCM-encoded sound sample formats such as WAV, 8SVX, AIFF, AU, AUD and SND.  A single codec could support the full range of features of all of these formats (8, 16, or 32-bit samples, uLaw, aLaw, little-endian, big-endian) with each of the grammars then defining values to tell the codec what the specific file was.  Another example is uncompressed archive files such as TAR, Doom's WAD, Command and Conquer's MIX, and dozens of other file distribution archives which are minor variations on a theme to package assets without the overhead of having to decompress or decrypt the content.  All of these could easily be handled by a single codec with multiple grammars.

Openness of the Format


Because the grammar doesn't attempt to read every aspect of the format and uses generic XML, it also has the benefit of being reasonably human-readable (given a suitable technically-minded human).  This means that even without a description of the specifics of the format, many of the grammar files can serve as a way to document the outline of the file format with comment and other references to allow finding more specifics.  Releasing the full and finished specification for the grammar would allow others to create and share grammar files and a suitable codec (for whatever operating system) could be written and contributed by someone else.

This allows me to segue into big feature of this structure.  Just like the Commodore Amiga operating system, one goal of AtomicityOS is to empower the user to tinker with the system.  The AtomicityCodec system allows a user to create grammars for a new file format that is similar to an existing one but with differences.  If they find an uncompressed archive such as an obscure dialect of TAR or if Compuserve release a GIF18a which is similar to GIF89a, a user may be able to create a grammar to cover the format with an existing codec to allow basic interoperability even if it doesn't allow the full range of features.  Yes, they could change all the grammars in the system and break things, but with great power and all that.

Empowering the user leads on to the next goal of the format, to allow creation of a hex viewer (hexadecimal viewer to be more specific) that allows a grammar to be applied to a hex view of a file to see the parse tree and to highlight the fields within the hex view.  I know that there are a few hex viewers available which allow a similar function, but many of them are limited to simply applying a C-like structure at a user-defined point in the file, rather than defining a structurally complete, conditional grammar to evaluate the skeleton of the file.  The hex viewer would also the grammar to be edited in real time to see direct changes against an example file of the type to create a grammar from scratch.


I think there's more I intended to say, but I'll leave that for another time.  I have already written a prototype parser and hex viewer in .NET, and nearly two-dozen grammars for common file formats across a range of types so that I can evaluate the suitability of the features I have developed so far.  There are definitely some shortcomings with the prototype, but I'm working on those.  I'll leave you with a characteristic image of what I'm working with at the moment.

Hope this helps



No comments:

Post a Comment