Monday, 3 February 2014

Argot Message Format - A self describing binary message format

Today I've committed to the Subversion repository the first implementation of the Argot Message format.  This is a feature I've been wanting to add for sometime, and a project I'm currently working on has given me the excuse to get it done.  The idea of the Argot Message format is that a binary message contains both the data and data dictionary with little overhead.

It's easiest to understand with a demonstration.   As part of the Argot test code I've defined a data type called 'demo':

definition demo 1.0:
{
    @short #int8;
    @byte #int16;
    @text #u8utf8;   
};

This rudimentary type contains three fields named 'short', 'byte' and 'text'.  If an instance of this data type were written to a stream it would look like:

00 0a 33  05 h  e  l  l  o  

This can be described as:

short - 2 bytes with value 10.
byte - 1 byte with value 51.
text - 6 bytes with one byte value 5 and the text 'hello'.

If an application received just these 9 bytes alone it would need to have previously known that the sender was sending this 'data' type.  However, with the Argot message format the following is sent:

A  13 01 32  20 00 04 d  e  m  o  01 00 1b 0f 03 0e 05 s  h  o  r  t  0d
28 0e 04 b  y  t  e  0d 01 0e 04 t  e  x  t  0d 08 32  00 0a 33  05 h  e  l  l  o 

The message is now 51 bytes, however, contains a full description of the data format along with the actual data.  Breaking the format down:

A  - Magic value indicating this is an Argot message format.
0x13 - Version of the Argot meta dictionary and Argot message format being used.
0x01 - The number of data types defined.  One here but could be thousands.

Each data type contains a unique identifier, a type name identifier and a type definition.  In this message it contains:

0x32 - The unique identifier for the data type. Integer 50.
20 00 04 d  e  m  o 01 00  - The type location and version.  demo version 1.0.
1b 0f 03 0e 05 s  h  o  r  t  0d 28 0e 04 b  y  t  e  0d 01 0e 04 t  e  x  t  0d 08 - The structure of the data type as defined above.

After the data dictionary is read the actual data is written.

0x32 - The identifier for the data type that follows.  In this case the 'demo' type.
00 0a 33  05 h  e  l  l  o - The actual data.

The first data type is defined as type 50 in this case as the format version 1.3 specifies that the reader assumes that the first 49 data types are known to the recipient.  The other 49 data types include all the Argot meta data types and the following base types:

40 - uint16
41 - uint32
42 - uint64
43 - int8
44 - int16
45 - int32
46 - int64
47 - float32
48 - double64
49 - u8boolean
The base types allow any other data types to be defined.  The data dictionary in the message could contain a large and complex set of data with every part of the data defined.  To write the message in Argot was simply:

msg.writeMessage(baos, MixedData.TYPENAME,
           new MixedData( 10, (short) 51, "hello"));

As I work with this new format, I may add additional elements.  One such idea is to include a reference to a known data dictionary provided by a URL.  In this way no data dictionary is required, yet, both sender and recipient will have a reference to the data types used.  This may be beneficial in Internet of Things applications where an additional 50 bytes may be considered too large.