Thursday 9 April 2009

Argot Versioning - Part 1 - Meta Data Type Versioning

Data type meta data versioning in communication is one of the more (if not most) complex areas of distributed computing. It's also an area that I've managed to avoid for quite some time. That is until someone sent me an email asking about versioning in Argot. This has kicked off a thought process which has resulted in around six months of investigation, a couple failed attempts, and introducing a long overdue feature in Argot. The final result solves the problem of versioning in a new and unique way.

In the next series of posts I'm going to discuss some of the issues of versioning and how versioning has been implemented in Argot. Be warned that the posts are probably going to be long and complex. Along the way, I'll demonstrate the new Argot meta dictionary which will become the basis for versioning meta (type) data in Argot.

Background

In Argot, before now, I've taken the view that a client and server have a single version of each data type. Each data type has a name, a structure definition and a unique identifier. When an Argot connection is established the client and server compare the names and data structures for each type. If the data structure of any type is different between client and server then the system is unable to communicate that data type.


The image above demonstrates how a client and server create a shared table which defines the set of data types that they can use to communicate. The name and definition of each data type must be the same on both client and server for an entry to be added to the shared table. The client and server assign a unique internal identifier which may differ for their own data type tables; each data type in the shared table has a unique identifier that is agreed between client and server.

For my purposes this method of having a single data type has worked fine. In my small environments I can update the client and server at the same time. However, versioning is a necessary requirement for many systems. You can't always upgrade all clients after a server has been updated. This means a single server must be able to support multiple versions on the client. In a similar way, you can't always upgrade all servers, requiring a client to support multiple versions of data structures. In situations where both clients and servers can not be upgraded then both must have multiple versions of data types. Therefore Argot needs to be modified to allow a data type name to have multiple definitions or versions.

The Issue of Names

The development of Argot to some extent has always been based on a language dictionary. The idea is that each and every data type definition can be taken individually (much like a single word can be found in a dictionary) and used in any data dictionary (schema). The language dictionary is once again the premise for how versioning should be handled by Argot. A standard language dictionary defines various aspects of a word's definition. Each word will have its pronunciation, phonetic spelling, various ways the word is used and possibly its etymology.

Compare this to an example using Argot's (version 1.2) definition:

address:   
meta.sequence([
meta.reference( #u8ascii, "street"),
meta.reference( #u8ascii, "suburb"),
meta.reference( #u8ascii, "state" )
]);

Argot provides a very basic format to create definitions. It has two parts: the name ("address" in the example above) and the definition. Internally this is also assigned a unique identifier. This simplicity is a double edged sword. The consequence is that every definition must have a name. However, there are many cases where a name is not required for Argot definitions.

The other important aspect of Argot is that each statement or definition must stand alone. This is required so that a client and server can compare each part of a types definition. This means that a single concept may be defined using multiple statements. This is the case for abstract data types. For example:


meta.definition: meta.abstract();
meta.definition#basic: meta.map( #meta.definition, #meta.basic );
meta.definition#map: meta.map( #meta.definition, #meta.map );

In this case meta.definition is defined as an abstract data type. The meta.basic and meta.map are then mapped to the abstract type using separate definitions. This requires that Argot define fake names like "meta.definition#basic" so that each definition can be found in the data type tables.

Introducing versioning offers an opportunity to modify the way data types are defined to create a model which is closer to a language dictionary.

An interesting aspect of basing versioning on a language dictionary is that each version of a data type may be completely different. A single dictionary might define an address as:


address
version:"1.0" :
sequence( [
reference( #u8ascii, "street" )
reference( #u8ascii, "suburb state" )
] );
version"2.0" :
sequence( [
sequence( [
reference( #u8utf8, "street number" )
reference( #u8utf8, "street name" )
reference( #u8utf8, "street type" )
])
reference( #street, "street" )
reference( #u8utf8, "suburb" )
reference( #u8utf8, "city" )
reference( #u8utf8, "state" )
] );

This is considerably different to how many other object serialization systems work. For instance, in ProtocolBuffers a label is assigned to each field in a definition. New versions consist of adding new optional elements to the definition. In effect this means a definition can not change radically between versions. It also means that new versions must become a hybrid data structure of both old and new, moving the strict rules about the data structure into the program. The advantage of the language based model is that versions can be completely different and encode strict definitions at the protocol or file format level.

Versions in Structure Definitions

One of the first problems to be solved in introducing versioning is how to reference a data type with multiple versions in another definition. For example, in the following example the structure test refers to "foo" and "bar" version 1.0.


test:
version:"1.0":
sequence( [
reference(#foo, version:1.0)
reference(#bar, version:1.0)
]))

However, there's a problem, when a data structure is defined and contains references to other data types it creates a brittle type system that is difficult to maintain. In the "test" definition there's a strict relationship of versioning between each sequence sub element. If foo was to be updated to "foo_1.1" the "test" type would also need to be updated. This causes a versioning ripple through every element that uses foo. Every element that was changed will also cause changes.

In the following example we try defining "test" using major and minor versions. Each reference can then specify the minimum version that is supported. The problem with this is that every definition requires too much data. The developer and schema designer will get lost in meta data versioning information.


"test" vMajor:1 vMinor:0:
(reference #foo (minVersion major:1 minor:0) (maxVersion major:1 minor:99) );

Looking back at the dictionary model (ie real world dictionary book) that Argot was built upon, it is clear that every word definition does not refer directly to a specific version of each word used to define another. Returning back to the original concept of a data structure definition without version information:


"test" version:"1.0":
sequence([
reference( #foo)
reference( #bar)
]));

In many cases the actual version of a referenced field is probably not important when defining the data type. As long as the server and client both agree on what version of a particular type they agree on then the data can be any format.

This model has the advantage that any change to "foo" does not cause a ripple through other data types. It also removes any barriers to what version of "foo" a client and server should use to communicate. This places additional burden on the programmer to ensure that all versions of "foo" that can be understood by the software are interchangeable through all parts of the application. Overall the advantages of not specifying a version is preferred over the other options, so it will be adopted for Argot versioning.

Using this model requires the ability to identify both the Name and definition as two separate references in the Argot Type system. When data structure's are being defined, any reference uses the Name identifier. When a data structure is being used in communication it uses a specific definition. This means that the name must have its own identifier and form part of the TypeLibrary.

Version Information Data

There are a few options as to how to encode the version information for a specified data structure. As far as Argot is concerned each version of a data type is a completely different type. From this point of view having a single integer value to represent the version is easiest. However, from a user point of view the version often consists of major, minor and patch levels.

Version options:

  • Single Integer - Has the advantage that it aligns with the design of Argot. Each version is completely independent of the other. Ordering of versions can be easily maintained.

  • Major, Minor, Patch Integers - Allows each data type to have multiple levels. Ordering of later versions can be easily maintained. The other advantage of this is it allows a mechanism for designers to make "compatible" changes to the protocol as part of minor revisions. A major revision will often signal a departure from previous logical designs and the previous system Object will no longer be a viable representation. A minor revision will signal the addition of a field or other minor change in which case the same system Object can be used. This method also allows a developer to keep track of the version of software that a definition was introduced.

  • String - A generic string has the advantage that any versioning system that the developer produces can be handled. No ordering can be guaranteed unless an additional ordering function is supplied by the developer and bound to each data type.

  • Abstract Type - An abstract type offers the most flexibility as the user is able to define the version using any method and mapping to the abstract type. This expands the meta dictionary and makes versions more difficult to compare.

To reduce complexity in the initial development a simple string was used. However, as the release version is developed this will migrate to a MAJOR and MINOR mechanism. The major and minor values are small unsigned integers. The use of major and minor releases becomes important to differentiate between new and older type versions. Later releases may eventually allow multiple tags to be assigned to specific versions providing a method of performing a version control across a group of data types.

This sets a couple of the core concepts of versioning in Argot. In the next post I'll introduce the key concept that makes versioning in Argot a possibility and demonstrate the Argot meta dictionary.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.