After I prepared my laboratory it is time to start the experiments…

The attentive reader might know that this project is essentially about mapping EMF Ecore models to Google Protocol Buffers (ProtoBuf). So, before I start hacking I should spend some time thinking about how to map Ecore onto ProtoBuf  – although I couldn’t resist and already played a bit. These thoughts are necessary, because the Ecore and ProtoBuf “languages” have some major differences.

Outlining the problems and challenges when mapping Ecore onto ProtoBuf is done best by comparing their features.

At this point I advise the reader not familiar with ProtoBuf to read its introducery tutorial to get an idea of the format before reading the rest of this article.

Ecore vs. ProtoBuf

Packages

Ecore and ProtoBuf both use the concept of packages. The unique identifier for Ecore packages is its namespace uri. In contrast ProtoBuf packages have a name format similiar to Java packages. Sub-packages are separated by a “.”. Here the first problem arises. Direct mapping isn’t possible.  But, in most cases the Ecore package names should also be unique, so using them instead of the namespace uri should be fine for the initial implementation.

Classes

Ecore supports all object oriented concepts like classes, interfaces and inheritance. ProtoBuf on the other side only supports messages, which can not inherit from eachother and which can not be used in a polymorphic way. This limitation is introduced to allow implementations for programming languages, which are not object oriented. This is a huge mismatch the mapping from Ecore onto ProtoBuf has to workaround. I will discuss some concepts for this later.

Primitive Attributes

Ecore classes and ProtoBuf messages can contain attributes of primitive (aka scalar)  types, e.g. integer, string, float, … (Note: ProtoBuf uses the term “field” instead of “attribute”). So mapping these attributes will be straightforward. The only problem might be that Ecore allows “custom” primitive types through its EDataTypes represented by a Java class. They can be mapped to strings, because EMF generates methods to convert them to and from strings. But sometimes it might be favorable to serialize them as a different type than string. A good example is the java.lang.Date type, which can be stored more efficiently as long (unix timestamp) instead of a lengthy string. But such optimizations can be neglected in the first implementation.

Containment References

Ecore classes can have attributes referencing other objects.  There are two types of those references: containment and non-containment references. Objects referenced through a containment reference are actual children of the object referencing them. This way object trees can be built. ProtoBuf messages can also have attributes, which have another message type as their type. These message instances are also contained in the parent message. So mapping this is straightforward, too. The problem here is related to the issue with polymorphism mentioned before. An attribute in an Ecore class, which references objects of a certain class can also reference instances of subclasses. This is not possible in ProtoBuf. Possible solutions for this will also be discussed later.

Non-Containment References

The second type of Ecore references are non-containment ones (aka cross-references). They allow the construction of arbitrary object graphs. ProtoBuf doesn’t support such kind of references. So the Ecore onto ProtoBuf mapping also has to take care of maintaining those references.

Conclusion

After all there are two main challenges when mapping Ecore models onto ProtoBuf. The first one is to find a mapping for class hierarchies and polymorphism. The second challenge is to support and maintain non-containment references.

In the following I will present several possibilities to tackle the named challenges. To better explain the concepts I will use the following model (by the way I draw it using cacoo.com, if you give it a try and sign up through this link it will give me some extra free diagrams).

The OrderCollection class is the root of the object tree and has a containment reference to zero or more instances of subclasses of the abstract Order class. There are two concrete subclasses of Order: BookOrder and CdOrder.

First I will explain my ideas to solve the inheritance and polymorphism problem. The upcoming mapping concepts for class hierarchies are comparable to Class Table Inheritance and Concrete Table Inheritance used for object-relational mapping.

ProtoBuf message per class

In this concept there is a ProtoBuf message for every class even the abstract ones. A ProtoBuf definition for the upper model might look like this:

message OrderCollection {
    repeated Order orders = 1;
}

message Order {
    enum Type {
        BOOK_ORDER,
        CD_ORDER
    }

    required int32 id = 1;
    required string name = 2;

    required Type type = 3;

    optional BookOrder book_order = 4;
    optional CdOrder cd_order = 5;
}

message BookOrder {
    required string isbn = 1;
}

message CdOrder {
    required string artist = 1;
}

As you can see there is a message Order representing the abstract Order class. The message contains the attributes id and string defined in the corresponding class. Additionally it has fields for a BookOrder and a CdOrder message. According to the actual object (instance of either BookOrder or CdOrder) one of the fields would be filled and the type field would contain the concrete class name.

The disadvantage of this approach is, if there are fields referencing only BookOrder or CdOrder messages the id and the name field can not be accessed. An advantage might be that when reusing the ProtoBuf definition and iterating over the content of orders one can already access the common fields and has not to check the type field.

ProtoBuf message per concrete class

The alternative is to have one ProtoBuf message per concrete (non-abstract) subclass, which contains all fields present in the entire class hierarchy. This is also described as union type in the ProtoBuf documentation. Here is the ProtoBuf definition for the upper model:

message OrderCollection {
    repeated Order orders = 1;
}

message Order {
    enum Type {
        BOOK_ORDER,
        CD_ORDER
    }

    required Type type = 1;

    optional BookOrder book_order = 2;
    optional CdOrder cd_order = 3;
}

message BookOrder {
    required int32 id = 1;
    required string name = 2;
    required string isbn = 3;
}

message CdOrder {
    required int32 id = 1;
    required string name = 2;
    required string artist = 3;
}

The main difference is that CdOrder and BookOrder both contain the id and name field. The Order message is just left as helper to reference either CdOrder or BookOrder. The advantage is that fields with type BookOrder or CdOrder can exist without problems. The disadvantage is that the type always has to be checked before accessing any data.

Lastly I will propose a solution for the problem concerning non-containment references.

ProtoBuf message references

The basic idea is to assign an id to every object and introduce for every class a special reference message, which will contain only a field holding the id of the actual (referenced) object. The ids can be either arbitrarily generated or the id feature of Ecore can be reused.

As example assume the OrderCollection class from above has an additional non-containment reference specialOrders to some Order objects stored in orders. The updated ProtoBuf definition from the “ProtoBuf message per concrete class” example might look like this:

message OrderCollection {
    repeated Order orders = 1;
    repeated OrderRef specialOrders = 2;
}

message Order {
    enum Type {
        BOOK_ORDER,
        CD_ORDER
    }

    required Type type = 1;

    optional BookOrder book_order = 2;
    optional CdOrder cd_order = 3;
}

message OrderRef {
    required int32 _internal_id_ref = 1;
}

message BookOrder {
    required int32 _internal_id = 1;
    required int32 id = 2;
    required string name = 3;
    required string isbn = 4;
}

message CdOrder {
    required int32 _internal_id = 1;
    required int32 id = 2;
    required string name = 3;
    required string artist = 4;
}

Notice the new OrderRef message and the _internal_id field in BookOrder and CdOrder. When deserializing such ProtoBuf messages, some kind of object pool has to be maintained and the *Ref messages have to be replaced with the actual objects.

In one of the next posts I will highlight some difficulties the concepts I have shown might introduce. I will also state which of these concepts are going to be implemented during this GSoC.

Advertisements