Thursday, January 17, 2008

Can't live without it

In the world of pure data modeling, where data modelers develop data models specific to a domain and not to a technology, the Protege editor is a must have. A pure data model is a object model. This is different from a data model in a database. The data model in a database is normalized where certain object attributes are grouped together in a table. For example, the Person object look like this:
Person
Name
Height
Weight
Nationality(ies)

Now Instances of this object might look like:
Person
Enoch Moses
6 ft
180lbs
US
Person
Osama Bin Laden
6ft 2inches
160lbs
Saudi
Pakistani
Person
Barry Bonds
6ft 2inches
228lbs
US

However the Person object might be "shredded" across multiple tables in a database. This is done for performance reasons. In a transactional world, systems might just query certain attributes from various object instances since that is what the system is interested in. This approach is quick and logically however data modelers should not model their domain objects to a database since they might miss some implicit data relationships.

XML editors like Altova's XMLSpy, Stylus Studio and Oxygen provide graphical representations of XML schemas (xsd) which are great to create object models however it is a bit cumbersome to validate the data model with real data. To validate the data model with real data, an XML instance needs to be created with actual data and then the XML instance has to be validated with its xsd. This can be a painstaking process if the xsd's are quite complex.

For example lets look at the Person Object in a XSD



For the person called Enoch Moses, the xml instance should look like this:

For the person called Barry Bonds, the xml instance should look like this:

And lastly the Osama Bin Ladin, the xml instance should look like this:


The xml instance might make sense via its structure and data; however the xml instances are different documents and this could be a painstaking process to validate numerous instances. It can be done however it is not an enterprise data modeling solution.

UML is also used to model data however UML doesn't provide any way of validating the data model with the real data. UML can also be ambiguous.

The image does not have any methods since we are not discussing how to access various attributes in a class.

After creating database schemas, XML schemas and UML diagrams, I found an open source tool called Protege which was developed at Stanford University. Protege is an ontology editor which lets modelers create OWL documents (W3C approved XML language for ontologies) or ontologies in Protege frames. Since I think XML is not the best medium to develop ontologies or data models, I use Protege frames to develop data models. Protege lets the modeler create an ontology and its views where the modeler can input real data and see if the data model makes sense. This is how I created the Person class and the three instances on Protege.

Here is the overall Person class (please click on the image to see the larger version of it):


Here is the Name attribute(please click on the image to see the larger version of it):


Here is the Weight attribute(please click on the image to see the larger version of it):

Here is the Height attribute(please click on the image to see the larger version of it):

Here is the Nationality attribute(please click on the image to see the larger version of it):


Protege allows the modeler to validate the data model with actual instance data. It also allows the modeler to create views of the data model. Here are the three instances we have been working with:

Instance Enoch Moses (please click on the image to see the larger version of it):


Instance Barry Bonds (please click on the image to see the larger version of it):


Instance Osama Bin Laden (please click on the image to see the larger version of it):


Protege is a great tool since it lets you follow the MVV (not MVC) pattern which is:
  • Model - How the data entities are structured and are related to each other. The modeler models the data according to requirements or how he perceives the data (which is an ontology).
  • View - Since data models can be complex, it allows the modeler to create various views of the model.
  • Validator - This validates the model with real data. Most tools don't allow you do to this but it is a critical component of any modeling process. I see this akin to unit testing in programming.
In summary,
  • modeling data via a database can lead to incomplete understanding of the data. It only provides one view (the database view), an incomplete model but it does allow the modeler to validate the model in the database view.
  • modeling data via xml schemas is a great way however it can be extremely process heavy because xml is verbose and there is a variation between xsd validators. Some validators validate a complex instance while another validator may flag the same instance as a invalid instance.
  • modeling data via uml does not allow the modeler to validate the data or create various views on the data. UML is however quite useful when developers want to get the UML class diagram and generate skeleton programming classes.
  • modeling data via protege is quick, easy and it is free. It has a great modeling UI. It allows modelers to great various views and validate the model and views with data which is provided by the modeler.
As a modeler, I find it hard to model without Protege since I believe it is a pure data modeling tool.

1 comment:

mjf said...

What is the issue with xsd validators? Is there ambiguiity in the spec or just bad implementations of validators? Would be interested to have you write about that sometime.