Monday, January 28, 2008

XML Specification and Its Implementations

Since XML is a specification and not a technology, technologies which work with XML are implemented different. This is most evident in the new technologies like Semantic Web and Services Oriented Architecture (SOA) which extensively use XML. For example:

Semantic Web
In November 2007, I attended an ontology conference in Columbia, MD. During the conference, Steven Robertshaw stated that there is no standard ontology editor since each editor renders, validates a OWL document differently. OWL is a W3C standard for ontologies. I agree with Robertshaw's statement because the OWL document which I created from Protege did not validate on Altova's SemanticWorks OWL editor. When I fixed the OWL document for the SemanticWorks tool then it wouldn't work on Protege.

Web Services
Altova's XMLSpy generated WSDL does not validate well on Cape Clear's WSDL editor called SOAEditor. WSDL is a W3C approved XML specification for web service contracts. Sometimes certain editors look for attributes in a different place. I have used Mindreef's SoapScope to test a WSDL which is approved by all WSDL editors.

XML Schemas
While I was doing in depth analysis on Global Justice XML Data Model (GJXDM) for Department of Homeland Security, I created XML Schemas using XMLSpy . This caused a major issue in XMLSpy editor. The graphical interface stated the the xsd was valid however the xsd was invalid in the text view. After some research, I determined that XMLSpy used multiple xsd validators and that each XMLSpy view used a different xsd validator. This is still a problem. Look at this article.

Going back to my last blog entry, XML based modeling technologies have too many quirks and this is because how the technologies have implemented various XML specifications. This type of confusion is NOT needed by a data modeler. Before I start any modeling exercise, I would say, "Where is my Protege? "

Thursday, January 17, 2008

Can't live without it

In the world of pure data modeling, where data modelers develop data models specific to a domain and not to a technology, the Protege editor is a must have. A pure data model is a object model. This is different from a data model in a database. The data model in a database is normalized where certain object attributes are grouped together in a table. For example, the Person object look like this:
Person
Name
Height
Weight
Nationality(ies)

Now Instances of this object might look like:
Person
Enoch Moses
6 ft
180lbs
US
Person
Osama Bin Laden
6ft 2inches
160lbs
Saudi
Pakistani
Person
Barry Bonds
6ft 2inches
228lbs
US

However the Person object might be "shredded" across multiple tables in a database. This is done for performance reasons. In a transactional world, systems might just query certain attributes from various object instances since that is what the system is interested in. This approach is quick and logically however data modelers should not model their domain objects to a database since they might miss some implicit data relationships.

XML editors like Altova's XMLSpy, Stylus Studio and Oxygen provide graphical representations of XML schemas (xsd) which are great to create object models however it is a bit cumbersome to validate the data model with real data. To validate the data model with real data, an XML instance needs to be created with actual data and then the XML instance has to be validated with its xsd. This can be a painstaking process if the xsd's are quite complex.

For example lets look at the Person Object in a XSD



For the person called Enoch Moses, the xml instance should look like this:

For the person called Barry Bonds, the xml instance should look like this:

And lastly the Osama Bin Ladin, the xml instance should look like this:


The xml instance might make sense via its structure and data; however the xml instances are different documents and this could be a painstaking process to validate numerous instances. It can be done however it is not an enterprise data modeling solution.

UML is also used to model data however UML doesn't provide any way of validating the data model with the real data. UML can also be ambiguous.

The image does not have any methods since we are not discussing how to access various attributes in a class.

After creating database schemas, XML schemas and UML diagrams, I found an open source tool called Protege which was developed at Stanford University. Protege is an ontology editor which lets modelers create OWL documents (W3C approved XML language for ontologies) or ontologies in Protege frames. Since I think XML is not the best medium to develop ontologies or data models, I use Protege frames to develop data models. Protege lets the modeler create an ontology and its views where the modeler can input real data and see if the data model makes sense. This is how I created the Person class and the three instances on Protege.

Here is the overall Person class (please click on the image to see the larger version of it):


Here is the Name attribute(please click on the image to see the larger version of it):


Here is the Weight attribute(please click on the image to see the larger version of it):

Here is the Height attribute(please click on the image to see the larger version of it):

Here is the Nationality attribute(please click on the image to see the larger version of it):


Protege allows the modeler to validate the data model with actual instance data. It also allows the modeler to create views of the data model. Here are the three instances we have been working with:

Instance Enoch Moses (please click on the image to see the larger version of it):


Instance Barry Bonds (please click on the image to see the larger version of it):


Instance Osama Bin Laden (please click on the image to see the larger version of it):


Protege is a great tool since it lets you follow the MVV (not MVC) pattern which is:
  • Model - How the data entities are structured and are related to each other. The modeler models the data according to requirements or how he perceives the data (which is an ontology).
  • View - Since data models can be complex, it allows the modeler to create various views of the model.
  • Validator - This validates the model with real data. Most tools don't allow you do to this but it is a critical component of any modeling process. I see this akin to unit testing in programming.
In summary,
  • modeling data via a database can lead to incomplete understanding of the data. It only provides one view (the database view), an incomplete model but it does allow the modeler to validate the model in the database view.
  • modeling data via xml schemas is a great way however it can be extremely process heavy because xml is verbose and there is a variation between xsd validators. Some validators validate a complex instance while another validator may flag the same instance as a invalid instance.
  • modeling data via uml does not allow the modeler to validate the data or create various views on the data. UML is however quite useful when developers want to get the UML class diagram and generate skeleton programming classes.
  • modeling data via protege is quick, easy and it is free. It has a great modeling UI. It allows modelers to great various views and validate the model and views with data which is provided by the modeler.
As a modeler, I find it hard to model without Protege since I believe it is a pure data modeling tool.

Saturday, January 12, 2008

If Enterprise Level Federated Query is a hoax then what?

Couple days ago I received an email with the comment to my previous blog entry Enterprise Federated Query is a hoax!. The comment stated that Enterprise Federate Query is not impossible. It is not impossible but rather improbable. According to my previous blog entry, what I meant by an enterprise federated query is a user querying "n" data sources and the responses from the "n" data sources are combined and a unique set of results is passed to UI which then renders it for the end user. This is not possible since they are performance issues, governance issues, security issues and other functional and technical barriers. I believe with current technology and with smart human resources. It is possible to create a functional federated query with twenty to thirty data sources. Twenty or thirty data sources is no where close to 500 or 10,000 data sources.
The question to be asked then "what can be achieved with a SOA enterprise?". I believe rather than a request/response model, a publish/subscribe model is more robust and extensible. If data sources can publish new data or updated data to a topic then the topic clients can subscribe to the data sources' data. This way clients can subscribe to the data they want. Service Level Agreements (SLA) should be written and agreed upon on how the data can be manipulated or stored. Metadata repositories and registries are essential for any SOA enterprise to be successful with the publish/subscribe model.

Wednesday, January 9, 2008

Enterprise Federated Query is a hoax!

Everyone uses the federated query example as a great application in the Service Oriented Architecture (SOA) paradigm. Via SOA paradigm, the application developers can integrate their client application to numerous services and this will allow the client application users to search and browse various data sources. This sounds great in theory however an enterprise federated query application is not possible. Before we look into why an enterprise federate query application will be a reality, we need to understand what is a federated query application (fqa).

A fqa will allow its users to query multiple data sources at the same time and then the fqa will process the multiple responses into one standardized result set. To the fqa user, it would appear like he or she is hitting one data source. Fqa would be a killer app if it:
  • was fast - performance is not an issue
  • was very secure - security is not an issue
  • was reliable
  • always returned great results
Unfortunately the reality is not nice. Here are the issues:
  • High number or infinity number of data sources - The middleware like an Enterprise Service Bus (ESB) takes the fqa user's request and replicate a request for each data source. After the middleware does that, it waits till it gets most if not all of the responses. It then aggregates the responses by removing duplicate results or corrupted results and then forwards the response to the fqa. Imagine the time it is going to take to replicate the requests, wait for the responses, aggregate the responses into one response and then forward it to the fqa. Now imagine if there are five thousand users sending requests at the same time or in a short amount time. The fqa result screen might take a few minutes to render the results. Caching can used to alleviate this problem to a certain degree however it is not possible with realtime or near realtime data.
  • Security - Now let us imagine that every data source which is accessed by the fqa requires user credential. This could add time since every data source has to authenticate the user credentials against a data store. This could add increase the response time. The response time is the amount time it takes the fqa user to get the response. Trust based security model would alleviate this issue. The data sources trust that the fqa user has been authenticated by the fqa.
  • Data Source variation - Each data source could be different. It could be a Relational Database like Oracle, SQL Server, or DB2; or it something else like:
    • Flat file
    • Custom Off The Self (COTS) product which has service interfaces
    • Object Database
    • etc., and etc
    The Data source can have specific query language and specific roles. If the Data Source team built services to a common a service contract like a Web Services Description Language (WSDL). This might help however data sources have be optimized to work well with service interfaces. Performance could be an issue
  • Result set Aggregation - After the responses are collected, the responses need to be process to remove redundant data set. Computer algorithms need to be developed to aggregate the results and display them.
  • Governance - Each of the data sources needs to have appropriate agreements before they can be integrated with the fqa. The agreements can include a Memorandum of Understanding (MOU), Service Level Agreement (SLA), Office Level Agreement (OLA), etc, and etc.
As we can see an enterprise federate query is a large undertaking and it in reality it is not possible because various dependency and it may not be feasible with respect to performance. I built a simple federated query using Yahoo! Pipes but this only works with three data sources.