Wednesday, January 9, 2008

Enterprise Federated Query is a hoax!

Everyone uses the federated query example as a great application in the Service Oriented Architecture (SOA) paradigm. Via SOA paradigm, the application developers can integrate their client application to numerous services and this will allow the client application users to search and browse various data sources. This sounds great in theory however an enterprise federated query application is not possible. Before we look into why an enterprise federate query application will be a reality, we need to understand what is a federated query application (fqa).

A fqa will allow its users to query multiple data sources at the same time and then the fqa will process the multiple responses into one standardized result set. To the fqa user, it would appear like he or she is hitting one data source. Fqa would be a killer app if it:
  • was fast - performance is not an issue
  • was very secure - security is not an issue
  • was reliable
  • always returned great results
Unfortunately the reality is not nice. Here are the issues:
  • High number or infinity number of data sources - The middleware like an Enterprise Service Bus (ESB) takes the fqa user's request and replicate a request for each data source. After the middleware does that, it waits till it gets most if not all of the responses. It then aggregates the responses by removing duplicate results or corrupted results and then forwards the response to the fqa. Imagine the time it is going to take to replicate the requests, wait for the responses, aggregate the responses into one response and then forward it to the fqa. Now imagine if there are five thousand users sending requests at the same time or in a short amount time. The fqa result screen might take a few minutes to render the results. Caching can used to alleviate this problem to a certain degree however it is not possible with realtime or near realtime data.
  • Security - Now let us imagine that every data source which is accessed by the fqa requires user credential. This could add time since every data source has to authenticate the user credentials against a data store. This could add increase the response time. The response time is the amount time it takes the fqa user to get the response. Trust based security model would alleviate this issue. The data sources trust that the fqa user has been authenticated by the fqa.
  • Data Source variation - Each data source could be different. It could be a Relational Database like Oracle, SQL Server, or DB2; or it something else like:
    • Flat file
    • Custom Off The Self (COTS) product which has service interfaces
    • Object Database
    • etc., and etc
    The Data source can have specific query language and specific roles. If the Data Source team built services to a common a service contract like a Web Services Description Language (WSDL). This might help however data sources have be optimized to work well with service interfaces. Performance could be an issue
  • Result set Aggregation - After the responses are collected, the responses need to be process to remove redundant data set. Computer algorithms need to be developed to aggregate the results and display them.
  • Governance - Each of the data sources needs to have appropriate agreements before they can be integrated with the fqa. The agreements can include a Memorandum of Understanding (MOU), Service Level Agreement (SLA), Office Level Agreement (OLA), etc, and etc.
As we can see an enterprise federate query is a large undertaking and it in reality it is not possible because various dependency and it may not be feasible with respect to performance. I built a simple federated query using Yahoo! Pipes but this only works with three data sources.

No comments: