Data Warehouse Access Architecture

Data Warehouse Access Architecture

Background

On March 25, 2009, there was a meeting to discuss architecture for the campus data warehouse.  Dick Warg presented the current plan for providing data from the warehouse both to "canned" applications (the first of which to be delivered within two months) and as raw data.  Problems to be addressed included:

  • cumbersome authorization process
  • security
  • difficulty of building queries
  • logging for audit and capacity planning
  • shared accounts
  • insufficient data classification

Currently, users write SQL queries against tables and views to retrieve data from the data warehouse.  A variety of client tools are used to accomplish this.

Dick's proposal is to create "packages" that return data from the warehouse, based on parameters that are passed to the packages. Each package represents a grouping of data by use and security classification, so authorization is facilitated, at least for low security classifications.  The packages would be written in PL/SQL and accessed over the network via JDBC. Overall, there would be a three-tier architecture, with the database at the bottom, applications at the top, and the packages in the middle.

Authentication of end-users would be implemented in the packages and based on CAS. The details of how to use CAS with JDBC aren't real clear, but Curtis believes it should be possible to what Dick has in mind.  Unfortunately, we can expect this to work only for the "canned" applications that are written here.  Access from off-the-shelf applications like Excel, however, would be unlikely to be able to use the CAS method.

Dick pointed out that this architecture would enable access via packages to external operational systems from the data warehouse.

Associated with this architecture, the data warehouse group would create a library of user-contributed packages that would be available to all authorized users of the data warehouse.

Analysis

The "package" concept makes a lot of sense.  It is, essentially, the concept of a service or resource in a Service Oriented or RESTful Architecture.  Proper selection and design of packages should reduce the need for the construction of special-purpose queries, and they can can address authentication, access control, and logging in a common manner.

In the short term, JDBC as the access protocol can be adopted more quickly by existing users and tools; JDBC is already in use by these people. There will, however, be limitations for authentication from the start for many of these tools, as they will not all be able to be integrated with CAS. 

In the long run, web services will predominate for most off-the-shelf tools; support has been available for Microsoft Office since about 2003.  Also, web services will be more readily integrated with campus authentication standards, particularly SAML, as used by Shibboleth. Use of SAML will also be important, if access by people outside of UC Davis is desired at some time in the future.

JDBC is, however, the natural access protocol for applications written in PL/SQL for the Oracle environment. Probably the most appropriate way to implement web services interfaces would be to invoke the JDBC calls within SOAP wrappers.

Recommendations

  • For the initial application in two months, continue with Dick's proposed architecture. Care should be taken, however, to educate the user community that JDBC will not be the preferred protocol in the medium-term future.
  • Integration with CAS is problematic, as it won't be supported by all tools. A better approach (for JDBC interfaces) would be to integrate the database's native authentication and user provisioning with the new identity management system.  Unfortunately, manual user provisioning will be required until the new identity management system is in production with support for the database.
  • A Roadmap for Kuali Rice at UC Davis shows the KSB service bus going into production for use by non-Kuali applications in Q1, 2010, although this may be changing somewhat. A roadmap for data warehouse services should be developed to move to exposing SOAP interfaces for existing and future "packages" within this time frame. The SOAP interfaces should be designed to accept SAML assertions to identify the end-users of the package. Also, the application invoking the package should be authenticated via TLS.
  • The TLS authentication of applications implies identity management for applications. This should be considered as a general-purpose service of the Middleware group.
  • The notion of the data warehouse as a generic access vehicle to data from multiple external systems is attractive, but may be difficult to scale up for a large number of external systems.  Thought should be put to the data warehouse's "mission," and whether that includes facilitating all data access, or whether the emphasis should be on business intelligence.
  • It appears that many of the requirements addressed here have been communicated by IT professionals who are supporting end-users. It may be appropriate to put some thought toward collecting end-user needs more directly. We might find that there is more commonality of business needs  than would be apparent from looking at technical needs.