Thursday, October 27, 2011

A tale of Mondrian and real time analytics

Back in April 2008, Julian Hyde, founder of Mondrian, extends a basic API to control the contents of member caches within the Mondrian OLAP engine. Its name, quite unsurprisingly, is CacheControl.There is a basic implementation available, yet that feature remains mostly unknown to most of the community.

Jump to December 2010. We are brainstorming on Mondrian 3.3 (today this feels like ages ago). We come up with those crazy ideas about enterprise integration, real time analytics and cool APIs / SPIs. One of these crazy ideas is: Wouldn't it be sweet to update Mondrian's member cache and do real time OLAP?

That's when we remember the old CacheControl API.

 A year and a half had passed since the release of the API. After some PR effort, a critical mass of integrators started using it. This gave us incentive to push for that feature and take it over the final mile. We spent some time testing, fixing what we've found

In the process, Will Gorman, longtime Mondrian contributor and VP Engineering at Pentaho, writes an example Pentaho XAction script to demonstrate how someone can use the API, in Javascript, run by Rhino, within a JVM.

It starts by a Mondrian schema.
<resources>
    <catalog>
      <url>
        <location>solution:steel-wheels/analysis/steelwheels.mondrian.xml</location> 
        <mime-type>text/xml</mime-type>
      </url>
    </catalog>
  </resources>
 To which we add two simple steps, the first of which establishes an OLAP connection.
<action-definition>
      <component-name>MDXLookupRule</component-name>
      <action-type>OLAP Connection</action-type>
      <action-resources>
        <catalog type="resource"/>
      </action-resources>
      <action-outputs>
        <prepared_component type="mdx-connection" mapping="shared_olap_connection"/>
      </action-outputs>
      <component-definition>
        <jndi><![CDATA[SampleData]]></jndi> 
        <location><![CDATA[mondrian]]></location>
      </component-definition>
    </action-definition>
 The second contains the juicy part. This is where the magic happens. We use the Javascript step to update Mondrian's caches.
<action-definition>
      <component-name>JavascriptRule</component-name>
      <action-type>JavaScript to clear portions of the cache</action-type>
      <action-inputs>
        <shared_olap_connection type="mdx-connection"/>
      </action-inputs>
      <action-outputs>
        <rule-result type="string"/>
      </action-outputs>
      <component-definition>
        <script>
            <![CDATA[
                function cacheControlImpl() {
                    // get access to the necessary cache control API objects
                    var conn = shared_olap_connection.shareConnection().getConnection();
                    var salesCube = conn.getSchema().lookupCube("SteelWheelsSales", true);
                    var schemaReader = salesCube.getSchemaReader(null).withLocus();
                    var parent = schemaReader.getMemberByUniqueName(Packages.mondrian.olap.Id.Segment.toList("Product", "All Products", "Classic Cars", "Autoart Studio Design"), true);
                    var hierarchy = parent.getHierarchy();
                    var member = hierarchy.createMember(
                      parent,
                      parent.getLevel().getChildLevel(),
                      "1927 Ford Model T",
                      null);
                    var cacheControl = conn.getCacheControl(null);
                    // must have mondrian.rolap.EnableRolapCubeMemberCache=false set in
                    // pentaho-solutions/system/mondrian/mondrian.properties
                    // We could also set it explicitly here by calling:
                    //     Packages.mondrian.olap.MondrianProperties.instance().EnableRolapCubeMemberCache.set(false);
                    var addCommand = cacheControl.createAddCommand(member);
                    cacheControl.execute(addCommand);
                    return "Flushed: " + parent;
                }
                cacheControlImpl();
            ]]>
        </script>
      </component-definition>
    </action-definition>
We're done. The complete script is available for download.

CacheControl contains methods to perform changes on the member cache, but the cell cache as well. Using CacheControl to modify the members caches updates both the member caches and flushes the cell data's cached regions as well. (This also affects any pluggable SegmentCache you are using by flushing out the regions touched by the members you modify.)

As mentioned in the script above, you must turn Mondrian's ROLAP member cache off. This means Mondrian must wrap objects as it resolves your query in a thin wrapper. That operation itself takes some time to perform, but allows Mondrian to optimize the members at runtime. Usually, we only wrap them once and cache them, but this leaves out the updates performed by CacheControl. This is a technical debt we pay, but fear not, Mondrian 4.0 will address that issue and remove that layer altogether.

The future

Our next step will be to expose that functionality through olap4j and make it easily available to ETL tools out there. There is, at the time this article is written, someone working on adding olap4j connections to the Kettle project. I can't specify any release date, but trust me, this opens a whole lot of possibilities. (A Kettle step for a Mondrian cache update would be pretty cool. Contributions are welcome, as always, wink wink.)

Our final step (i mean long term here...) will be to bubble up those events to the olap4j driver, using the CellSetListener. This will allow user interfaces to get notified when new data arrives.

Full circle.

2 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. We are using Mondrian Community Edition, and has some issues with cache flushing - i.e. the cache is not flused until the JVM GC deletes the memory. Is there some new/better solutions to flush different parts of Mondrian cache? Have anyone finished the oalp4j connections to Kettle?

    ReplyDelete