The Dev Donkey Blog: 2011

Thursday, December 15, 2011

Robot 02

Last month, I built my second robot, which I cleverly nicknamed Robot 02. Its name stands for... well the name says it all, really.

Pictured above, Robot 02 is my second attempt at building an autonomous robot capable of navigating around a room. I wanted it to go around obstacles and detect when it gets stuck so that it can act accordingly.

A tale of Mondrian and real time analytics

Back in April 2008, Julian Hyde, founder of Mondrian, extends a basic API to control the contents of member caches within the Mondrian OLAP engine. Its name, quite unsurprisingly, is CacheControl.There is a basic implementation available, yet that feature remains mostly unknown to most of the community.

Jump to December 2010. We are brainstorming on Mondrian 3.3 (today this feels like ages ago). We come up with those crazy ideas about enterprise integration, real time analytics and cool APIs / SPIs. One of these crazy ideas is: Wouldn't it be sweet to update Mondrian's member cache and do real time OLAP?

That's when we remember the old CacheControl API.

olap4j 1.0 is here

Dear olap4j community,

It is with immense pride that we deliver to you today version 1.0 of olap4j. Yup. And you should be proud as well. Thanks to you and to the concerted efforts of countless collaborators, advisers, testers, integrators and developers worldwide, we were able to reach this historic milestone.

If you would allow me to exagerate blatantly for a second, olap4j 1.0 is what I believe the greatest and bestest (yes, you've read it right, bestest) version to have ever been granted to humanity.

At this point, I encourage you to go read the official press release instead of reading my futile efforts at proper journalism. Our fine folks really know how to best express what olap4j represents, why it is important and why it *might* be the bestest thing on Earth. I've also resisted the urge to copy / paste their fabulous text. That said, I'll do my part and get into the more technical details.

You will first notice that the distribution files have changed. The XML/A driver is now separated from the core API. This is a good thing for the project, as it will allow us to fix issues and release more often. Those of you who are using the JDK 1.4 compatible distribution will still get the XML/A driver in the same library. The four binary libraries are available on our Maven repository and Sourceforge project.

Among other changes, the top levels of the connection's metadata have also received some upgrades. A new metadata element was introduced at the top; the Database. We have also removed everything that was marked for deprecation as of 1.0. We therefore encourage people to test carefully while upgrading.

The reference implementation, Mondrian, is compatible with olap4j 1.0, but only for version 3.3.0.14192 and upwards. This means that if you are using both Mondrian and olap4j, the only compatible build is not certified nor tested anymore than any regular CI release. On production systems, you should abstain from upgrading to Mondrian 3.3.X or risk the consequences, although the 3.3.X series are very similar to what we released with the 3.2.X cousin. Anyhow, bottom line is:

If you're only using olap4j and the XML/A driver, upgrade at will!
If you're using Mondrian too, we suggest waiting for Mondrian 3.3 official.
If you like to live on the edge, grab olap4j 1.0.0.445 and Mondrian 3.3.0.14192. They are both available in our Maven repository.

That's it for now. As usual, drop by our forums or mailing list if there is anything we can help with. And again, congratulations to everyone for the hard work.

olap4j's maven repository cheat sheet
url:
    http://repository.pentaho.org/artifactory/
group:
    org.olap4j
artifacts:
    olap4j
    olap4j-xmla
    olap4j-jdk14
    olap4j-tck

Sunday, February 13, 2011

Welcome!

I was searching the internet for some technical documents and stumbled on this website where the documents were behind a password prompt. What struck me is the weird password prompt, which was not the standard Firefox BASIC authentication prompt. Using curl, I discovered this gem

<SCRIPT language="JavaScript">

</SCRIPT>

Welcome indeed.

Thursday, February 3, 2011

Mondrian SPI SegmentCache

Fellow Mondrian developers and users,

One month has already passed since the new year festivities, and while most of you have been trying to renew your gym membership or hold on to your new year resolutions the best you could, so did the Mondrian team. Our resolutions, although not requiring personal sacrifices, are none the less starting to bear fruit.

For you see, our resolution for the year was to provide Mondrian developers and integrators means to achieve better understanding, scalability and control. We have many ideas on how to reach those goals. Some of them are still in their infancy, yet some of them have already been committed to the source. Last month, we worked on the first phase. We added means for system architects to externalize and share a pluggable segment cache. What does this mean exactly? Let's take a step back in order to better understand.

Internally, Mondrian splits the tuples in segments. A typical segment could be described as a measure crossjoined by a series of predicates. As an example, a textual representation of a segment contents could be:

Measure = [ Sales ]
Predicates = {
          [ Products = * ],
    [ State = California ],
    [ Gender = Male ] }
Data = [ 1346.34, 234.00, ... ]

In the case above, the segment would represent the Sales data of all males in California, for all products. It is a lot more effective to deal with those data structures. If Mondrian was to internally represent each data cell individually, the unique identifier of that cell would be of a greater size than the data itself, thus creating a whole lot more problems in terms of data efficiency. This is therefore why Mondrian deals with groups of cells, which it loads in batches, rather than individually. There is a lot of voodoo magic and heuristics in the background trying to figure out how best to group those segments and how to reduce the number of segments to load, ultimately reducing the number of SQL queries to be executed. Mondrian will group all segments with the same predicates but with a different measure into a segment group. Mondrian will also tend to remove as many predicates as it possibly can in order to optimize the data payload. Lets say that a segment covers all products except a single one, Mondrian will still include the product in the segment but filter it out when a specific query requires it.

Once those segments are populated, Mondrian keeps those in a collection of weak references in local memory. All required segment references are pinned down during the resolving of a particular query, but as soon as the query is done executing, the references are returned to their weak state, thus ready to be garbage collected if needed. This simple mechanism allows Mondrian to answer just about any query, as long as the memory allocated is big enough to answer that particular query. This works really well in fact, since in most small deployments, the maximum amount of memory is never reached. And if it ever gets filled, old segments will be evicted to make some room for the new ones.

Now, there are obvious gotchas. First off, what if it takes a long time for a segment to be populated by the RDBMS. This means that if a particular segment ever gets picked up by the garbage collector, the MDX query sent to Mondrian *might* take longer to execute, whether it was in the segment cache or not. This is not acceptable, simply because this makes all performance predictions impossible.

This is where the SegmentCache SPI comes in. It is essentially a pluggable cache for segments. The algorithm behind the segment loader becomes this:

Lookup segments in local cache and pin those required.
Optimize / group segments
Lookup segments from the SPI cache
Load the segments found from the SPI cache
Populate the remaining unloaded segments from the RDBMS
Put the segments which come from the RDBMS into the SPI cache
Pin all loaded segments
Resolve the query
Unpin all segments in the local cache

But wait! There is more! The SegmentCache SPI is trivial to implement.

Future<Boolean> contains(SegmentHeader header);
Future<SegmentBody> get(SegmentHeader header);
Future<List<SegmentHeader>> getSegmentHeaders();
Future<Boolean> put(
SegmentHeader header,
SegmentBody body);
void tearDown();

Figure 1. Mondrian Segment Loader Architecture

There are two assumptions that are made towards the implementation. The first obvious one is that the cache must assume that many Mondrian instances might access the cache concurrently, form different threads. We therefore recommend using the Actor Pattern or anything similar in order to enforce thread safety. ~~The second is that SegmentCache implementations will be instantiated very often. We therefore recommend using a facade object which relays calls to the actual segment cache code.~~ Update: This was redesigned so that a singleton is created and used throughout Mondrian's internals.

As for the storage of the SegmentHeader and SegmentBody objects, we tried to make it as simple and flexible as possible. Both objects are fully serializable and are immutable. They are also specially crafted to use dense arrays of primitive data types. We also tried to make extensive use of Java native functions when copying the data to / from the cache within Mondrian internals.

The bottom line is that from now on the Mondrian community will be free to implement segment caches to fit their needs. We will be rolling out a few default implementations and examples, obviously. One neat implementation could be one which pages the segments to a super fast array of SSD drives. Another one could be to store the segments in Terracota or ehCache or Infinispan, or just about any scalable caching system there is out there. So if any of you out there are interested in implementing this SPI for your business and would like to either share your experiences or contribute those implementations, don't hesitate to contact us. Or me directly.

There is more goodness to come, but that's it for now. Stay tuned!

The Dev Donkey Blog