Thursday, December 15, 2011

Robot 02

Last month, I built my second robot, which I cleverly nicknamed Robot 02. Its name stands for... well the name says it all, really.


Pictured above, Robot 02 is my second attempt at building an autonomous robot capable of navigating around a room. I wanted it to go around obstacles and detect when it gets stuck so that it can act accordingly.


Thursday, October 27, 2011

A tale of Mondrian and real time analytics

Back in April 2008, Julian Hyde, founder of Mondrian, extends a basic API to control the contents of member caches within the Mondrian OLAP engine. Its name, quite unsurprisingly, is CacheControl.There is a basic implementation available, yet that feature remains mostly unknown to most of the community.

Jump to December 2010. We are brainstorming on Mondrian 3.3 (today this feels like ages ago). We come up with those crazy ideas about enterprise integration, real time analytics and cool APIs / SPIs. One of these crazy ideas is: Wouldn't it be sweet to update Mondrian's member cache and do real time OLAP?

That's when we remember the old CacheControl API.

Tuesday, April 12, 2011

olap4j 1.0 is here

Dear olap4j community,

It is with immense pride that we deliver to you today version 1.0 of olap4j. Yup. And you should be proud as well. Thanks to you and to the concerted efforts of countless collaborators, advisers, testers, integrators and developers worldwide, we were able to reach this historic milestone.

If you would allow me to exagerate blatantly for a second, olap4j 1.0 is what I believe the greatest and bestest (yes, you've read it right, bestest) version to have ever been granted to humanity.

At this point, I encourage you to go read the official press release instead of reading my futile efforts at proper journalism. Our fine folks really know how to best express what olap4j represents, why it is important and why it *might* be the bestest thing on Earth. I've also resisted the urge to copy / paste their fabulous text. That said, I'll do my part and get into the more technical details.

You will first notice that the distribution files have changed. The XML/A driver is now separated from the core API. This is a good thing for the project, as it will allow us to fix issues and release more often. Those of you who are using the JDK 1.4 compatible distribution will still get the XML/A driver in the same library. The four binary libraries are available on our Maven repository and Sourceforge project.

Among other changes, the top levels of the connection's metadata have also received some upgrades. A new metadata element was introduced at the top; the Database. We have also removed everything that was marked for deprecation as of 1.0. We therefore encourage people to test carefully while upgrading.

The reference implementation, Mondrian, is compatible with olap4j 1.0, but only for version 3.3.0.14192 and upwards. This means that if you are using both Mondrian and olap4j, the only compatible build is not certified nor tested anymore than any regular CI release. On production systems, you should abstain from upgrading to Mondrian 3.3.X or risk the consequences, although the 3.3.X series are very similar to what we released with the 3.2.X cousin. Anyhow, bottom line is:
  • If you're only using olap4j and the XML/A driver, upgrade at will! 
  • If you're using Mondrian too, we suggest waiting for Mondrian 3.3 official.
  • If you like to live on the edge, grab olap4j 1.0.0.445 and Mondrian 3.3.0.14192. They are both available in our Maven repository.
That's it for now. As usual, drop by our forums or mailing list if there is anything we can help with. And again, congratulations to everyone for the hard work.
olap4j's  maven repository cheat sheet
url:
    http://repository.pentaho.org/artifactory/
group:
    org.olap4j
artifacts:
    olap4j
    olap4j-xmla
    olap4j-jdk14
    olap4j-tck

Tuesday, March 1, 2011

Mondrian High Cardinality Dimensions Demystified

Performance is the buzz word in Analysis these days. Real time, MapReduce, scalability. Mondrian is no exception in this trend. As the worldwide pool of data grows, our analysis software must follow in its wake. I often get questions related to Mondrian performance, and most of the time, people expect a magic answer to their performance issues. The answer to solve them all.

Mondrian, like other ROLAP engines, are notoriously hard to fine tune for performance. There are many pieces of technology collaborating together, and the final performance will be that of the weakest link. In Mondrian's case, the performance must therefore be taken into account at all levels:
  • We must perform efficient internal processing
  • We use the leanest internal data structures
  • We must generate 'scalable' SQL queries
  • Many many other things...
The two first points are easy enough to understand. Good software is lean and efficient. There is no black magic here. The third, on the other end, sounds like nothing more than a marketing buzzword, eh? Well... there is more to it. As I said earlier, the total performance will be that of the weakest link. It follows that Mondrian, being a ROLAP engine, must at all times issue the best SQL queries possible to the underlying RDBMS, or suffer the consequences. Why 'scalable', and what does it mean here?

When designing a data warehouse, the number of records in your fact / dimensions tables will greatly influence the overall system performance. Again, no voodoo dolls here, just plain commonsense. Some dimensions are huge. We're talking millions of members. We call those High Cardinality Dimensions (HCD). They are in fact a common occurrence. A popular website's data warehouse and their user's dimension table is very likely to contain millions of rows. A reseller will have a products dimension table just as big, or more. A common way to safeguard the performance of those tables is to use table partitioning. In short, this means your RDBMS will split the table in logical sections. When queried, the RDBMS will only scan the partitions relevant to the query, therefore leaving alone most of the data. This all sounds good in theory, but there is a very important implied requirement here. The SQL queries must take this into account. Issuing a SELECT COUNT(*) on the table will require the RDBMS to scan all partitions, thus eliminating the advantage gained by partitioning the table.

So. Today's article will cover Mondrian's means to leverage the table partitioning of the underlying RDBMS. We will only cover the case of HCD. Not the fact table. I might write something later, but in the meanwhile, there are some very good articles out there on the subject.

Paradigm shift. We are a user. We are seated in front of an OLAP exploration tool. It is connected to a Mondrian instance. We drag the customer's dimension in the grid along with the regions and sales measure. We press the execute button. The query takes 10 seconds to return. We are frustrated. We blame the guy responsible for the Mondrian instance.

Paradigm shift. We are the guy who administers the Mondrian instance. Looking at the SQL logs, we see the following:
select
  r.region_name, c.customer_name, sum(f.sales)
from
  sales_fact f, regions r, customers c
where
  f.region_id = r.id and f.customer_id = c.id
group by
  r.region_name, c.customer_name
Remember my previous post about segments and internal cell data representation within Mondrian? The query above is due to Mondrian's decision to create a single segment based on the query the user issued. This query looks great, right? No. Well, it depends. It's fine IF none of the tables here are partitioned. If they were, the query would scan all partitions. This is where Mondrian HCD capabilities come in. Mondrian schemas have an attribute on the CubeDimension element class to instruct it to switch to HCD mode when dealing with that particular dimension. It is detailed here. The end result is that the single segment is split in smaller segments, each covering a single tuple. The SQL generated becomes:

select
  r.region_name, c.customer_name, sum(f.sales)
from
  sales_fact f, regions r, customers c
where
  f.region_id = r.id and f.customer_id = c.id
  and r.region_id = 42 and c.customer_id = 1
group by
  r.region_name, c.customer_name

select
  r.region_name, c.customer_name, sum(f.sales)
from
  sales_fact f, regions r, customers c
where
  f.region_id = r.id and f.customer_id = c.id
  and r.region_id = 42 and c.customer_id = 2
group by
  r.region_name, c.customer_name

(...)

select
  r.region_name, c.customer_name, sum(f.sales)
from
  sales_fact f, regions r, customers c
where
  f.region_id = r.id and f.customer_id = c.id
  and r.region_id = 42 and c.customer_id = N
group by
  r.region_name, c.customer_name
We are now assured that each query touches one and only one partition. It might seem less than optimal to issue so many queries, but when you get down to benchmarking it, the performance is quite good. Not only do small queries take a very small amount of time to execute on the RDBMS side, they are very lightweight and can be handled by Mondrian extremely efficiently. As a matter of fact, in most cases where I tested this, it was more efficient to issue all of those small queries than the cross-partition big wooping query.

But we can do better. By changing the value of the mondrian.result.highCardChunkSize property, we can batch X tuples per segment, compared to a single one. Setting it to 5 would result in queries looking similar to this:
select
  r.region_name, c.customer_name, sum(f.sales)
from
  sales_fact f, regions r, customers c
where
  f.region_id = r.id and f.customer_id = c.id
  and r.region_id = 42
  and c.customer_id in (1,2,3,4,5)

group by
  r.region_name, c.customer_name
The chances that the customer_id predicate overlaps over two different partitions are quite small. Even if it was to happen, it would still touch no more than two partitions as compared to all of them previously. There are more ways to optimize Mondrian for HCDs, but that's it for today.

Sunday, February 13, 2011

Welcome!

I was searching the internet for some technical documents and stumbled on this website where the documents were behind a password prompt. What struck me is the weird password prompt, which was not the standard Firefox BASIC authentication prompt. Using curl, I discovered this gem
<SCRIPT language="JavaScript">
<!--hide
var password;
var pass="[redacted]";

password=prompt('Enter password:',' ');

if (password==pass)
  alert('Welcome!');
else
  window.location="errorframe.asp";
   
//-->
</SCRIPT>
Welcome indeed.