The Dev Donkey Blog
something something software
Thursday, December 15, 2011
Robot 02
Last month, I built my second robot, which I cleverly nicknamed Robot 02. Its name stands for... well the name says it all, really.
Pictured above, Robot 02 is my second attempt at building an autonomous robot capable of navigating around a room. I wanted it to go around obstacles and detect when it gets stuck so that it can act accordingly.
Thursday, October 27, 2011
A tale of Mondrian and real time analytics
Back in April 2008, Julian Hyde, founder of Mondrian, extends a basic API to control the contents of member caches within the Mondrian OLAP engine. Its name, quite unsurprisingly, is CacheControl.There is a basic implementation available, yet that feature remains mostly unknown to most of the community.
Jump to December 2010. We are brainstorming on Mondrian 3.3 (today this feels like ages ago). We come up with those crazy ideas about enterprise integration, real time analytics and cool APIs / SPIs. One of these crazy ideas is: Wouldn't it be sweet to update Mondrian's member cache and do real time OLAP?
That's when we remember the old CacheControl API.
Jump to December 2010. We are brainstorming on Mondrian 3.3 (today this feels like ages ago). We come up with those crazy ideas about enterprise integration, real time analytics and cool APIs / SPIs. One of these crazy ideas is: Wouldn't it be sweet to update Mondrian's member cache and do real time OLAP?
That's when we remember the old CacheControl API.
Labels:
bi,
business intelligence,
Mondrian,
olap
Tuesday, April 12, 2011
olap4j 1.0 is here
Dear olap4j community,
It is with immense pride that we deliver to you today version 1.0 of olap4j. Yup. And you should be proud as well. Thanks to you and to the concerted efforts of countless collaborators, advisers, testers, integrators and developers worldwide, we were able to reach this historic milestone.
If you would allow me to exagerate blatantly for a second, olap4j 1.0 is what I believe the greatest and bestest (yes, you've read it right, bestest) version to have ever been granted to humanity.
At this point, I encourage you to go read the official press release instead of reading my futile efforts at proper journalism. Our fine folks really know how to best express what olap4j represents, why it is important and why it *might* be the bestest thing on Earth. I've also resisted the urge to copy / paste their fabulous text. That said, I'll do my part and get into the more technical details.
You will first notice that the distribution files have changed. The XML/A driver is now separated from the core API. This is a good thing for the project, as it will allow us to fix issues and release more often. Those of you who are using the JDK 1.4 compatible distribution will still get the XML/A driver in the same library. The four binary libraries are available on our Maven repository and Sourceforge project.
Among other changes, the top levels of the connection's metadata have also received some upgrades. A new metadata element was introduced at the top; the Database. We have also removed everything that was marked for deprecation as of 1.0. We therefore encourage people to test carefully while upgrading.
The reference implementation, Mondrian, is compatible with olap4j 1.0, but only for version 3.3.0.14192 and upwards. This means that if you are using both Mondrian and olap4j, the only compatible build is not certified nor tested anymore than any regular CI release. On production systems, you should abstain from upgrading to Mondrian 3.3.X or risk the consequences, although the 3.3.X series are very similar to what we released with the 3.2.X cousin. Anyhow, bottom line is:
It is with immense pride that we deliver to you today version 1.0 of olap4j. Yup. And you should be proud as well. Thanks to you and to the concerted efforts of countless collaborators, advisers, testers, integrators and developers worldwide, we were able to reach this historic milestone.
If you would allow me to exagerate blatantly for a second, olap4j 1.0 is what I believe the greatest and bestest (yes, you've read it right, bestest) version to have ever been granted to humanity.
At this point, I encourage you to go read the official press release instead of reading my futile efforts at proper journalism. Our fine folks really know how to best express what olap4j represents, why it is important and why it *might* be the bestest thing on Earth. I've also resisted the urge to copy / paste their fabulous text. That said, I'll do my part and get into the more technical details.
You will first notice that the distribution files have changed. The XML/A driver is now separated from the core API. This is a good thing for the project, as it will allow us to fix issues and release more often. Those of you who are using the JDK 1.4 compatible distribution will still get the XML/A driver in the same library. The four binary libraries are available on our Maven repository and Sourceforge project.
Among other changes, the top levels of the connection's metadata have also received some upgrades. A new metadata element was introduced at the top; the Database. We have also removed everything that was marked for deprecation as of 1.0. We therefore encourage people to test carefully while upgrading.
The reference implementation, Mondrian, is compatible with olap4j 1.0, but only for version 3.3.0.14192 and upwards. This means that if you are using both Mondrian and olap4j, the only compatible build is not certified nor tested anymore than any regular CI release. On production systems, you should abstain from upgrading to Mondrian 3.3.X or risk the consequences, although the 3.3.X series are very similar to what we released with the 3.2.X cousin. Anyhow, bottom line is:
- If you're only using olap4j and the XML/A driver, upgrade at will!
- If you're using Mondrian too, we suggest waiting for Mondrian 3.3 official.
- If you like to live on the edge, grab olap4j 1.0.0.445 and Mondrian 3.3.0.14192. They are both available in our Maven repository.
olap4j's maven repository cheat sheet
url:
http://repository.pentaho.org/artifactory/
group:
org.olap4j
artifacts:
olap4j
olap4j-xmla
olap4j-jdk14
olap4j-tck
Labels:
business intelligence,
olap,
olap4j
Tuesday, March 1, 2011
Mondrian High Cardinality Dimensions Demystified
Performance is the buzz word in Analysis these days. Real time, MapReduce, scalability. Mondrian is no exception in this trend. As the worldwide pool of data grows, our analysis software must follow in its wake. I often get questions related to Mondrian performance, and most of the time, people expect a magic answer to their performance issues. The answer to solve them all.
Mondrian, like other ROLAP engines, are notoriously hard to fine tune for performance. There are many pieces of technology collaborating together, and the final performance will be that of the weakest link. In Mondrian's case, the performance must therefore be taken into account at all levels:
When designing a data warehouse, the number of records in your fact / dimensions tables will greatly influence the overall system performance. Again, no voodoo dolls here, just plain commonsense. Some dimensions are huge. We're talking millions of members. We call those High Cardinality Dimensions (HCD). They are in fact a common occurrence. A popular website's data warehouse and their user's dimension table is very likely to contain millions of rows. A reseller will have a products dimension table just as big, or more. A common way to safeguard the performance of those tables is to use table partitioning. In short, this means your RDBMS will split the table in logical sections. When queried, the RDBMS will only scan the partitions relevant to the query, therefore leaving alone most of the data. This all sounds good in theory, but there is a very important implied requirement here. The SQL queries must take this into account. Issuing a SELECT COUNT(*) on the table will require the RDBMS to scan all partitions, thus eliminating the advantage gained by partitioning the table.
So. Today's article will cover Mondrian's means to leverage the table partitioning of the underlying RDBMS. We will only cover the case of HCD. Not the fact table. I might write something later, but in the meanwhile, there are some very good articles out there on the subject.
Paradigm shift. We are a user. We are seated in front of an OLAP exploration tool. It is connected to a Mondrian instance. We drag the customer's dimension in the grid along with the regions and sales measure. We press the execute button. The query takes 10 seconds to return. We are frustrated. We blame the guy responsible for the Mondrian instance.
Paradigm shift. We are the guy who administers the Mondrian instance. Looking at the SQL logs, we see the following:
But we can do better. By changing the value of the mondrian.result.highCardChunkSize property, we can batch X tuples per segment, compared to a single one. Setting it to 5 would result in queries looking similar to this:
Mondrian, like other ROLAP engines, are notoriously hard to fine tune for performance. There are many pieces of technology collaborating together, and the final performance will be that of the weakest link. In Mondrian's case, the performance must therefore be taken into account at all levels:
- We must perform efficient internal processing
- We use the leanest internal data structures
- We must generate 'scalable' SQL queries
- Many many other things...
When designing a data warehouse, the number of records in your fact / dimensions tables will greatly influence the overall system performance. Again, no voodoo dolls here, just plain commonsense. Some dimensions are huge. We're talking millions of members. We call those High Cardinality Dimensions (HCD). They are in fact a common occurrence. A popular website's data warehouse and their user's dimension table is very likely to contain millions of rows. A reseller will have a products dimension table just as big, or more. A common way to safeguard the performance of those tables is to use table partitioning. In short, this means your RDBMS will split the table in logical sections. When queried, the RDBMS will only scan the partitions relevant to the query, therefore leaving alone most of the data. This all sounds good in theory, but there is a very important implied requirement here. The SQL queries must take this into account. Issuing a SELECT COUNT(*) on the table will require the RDBMS to scan all partitions, thus eliminating the advantage gained by partitioning the table.
So. Today's article will cover Mondrian's means to leverage the table partitioning of the underlying RDBMS. We will only cover the case of HCD. Not the fact table. I might write something later, but in the meanwhile, there are some very good articles out there on the subject.
Paradigm shift. We are a user. We are seated in front of an OLAP exploration tool. It is connected to a Mondrian instance. We drag the customer's dimension in the grid along with the regions and sales measure. We press the execute button. The query takes 10 seconds to return. We are frustrated. We blame the guy responsible for the Mondrian instance.
Paradigm shift. We are the guy who administers the Mondrian instance. Looking at the SQL logs, we see the following:
selectRemember my previous post about segments and internal cell data representation within Mondrian? The query above is due to Mondrian's decision to create a single segment based on the query the user issued. This query looks great, right? No. Well, it depends. It's fine IF none of the tables here are partitioned. If they were, the query would scan all partitions. This is where Mondrian HCD capabilities come in. Mondrian schemas have an attribute on the CubeDimension element class to instruct it to switch to HCD mode when dealing with that particular dimension. It is detailed here. The end result is that the single segment is split in smaller segments, each covering a single tuple. The SQL generated becomes:
r.region_name, c.customer_name, sum(f.sales)
from
sales_fact f, regions r, customers c
where
f.region_id = r.id and f.customer_id = c.id
group by
r.region_name, c.customer_name
selectWe are now assured that each query touches one and only one partition. It might seem less than optimal to issue so many queries, but when you get down to benchmarking it, the performance is quite good. Not only do small queries take a very small amount of time to execute on the RDBMS side, they are very lightweight and can be handled by Mondrian extremely efficiently. As a matter of fact, in most cases where I tested this, it was more efficient to issue all of those small queries than the cross-partition big wooping query.
r.region_name, c.customer_name, sum(f.sales)
from
sales_fact f, regions r, customers c
where
f.region_id = r.id and f.customer_id = c.id
and r.region_id = 42 and c.customer_id = 1
group by
r.region_name, c.customer_name
select
r.region_name, c.customer_name, sum(f.sales)
from
sales_fact f, regions r, customers c
where
f.region_id = r.id and f.customer_id = c.id
and r.region_id = 42 and c.customer_id = 2
group by
r.region_name, c.customer_name
(...)
select
r.region_name, c.customer_name, sum(f.sales)
from
sales_fact f, regions r, customers c
where
f.region_id = r.id and f.customer_id = c.id
and r.region_id = 42 and c.customer_id = N
group by
r.region_name, c.customer_name
But we can do better. By changing the value of the mondrian.result.highCardChunkSize property, we can batch X tuples per segment, compared to a single one. Setting it to 5 would result in queries looking similar to this:
selectThe chances that the customer_id predicate overlaps over two different partitions are quite small. Even if it was to happen, it would still touch no more than two partitions as compared to all of them previously. There are more ways to optimize Mondrian for HCDs, but that's it for today.
r.region_name, c.customer_name, sum(f.sales)
from
sales_fact f, regions r, customers c
where
f.region_id = r.id and f.customer_id = c.id
and r.region_id = 42
and c.customer_id in (1,2,3,4,5)
group by
r.region_name, c.customer_name
Sunday, February 13, 2011
Welcome!
I was searching the internet for some technical documents and stumbled on this website where the documents were behind a password prompt. What struck me is the weird password prompt, which was not the standard Firefox BASIC authentication prompt. Using curl, I discovered this gem
<SCRIPT language="JavaScript">Welcome indeed.
<!--hide
var password;
var pass="[redacted]";
password=prompt('Enter password:',' ');
if (password==pass)
alert('Welcome!');
else
window.location="errorframe.asp";
//-->
</SCRIPT>
Subscribe to:
Posts (Atom)
