For the last 6 months, I’ve been buried deep in the world of Business Intelligence (or BI for short) and the mind-numbing field that is Data Warehousing and Data Mining. Given that I’m not really a database person, and I only use MySQL with my PHP development, this was a whole new experience for me.
I’m writing this post because when I first laid my eyes on Pentaho, I was confused as hell. What I didn’t know was that Pentaho is not a single product, but a collection of open-source tools that have been meshed together to form a suite for BI. All the documentation and writeup at the website on the different aspects of Pentaho just didn’t gel together, leading to a very steep learning curve.
Hence, I’m just going to briefly cover the various aspects that makes up Pentaho, so that if you are just starting to use Pentaho, this would hopefully give you a good overview.
Pentaho is available in two flavors, a paid commerical version and an open-source community version. Obviously, if you opt for the paid version, you get all the expert help and support so things should be straightforward. But, if you are unfornate enough like myself to be stuck with the community version, you have to try to make sense of everything on your own.
If you visit the Pentaho Community version’s homepage, you should see there are essentially five projects that makes up the Pentaho Platform.
What’s confusing is that the community version website shows one naming scheme which is slightly different from what folders in SourceForge presents, or the link points you to an older version. All this confusion is really a sign of the work in progress by the Pentaho team to try to integrate the tools into a cohesive suite.
The best representation of Pentaho is to look at how the files at SourceForge is organized.
Data Integration (PDI)
In the world of Data Warehousing, there’s a process called ETL which stands for Extraction, Transformation and Loading. This is really the process of compiling data from various sources, cleaning it up and stuffing it into the Data Warehouse (which is just another database system).
The Pentaho Data Integration (PDI) tool (which is listed as Kettle project at the community version website), primarily contains the Spoon application, that provides a graphical model-driven approach to ETL. You would use it to craft your own ETL using the library of transformation processes.
Here’s an example of a simple process to load records from one MySQL table into another.
In my case, although I was quite impressed by how the tool works, I still ended up creating my own ETL process using cron and command-line PHP. It was really a time issue that led me to use PHP for the ETL process. I just didn’t have the time to learn how the 100+ transformation work and master it for the project.
Report Designer (PRD)
The Pentho Report Designer (PRD) is front-end application for building and publishing reports to the Pentaho platform. It publishes in a new file format .prpt which the Pentaho server recognizes and knows how to intepret it to display.
The tool works very much like other WYSIWYG-based reporting tools like Microsoft Access reporting or Crystal Reports. You define your database connection using JDBC/JDNI, and drag reporting elements like text field etc onto the report, and tie them to the corresponding database field.
Business Intelligence Server
This is the core of Pentaho. It is made up of 3 parts – the core platform itself, the BI components and the presentation layer.
The core platform is where all the low-level functions occurs, such as database connection management, solution repository, user authentication, task scheduling, logging etc.
The BI component part is where the various BI engines are plugged into, such as the reporting engine, the OLAP engine, the ETL engine and the Data Mining engine. This is where all the churning of the data occurs and spits it out to presentation layer.
The presentation layer is where the Pentaho users will interact with platform, to view reports, dashboard, perform analysis etc.
The Pentaho Design Studio (PDS) is essentially an Eclipse IDE with a Pentaho plugin that is used to create and maintain action sequences that work within the Pentaho platform. Action sequences is a predefined set of actions which can be triggered by a user’s action, a schedule or any other action sequence. Action sequence can be as simple as “display the report” to as complex “find all overdue orders and send out a reminder email”.
Mondrian or Pentaho Analysis Service (PAS)
PAS or Mondrian provides OLAP capabilities to the Pentaho platform. When we talk about BI and Data Warehousing, the most common term you hear is ‘slice-and-dice’, well, this is essentially what OLAP enables you to do within Pentaho. By representing your data in a multi-dimensional cube, you can easily drill-down, drill-up, slice or dice the cube to gain insights.
Once you have defined your cube, you can publish it Pentaho server, and it will be available under the ‘New Analysis View’ option.
I know this post doesn’t cover much in detail and I’ve skipped a few components like Pentaho Metadata and Weka. What I’ve listed is really what I’ve used. Nonetheless, I hope this overview of Pentaho will be useful and helpful to someone at least. Feel free to comment (or correct anything I might have gotten wrong).
Lastly, Pentaho is by no means the only open-source BI tool available, here’s a list of other BI tools.
If you would like to get yourself a book, I would recommend Pentaho Solutions by Roland Bouman and Jos van Dongen. It would get you started in the right direction.