Pentaho: A beginner’s perspective

For the last 6 months, I’ve been buried deep in the world of Business Intelligence (or BI for short) and the mind-numbing field that is Data Warehousing and Data Mining. Given that I’m not really a database person, and I only use MySQL with my PHP development, this was a whole new experience for me.

I’m writing this post because when I first laid my eyes on Pentaho, I was confused as hell. What I didn’t know was that Pentaho is not a single product, but a collection of open-source tools that have been meshed together to form a suite for BI. All the documentation and writeup at the website on the different aspects of Pentaho just didn’t gel together, leading to a very steep learning curve.

Hence, I’m just going to briefly cover the various aspects that makes up Pentaho, so that if you are just starting to use Pentaho, this would hopefully give you a good overview.


Pentaho is available in two flavors, a paid commerical version and an open-source community version. Obviously, if you opt for the paid version, you get all the expert help and support so things should be straightforward. But, if you are unfornate enough like myself to be stuck with the community version, you have to try to make sense of everything on your own.

If you visit the Pentaho Community version’s homepage, you should see there are essentially five projects that makes up the Pentaho Platform.

What’s confusing is that the community version website shows one naming scheme which is slightly different from what folders in SourceForge presents, or the link points you to an older version. All this confusion is really a sign of the work in progress by the Pentaho team to try to integrate the tools into a cohesive suite.

The best representation of Pentaho is to look at how the files at SourceForge is organized.

Data Integration (PDI)
In the world of Data Warehousing, there’s a process called ETL which stands for Extraction, Transformation and Loading. This is really the process of compiling data from various sources, cleaning it up and stuffing it into the Data Warehouse (which is just another database system).

The Pentaho Data Integration (PDI) tool (which is listed as Kettle project at the community version website), primarily contains the Spoon application, that provides a graphical model-driven approach to ETL. You would use it to craft your own ETL using the library of transformation processes.

Here’s an example of a simple process to load records from one MySQL table into another.

In my case, although I was quite impressed by how the tool works, I still ended up creating my own ETL process using cron and command-line PHP. It was really a time issue that led me to use PHP for the ETL process. I just didn’t have the time to learn how the 100+ transformation work and master it for the project.

Report Designer (PRD)
The Pentho Report Designer (PRD) is front-end application for building and publishing reports to the Pentaho platform. It publishes in a new file format .prpt which the Pentaho server recognizes and knows how to intepret it to display.

The tool works very much like other WYSIWYG-based reporting tools like Microsoft Access reporting or Crystal Reports. You define your database connection using JDBC/JDNI, and drag reporting elements like text field etc onto the report, and tie them to the corresponding database field.

Business Intelligence Server
This is the core of Pentaho. It is made up of 3 parts – the core platform itself, the BI components and the presentation layer.

The core platform is where all the low-level functions occurs, such as database connection management, solution repository, user authentication, task scheduling, logging etc.

The BI component part is where the various BI engines are plugged into, such as the reporting engine, the OLAP engine, the ETL engine and the Data Mining engine. This is where all the churning of the data occurs and spits it out to presentation layer.

The presentation layer is where the Pentaho users will interact with platform, to view reports, dashboard, perform analysis etc.

Design Studio
The Pentaho Design Studio (PDS) is essentially an Eclipse IDE with a Pentaho plugin that is used to create and maintain action sequences that work within the Pentaho platform. Action sequences is a predefined set of actions which can be triggered by a user’s action, a schedule or any other action sequence. Action sequence can be as simple as “display the report” to as complex “find all overdue orders and send out a reminder email”.

Mondrian or Pentaho Analysis Service (PAS)
PAS or Mondrian provides OLAP capabilities to the Pentaho platform. When we talk about BI and Data Warehousing, the most common term you hear is ’slice-and-dice’, well, this is essentially what OLAP enables you to do within Pentaho. By representing your data in a multi-dimensional cube, you can easily drill-down, drill-up, slice or dice the cube to gain insights.

Once you have defined your cube, you can publish it Pentaho server, and it will be available under the ‘New Analysis View’ option.

I know this post doesn’t cover much in detail and I’ve skipped a few components like Pentaho Metadata and Weka. What I’ve listed is really what I’ve used. Nonetheless, I hope this overview of Pentaho will be useful and helpful to someone at least. Feel free to comment (or correct anything I might have gotten wrong).

Lastly, Pentaho is by no means the only open-source BI tool available, here’s a list of other BI tools.

If you would like to get yourself a book, I would recommend Pentaho Solutions by Roland Bouman and Jos van Dongen. It would get you started in the right direction.

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • DZone
  • Propeller
  • Reddit
  • StumbleUpon
  • Technorati
  • Yahoo! Buzz
Posted on January 7, 2010 at 5:28 pm by Eldee · Permalink
In: Pentaho · Tagged with: , ,

13 Responses

Subscribe to comments via RSS

  1. Written by Kevin on January 8, 2010 at 1:52 am
    Permalink

    Do you know any reason why pentaho decided to not use birt as their report designer any more? What’s the different with this new report designer and birt?

  2. Written by cease on January 8, 2010 at 2:20 am
    Permalink

    Pentaho is a beast , the good thing is that now you can charge others consulting fees to help others figure it out, but hopefully you will continue to write about it :) . good stuff

  3. Written by Arturo Tena on January 8, 2010 at 4:40 am
    Permalink

    Great article! Thanks for posting. I was confused too. Now all makes sense.

  4. Written by webmaster on January 8, 2010 at 11:06 am
    Permalink

    Hi folks, thanks for the comments. Yes I hope to push out more Pentaho posts and tutorials.

    @Kevin, I don’t know the reason, but I would think the new PRD is aimed at allowing a more knowledgeable end-user to create reports, without the need for a technical person. I could be wrong.

  5. Written by Tanmay on June 23, 2010 at 4:25 am
    Permalink

    Nice Article

  6. Written by Mark on August 26, 2010 at 5:34 am
    Permalink

    This must have been a lot of work to post article, I appreciate your time and effort in making this understandable for the rest of us.

  7. Written by Gagan on October 5, 2011 at 4:47 am
    Permalink

    I just got into Pentaho DI and DWH and BI. Good starter overview/article for me about PDI.

  8. Written by Moniza Choudhury on October 20, 2011 at 4:10 am
    Permalink

    Great article! Thanks for giving us great high-level overveiw.

  9. Written by Moniza Choudhury on October 20, 2011 at 4:11 am
    Permalink

    Also, want to check if you can give me some idea about the weakness of the Pentaho.

  10. [...] October 2011] Resources: A brief description of the software projects in the BI suite. Introduction to the Pentaho BI suite [pdf] Like [...]

  11. Written by James F on November 28, 2011 at 11:20 pm
    Permalink

    Great High Level breakdown of the Pentaho tools!

    Thanks for doing this.

  12. Written by Josie on October 24, 2012 at 1:40 pm
    Permalink

    Thanks for that. The Pentaho documentation is just so confusing and had me downloading a lot of stuff I dont need. Got confused between paid version doc and open source doc. And downloads pointers kept pointing at download for the integration module which is horrendously large and not necessary for what I want atm At least I think it is optional. Toook ages to download and unpack just to find it was a waste of time.

  13. Written by Josie on October 24, 2012 at 1:45 pm
    Permalink

    Sure enough, I checked again (followed your download link this time) and it looks like the way to get all necessary components is to take this link:

    Looking for the latest version? Download pdi-ce-4.3.0-stable.zip (359.3 MB)

    Which is just plain wrong!!! Grrrrr

Subscribe to comments via RSS

Leave a Reply