Sunday, January 04, 2009

MySQL Stored Procedure Result Sets in Pentaho Data Integration

Quick tip - suppose you need the result set of a MySQL stored procedure in your Pentaho Data Integration (a.k.a. Kettle) Transformation, what do you do?

A Call DB Procedure sounds promising, but as it turns out, you can't use it to retrieve any result sets. Rather, this type of step is meant have an input stream from another source:

  • drive stored procedure execution presumably for some useful side-effect

  • invoke a database stored function and obtain the scalar result



So, what can we do? The answer is simpler than might be expected.

Just use an ordinary Table input step. Everybody that has used PDI before knows that the Table input step retrieves data based on an SQL statement. However, the SQL statement is not required to be a SELECT statement. Any statement that returns a result set will do, including SHOW statements, and CALL.

Stay tuned for more tips like these - more will come as Jos and I are completing our "Building Pentaho Solutions" Book.

29 comments:

Anonymous said...

Hi Roland,

I've recently encountered this problem with another reporting application.

Do you know why calling stored procedures directly is not supported?

Regards,
Shlomi

rpbouman said...

Hi Shlomi!

"Do you know why calling stored procedures directly is not supported?"

In case of your reporting application - I can only guess...What product did you experience this with? I think this functionality was recently added to Pentaho Reporting (JFreeReport).

I do know that, at least in the MySQL case, the application must use the prepared statement protocol when executing the procedure, else you won't be able to capture the result set.

Note that in the case of pentaho data integration, it is in fact supported to call the stored procedure directly by explicitly typing the CALL statement in the "Table input" step. The only thing that is odd here is the name of the step, "Table input", because it does not literally read rows from a table, but an SQL statement in general.

This leaves the question: "Why doesn't the existing Call DB Procedure step support this", or "Why isn't there a step especially for getting result sets out of stored procedures (or table functions for that matter)".

I can only guess but I think the answer is that in case of the "Call DB Procedure", the design considerations were simply to allow a pdi input stream to drive stored procedure execution and to obtain the return value, and possibly OUT parameter values - no more no less.

I think that there is no technical reason not to support it - the JDBC API allows for resultsets to be returned from prepared statements, so if the Call DB Procedure uses that it should be able to capture a result set (or even multiple, although that would open a whole can of worms on its own).

That said, for a data integration tool like pdi, I do not think it is terribly important to be able to capture SP result sets, more of a nice to have thing. But maybe I'm too short-sighted here. Please feel free to continue this discussion, and maybe some of the PDI developers will chime on to provide their point of view.

Anonymous said...

Roland,

Great news on the book !!

Congratulations, keep the informative blog posts coming !!

--
Tom Hanlon

Anonymous said...

Roland,

Thanks for your elaborate response.
I'm ashamed to say I don;t remember the name of the application. I had it on a customer's site.

I had a good reason to use a stored procedure instead of a basic query. There was a problematic, long running LEFT JOIN query, which we managed to significantly reduce using two INNER JOINs UNION, and two truncated temporary tables. This process required a stored procedure. All worked well till we realized the reporting tool did not support SPs. Eventually, we used another solution, which was not as fast as the SP one.

As a Java programmer, I know just how easy it is to get results from a SP: almost as easy as running a query. You just need to know in advance that it is indeed a SP you're calling. And if you're not even passing parameters, it's even easier.

Regards,
Shlomi

rpbouman said...

Hi Tom!

thanks for the kind words. I am pretty sure you'll hear more about it. BTW - looking forward to meeting you again at the MySQL UC. I'm sure we can find some illegal roulette racket somewhere to spill some chips ;)

rpbouman said...

Hi Shlomi!

Don't be ashamed - maybe gigs, many tools - you can't devote everything to memory.

Maybe you can checkout Pentaho Reporting in that case (aka JFreeReport) - Like I said I think they added it. I believe BIRT also supports it.

That said, what you are describing sounds like a pretty complex thing to do, just to run a report. My gut feeling says that it would probably be better to set up an environment that is designed to cater to queries, and that uses some after hours to pre-process things so you know you can always efficiently get the answers whenever you need them.

I realize It might not always be possible and some customers are truly scared by the term "Data Warehouse" but really, I am feeling more and more inclined to advise against one-by-one copies of the source systems as "reporting environments" in favor of a proper, star schema - based data warehouse. This would in almost all cases eliminate the need for explicit temporary tables and outer joins at all

Kyle said...

FYI - I have not had luck in all "SHOW" statements working with PDI - specifically "show slave status" - it returns values but for some reason PDI doesn't recognize them. Not sure the reason.

Anonymous said...

can anyone plz tell me how to obtain the result set produced by stored procedure in java ??
I am using MySQL 5.0

Anonymous said...

Roland,

After some searching, I came across your recommendation. The problem I'm finding is that PDI does not detect the column names returned (as a resultset) from the call, so I am getting some rather dire errors. I'm sending the results of the stored procedure call to a stream lookup, which may be my problem. Not sure if you've encountered this, but it's a potential downside to this solution.

rpbouman said...

Hi kabel!

yes, this is indeed a problem at least when designing the transformation. However, you can work around it by manually specifying the fields in the lookup step.

Ugly, yes. But work it will.

I checked myself just now, and I can successfully use the stream look up step on the result of a stored procedure.

I haven't looked into it very deeply, but it may be a good idea to use a Select Values step behind the table input to explicitly set the data types as well if you want to use the data to do lookups.

HTH, let me know if you find out more.

kinnd regards,
Roland

Anonymous said...

Roland,

You must have a lot better luck than I do. I'm returning a dataset out of my stored procedure using SELECT, sending that to a Select Values, where I try to set the data types of the two columns being returned from the SP, and sending that to the Stream Lookup, but am getting some null pointer errors from the Stream Lookup step It's not a deal-breaker, as I have a workaround in place, it's just frustrating.

Thanks for looking at it, though

kabel

Anonymous said...

Wow, ignore that, I was making a very stupid mistake. I seem to have the same luck.

Using a Select Values did the trick. It still gives errors if I try to "get fields" or "get input/output fields" anywhere downstream from the Stream Lookup, but it appears that the Stream Lookup is working perfectly.

Thanks again!

rpbouman said...

Hi kabel,

I checked, with and without Select Values, and it 'just works'.

Would you like me to send you my transformation? Just mail me at Roland.Bouman@google.com so I know where to send it.

Thanks!

rpbouman said...

Kabel, gah...
should be

Roland.Bouman@gmail.com

Anonymous said...

Hi,

When using Table input step to get the value from stored function, everything is fine except the returned value is rounded to 1 decimal place although the function itself RETURNS DECIMAL(14,4)

Any idea?

rpbouman said...

Hi Anonymous!

can you please explain a little bit more about how you set this up? I should like to try and reproduce it before I attempt to explain.

kind regards,

Roland

Anonymous said...

Hey Roland, just got your book...congrats and a good read! I have a transformation I'd like your take on from a "best practice" perspective. I'm thinking there's multiple ways of doing this, so I'd like to understand what the most efficient is.

Scenario: table input with a field that I need to make consistent. Lets say the values I can receive from this field are as follows:
1. example - A101
2. example - A202
3. example - B402

I'd like to create a transformation that would catch 1 and 2 and output as "Jane Smith". 3 would output as "John Doe".

What is the "Pentaho way" of doing this? SQL case statement? Javascript step? built in transformation step?

Thanks much for your time.
Regards,
UserError

rpbouman said...

Hi Anonymous,

thanks for your interest and support! I always enjoy hearing the book is useful to someone.

Regarding your scenario: Although you could solve it in the SQL statement in the input step using a CASE...WHEN expression, I would advise against that. Best practice is to make all transformations to the data as explicit as possible, and writing SQL expressions hides the logic.

In pentaho data integration, I would most probably use a "Value Mapper" step for this case. The value mapper operates on a field in the input stream and then looks if it matches one of the listed "source" values and then returns the mapped "target" value if there is a match. In this particular case, you would have the following source/target pairs: {A101, Jane Smith}, {A202, Jane Smith}, {B402, John Doe}. You can find the value mapper in the "Transform" category. It is first discussed in the book on page 273, "The Value Mapper Step".

I should point out that the value mapper is the way to go if you "just" want to map a few values. If the mapping is actually stored in a database table, you should use a database lookup step. Especially if you need to manage the mapping over time (and you want to keep track of the histoy of your mapping practice) you could start to explictly store the mappings in your staging arera in a lookup table, and use the lookup step to do it.

Now about your suggestions:

You could do it with a javascript step too, but then again: you can do everything with a script step. Like writing SQL expressions in the input stetp, a java script step obscures the logic, and you should only use that step as a last resort. Also, the javascript step is quite slow.

There is also a switch/case step (in the "Flow" category") but you would use that in case you actually wanted to apply a different treatment on the rows depending on a field value.

kind regards, and good luck

Roland Bouman

Anonymous said...

How to pass parameters from Pentaho Report Designer to PDI tranfromation file. I already defined a parameter in PDI transformation file. Let me know how to pass parameters from PRD to PDI.

I tried creating a parameter in PDI but, I do not find anything to specify that the created parameter will be passed to PDI.

Let me know if there is any solution to achieve this.

Thanks,
kal

Anonymous said...

If you need to use stored procedures to result data sets, and include paramaters for ad hoc reporting, you can use a product called JEvolution from www.mcmsoftware.com see the following video: http://www.mcmsoftware.com/docs/jevolfromstp.htm

good luck

Akhil Gupta said...

Visual Studio and SQL Server management Studio are two different entities. So it takes a lot of time when we try an interaction with the database from Visual Studio then it takes a lot of time.

When we are using desktop applications then we do not realize this time delay, but while using web applications the time delay becomes an important factor because the database is on the remote server so a huge time delay is observed.

Normally, the query that we write in Visual Studio is first compiled and then executed in Sql Server which then returns the results. So, this is why a time delay is observed.

The answer to this is Stored Procedure usage. The Stored Procedure is stored in compiled format in SQL server so when we call it from Visual Studio application then it is just executed. Thus saving a lot of time.

An illustration of proper usage of STORED PROCEDURE has been displayed in the following video. You may see it:

http://www.visiontechno.net/studymats/storeprocedure.html

Anonymous said...

Hello Roland,

I see its been a while since this thread has been active. I'm also trying to execute a SP within Kettle (Spoon 4.1).

I used your tip on using the Table input and calling the SP which works. Although.. My table gets filled with data but the transformation returns an error: 'Couldn't get row from resultset'. Maybe you already explained it in this thread, but I don't know how to make my transformation not returning the error.

Perhaps you know a solution?
P.s. Your book is awesome! I got it as well since last month.

Thanks,

Anonymous said...

Hoi Roland,

Hartelijk gefeliciteerd met je boek. We moeten nog eens afspreken, even bij bieren ;) Woon je nog in de zelfde plaats als Catena?

Groetjes,

Robert Nagtegaal.

rpbouman said...

Hi Robert!

heh, da's lang geleden :) Ja hoor, ik zit nog steeds in Leiden. Alles goed?

Ben Uphoff said...

Roland,

Your posts have been incredibly valuable to me - glad I discovered your blog.

Do you know of a way to select from a stored procedure in MySQL? For example, a hypothetical syntax might be:

SELECT s.* FROM db.sp_my_stored_procedure(@arg1,@arg2) s;

This would allow me to also join the stored procedure result set to other tables - handy. I've worked around this by having my stored procedures create temporary & permanent tables, but that's pretty clunky.

Thanks in advance for your sage counsel!

Ben

rpbouman said...

Hi Ben,

thanks for your kind words, much appreciated :)

unfortunately, this is not possible. There must be a feature request somewhere in bugs.mysql.com, I've seen many people request this feature.

I have a suggestion that might improve the workaround you're using at the moment.

Instead of creating a (temporary) table, you can have your stored procedure create a view. The advantage of that it doesn't cost storage, and typically your generating procedure returns faster because it doesn't have wait until the data is stored to disk. If the query in which you then use the view allows the view to be queried using the MERGE algorithm, then that should be reasonably fast (provided there are good indexes on the underlying tables). If the view is evaluated using the TEMPTABLE algorithm then it could be costly because you won't have any indexes on the set from the view.

kind regards, and good luck,

Roland

Ben Uphoff said...

Hi Roland,

Alas, I suspected that would be the answer. Maybe sometime in the future, then...

The tip on switching to views is much appreciated. We have a pretty small data set (at present), so I don't have performance problems - yet. I'll experiment with this, though.

Thanks!

Ben

Ravuri said...

How to call mysql stored procedure in Pentaho report designer ?

Ravuri said...

how to call stored procedure in pentaho report designer ?

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...