tag:blogger.com,1999:blog-15319370.post114989346859884419..comments2024-03-05T11:16:00.846+01:00Comments on Roland Bouman's blog: Pentaho Data Integration: Kettle turns data into businessrpboumanhttp://www.blogger.com/profile/13365137747952711328noreply@blogger.comBlogger110125tag:blogger.com,1999:blog-15319370.post-22404285181740555332017-01-23T08:50:21.760+01:002017-01-23T08:50:21.760+01:00Dear Roland,
i have a question about developing a...Dear Roland,<br /><br />i have a question about developing a kettle plugin. i run spoon in my developing enviroument in my ide (eclipse), but my plugin is not in the list. Do i have to register the plugin. i did the same like in my enterprise installation. i created in my plugin folder a "step" folder and a folder called as my plugin. In this folder i copied the icon and the plugin.xml. in the enterprise installation i have to copy my jar as well and it works. but not in my dev env. :) may you can give me an advise would be wonderfull.<br /><br />thanks<br />Marco Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-15319370.post-21393890822216741692014-03-27T07:35:40.824+01:002014-03-27T07:35:40.824+01:00Pascal, take a look at the steps in the Statistics...Pascal, take a look at the steps in the Statistics category, like Group By http://wiki.pentaho.com/display/EAI/Group+By or Analytic query rpboumanhttps://www.blogger.com/profile/13365137747952711328noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-90739376565934909682014-03-27T04:31:11.030+01:002014-03-27T04:31:11.030+01:00Hi Roland,
Thank you very much for your help.
Afte...Hi Roland,<br />Thank you very much for your help.<br />After some tests,<br />now i can read data from all txt files from a directory.<br />I just add the folder path and .*\.txt in wildcard, the system accepts and shows all txt files and in Fields Tab if i click the Preview row, it shows all data from all files. That's good for me at this step.<br />Now can i calculate the average of all data, what step should i use? Thank youpascalhttps://www.blogger.com/profile/15393328738317216471noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-52220263985150826672014-03-26T04:37:45.076+01:002014-03-26T04:37:45.076+01:00Pascal,
the text input step allows you to specif...Pascal, <br /><br />the text input step allows you to specify a list of entries.<br /><br />each entry is a directory and file name pattern (using a regular expression)<br /><br />this allows you to conveniently specify multiple files.<br /><br />However, you'll need one text file input step for each type of file.<br /><br />That's because the step needs to know what fields you want to parse out of the files; you'll have to specify that for each file with a particular field layout. <br /><br />I think htat makes sense though - you'll typically want to output each type of file to a distinct database type, or xls worksheet.rpboumanhttps://www.blogger.com/profile/13365137747952711328noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-31894995413392610272014-03-26T04:19:16.162+01:002014-03-26T04:19:16.162+01:00Hi Roland,
i'm beginner in Penthao pdi, i dow...Hi Roland,<br /><br />i'm beginner in Penthao pdi, i downloaded the Pentaho spoon 5.0.1 and it's working fine.<br />So i tried different sample files given in the software.<br />So my question is:<br />Is it possible to extract data from multiple .txt file and store them in a Database or in another file as xls?<br />I tried Text Input file but i have to mention the fle name, i don't want to mention it becasue i have too many files in a folder.<br /><br />Thank you<br />Pascalpascalhttps://www.blogger.com/profile/15393328738317216471noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-16476571030775660882014-03-18T15:09:57.103+01:002014-03-18T15:09:57.103+01:00Thank you so much Roland for all your help. I real...Thank you so much Roland for all your help. I really appreciate it !<br />In fact, That example I found it on the internet, it is not mine.<br /><br />Kind regards.Anonymoushttps://www.blogger.com/profile/02747321561625874153noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-84667429298750658822014-03-18T13:44:18.309+01:002014-03-18T13:44:18.309+01:002. In a star schema, there are only relationships ...2. In a star schema, there are only relationships between the fact table and its dimension tables. Both date and product are dimension tables in your example. You don't need any date inside your product table; you only need for the fact table to have a column that points to the product dimension, and any number of relevant columns to point to the date dimension table (for order date, shipdate, whatever dates you would like to use to navigate and filter your fact table)<br /><br />3. The typical loading process of a star schema is: 1) make sure all dimension tables are up to date. 2) load the fact table. While loading the fact table, you typically use the column values form the source system to lookup the corresponding keys in the dimension table. As I pointed out, if you use a logical key for the date dimension, you can conveniently directly translate dates from the source system into date keys. <br /><br />For product, you will need a lookup. There are several ways to do that, but in all cases it reliees on comparing values from the source system with columns in the dimension table to retrieve the dimension key. And then that dimension key is going to end up in the fact table. Please take a look at the various lookup steps in kettle, and check the documentation to find out more: http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+Steps IN particular, take a look at the Dimension update/lookup step and the DAtabase lookup step.<br /><br />Now for the sales amount, this is the only fact I've seen in your schema so far. (Another typical metric could be sales quantity).It looks like you got QuantityOrdred in your line item source table, and price in your product source table. There are 2 options: either you can load 1 row for each individual actual item that was sold on a particular order. Or you can load one row for each row in the OrderDetail table. In both cases, you will need to pull in the price from the product table to calculate the sales amount.<br /><br />A few final notes. I noticed you have a store dimension, however, the source system does not seem to provide that kind of data. I also noticed that your star schema does not have a customer dimension. That is odd, since that is typically one of the most important ones. When designing a star schema there is a very simple checklist to consider to see if you got all your dimensions: "what, who, where, when, why". In your example, "what" is represented by the product dimensions. "where" is represented by the sore dimension, "when" is represented by the date dimension. The "why" is not always needed, but for example a "marketing campaign" dimension would be a good example of a "why". What should typically not be missing is one or more "who" dimensions, in particular, the customer. Another type of "who" is the sales guy or order picker.rpboumanhttps://www.blogger.com/profile/13365137747952711328noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-31199047459669277812014-03-18T13:44:14.379+01:002014-03-18T13:44:14.379+01:00Hi Sara,
thanks for your interest. Let me try an...Hi Sara, <br /><br />thanks for your interest. Let me try and answer your questions first:<br /><br />1. Almost every star schema has at least one dimension table that has at its lowest granularity the calendar day. <br /><br />Typically it has a hierarchy along year, quarter, month, day, but alternate hierarchies are possible to express fiscal year, semesters or weeks. <br /><br />Some people call this the "Time dimension" - i prefer to call it "Date dimension".<br /><br />The Date dimension table is typically used multiple times in an olap cube: the same physical dimension table can be used by a logical date dimensions like orderdate, shipdate, paymentdate etc. (This is called "role-playing")<br /><br />Like all dimension tables, the date dimension table should have an integer column as key, and the fact table would have one corresponding column for each relevant different type of date. The key column can either be filled with a truly generated artificial key, but often it is more convenient to use a format like 20140318 to represent March 18, 2014. This is convenient when you have to load the fact table, since you can then directly convert the order date, shipment date etc. into the corresponding date dimension key. If the key of the date table is a truly artificial one, then you would need to do an explicit lookup to find the date dimension key for an actual date from the source system.<br /><br />Typically people fill their date dimension table initially, independently of the daily dwh loading process. The reason for that is that you can: you know in advance what the calendar looks like and you can generate all the data in advance. Of course people don't fill it with any possible date, they simply make sure they hava a starting day that is early enough to capture all business transactions, and they will typically generate data for 10 or 20 years ahead of time.<br /><br />In the kettle samples directory you will find a number of transformations (.ktr files) to generate a date dimension. You typically create the table yourself, and then load have kettle generate and load the data. <br /><br />Here's an example I wrote already years back:<br />http://rpbouman.blogspot.nl/2007/04/kettle-tip-using-java-locales-for-date.html<br /><br />Here's another example:<br />http://type-exit.org/adventures-with-open-source-bi/2010/09/building-a-detailed-date-dimension-with-pentaho-kettle/rpboumanhttps://www.blogger.com/profile/13365137747952711328noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-14734221285094328092014-03-18T12:53:45.270+01:002014-03-18T12:53:45.270+01:00Hi Roland,
I'm a beginner in Pentaho Data Int...Hi Roland,<br /><br />I'm a beginner in Pentaho Data Integration [Kettle]. And I am doing my final year project.<br />Well, I want to do an ETL process from conventional relational database model to dimensional database model. (Loading Dimension/Fact Table).<br /><br />I am just wondering how to go about doing this ! I have read around and looked at examples, but I am just having difficulties wrapping my head around how this works !!!!<br /><br /><br />* For exemple, we have this "relational database" (Mysql) :<br /><br />Customer(CustmerID, CustomerName, CustomerAddress, CustomerCity, CustomerState, CustomerZip, CustomerContactLName, CustomerContactFName, CustomerContactPhone);<br /><br /><br />OrderHeader(OrderID, CustomerID, OrderData, BillToAddress, BillToCity, BillToState, BillToZip, ShipToAddress, ShipToCity, ShipToState, ShipToZip);<br /><br /><br />Category(CategoryID, CategoryName);<br /><br /><br />Product(ProductID, CategoryID, ProductName, ProductDescription, <br />ProductPrice);<br /><br /><br />OrderDetail(OrderID, ProductID, LineItem, QuantityOrdred);<br /><br /><br />* In the other side, concerning the "Star Model", we have:<br /><br />ProductDimension(ProductID, ProductCode, ProductName, Category, <br />SubCategory, Brand, Height, Width)<br /><br /><br />StoreDimension(StoreID, StoreName, ParentChain, Region, Territory, <br />Zone, Address, City, State, Zip)<br /><br /><br />TimeDimesion(TimeID, DayofWeek, DayofMonth, DayofYear, Month, Quarter, Year, Holiday, Weekend)<br /><br /><br />SalesFact(ProductID, TimeID, StoreID, SaledDollars)<br /><br /><br />Well, What isn't clear for me is the "TimeDimension" :<br /><br />1. I have read that there is an ETL process to "create/fill" a <br />TimeDimension ; How comes ?!.... Should I create a process <br />(Transformation) in ETL then load it to a Table in the database, and <br />after that loading that table from the database into a Transformation<br />(Kettle) or what ???<br /><br /><br />2. Now after loading the Dimesions Tables and the fact Table : How will I affect the Date to a row in the DataBase. Because in the database, there is no table created called "TimeTable" !!<br />In other word, When we will execute a query , we 're going to <br />interrogate the database but indeed, there is no relationship between the Table Product and the TimeTable(which even doesn't exist), for example... !!?<br /><br /><br />3. In the exemple above, concerning the "Fact Table": How can I Add a Row into it, in order to calculate the "SalesDollars"?<br /><br /><br />Could you please clarify this for me ?! <br /><br />Thanks and Regards.Anonymoushttps://www.blogger.com/profile/02747321561625874153noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-63791691507657364582014-03-13T17:00:52.230+01:002014-03-13T17:00:52.230+01:00Thank you very much Roland.. for your response and...Thank you very much Roland.. for your response and clarification.<br /><br /><br />God Bless You !<br /><br />Sara.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-15319370.post-42329019334485195832014-03-13T14:02:37.744+01:002014-03-13T14:02:37.744+01:00Hi Sara,
thanks for your kind words.
A Cube is ...Hi Sara, <br /><br />thanks for your kind words.<br /><br />A Cube is a logical, not a physical concept. In the OLAP world, a cube is basically a set of metrics (some people say measures or facts) that can be navigated, filtered and aggregated along a set om dimensions. <br /><br />Cubes are the backend as far as any OLAP client applications are concerned. Typically, the client applications use MDX as query language to let the user selection portions of the data from the cube. The client typically displayus that to the user in graphs and pivot tables, or combinations thereof in a dashboard.<br /><br />Pentaho uses Mondrian as OLAP engine. Mondrian is based on the ROLAP principle: Relational OLAP. ROLAP simply means that the Cube is implemented using direct SQL queries on a RDBMS end. In ROLAP, the physical definition of the cube is just a description of how the tables in the RDBS map to dimensions and metrics. For mondrian, this is done in a mondrian schema file. Such files are in xml format and can define multiple cubes.<br /><br />Typically, the cube is not the only level of data organization. One will usually want to organize the data warehouse (the database where you store your OLAP data and keep track of its history) in a way that allows you to easily create cubes. This is usually done by creating a set of star schemas. <br /><br />A star schema is a single fact table which contains a set of columns which hold the metrics; usually these are counts, amounts etc. Apart from the metrics, the fact table also contains foreign keys (which may or may not be enforced at the RDBMS level) which point to its corresponding dimension tables. And typically, all data for a dimension is stored in a single dimension table. (Sometimes, a logical dimension is built off of several related dimension tables, this is referred to as snowflaking)<br /><br />Creating and loading the fact and dimension tables, that is where kettle comes in. In many cases people use kettle to extract data from one or more source systems, then to enrich and validate that, and to organize and load that into the fact and dimension tables.<br /><br />If you download pentaho ba server, mondrian is shipped as part of the pentaho web application. So there is no separate mondrian.war - you only have to deal with that if you set up mondrian manually and independently of pentaho.<br /><br />If you want more information, there's a couple of books that are still quite ok to learn more about this subject:<br /><br />Pentaho Solutions: http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470484322.html<br /><br />Pentaho Kettle Solutions:<br />http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html<br /><br />Here's a presentation you might like:<br />https://www.youtube.com/watch?v=cSXWTNYn3es<br /><br />While these are specific to Pentaho and Kettle, I would very strongly advise you to get a copy of Ralph Kimball's "The Datawarehouse Toolkit".<br /><br />I hope this helps. Kind regards,<br /><br />Roland.rpboumanhttps://www.blogger.com/profile/13365137747952711328noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-84274585519698151192014-03-13T13:26:20.424+01:002014-03-13T13:26:20.424+01:00Hi Roland,
This blog is very nice. Well, I am new...Hi Roland,<br /><br />This blog is very nice. Well, I am new in the Penthao DI world. And I <br />need to know how to "Fill" an Olap Cube and publish it on the Pentaho <br />BI Server, but I don't know how to do it using <b>KETTLE</b> ( Transformations/Jobs).<br /><br />I have searched on the Net, and I founded some solutions using the <br /><i>Mondrian</i> to create the Mondrian Cube Schema file which contain the definition of OLAP cubes and <i>The Mondrian Schema Workbench</i> in order to <br />create dimensions, measures and hierarchy; Then publish it on the <br />Pentaho BI Server.<br />But Now, the 'Mondiand.war' does not exist Anymore !!<br /><br />Also, I have founded some articles talking about loading Dimensions <br />Tables then join all of them in the Fact Table ( using the step "Merge Join") .. So is that the right solution to create The OLAP Cube !???<br /><br />1. Should I load the Table from the database(MySQL) "Input", Then using the Step “Dimension lookup / update” in order to "convert it" into a Dimension ??<br />Then I assemble all of them on tha Fact Table using the step "Merge <br />Join" ???<br /><br />2. How can I generate/load The "OLAP Cube" finally, it will be an xml file too or what???? In other word, what will be the "Output" !?<br /><br />3. At least, how can I integrate/publish it on the Pentaho BI Server ???<br /><br /><br />Hope that my questions are clear ! Thank you for your help.<br /><br />Regards,<br />Sara.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-15319370.post-80567960911525755942014-03-13T13:23:06.132+01:002014-03-13T13:23:06.132+01:00Hi Roland,
This blog is very nice. Well, I am new...Hi Roland,<br /><br />This blog is very nice. Well, I am new in the Penthao DI world. And I <br />need to know how to "Fill" an Olap Cube and publish it on the Pentaho <br />BI Server, but I don't know how to do it using <b>KETTLE</b> ( Transformations/Jobs).<br /><br />I have searched on the Net, and I founded some solutions using the <br /><i>Mondrian</i> to create the Mondrian Cube Schema file which contain the definition of OLAP cubes and <i>The Mondrian Schema Workbench</i> in order to <br />create dimensions, measures and hierarchy; Then publish it on the <br />Pentaho BI Server.<br />But Now, the 'Mondiand.war' does not exist Anymore !!<br /><br />Also, I have founded some articles talking about loading Dimensions <br />Tables then join all of them in the Fact Table ( using the step "Merge Join") .. So is that the right solution to create The OLAP Cube !???<br /><br />1. Should I load the Table from the database(MySQL) "Input", Then using the Step “Dimension lookup / update” in order to "convert it" into a Dimension ??<br />Then I assemble all of them on tha Fact Table using the step "Merge <br />Join" ???<br /><br />2. How can I generate/load The "OLAP Cube" finally, it will be an xml file too or what???? In other word, what will be the "Output" !?<br /><br />3. At least, how can I integrate/publish it on the Pentaho BI Server ???<br /><br /><br />Hope that my questions are clear !<br /><br />Thanks and Regards,<br />Sara.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-15319370.post-8249351082150841022014-03-13T07:55:55.509+01:002014-03-13T07:55:55.509+01:00Hi!
I think you should checkout the community das...Hi!<br /><br />I think you should checkout the community dashboard framework and community dashboard editor. These are Pentaho bi server (get downloads here http://community.pentaho.com/) plugins which allow you to create and run dashboards. You can install these from the pentaho marketplace (this is a sort of plugin manager for the bi server)<br /><br />One of the cdf components is based on open street maps. This allows you to plot data on maps.<br /><br />If that doesn't suit your needs, you can always write your own cdf component. Or maybe you might be able to repurpose an existing component like a scatter plot and set an image as backdrop.<br /><br />Please go to forums.pentaho.org or ##pentaho freenode irc channel for detailed community support.rpboumanhttps://www.blogger.com/profile/13365137747952711328noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-69609538115933262202014-03-13T03:07:18.915+01:002014-03-13T03:07:18.915+01:00Hello Roland,
your blog is very nice and i learn a...Hello Roland,<br />your blog is very nice and i learn a lot. I'm new in Pentaho and hope it's not too late for me.<br />So i know retrive data from a file and store into a DB but what is a logic to print datas in an image as a map. I don't want to use google map api. Is there a plug in for this type of work?<br /><br />Thank you<br />JAVAEmbedded system codehttps://www.blogger.com/profile/11176991745975688010noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-63644580181566644222014-03-13T03:06:16.340+01:002014-03-13T03:06:16.340+01:00Hello Roland,
your blog is very nice and i learn a...Hello Roland,<br />your blog is very nice and i learn a lot. I'm new in Pentaho and hope it's not too late for me.<br />So i know retrive data from a file and store into a DB but what is a logic to print datas in an image as a map. I don't want to use google map api. Is there a plug in for this type of work?<br /><br />Thank you<br />JAVAAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-15319370.post-75315282043251951642014-03-13T03:01:46.852+01:002014-03-13T03:01:46.852+01:00Hi Roland,
this blog is very nice, i'm new in ...Hi Roland,<br />this blog is very nice, i'm new in Penthao spoon but i already tried some examples which are working fine.<br />i know to retrieve data from a file and store them in a DB. But what is logic to use data from db or file and print them on an image as example a map, i don't wnat to use google map with Pentaho even if it's a good api, for security reason i prefer to use an image. Thank you for your help. RegardsEmbedded system codehttps://www.blogger.com/profile/11176991745975688010noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-48798493717193985812013-07-19T12:32:15.509+02:002013-07-19T12:32:15.509+02:00@surya,
in a job, you can create multiple hops g...@surya, <br /><br />in a job, you can create multiple hops going out of a job entry. For each hop you can use the right click menu to mark if the hop is to be followed unconditionally, upon success or upon failure. You can then lead the failure hop to an earlier point in the stream and thus create a loop.<br /><br />However, what you don't want to have is this loop running around as fast as possible and consuming CPU. You should build in some sleep or something to allow other processes to run. I'm not sure how to do that.rpboumanhttps://www.blogger.com/profile/13365137747952711328noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-54086337910159886832013-07-19T07:51:17.254+02:002013-07-19T07:51:17.254+02:00Hi Roland,
is there any option in PDI , to re-ex...Hi Roland,<br /><br /><br />is there any option in PDI , to re-execute the job until success .<br /><br />scenario :-<br /><br />I am having file over FTP , My job fails as there is not file to get from FTP .At failure condition, I want to use a wait step and again the same Job(get file job) . my problem comes in to picture when my job after Failure condition too fails . so is there any way to re-execute the job until it is success . suryahttps://www.blogger.com/profile/10868488641514029976noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-74522029609964110332013-04-12T14:44:14.054+02:002013-04-12T14:44:14.054+02:00Hi Roland,
Thank you verymuch for your help. I wi...Hi Roland,<br /><br />Thank you verymuch for your help. I will try that.<br /><br />Thanks and Regards,<br />AnkiAnki Reddyhttps://www.blogger.com/profile/02666952247964056422noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-62319009152649812412013-04-12T12:58:43.098+02:002013-04-12T12:58:43.098+02:00Hey Anki,
I helped out this far - please go to a...Hey Anki, <br /><br />I helped out this far - please go to a public forum like forums.pentaho.or or IRC channel like ##pentaho on freenode IRC to get more help.rpboumanhttps://www.blogger.com/profile/13365137747952711328noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-79770762722291624722013-04-12T11:01:04.497+02:002013-04-12T11:01:04.497+02:00Hi Roland,
Thank you very much for such a simple ...Hi Roland,<br /><br />Thank you very much for such a simple solution. I have tried the method which you have described, and Able to create an empty file.<br /><br />In Another sistuation I am getting data along with header in a stream (not from directly file i/p) with some calculations. My stream as follows:<br /><br /><b>filename maildate outfilename</b><br />C:\ex.txt 22-3-13 D:\out.txt<br />C:\ex2.txt 23-3-13 D:\out.txt<br /><br />I need to create an empty file with same header but the name of file must be "outfilename" i.e D:\out.txt - in this case.<br /><br />Can you please give me input to handle this sistuation.<br /><br />Thanks and Regards,<br />AnkiAnki Reddyhttps://www.blogger.com/profile/02666952247964056422noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-4259446459529730912013-04-11T23:17:59.622+02:002013-04-11T23:17:59.622+02:00Anki,
Not sure what the best way is, but what sh...Anki, <br /><br />Not sure what the best way is, but what should work is this:<br /><br />1) create text file input step. set limit to 1<br />2) lead outgoing stream of text file input step into a filter step. configure a dummy "always true" condition. Lead the "true" output to a dummy step. Lead the "false" output to a text file output step.<br /><br />If you already have a transformation that uses the text file input for processing, you could technically use that, however, the filter step will cost some processing power. So if you only need to create the output file in advance, I would isolate the steps I described in a separate job which you can run whenever you need that empty file.rpboumanhttps://www.blogger.com/profile/13365137747952711328noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-39978751952180982452013-04-11T21:57:04.504+02:002013-04-11T21:57:04.504+02:00Hi Roland,
It will be my mistake if I am not appr...Hi Roland,<br /><br />It will be my mistake if I am not appreciating your efforts to spread knowledge on pentaho.<br /><br />Recently while I am working on a problem, I stuck with an issue. I am reading a file which is not empty (It has header also), but I need to create an empty file with same header as input file. <br /><br />Can you please suggest, how can I do this?<br /><br />Thanks and Regards,<br />Anki Anki Reddyhttps://www.blogger.com/profile/02666952247964056422noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-56205085259345236482013-04-10T23:02:47.977+02:002013-04-10T23:02:47.977+02:00"I have multiple Join Rows steps in the trans..."I have multiple Join Rows steps in the transformation."<br /><br />this is bad news. The "Join Rows" step will perform a cartesian product. If you specify a filter, then I believe that filter is applied after the product.<br /><br />"I have tried merge join (left outer and inner). They do not achieve what am trying to do with the data."<br /><br />Note that the merge join requires data from both streams to be ordered by the join key.<br /><br />Now, concerning solutions:<br /><br />You say you have very large database tables and you want to join them in the transformation. <br /><br />There are 2 steps you can use:<br /><br />1) the Database join step. This takes an inputstream from the transformation, and for each row in the input stream, it fetches the matching rows from the database, based on some join key you configured. This is what you need if there is a one-to-many relationship between the inputstream and the joined table respectively<br /><br />2) the Database lookup step. This takes an input stream, and for each row in the input stream it looks up exactly one row in the database. You can use this if there is a one-to-many relationship between the database lookup table and the inputstream respectively. If the lookup table is relatively small you can gain a lot of performance by configuring this step to cache the looked up rows.<br /><br />In both cases, the input stream can be taken from a table input step.<br /><br />Note that in both cases, the database still does the work of matching the rows from the joined table to the join key; the difference with an in-database join is that kettle will perform a query on the database for each row from the input stream. This is useful in case an in-database join would lead to a very large intermediary result, choking the database; in such cases doing many queries sequentially has the effect of throttling the join, keeping the system as a whole more responsive (although the total workload is likely larger)<br /><br />In some cases, it might still be the best idea to let the database do the joining entirely. In that case, simply use a table inputstep and enter the join query.<br /><br />You should consider each these methods and pick the one that best suits your needs. rpboumanhttps://www.blogger.com/profile/13365137747952711328noreply@blogger.com