Roland Bouman's blog: 2007

Monday, December 17, 2007

Calculating the Financial Median in MySQL

I believe I found a new method to calculate the median in MySQL. I would not be surprised if this method has been figured out by somebody else already. However, I can't seem to find any resources on the internet describing this method, so for now I flatter myself by assuming the method is original.

(Please do post your comments to this blog to correct me on that should I be wrong so I have a chance to rectify.)

The method I'm describing is a one-pass, pure SQL method. It does not require subqueries, cursors or user variables. However, it does rely on the MySQL specific functions GROUP_CONCAT() and SUBSTRING_INDEX()

I'll be maintaining a snippet for this method at MySQL Forge.
If you want to know what the median is, and how my snippet works, read on.

Some background

Like the mean and the mode, the median is an important metric to characterize the distribution of values in a collection. If we have a ordered collection of (numerical) values, the median is the value for which the number of entries that has a value that is higher than the median is exactly equal to the number of entries that has a value that is lower than the median. If there is an odd number of entries in the collection, the value of the median corresponds to the value of the entry that lies exactly in the middle of the list. If there is an even number of entries, the median is calculated as the mean of the two middle values.

MySQL offers a number of aggregate functions. Unfortunately, MySQL does not offer a function to calculate the median.

Even though MySQL does not support a MEDIAN() function natively, it is still possible to calculate it. You can:

install one of the numerous UDF's floating around on the internet

Use pure SQL like suggested in Chapter 23 of Joe Celko's SQL for Smarties

The snippet

Here's a snippet that shows how to calculate the median replacement cost for a film in the sakila sample database:


select
        (
            substring_index(                          -- left median: max value in lower half:
                substring_index(
                    group_concat(                     --   list all values in ascending order
                        f.replacement_cost
                        order by f.replacement_cost
                    )
                ,   ','
                ,   ceiling(count(*)/2)               --   left half of the list 
                )
            ,   ','
            ,   -1                                    --   keep only the last value in list
            )
        +   substring_index(                          -- right median: min value in upper half:
                substring_index(
                    group_concat(                     --   list all values in ascending order
                        f.replacement_cost
                        order by f.replacement_cost
                    )
                ,   ','
                ,   -ceiling(count(*)/2)              --   right half of the list 
                )
            ,   ','
            ,   1                                     --   keep only the first value in list
            )
        ) / 2                                         -- average of left and right medians
        as median
from    sakila.film f;

(For the latest version, refer to MySQL Forge)

So, how does this snippet work? In the remainder of this post, I'll explain the inner workings of this method in a top-down fashion.

The mean of the left and right median

The method I'm describing always takes the mean of the 'left' and 'right' median.


select
        (
            left-median(f.replacement_cost)
        +   right-median(f.replacement_cost)
        ) / 2                                         -- average of left and right medians
        as median
from    sakila.film f;

(Note that the usage of left-median() and right-median() is just an explanation of the structure - in reality there are not two distinct functions by that name)

The terms 'left median' and 'right median' are not common so they need an explanation.

Let's visualize the process to determine the median. We can do this by imagining that we have an ordered list of values and that we point our left index finger to the lowest value in the list and our right index finger to the highest value in the list.

Now, we look at our fingers. If there is more than one entry between our right and left finger, we move or left finger one entry to the right and our right finger one entry to the left, and we keep doing that until there are no more entries between our left and right finger. Once we're there, the value of the entry that is pointed to by our left finger is the 'left median' and the value of the entry pointed to by our right finger is the 'right median'.

If we had an even number of entries, then the left and right median each correspond to distinct entries - if there was an odd number of entries then the left and right median correspond to one and the same entry.

At any rate, once we found the left and right median, it is clear that their mean is the true median. If we have an even number of entries, we have to calculate the mean of the two middle values anyway, and if there is an odd number of entries, taking the mean of two identical values results of course in that same value which does thus result in a correct value for the median.

`GROUP_CONCAT`: an ordered list of values

In the example, we use GROUP_CONCAT to generate a list of values in ascending order:


GROUP_CONCAT(                     --   list all values in ascending order
    f.replacement_cost
    ORDER BY f.replacement_cost
)

This gets us a string consisting of concatenated replacement_cost values in ascending order, separated by the default separator, which is a comma (',').

We use the same list in both the calculation of the left and the right median.

Note that the length of the concatenation result returned by GROUP_CONCAT() is limited. By default, it is as small as 1024 bytes. Personally I think this is way too small so I have it configured to be 64K by default. You can set the length at runtime too like this:


    SET group_concat_max_len := 65535

You can specify larger values than 65535 too, and I suspect that the maximum packet size is the practical maximum:


    SET group_concat_max_len := @@max_allowed_packet

To inspect the current value, you can do this:


    SELECT @@group_concat_max_len

Getting the 'left' half of the list

Once we have the list of values, we can split it in two halves with little effort. We do this using the SUBSTRING_INDEX() function.

The SUBSTRING_INDEX() function processes a string argument and gets a substring based on the position of a particular occurrence of another substring.

In this case, the comma ',' is the substring that separates the values in our ordered list. But of all the commas in the list, which occurrence of the comma do we need to find?

Suppose our list contains 4 values. Then, these values are separated by three comma's:


values: '1,2,3,4'
commas:   ^ ^ ^
          1 2 3

If we want to divide this list in two equal halves, then the second comma is the divisor between the left and right halves of our list. With SUBSTRING_INDEX() this expression would get us the left half of the this list:


    SUBSTRING_INDEX('1,2,3,4', ',', 2) -- the substring up to the 2nd occurrence of ','

and the result will be:


    '1,2'

So now we have the first half of the list, and by definition, the last entry in that list, '2' is the left median of the original list '1,2,3,4'.

Now what if we would've had an odd number of entries in our list? Supose our list would've been like this:


values: '1,2,3'
commas:   ^ ^
          1 2

In this case too, we need the second comma to end up with a left substring that has the left median as last entry in the list (which also happens to be the proper median because this is a list with an odd number of entries).

As it turns out, we can conveniently generalize the SUBSTRING_INDEX expression like this:


    SUBSTRING_INDEX(list, separator, CEILING(#entries/2))

In other words, if we divide the number of entries in our list by two, and then round to the nearest higher integer, this gives us the particular occurrence of the separator what we are looking for to halve our list as required. Of course, calculating the number of entries is simply a matter of using the COUNT() aggregate function.

Excising the 'left' median

To actually obtain the left median itself, we just need to excise the last value from the left half of the list of values. We do this by applying SUBSTRING_INDEX() again.

Again, we need to make a substring in terms of the occurrence of the comma that separates the values in our list. This time, we need to get the substring found directly after the last comma in the list. With SUBSTRING_INDEX() we can conveniently express this in the following manner:


    SUBSTRING_INDEX(list, separator, -1)

This means: search the list from right to left and find the first occurrence of the separator. Return the substring that appears after the separator (that is, the substring appearing on the right hand of the separator).

Getting the 'right' median

The process to obtain the right median is a mirror of obtaining the left median: instead of obtaining the last value in the left half of the ordered list of values, we now need to obtain the first value of the right half of the list. This is actually as simple as reversing the sign of the occurrence argument in the SUBSTRING_INDEX() calls:


    left half:   SUBSTRING_INDEX(list, separator,  CEILING(COUNT(*)/2))
    right half:  SUBSTRING_INDEX(list, separator, -CEILING(COUNT(*)/2))

    last entry:  SUBSTRING_INDEX(list, separator, -1)
    first entry: SUBSTRING_INDEX(list, separator,  1)

A few remarks

I think that in many cases, this can be a fair method to calculate the median. The advantage of this method is that it is relatively fast because the query itself is relatively simple.

It would be interesting to see how this method behaves when handling millions of rows. Maybe I will run some benchmarks on that later on.

In the mean while, feel free to post your thoughts, suggstions or critique on this blog.

Thursday, December 13, 2007

Arnold Daniels' lib_mysqludf_xql featured on MySQL devzone

Wohoo! Jon Stephens just published an excellent article on the possibilities for working with XML and MySQL on the MySQL devzone!

I'm very proud that he chose to highlight the lib_mysqludf_xql library of MySQL User Defined functions authored by Arnold Daniels. Some time ago, Arnold joined the MySQL UDF Repository and so far he's been the most active member of the group.

Arnold also took the initiative to set up a new website for the MySQL UDF Repository, and we will soon move all content and libraries from the old location to the new site. (The main reason it hasn't happened yet is all due to me - i need to make time to move the old content to the new site).

If you want to discuss lib_mysqludf_xql or if you want to joint the UDF repository, sponsor our work or contribute in any other way, just go to our google group and post your questions.

Friday, November 09, 2007

Hacking Open Office Writer Lists

Here's a tip I'd like to share. It's about Writer, the OpenOffice.org Word Processor, but all concepts and described functionality are available als in Microsoft Word.

This tip will show you how to create styles to create notes in your documents. Among the more interesting things, the style does not only take care of the formatting, it also takes care of the static text that is repeated for every note. All this achieved without using macro automation.

A Note

Everybody that has written a manual or a business report has at some point dealt with things like this:

Yup, a note. It is essentially a paragraph with a gray background, a black border, and some extra space on all sides. And, oh, yeah...there is a little bold, capitalized caption that reads NOTE:. That's all it is.

Now, I bet that most people won't have much of an issue getting the word processor to format a paragraph to make it look like this. If you do have some trouble with that, hang on - a recipe is coming up.

Of course, it gets a bit annoying when you have to repeat this markup. I have no way to back this up, but it is my hunch that most people solve this simply by copying the paragraph, pasting it, and then editing the text.

(Well? Is it true? Don't be ashamed, post a comment and confess ;-)

Some people might look ahead a bit farther and consolidate the style for reuse. Let's take a look how that works:

Styles and Formatting

First open the styles and formatting toolbox. You can access it via Menu: "Format" > "Styles And Formatting" or by hitting the F11 key on your keyboard.

The Styles and formatting toolbox appears. I usually dock it to the right side of the Writer application window. This makes it very easy to apply the styles to the text at the current cursor position.

Since we are styling a paragraph to look like a note, we need to click the first button on the little toolbar in the top of the Styles and Formatting toolbox to list the available paragraph styles.

In the list of paragraph styles, we will see that the current paragraph style is highlighted. Assuming your cursor was in a plain paragraph, this will be the "Default" style.

Now, we want to create a new paragraph style especially for our Note. To create a new style, right click in the list, and choose "New" in the context menu. This pops up the the paragraph style dialog:

First, we need to give it a proper, sensible Name. Here, we dubbed it "MyNote". By default, Writer assumes it must use the new style also as style for the Next paragraph. Here I deviate from the default, and choose the "Default" style again as Next Style.

One of the most powerful features of styles is that they may be Linked With other styles. The new style is based on the style to which it is linked, effectively inheriting its properties. In this example, it means that changing, say, the font-size of the "Default" style will also change the font-size of the "MyNote" style accordingly.

Of course, "MyNote" may locally change its own font-size, and then that property will be preserved even if the "Default" style changes the font-size. In other words, the new style can override the properties of the style it is linked with.

Finally, a quick note about the style category listbox. By default, a new style is placed in the Custom Styles category. The Hierarchical category is also useful, as it allows you to see which styles are linked to each other:

styles-and-formatting-window-hierarchical

So, this is something to keep in mind when you plan to change a style, as this view allows you to quickly see which other styles will be affected by the change.

Ok - now that we dealt with these logical properties, we set the indents and spacing:

Then, we make sure that all text in one "MyNote" paragraph is not broken up at page breaks:

This is taken care of by the "Do not split paragraph" checkbox.

And we set the background color:

And finally the border style:

The "Merge with next paragraph style" ensures that the border does not separate two consecutive paragraphs that are of the same style. In other words, it allows us to create notes that span more than one paragraph.

When we are all done defining the style, we simply confirm the dialog and close it.

Applying the style

Now that we created the style, we can apply it. First, position the cursor at the desired paragraph:

Then, double click the style in the "Styles and Formatting" toolbox:

Alternatively, you can toggle the Fill Format Mode button. Your mouse pointer then changes into a little bucket. In this mode, single clicking on a paragraph applies the selected style.

I explained that I like to dock the "Styles and Formatting" toolbox to the right side of the Writer Application window so I can easily apply styles where needed.

Are we done yet?

So, are we happy now? Well, actually...no.

The note we created manually has a little bold capital NOTE: caption. Of course, the first time around we simply typed that text in the note paragraph and marked it up in an ad-hoc manner.

Can we do better this time? Well, kind of...

First, lets see if we can consolidate the markup we use for the NOTE: caption. This time, we need to create a Character Style, because this time, we want to markup a stretch of text that appears within a paragraph.

So, we create a character style linked to the "Default" character style. The procedure is completely comparable to creating a paragraph style. We first give it a sensible Name:

Like with paragraph styles, the character style is based on the style it is linked with. However, here, we cannot specify the next style. Instead, applying a character style automatically stays in effect until you switch to another character style.

We then make the font bold:

And finally, we choose a Capitals effect:

We can then confirm and close the dialog.

Now that we created the character style, we can apply it, and type text using this character style. We can either select a piece of text an change the selection to use the character style, or we can toggle the character style and type text in that style.

So, does this do the trick?

Well, it's better now than it was before. At least, now can ensure that all our little NOTE: captions look the same, and that their formatting is updated when we change the style.

However, the problem is that we still have to type this text all the time. It is not only tedious, it is also error prone. Ideally, we would like to somehow have the word processor treat this text as part of the paragraph style instead of it being just text.

As it turns out, there is a clean, reliable way to achieve this. Enter List styles.

List styles are normally used to create bulleted or numbered lists. Each list item is essentially a paragraph that is preceded by a bullet or a number which is generated automatically by the word processor.

This is the key to solving the problem. If we can somehow create a list style that inserts the text NOTE instead of a bullet or a number, we can use it to implement our note caption. So, let's try and create one:

Again, the first step of the process is very similar to creating a paragraph or a character style. We click the list style button, and right click in the "Styles and Formatting" toolbox:

The list style dialog opens, and we provide a name for the list style:

Note however that list styles cannot be linked to other list styles.

We then specify that this list style is a numbering style. This may not really make sense at this point, but we really do need this step. That is because numbering styles allow us to prefix and postfix the list number with a piece of text - this is what we will use to add our Note Caption.

Now that we specified this style as a numbering style rather than a list style, a number of options become available:

Here, we specify that although it is a numbering style, we actually don't want any numbers to be inserted. This is specified by choosing None for Numbering. We also specify that we want the text Note: to appear before the list number. (Note however that we did specify that the list number is to be displayed using "None" - in other words, no number will actually be displayed ;-). Finally, we also specify to use our custom character style for the list number text - it is the character style we just created for our NOTE: captions.

Note that this step is the actual hack that specifies that the static NOTE caption text should be generated by the word processor whenever this list style is applied.

The final step in defining the list style is to undo any automatic indent that is applied to the list 'number'.

Checking the Relative checkbox makes the indent be relative to the current position of this list item in the list hierarchy. But we are not interested in this type of indent at all - we have specified the indent for at the level of the "MyNote" paragraph style. So we want to get rid of any additional indenting. So, we check the checkbox and specify that we want zero indenting relative to the current position in the list hierarchy.

(It might not be clear at this moment, but this step really is essential in order to allow a note to span multiple paragraphs.)

Applying the list style

Applying a list style is different from applying character or paragraph styles. In order to apply a list style, it needs to be attached to a paragraph style. This ensures that the paragraph starts behaving as a list item.

Please observe that attaching a list style to a paragraph style is different from linking a paragraph style to a paragraph style. A link is configured in the Linked With property in the Organizer tab of the paragraph style dialog. Linking causes the paragraph style that links to the other style to inherit its properties. Attaching a list style is done in the Numbering tab of the paragraph style dialog by setting the Numbering style property. Specifying a numbering style makes the paragraph style behave like a list item but does not change the properties of the paragraph style itself.

We might be tempted at this point to simply attach the list style to our "MyNote" paragraph style. This will work: all "MyNote" paragraphs will be starting automatically with the Note: caption.

However, directly attaching the "MyNote" paragraph style to the list style is not a good idea beacause all "MyNote" paragraphs will then start with the Note: caption: even two consecutive paragraphs in the "MyNote" style that together make up one note would each have the NOTE: caption. So, specifying the numbering style in the "MyNote" paragraph style would prevent us from creating Notes with multiple paragraphs.

The solution to that problem is create a new paragraph style called "MyNoteFirstLine", linked to the original "MyNote" style:

To this new "MyNoteFirstLine" style we then attach our list style:

By linking it to the original "MyNote" style, we ensure that there is still a single place that defines all of the "normal" Note formatting. So, if we want to change, say, the background of our notes, we will only need to modify the "MyNote" style, and the changes will cascade through to the "MyNoteFirstLine" style. By Setting the Next Style also to the MyNote style, we ensure that only the first paragraph of our note gets the Note: caption, and that we can continue the note without the static Note caption being generated.

So, here's the final result. We apply the "MyNoteFirstLine" paragraph style to initiate a Note:

Then we type some text. Once we hit Enter, a new "MyNote" paragraph is created:

and the result is a note spanning two paragraphs, with only the first paragraph having the static NOTE: caption.

Obviously, there is now too much white space between the paragraphs within the note. That's because both the "MyNote" as well as the "MyNoteFirstLine" styles have the same settings for "Indents and spacing". However, it is trivial to tweak the settings to achieve a more normal spacing.

To do so, you simply have to modify the "MyNote" style and decrease the spacing that appears above the paragraph. That change will of course cascade through to the "MyNoteFirstLine" paragraph style. But we do want to make sure that the spacing occurs above all "MyNoteFirstLine" paragraphs in order to separate the Note from the normal paragraphs.

So, instead, we need to specify the spacing to appear above the "MyNoteFirstLine" paragraph style. You likely also want to decrease the spacing below the "MyNoteFirstLine" paragraph style in order to make it appear more closely to the remaining paragraphs in the note.

We will leave this as an exercise to te reader.

Summary

This might seem like a long post, but it an be quickly summarized: To create a note like this you:

Create a generic "Note" style
Base a "First" paragraph style off the "Note" style by linking to it, and make sure the Next style is the Note style.
Create a character style for the Note caption.
Create a list style for the Note caption and configure it to use the character style. Then, attach the list style to "First" paragraph style

Random RDBMS and SQL Myths debunked

A few times now, I've been wanting to write this down. I know: a lot of people will go *shrug*. Others may find me pedantic. Some of will say I'm being a smart-ass. Whatever...but I just got to write down a few of these common misconceptions that keep floating around.

None of these misconceptions are really harmful - in most cases, they do not lead to misunderstanding or miscommunication. However, when you are writing about these subjects, you'll often find that a sloppy definition you used in some place will bite you in the tail, and make it harder to explain something later on. So, that is why I from time to time get kind of obsessed with finding just the right words.

I'm not pretending I have the right words though. But there are a few informal ways of saying things that at a glance look right but are in fact wrong. Here's a random list of some of them:

PRIMARY KEY and UNIQUE constraints are unique indexes

Wrong - an index is just a convenient implementation to detect duplicate entries, and this is used by all RDBMS-es I am familiar with to implement PRIMARY KEY and UNIQUE constraints. However, the fact that there is a distinction is evident in for example the Oracle SQL syntax. For example, in ALTER TABLE ... DROP CONSTRAINT you can specify whether the associated index should be kept or also discarded.

Some people argue that it does not make sense to make the distinction in case the RDBMS does not maintain the constraint and index as separate objects. (This is the case in for example MySQL.)

Well, maybe...but I disagree. When I think about constraints, I'm thinking about business rules and guarding them to maintain database integrity. When talking about indexes, I'm thinking about performance and access paths. Quite different things, and in my opinion a shame to throw away the words to express the difference in my opinion.

A table consists of rows and columns

No - there is nothing wrong with an empty table. In other words, it does not consist of rows. It may or may not contain rows, but that is a different story.

A scalar subquery returns one column and one row

Wrong - first of all, a scalar subquery may return zero rows, in which case it evaluates to null, which is perfectly valid. But there is more to it.

Whether something is or is not a subquery is matter of syntaxis. The SQL grammer is defined so that if you encounter a query between parenthesis where a scalar value is appropriate, then that query (including the parentheses) will be parsed as a scalar subquery. In other words, the text satisfies the production rule for the non-terminal symbol "scalar subquery".

The parser will usually be smart enough to verify whether the subquery yields one column, but the number of rows returned is a runtime affair.

Suppose the query that makes up the scalar subquery would in fact return more than one row...would it suddenly not be a scalar subquery anymore? Of course not. It is still a scalar subquery - it just happens to be impossible to execute it. In other words, it violates the semantics of a scalar subquery and is therefore invalid. But the mere fact that we can conlcude that must imply that it is a scalar subquery.

A subquery is a SELECT statement that appears as a part of another SELECT statement

Wrong - For the same reasons as the previous issue. A statement is a syntactical construct. It has to do with discovering a pattern in a piece of text so that it satisfies a particular rule in the SQL grammer. That grammar does not have a rule that allows statements to be nested - not in pure SQL anyway (Of course, in stored procedures, one can have statement blocks like BEGIN...END, IF...END IF etc that really can contain other statements)

Of course, if we would take the SELECT that makes up the subquery and run it in isolation, it would be a SELECT-statement. Bit that is exactly the heart of the matter: because we are regarging it as part of another statement, it cannnot be a statement itself. This is simply a matter of definition of course - most people will immediately understand what is meant.

What would be better to say though is that a subquery is a query or query expression that appears as part of another SQL statement. However, this is also not correct: CREATE VIEW for example does contain a query expression, but this would most likely not be called a subquery. For this particular case, you can argue that there is nothing sub-ish about the query expression, because it is simply an essential part of the CREATE VIEW statement.

But what to think of CREATE TABLE...AS SELECT... and INSERT INTO...SELECT? The query expression is certainly not an essential part of CREATE TABLE and INSERT INTO, and in that sense, the query does look like it is subordinate to the statement it is part of.

You could argue that a query is a subquery if it appears inside another query. That seems sound, but what to think of UPDATE ... SET = (SELECT ...)? Personally I am reluctant to call an UPDATE statement a query - I tend to think of a query as a SELECT statement or sometimes a query expression.

I can think of only one thing that really is a defining characteristic of a subquery though - that is that the query expression must appear within parentheses. So, again, a matter of syntax more than a matter of semantics. I must admit I'm still not very satisfied with this though...What do you think?

NULL is the absence of a value

Variants of this statement go like "NULL is a missing value" or "NULL is not a value".

With slight doubt, I say: wrong. It certainly is true that many people use NULL to convey that something is not there or that something is not applicable. But this is a matter of choice, it does not change the meaning of NULL itself. If we use the same line of reasoning as we used for the subquery myth, we must conclude that NULL is certainly a valid value expression. It can legally appear anywhere where we can put a value. It is IMO also perfectly ok to say things like "...that expression evaluates to NULL".

So what does the SQL standard say? Well, here's a quote:

...the null value is neither equal to any other value nor not equal to any other value — it is unknown
whether or not it is equal to any given value....

So, I'm in that camp too: NULL is a value, and if we have a NULL in say, the integer domain, we just don't know which of all possible integers it is.

Foreign keys must reference a primary key

Wrong - a Unique constraint is mostly just as acceptable.

In MySQLs InnoDB it is even more relaxed - the foreign key only needs to reference the prefix of an index in the parent table, although this is so exotic, it should probably be ignored.

This table is not normalized - it still contains redundancy

Wrong - a table is normalized when it is in the first Normal form. There are a few different opinions what that means exactly, but it usually works to say that a table is not normalized when it contains repeating groups.

A slightly stronger statement is to say that a table is not normalized when it contains data that is not atomic. This is stronger because it does not cover only repeating groups, but also columns that, for a single row, do not contain a single value. For example, a first name/last name combination in one column is not atomic, and therefore, a table that contains such values is not normalized. (There are opinions that require even more than this, but for practical purposes the sense of atomic values works pretty well.)

The source of confusion is in what happens beyond the first normal form. Although a table maybe normalized, it can still contain redundancy. By removing redundancy, you can progressively achieve a higher normal form. In many cases, one would require at least third normal form or the Boyce-Codd normal form for building database schemas. Many people say "normalized" when they actually mean "in at least the third normal form".

So - what do you think? Pedantic? Have other myths? Maybe you have a good, satisfactory definition for subqueries? Or maybe you find an error in my debunkings?

Just drop me a comment on this post - thanks in advance.

Wednesday, October 24, 2007

US Patents: Can this be real?! How?!!

I guess many people learned through slashdot that yesterday, Amazon was awarded US patent 7278042, titled Search engine system supporting inclusion of unformatted search string after domain name portion of URL. In case you are still wondering what this patent entails, read the abstract:

A web site system provides functionality for searching a repository of information, such as the World Wide Web, by including a search string at the end of a URL without any special formatting. In one embodiment, when the system receives a request for a URL of the form www.domain_name/char_string, where char_string is a character string that may include spaces and non-alphabetic characters, the system initially determines whether the character string includes a prefix that identifies the URL as a non-search-request URL. If no such prefix is present, the character string is used in its entirely as a search string to execute a search, and the results of the search are returned to the user.

So, basically, this describes an application that would process URLs of the form http://<somedomain>/<search string> by feeding the text value of <search string> into a search engine and return the search results. The flowchart describes the process:

The patent claim includes a few variations:

As will be recognized, various modifications can be made to the method shown in FIG. 1. One such modification is to treat all URLs of the format www.domain_name/char_string as search request URLs. If this approach is used, a subdomain (e.g., www.pages.a9.com) may be provided for accessing other content of the web site. Another variation is to treat certain URLs that lack the full prefix as search requests if such URLs are otherwise invalid. For example, assuming "-/" is used as the only valid prefix, a request for www.a9.com//text may be treated as a request to search for "text" if the URL does not point to a valid subdirectory or object.

Another variation is to use a suffix, rather than a prefix, to identify non-search-request URLs. For instance, a URL of the format www.domain_name/char_string may be treated as a non-search-request URL if and only if char_string ends with a predefined suffix such as "/-" or "/-/"

The patent was filed in 2004. Personally, I have a hard time believing Amazon was the first one to think of this, but even then...I never realized this patent system really is...what will happen next? Is there a patent to interpret subdomains as search strings? Should I file one, just for fun?

Monday, September 24, 2007

Creating MySQL UDFs with Microsoft Visual C++ Express

Some time ago, I announced the MySQL UDF Repository. In short, the MySQL UDF Repository tries to be a one stop place to obtain high quality LGPL licensed libraries containing MySQL UDFs, including documentation and binaries. Since the announcement, our Google Group has grown to a 22 members (including a number of MySQL employees and prominent community members), and we've gained a few interesting new UDF libraries:

lib_mysqludf_preg: A library authored by Rich Waters providing PERL compatible regular expressions.
lib_mysqludf_xql: A library authored by Arnold Daniels with many useful functions to map and export relational data from MySQL to XML.

Another thing that we see happening now is that people are starting to ask for windows binaries. Although we intend to provide binaries for major operating systems, we haven't really got round to it yet for Windows in particular.

I want to make a start now by explaining how to create and run MySQL UDFs on Windows using the Express Edition of the popular Microsoft Visual C++ IDE. I hope this information will be useful to the many MS oriented developers out there that have trouble getting started with MySQL UDFs.

Preparation

Before we can actually start, we need to install and configure some software and obtain a few resources.

Installing Visual C++ 2005 Express Edition

First, you'll need to download and install Microsoft Visual C++ 2005 Express Edition. At present, this is the latest stable release of the free (As in Beer) version of the popular Visual Studio IDE, set up to create, compile and debug C++ programs.

The installation procedure can take a little while, but is otherwise pretty straightforward.

If you already have a paid-for version of Visual Studio 2005, you should not install the express edition (you are at risk of messing up the existing installation if you do). In that case, you should use your paid-for version or alternatively, download the upcoming release (Microsoft Visual C++ 2008 Express Edition, now in Beta) and give that a spin.

Installing the Microsoft Platform SDK

Apart from Visual Studio, you also need to have the Microsoft Platform SDK installed. Although this SDK is officially entitled "Microsoft ® Windows Server® 2003 R2 Platform SDK", it includes the resources for many flavours of Windows, including Windows 2000, Windows 2003 and Windows XP.

Installing the MySQL Development resources

Source files for UDFs contain references to header files supplied by MySQL. The easiest way to obtain them is by installing them using the Setup.exe installer program that you use to install the MySQL Server.

If you are installing a new server, you can ensure that the files are installed by choosing "custom" in the "Setup Type" step of the wizard started by the Setup.exe:

After that, you will be able to choose which features you want to install. You need to ensure that the "C Include files/Lib Files" under the "Developer Components" is selected:

If you are not installing a new server, you should first check to see if you have an include directory immediately beneath the MySQL base direction. If so, you probably don't need to do anything right now.

If you don't have the include directory, it probably means you did not choose to install the "C Include files/Lib Files" when installing the server (by default, they are not installed). Running the Setup.exe program again will offer you the possibility to add new components to the installation:

And from here, you will be led to the step where you can choose to install the include files and library files.

Setting up a VC++EE Project for MySQL UDFs

Once you fulfilled all necessary prerequisites, the next step is to create a Visual Studio Project. In this context, a project is a container for source files, resources, references to existing libraries, as well as a number of options to compile the source files.

In this article we will set up a VS project to compile the (existing) source of lib_mysqludf_udf. This is a library that demonstrates the basic features of the MySQL UDF interface, such as processing arguments, returning values and handling errors. It is great to get started programming MySQL UDFs. In order to walk through this example, you only need to download the C source file. (Tip: right click the link and choose an option from the context menu to download the file to your local file system.)

Creating a new Project

To actually create the project, we can use the File/New/Project... menu or else the "create project" hyperlink on the the Visual Studio Startpage:

This opens a dialog where we must enter a few details about our project:

For the Visual C++ Express edition, it works best to choose a General/Empty Project. (The paid-for edition of Visual Studio provides templates for projects to create dynamically linked libraries a.k.a. DLLs but as we shall see later on we have to configure this manually.)

We are also required to provide a name for the project. In this case, we use a name that corresponds directly to the source file: lib_mysqludf_udf.

Visual Studio Solution

In Visual Studio, a project is always part of a Solution, which is basically a container for a number of related projects. Because we just started a new project, we are implicitly creating a new solution too, so we have to specify a few things about that as well:

In this case, we create a separate directory for the solution itself, and we use the same name for the solution as for the project. It is important to realize that there can be multiple projects per solution, in which case it probably makes more sense to choose a distinct name for the solution as a whole.

After confirming the dialog, a number of directories and files are created:

Adding a source file

Now it is time to add the source file to our project, so if you didn't download the lib_mysqludf_udf C source file yet, you should do so now. Be sure to copy the lib_mysqludf_udf.c source file to the lib_mysqludf_udf project directory beneath the lib_mysqludf_udf solution directory:

Copying the source file there is just a matter of convenience - I like to keep things that belong together in one place. If you don't keep the file in the project directory, things might may (and probably will) go wrong if you move the source file or the project to another location later on.

Copying the file to the directory still does not formally add the file to the project. To actually add the file to the project, you can right-click the "Source Files" folder beneath the project folder in the Visual Studio Solution Explorer window and use the context menu to add the existing item:

As an alternative to using the menu you can also add the source file to your project by simply dragging it into the Solution Explorer and dropping it into the "Source Files" folder of the project.

If all went well, the source file is now part of the project and can be opened from the solution explorer:

Project Configuration

Although we already defined the structure of the project, we need to configure it in order to compile it. The configuration can be edited through a single dialog which can be accessed by clicking the "Properties" item in the project folder's context menu:

General Properties

We first need to take care of the general configuration. A Visual Studio project can have several configurations - something which is very useful if you want to create different builds (debug or release) from the same project. However, it is a good idea to first configure all project properties that are the same for all configurations. To do that, we choose "All configurations" in the top left listbox of the configuration dialog. (For this example, we do not separately configure for debug and release builds.)

The rest of the configuration process is a matter of editing individual properties. Related properties are organized in property pages, each of which covers a particular aspect of the project. By default, the "General" property page is selected and it makes sense to start editing properties there right away.

In the "General" property page we need to set the "Configuration Type" property to "Dynamic library (.dll)" as we need to be able to load the UDF library dynamically into the MySQL Server.

Configuring the Include path

MySQL UDFs refer to types and contants defined in C header files provided by MySQL. In turn, these refer to header files from the Microsoft Platform SDK. The project does not know automatically where to locate these header files, so we need to configure the project and point it to the location(s) manually.

To specify the location of the header files, we need to activate the "C/C++" property page and edit the "Additional Include Directories" property. You can either directly type the paths in the text box, or otherwise click the elipsis buttons (...) to browse for them.

For this example, we need to specify two locations:

The location of the "include" directory beneath the MySQL installation directory.

The location of the "include" directory beneath the Microsoft Platform SDK installation directory.

If you can't find these directorie, you most likely need to revisit the "preparation" section of this article.

Adding the `HAVE_DLOPEN` macro

The lib_mysqludf_udf.c source file was created using the udf_example.c source file from the MySQL source distribution as an example. The structure of that code uses conditional compilation according to wheter HAVE_DLOPEN is defined:


#ifdef HAVE_DLOPEN

...code goes here...

#endif /* HAVE_DLOPEN */

And this is also used in lib_mysqludf_udf.

I admit that I don't understand why that is there, or what it is supposed to achieve, and I would very much like someone to comment on this blog entry to explain it. Anyway, for Visual C++ it means we have to explicitly define it using a Preprocessor definition:

Configuring the library path

We configured the project to compile a Dynamic-Link Library. For the compiler, this means it cannot just compile the code and package it in a file: the dll target file needs to adhere to a certain specification. In order to make that happen, it needs to link to existing libraries from the platform SDK.

Just like we did for the include path, we need to tell Visual C++ where it can find the libraries it must link to. This can be configured by editing the "Additional Library Directories" property in the "Linker" property page:

In this case, we only need to specify the path of the "Lib" directory find immediately beneath the Platform SDK installation directory.

Compiling the UDFs

At this point, we are ready to compile the project and/or solution. In most cases, you will want to choose the build configuration to choose between a debug or a release build. This can be done by clicking the "Configuration Manager" item in the build menu to invoke the Configuration Manager dialog:

Actually building the project is done using the "Build Solution" or " Build Project" item in the "Build" menu:

The result of building the solution should be as indicated in the screenshot. If the final line does not read

1>lib_mysqludf_udf - 0 error(s), 0 warning(s)

you might want to read the remainder of this section to figure out what the problem is.

Common problems

Of course, no programming task is complete without running into trouble. In this section, a few common problems compiling the project are listed, as well as their solutions.

fatal error C1083: Cannot open include file: 'filename.h'

The output of the build process might look something like this:

1>Compiling...
1>lib_mysqludf_udf.c
1>..\..\..\temp\lib_mysqludf_udf.c(41) : fatal error C1083: Cannot open include file: 'my_global.h': No such file or directory
1>Build log was saved at "file://c:\projects\lib_mysqludf_udf\lib_mysqludf_udf\Release\BuildLog.htm"
1>lib_mysqludf_udf - 1 error(s), 0 warning(s)
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========

This indicates that you did not configure the include path properly. You should revisit the section on configuring the include path and ensure that the path does in fact point to the include directory that appears under the mysql installation directory.

The build output might look something like this:

1>Compiling...
1>lib_mysqludf_udf.c
1>C:\MySQL Server 5.1.21\include\config-win.h(30) : fatal error C1083: Cannot open include file: 'winsock2.h': No such file or directory
1>Build log was saved at "file://c:\projects\lib_mysqludf_udf\lib_mysqludf_udf\Release\BuildLog.htm"
1>lib_mysqludf_udf - 1 error(s), 0 warning(s)
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========

This is a similar problem. It occurs when you did include the "include" directory beneath the MySQL installation directory but forgot the one beneath the Microsoft Platform SDK installation directory. Because the latter is referenced by the former, both have to be added to the include path.

fatal error LNK1104: cannot open file 'file.lib'

Your build output might look like this:

1>------ Build started: Project: lib_mysqludf_udf, Configuration: Release Win32 ------
1>Linking...
1>LINK : fatal error LNK1104: cannot open file 'uuid.lib'
1>Build log was saved at "file://c:\projects\lib_mysqludf_udf\lib_mysqludf_udf\Release\BuildLog.htm"
1>lib_mysqludf_udf - 1 error(s), 0 warning(s)
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========

This indicates that you did not properly configure the path where Visual Studio looks for Additional Libraries. You should revisit the relevant section in this article and ensure the configured path does in fact contain the specified missing library.

Installing the UDFs

If you managed to succesfully compile the project you are ready to install the UDFs in your MySQL Server. Depending on the whether you chose to do a "Release" or a "Debug" build, you will find the lib_mysqludf_udf.dll in the "Release" or "Debug" directory directly beneath the Solution directory respectively:

The .ddl needs to be copied to a location that is accessible to the MySQL Server. For MySQL versions lower than 5.1.19, the bin and/or lib directories right beneath the MySQL installation directory should work. For MySQL version 5.1.19 and beyond, you are required to copy the dll to the plugin_dir. The plugin_dir can be determined by running the following query:

mysql> show variables like 'plugin_dir';
+---------------+-----------------------------+
| Variable_name | Value                       |
+---------------+-----------------------------+
| plugin_dir    | C:\MySQL Server 5.1.21\lib/ |
+---------------+-----------------------------+
1 row in set (0.00 sec)

When the dll is in place, the UDFs can be installed using the CREATE FUNCTION syntax. The following script will install all functions in lib_mysqludf_udf.dll:

CREATE FUNCTION lib_mysqludf_udf_info 
RETURNS STRING 
SONAME 'lib_mysqludf_udf.dll'
;
CREATE FUNCTION udf_arg_count 
RETURNS INTEGER 
SONAME 'lib_mysqludf_udf.dll'
;
CREATE FUNCTION udf_arg_type 
RETURNS INTEGER 
SONAME 'lib_mysqludf_udf.dll'
;
CREATE FUNCTION udf_arg_value 
RETURNS STRING 
SONAME 'lib_mysqludf_udf.dll'
;
CREATE FUNCTION udf_arg_value_length 
RETURNS INTEGER 
SONAME 'lib_mysqludf_udf.dll'
;
CREATE FUNCTION udf_arg_maybe_null 
RETURNS INTEGER 
SONAME 'lib_mysqludf_udf.dll'
;
CREATE FUNCTION udf_arg_attribute 
RETURNS STRING 
SONAME 'lib_mysqludf_udf.dll'
;
CREATE FUNCTION udf_arg_attribute_length 
RETURNS INTEGER 
SONAME 'lib_mysqludf_udf.dll'
;
CREATE FUNCTION udf_init_error 
RETURNS INTEGER 
SONAME 'lib_mysqludf_udf.dll'
;

Common Problems

Even if you successfully compiled the solution, you still might run into a few problems at this stage. A few common ones are described in the remainder of this section.

ERROR 1126 (HY000): Can't open shared library 'file.dll'

If you encounter this error, it means that MySQL cannot find the library you are referring to in the SONAME clause of the CREATE FUNCTION statement. You may have made a typo in your statement, or the MySQL may be looking in another location for the libarary than you might think it does. Verify that you typed the correct location. For MySQL 5.1.18 and earlier, ensure that the dll is copied to either the bin and/or lib directory beneath the MySQL installation directory. For MySQL 5.1.19 and beyond, ensure that the file is located in the plugin_dir.

ERROR 1127 (HY000): Can't find symbol 'functionname' in library

If you encounter this error, a few things might be the matter. You might have made a typo in the function identifier in the CREATE FUNCTION statement. Another possibility is that you forgot to add the HAVE_DLOPEN macro to the preprocessor definitions. If needed, revisit that section in this article.

ERROR 1046 (3D000): No database selected

This error occurs when you did not set the default database. The workaround is to set any database as default database using the USE statement:

USE test;

. Arguably this is a bug in the MySQL Server: it somehow thinks we are trying to create a stored function which is bound to a database. It's as if MySQL cannot distinguish between a UDF and a stored function at this point.

ERROR 1044 (42000): Access denied for user 'user'@'host' to database 'mysql'

This error occurs when the user that is trying to create the function is not privileged to write to the mysql system database. To the best of my knowledge, UDFs are written only the mysql.func table so I would expect that granting privileges on that table would be enough to be allowed to create UDFs. It turns out that this is not the case. Granting all privileges on the mysql database does allow a user to install UDFs, but I don't know if that is indeed the minimal set of privileges required to install UDFs.

Using UDFs

If you succesfully installed the UDFs, any user will be able to use them, regardless of the setting for the default database . Here is a quick set of examples for lib_mysqludf_udf. The examples in itself are not very useful in itself, but you they can be useful tools for general UDF development when debugging or testing. Also, you can learn from the code how to work with the UDF interface)

mysql> -- show the lib_mysqludf_udf version
mysql> select lib_mysqludf_udf_info();
+--------------------------------+
| lib_mysqludf_udf_info()        |
+--------------------------------+
| lib_mysqludf_sys version 0.0.2 |
+--------------------------------+
1 row in set (0.00 sec)

mysql> -- returns the number of passed arguments
mysql> select udf_arg_count(1,2,3,4);
+------------------------+
| udf_arg_count(1,2,3,4) |
+------------------------+
|                      4 |
+------------------------+
1 row in set (0.00 sec)

mysql> -- returns the type of the passed argument
mysql> select udf_arg_type(1), udf_arg_type('string');
+-----------------+------------------------+
| udf_arg_type(1) | udf_arg_type('string') |
+-----------------+------------------------+
|               2 |                      0 |
+-----------------+------------------------+
1 row in set (0.00 sec)

mysql> -- returns the (string)value of the passed argument
mysql> select udf_arg_value(1);
+------------------+
| udf_arg_value(1) |
+------------------+
| 1                |
+------------------+
1 row in set (0.00 sec)

mysql> -- returns the string-length of the passed argument
mysql> select udf_arg_value_length(123);
+---------------------------+
| udf_arg_value_length(123) |
+---------------------------+
|                         3 |
+---------------------------+
1 row in set (0.00 sec)

mysql> -- returns 1 or 0 according to whether the passed argument might be null
mysql> select udf_arg_maybe_null(null), udf_arg_maybe_null(1);
+--------------------------+-----------------------+
| udf_arg_maybe_null(null) | udf_arg_maybe_null(1) |
+--------------------------+-----------------------+
|                        1 |                     0 |
+--------------------------+-----------------------+
1 row in set (0.00 sec)

mysql> -- returns the expression text or its alias
mysql> select udf_arg_attribute(1+1), udf_arg_attribute(1+1 as alias);
+------------------------+---------------------------------+
| udf_arg_attribute(1+1) | udf_arg_attribute(1+1 as alias) |
+------------------------+---------------------------------+
| 1+1                    | alias                           |
+------------------------+---------------------------------+
1 row in set (0.00 sec)

mysql> -- returns the length of the expression text or its alias
mysql> select udf_arg_attribute_length(1+1);
+-------------------------------+
| udf_arg_attribute_length(1+1) |
+-------------------------------+
|                             3 |
+-------------------------------+
1 row in set (0.00 sec)

mysql> -- throws an error
mysql> select udf_init_error('Whoops!');
ERROR:
Whoops!

Tuesday, September 04, 2007

Kettle Tip: Using java locales for a Date Dimension

The Date dimension is a well known construct in general data warehousing. In many cases, the data for a date dimension is generated using a database stored procedure or shell-script.

Another approach to obtain the data for a date dimension is to generate it using an ETL tool like Pentaho Data Integration, a.k.a. Kettle. I think this approach makes sense for a number of reasons:

When you tend to use a particular ETL tool, you will be able to reuse the date dimension generator over an over, and on different database platforms.

You won't need special database privileges beyond the ones you need already. Privileges for creating tables and to perform DML will usually be available, whereas you might need to convince a DBA that you require extra privileges to create and execute stored procedures.

In addition to these general considerations, you can pull a neat little trick with Kettle to localize the data and format of the date attributes. I wouldn't go as far as to say that this feature is Kettle specific: rather, it relies on the localization support built into the java platform and the way you can put that to use in Kettle transformations.

Prerequisites

In this tip, the steps to create a date dimension are described using Kettle 2.5.1 (Generally available Release) and MySQL 5.1.20 (Beta). You will be able to follow through the example using earlier (and later) versions of both products though - I am not using any functionality that is specific to these particular version of the products. The recipe does not really require that you understand anything about data warehouses or date dimensions, but you will probably appreciate it better if you do ;)

Overview

The transformation to generate the data for the date dimension follows a pretty straightforward design. The graphical representation of the transformation is shown below:

First, the dimension table is created (Prepare). After that, rows are generated to fill it (Input). However, the generated rows are almost empty and barren - we still need to derive and add data to fill the attributes of the date dimension (Transformation). Finally, the data is stored in the date dimension table (Output).

Step-by-Step

The remainder of this article describes in detail how to build this transformation. The majority of steps is probably not very interesting to moderately experienced Kettle users, but may be of use to beginning users.

Note for users that are completely new to Kettle - it is advisable to review the first few chapters of the Spoon user guide (Spoon is the name of Kettle tool you use to design the ETL process). It explains how to start up the tool, create a new transformation, add and connect steps etc. You can find it in the docs/English directory beneath the Kettle home directory.

MySQL JDBC driver: setting the characterEncoding property to UTF8

You need to create a (JDBC) connection to MySQL in the usual, straightforward way:

In addition, you need to set the characterEncoding property of the JDBC driver:

This ensures MySQL will be able to understand the utf8 encoded data that we may produce to generate a date dimension in the, say, Chinese language. Note that you cannot just use a statement like SET NAMES utf8 to do this. This is not specific to Kettle, but has to do with the way the MySQL JDBC driver (Connector/J) handles character sets. Please refer to the "Using character sets and unicode" section of the Connector/J documentation for more information on this topic.

Creating the date dimension table

In this particular case, it seemed convenient to create the dimension table as part of the transformation. This is done using the "Execute SQL Script" step shown below:

The "Execute SQL Script" step executes the following script to create the date dimension table:

DROP TABLE IF EXISTS dim_date
;
CREATE TABLE IF NOT EXISTS dim_date (
  date_key                smallint unsigned NOT NULL,
  date                    date              NOT NULL,
  date_short              char(12)          NOT NULL,
  date_medium             char(16)          NOT NULL,
  date_long               char(24)          NOT NULL,
  date_full               char(32)          NOT NULL,
  day_in_year             smallint unsigned NOT NULL,
  day_in_month            tinyint  unsigned NOT NULL,
  is_first_day_in_month   char(10)          NOT NULL,
  is_last_day_in_month    char(10)          NOT NULL,
  day_abbreviation        char(3)           NOT NULL,
  day_name                char(12)          NOT NULL,
  week_in_year            tinyint  unsigned NOT NULL,
  week_in_month           tinyint  unsigned NOT NULL,
  is_first_day_in_week    char(10)          NOT NULL,
  is_last_day_in_week     char(10)          NOT NULL,
  month_number            tinyint  unsigned NOT NULL,
  month_abbreviation      char(3)           NOT NULL,
  month_name              char(12)          NOT NULL,
  year2                   char(2)           NOT NULL,
  year4                   year              NOT NULL,
  quarter_name            char(2)           NOT NULL,
  quarter_number          tinyint           NOT NULL,
  year_quarter            char(7)           NOT NULL,
  year_month_number       char(7)           NOT NULL,
  year_month_abbreviation char(8)           NOT NULL,
  PRIMARY KEY(date_key),
  UNIQUE(date)
) 
ENGINE=MyISAM 
DEFAULT CHARACTER SET utf8
DEFAULT COLLATE utf8_unicode_ci

This is by no means a complete date dimension. The most important limitation is that it only contains attributes that are immediately derivable from the calendar. So, attributes to denote business specific periods like the fiscal year and holidays are not included.

Generating 10 years worth of days

The grain of the date dimension is days - a row in the date dimension represents a single day. In this case, the "Generate Rows" is configured to generate 3660 rows, which roughly corresponds with enough days to last 10 years:

In the example, this step is also used to provide parameters to generate the date dimension data. As we'll see in a moment, the inital_date field effectively specifies the first date that goes into the date dimension. The language_code and country_code fields are used to localize the textual attributes of the date dimension, and the local_yes and local_no fields are used for boolean attributes.

There are other ways to get these parameters into our transformation. For example, we could have used an "Add Constants" step with a similar result. Another possibility would be to get this data from the environment using a "Get Variables" step, and this would allow the parameters to be specified at transformation run-time.

Counting the days

Although we certainly generate enough rows, they are all identical. In order to have each row represent a single distinct day, we need a way to 'count' the generated rows. We do this by adding a "Sequence" step:

In this case, we use the "Add sequence" step to generate an incrementing number within the scope of the transformation. As we'll see later on, we can add this to our initial date to get a series of consecutive dates.

Calculating date dimension Attributes

The previous steps form a basis from which we can derive all of the attributes that currently make up our date dimension. To actually calculate the date attributes, we use a "Modified Java Script Value" step:

Kettle comes with an embedded Rhino javascript engine. The "Modified Java Script Value" step lets you use it to run javascript code to as part of the transformation.

The javascript code is executed for each row that comes out of the previous steps. In the script code, one can reference the values from the input rows, perform some processing on them, and generate new output fields.

One of the fortunate characteristics of the Rhino engine is that it lets us use java classes inside the javascript code. Let's take a look at the script to see how we can use that to generate the localized data for our data dimension attributes.

Initialization

The first thing we do in the javascript code is to get data from the current input row. In the "Generate Rows" step, we added the language_code and country_code fields to specify a locale. Here, in the script, we use the following piece of code to turn that into a java.util.Locale object:

//Create a Locale according to the specified language code
var locale = new java.util.Locale(
    language_code.getString()    //get the ISO639 language_code from the input row
,   country_code.getString()     //get the ISO3166 country_code from the input row
);

The java.util.Locale class represents a particular cultural region. It forms a cornerstone of the internationalization support built into the java platform, and provides information to many other classes to generate appropriately localized output.

We will be using the locale on a number of occasions, but first, we use our it to initialize a java.util.Calendar object:


//Create a calendar, use the specified locale
var calendar = new java.util.GregorianCalendar(locale);

(Note that the java platform currently only provides one concrete Calender Class: the java.util.GregorianCalendar. Unfortunately, java does not seem to provide a built-in recipe for dealing with, for example, Islamic or Hebrew calendars).

We require the calendar object to obtain an instance of the java.util.Date Class that represents the date corresponding to the current row. To do that, we first set the calendar's current date using the initial_date field that was specified in the "Generate Rows" step:

//Set the initial date
calendar.setTime(initial_date.getDate());

We need this to add the number of days generated by our "Add Sequence" step:

//set the calendar to the current date by adding DaySequence days
calendar.add(calendar.DAY_OF_MONTH,DaySequence.getInteger() - 1);

(Note that we substract 1 from the DaySequence value. This is because our sequence starts at 1, and we want the specified initial date to be included in our date dimension).

We conclude the initialization of the script by retrieving a java.util.Date object that represents the date for the current row.

//get the calendar date
var date = new java.util.Date(calendar.getTimeInMillis());

This java.util.Date instance is assigned to the date variable in the script, allowing it to be used as an output field of the javascript step. We require this in order to fill the date column of the date dimension table. We will also be using the date variable later on in this script to derive the value of other date dimension attributes.

Getting Text representations of full dates

Our date dimension has a number of attributes to denote a complete date containing day, month and year parts, in various formats: date_short, date_medium, date_long and date_full. These are all generated using the java.text.DateFormat class.

To do that, we first need to create an appropriate DateFormat instance using the static getDateInstance() method, passing our locale object as well as a constant that specifies whether we want to short, medium, long or full format. Then, we can pass the java.util.Date object for which we want to obtain the textual representation to the format method of the newly created java.text.DateFormat instance:

//en-us example: 9/3/07
var date_short  = java.text.DateFormat.getDateInstance(
                      java.text.DateFormat.SHORT
                  ,   locale
                  ).format(date);
//en-us example: Sep 3, 2007 
var date_medium = java.text.DateFormat.getDateInstance(
                      java.text.DateFormat.MEDIUM
                  ,   locale
                  ).format(date);
//en-us example: September 3, 2007
var date_long   = java.text.DateFormat.getDateInstance(
                      java.text.DateFormat.LONG
                  ,   locale
                  ).format(date);
//en-us example: Monday, September 3, 2007
var date_full   = java.text.DateFormat.getDateInstance(
                      java.text.DateFormat.FULL
                  ,   locale
                  ).format(date);

Formatting date parts

Extracting and formatting different date parts is most easily done by applying the format function on a subclass of java.text.Dateformat, the java.text.SimpleDateFormat class. The java.text.SimpleDateFormat class allows formatting of dates based on date and time patterns:

//day in year: 1..366
var simpleDateFormat = java.text.SimpleDateFormat("D",locale);
var day_in_year              = simpleDateFormat.format(date);

In this example, we pass both the locale and a date pattern to the constructor to create an instance of the java.text.SimpleDateFormat class. The pattern is passed as the string "D", specifying a day-in-year format.

Once we created the java.text.SimpleDateFormat instance, we can apply a new pattern to it using the applyPattern() method. Calling the format method again, we obtain the date in the desired format:

//day in month: 1..31
simpleDateFormat.applyPattern("d");
var day_in_month       = simpleDateFormat.format(date);
//en-us example: "Monday"
simpleDateFormat.applyPattern("EEEE");
var day_name           = simpleDateFormat.format(date);
//en-us example: "Mon"
simpleDateFormat.applyPattern("E");
var day_abbreviation   = simpleDateFormat.format(date);
//week in year, 1..53
simpleDateFormat.applyPattern("ww");
var week_in_year       = simpleDateFormat.format(date);
//week in month, 1..5
simpleDateFormat.applyPattern("W");
var week_in_month      = simpleDateFormat.format(date);
//month number in year, 1..12
simpleDateFormat.applyPattern("MM");
var month_number       = simpleDateFormat.format(date);
//en-us example: "September"
simpleDateFormat.applyPattern("MMMM");
var month_name         = simpleDateFormat.format(date);
//en-us example: "Sep"
simpleDateFormat.applyPattern("MMM");
var month_abbreviation = simpleDateFormat.format(date);
//2 digit representation of the year, example: "07" for 2007
simpleDateFormat.applyPattern("y");
var year2              = simpleDateFormat.format(date);
//4 digit representation of the year, example:  2007
simpleDateFormat.applyPattern("yyyy");
var year4              = simpleDateFormat.format(date);

Dealing with Quarters

Although the java.text.SimpleDateFormat class is useful, it does not provide any functionality for working with quarters. We do want our date dimension to contain attributes to represent the quarter, so we have to reside to computing these manually:

//handling Quarters is a DIY
var quarter_name = "Q";
var quarter_number;
switch(parseInt(month_number)){
    case 1: case 2: case 3: quarter_number = "1"; break;
    case 4: case 5: case 6: quarter_number = "2"; break;
    case 7: case 8: case 9: quarter_number = "3"; break;
    case 10: case 11: case 12: quarter_number = "4"; break;
}
quarter_name += quarter_number;

Although this will do for now, this solution doesn't really cut it because it does not produce localized output. Anyway, it is better than nothing so we'll just have to make do with it.

Period demarcation flags

Our date dimension has a few attributes that are used to indicate the start and end of week and month periods. We use simple yes/no type flags, but we allow the actual "yes" and "no" values to be specified by the user in the "Generate Rows" step. We retrieve them with the following piece of code:

//get the local yes/no values
var yes = local_yes.getString();
var no = local_no.getString();

We can now use them these to flag the start and end of week and month periods.

The start (and of course, also the end) of the week are subject to the locale. In order to find out if we are dealing with the first day of a week, we use the getFirstDayOfWeek() method of the java.util.Calendar class. By comparing its return value with the day of week of the current row, we can see if we happen to be dealing with the first day of the week:

//initialize for week calculations
var first_day_of_week = calendar.getFirstDayOfWeek();
var day_of_week = java.util.Calendar.DAY_OF_WEEK;

//find out if this is the first day of the week
var is_first_day_in_week;
if(first_day_of_week==calendar.get(day_of_week)){
    is_first_day_in_week = yes;
} else {
    is_first_day_in_week = no;
}

Note that we obtain the current day of the week by passing the value of the DAY_OF_WEEK constant to the get method of the java.util.Calendar object that we initialized at the start of the script.

In order to set the value for the is_last_day_in_week attribute of the date dimension, we simply find out if the next day happens to be the first day of the week. If it is, then by definition, the current row represents the last day of the week:

//calculate the next day
calendar.add(calendar.DAY_OF_MONTH,1);

//get the next calendar date
var next_day = new java.util.Date(calendar.getTimeInMillis());

//find out if this is the first day of the week
var is_last_day_in_week;
if(first_day_of_week==calendar.get(day_of_week)){
    is_last_day_in_week = yes;
} else {
    is_last_day_in_week = no;
}

(Note that we have already used similar code to add a day to a date when we added the day sequence to the initial date.)

We can use similar logic to calculate the values for the is_first_day_in_month and is_last_day_in_month indicators. This is actually easier, because the first day in the month is not dependant upon the locale (at least - not within one calendar). So, we only need to find out if the day of month is equal to one:

//find out if this is the first day of the month
var is_first_day_of_month;
if(day_in_month == 1){
    is_first_day_in_month = yes;
} else {
    is_first_day_in_month = no;
}

//find out if this is the last day in the month
var is_last_day_of_month;
if(java.text.SimpleDateFormat("d",locale).format(next_day)==1){
    is_last_day_in_month = yes;
} else {
    is_last_day_in_month = no;
}

A few more date attributes

We conclude the computation of the date attributes by adding a few more useful labels:

//a few useful labels
var year_quarter            = year4 + "-" + quarter_name;
var year_month_number       = year4 + "-" + month_number;
var year_month_abbreviation = year4 + "-" + month_abbreviation;

Like when we calculated the quarters, this is actually not a very good method because the results will not be localized. That said, the result will make sense for many locales, and we don't really have a better way to deal with it right now.

Defining the step outputs

We just calculated all the required values to fill the attributes of our date dimension. We just need to get them out of the script and into the outputs of the step.

Every variable declared in the javascript (using the var keyword) can be used as an output field of the javascript step. The easiest way to generate the outputs is by hitting the "Get Variables" button at the bottom of the dialog. This simply adds an output field for each variable declared in the script:

By default, the data type for all the outputs added in this way is set to the String type. Although it is good practice to choose a more specific data type, it is almost always unncessary in this case, as all integer type values will be correctly converted implicitly when we insert them into the database. There is one exception in this case, and that is the date output. Inside the script, it is an instance of a java.util.Date class, and we must set the type to "Date" in the output too. Otherwise, the (java) string representation of the java.util.Date object will be sent as output, and this is not automatically recognized as a date by MySQL.

Discarding Fields

We are now almost ready to insert the rows into the date dimension table. We only need to discard all fields in the stream that do not correspond with any of the columns in our date dimension table. We use a "Select Values" step to do that:

We use the "Get Fields To Select" button to pull in all available fields, and after that, simply select and delete each field that we do not need. As a final step, we rename the DaySequence field to date_key to map it to the date_key column in our date dimension table.

Inserting data into the table

In the final step, we add the generated data to the dim_date table we created in the very first step of the transformation:

We only need to specify the connection and the table name here, and the step will then automatically attempt to map the fields of the incoming rows to table columns.

We could have used the "Insert / Update" step, or even the "Execute SQL Script" step too to write the data to the dimension table, but that would require a little bit of extra work.

Running the transformation

After building the transformation, you can run it by hitting the "running man" icon on the toolbar. This will open a dialog where you can set a number of properties for the transformation. Hit "Launch" button there and after that, the transformation will be executed:

Closing Notes

I hope you enjoyed this tip. If you want to, you can download the kettle transformation here, and use it as you see fit.

If you are interested in open source data warehousing, register for the MySQL Enterprise Data Warehousing Seminar, Thursday, September 06, 2007 and hear what Robin Schumacher has to say about that subject. (Note that this is a general MySQL data warehousing seminar - this post and the seminar are unrelated)