Tuesday, March 30, 2010

MySQL: Partition-wise backups with mysqldump

To whom it may concern,

in response to a query from André Simões (also known as ITXpander), I slapped together a MySQL script that outputs mysqldump commands for backing up individual partitions of the tables in the current schema. The script is maintained as a snippet at MySQL Forge a gist on github.

How it works

The script works by querying the information_schema.PARTITIONS system view to generate an appropriate expression for mysqldump's --where option. The generated command also redirects the output to a file with this name pattern:
<schema>.<table>.<partition-name>.sql
For example, for this table (taken from the MySQL reference manual):
CREATE TABLE members (
firstname VARCHAR(25) NOT NULL,
lastname VARCHAR(25) NOT NULL,
username VARCHAR(16) NOT NULL,
email VARCHAR(35),
joined DATE NOT NULL
)
PARTITION BY RANGE( YEAR(joined) ) (
PARTITION p0 VALUES LESS THAN (1960),
PARTITION p1 VALUES LESS THAN (1970),
PARTITION p2 VALUES LESS THAN (1980),
PARTITION p3 VALUES LESS THAN (1990),
PARTITION p4 VALUES LESS THAN MAXVALUE
);
the script generates the following commands:
mysqldump --user=username --password=password --no-create-info --where=" YEAR(joined) < 1960" test members > test.members.p0.sql
mysqldump --user=username --password=password --no-create-info --where=" YEAR(joined) >= 1960 and YEAR(joined) < 1970" test members > test.members.p1.sql
mysqldump --user=username --password=password --no-create-info --where=" YEAR(joined) >= 1970 and YEAR(joined) < 1980" test members > test.members.p2.sql
mysqldump --user=username --password=password --no-create-info --where=" YEAR(joined) >= 1980 and YEAR(joined) < 1990" test members > test.members.p3.sql
mysqldump --user=username --password=password --no-create-info --where=" YEAR(joined) >= 1990 and YEAR(joined) < 18446744073709551615" test members > test.members.p4.sql
Tip: in order to obtain directly executable output from the mysql command line tool, run the script with the --skip-column-names (or -N) option.

Features

Currently, the script supports the following partitioning methods:

Limitations

The LINEAR HASH method is currently not supported, but I may implement that in the future.

Currently I do not have plans to implement the KEY and LINEAR KEY partitioning methods, but I may reconsider if and when I have more information about the storage-engine specific partitioning functions used by these methods.

Finally, I should point out that querying the information_schema.PARTITIONS table is dog-slow. This may not be too big of an issue, however it is pretty annoying. If anybody has some tips to increase performance, please let me know.

Acknowledgements

Thanks to André for posing the problem. I had a fun hour of procrastination to implement this, and it made me read part of the MySQL reference manual on partitioning.

I also would like to thank Giuseppe Maxia (the Datacharmer) for providing valuable feedback. If you're interested in either partitioning or the mysql command line, you should visit his tutorials at the MySQL conference, april 12-15, 2010.

Friday, March 19, 2010

Greatest N per group: top 3 with GROUP_CONCAT()

In my opinion, one of the best things that happened to Planet MySQL lately, is Explain Extended, a blog by Alex Bolenok (also known as Quassnoi on Stackoverflow).

I never had the pleasure of meeting Alex in person, but his articles are always interesting and of high quality, and the SQL wizardry he pulls off is downright inspiring. I really feel humbled by the creativity of some of his solutions and his apparent experience with multiple RDBMS products.

Alex' most recent post is about aggregation, and finding a top 3 based on the aggregate:

In MySQL I have a table called meanings with three columns: word, meaning, person. word has 16 possible values, meaning has 26. A person assigns one or more meanings to each word. In the sample above, person 1 assigned two meanings to word 2. There will be thousands of persons. I need to find the top three meanings for each of the 16 words, with their frequencies. Is it possible to solve this with a single MySQL query?


Alex presents a solution that uses GROUP_CONCAT basically as a poor man's windowing function, a technique I have described on several occasions in the past for ranking, median and percentile solutions in MySQL.

Now, Alex' solution is very clever and there are some elements that I think are very creative. That said, I think his solution can be improved still. Normally I wouldn't write a blog about it, and simply leave a comment on his blog, but his blog supports comments only for general articles, which is why I present it here:

SELECT word
, CONCAT(
SUBSTRING_INDEX(
GROUP_CONCAT(meaning ORDER BY num DESC), ',', 1
)
, ' ('
, SUBSTRING_INDEX(
GROUP_CONCAT(num ORDER BY num DESC), ',', 1
) / SUM(num) * 100
, '%)'
) rank1
, CONCAT(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(meaning ORDER BY num DESC), ',', 2
), ',', -1
)
, ' ('
, SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(num ORDER BY num DESC), ',', 2
), ',', -1
) / SUM(num) * 100
, '%)'
) rank2
, CONCAT(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(meaning ORDER BY num DESC), ',', 3
), ',', -1)
, ' ('
, SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(num ORDER BY num DESC), ',', 3
), ',', -1
) / SUM(num) * 100
, '%)'
) rank3
FROM (
SELECT word, meaning, COUNT(*) num
FROM t_meaning m
GROUP BY word,meaning
) a
GROUP BY word

This gives me output like this:

+------+--------------+--------------+--------------+
| word | rank1 | rank2 | rank3 |
+------+--------------+--------------+--------------+
| 1 | 16 (3.9728%) | 17 (3.9648%) | 12 (3.9632%) |
| 2 | 9 (3.9792%) | 10 (3.9632%) | 20 (3.9328%) |
| 3 | 20 (3.9744%) | 13 (3.968%) | 1 (3.9648%) |
| 4 | 26 (3.952%) | 7 (3.9456%) | 17 (3.9424%) |
| 5 | 9 (4.008%) | 21 (3.9824%) | 20 (3.936%) |
| 6 | 19 (3.9504%) | 10 (3.9488%) | 13 (3.9408%) |
| 7 | 23 (4.0464%) | 12 (3.976%) | 19 (3.9648%) |
| 8 | 23 (4.0112%) | 3 (4.0096%) | 8 (3.9328%) |
| 9 | 10 (4.016%) | 19 (3.984%) | 15 (3.9616%) |
| 10 | 10 (4.0304%) | 14 (3.9344%) | 11 (3.9312%) |
| 11 | 16 (3.9584%) | 6 (3.9296%) | 19 (3.9232%) |
| 12 | 7 (3.9968%) | 1 (3.9392%) | 26 (3.9264%) |
| 13 | 8 (4.048%) | 25 (3.9712%) | 23 (3.9616%) |
| 14 | 16 (3.9936%) | 26 (3.9632%) | 4 (3.9536%) |
| 15 | 22 (4.0608%) | 12 (4.0048%) | 1 (3.9632%) |
| 16 | 14 (4.0032%) | 18 (3.9712%) | 4 (3.9488%) |
+------+--------------+--------------+--------------+
16 rows in set (0.63 sec)

On my laptop, my solution is about 30% faster than the one presented by Alex. Personally I think mine is easier to understand too, but that is a matter of taste.

Anyway, I'm just posting this to share my solution - I do not intend to downplay the one presented by Alex. Instead, I invite everyone interested in SQL, MySQL and PostgreSQL to keep an eye on Alex' blog as well as his excellent answers on Stackoverflow. He's an SQL jedi master in my book :)

Of course, if you have a better solution to crack this problem in MySQL, please leave a comment. I'd love to hear what other people are doing to cope with these kinds of queries.

Sunday, March 14, 2010

Writing another book: Pentaho Kettle Solutions

Last year, at about this time of the year, I was well involved in the process of writing the book Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL" for Wiley. To date, "Pentaho Solutions" is still the only all-round book on the open source Pentaho Business Intelligence suite.

It was an extremely interesting project to participate in, full of new experiences. Although the act of writing was time consuming and at times very trying for me as well as my family, it was completely worth it. I have none but happy memories of the collaboration with my full co-author Jos van Dongen, our technical editors Jens Bleuel, Jeroen Kuiper, Tom Barber and Tomas Morgner, several of the Pentaho Developers, and last but not least, the team at Wiley, in particular Robert Elliot and Sara Shlaer.

When the book was finally published, late August 2009, I was very proud - as a matter of fact, I still am :) Both Jos and I have been rewarded with a lot of positive feedback, and so far, book sales are meeting the expectations of the publisher. We've had mostly positive reviews on places like Amazon, and elsewhere on the web. I'd like to use this opportunity to thank everybody that took the time to review the book: Thank you all - it is very rewarding to get this kind of feedback, and I appreciate it enourmously that you all took the time to spread the word. Beer is on me next time we meet :)

Announcing "Pentaho Kettle Solutions"


In the autumn of 2009, just a month after "Pentaho Solutions" was published, Wiley contacted Jos and me to find out if we were interested in writing a more specialized book on ETL and data integration using Pentaho. I felt honoured, and took the fact that Wiley, an experienced and well-reknowned publisher in the field of data warehousing and business intelligence, voiced interested in another Pentaho book by Jos an me as a token of confidence and encouragement that I value greatly. (For Pentaho Solutions, we heard that Wiley was interested, but we contacted them.) At the same time, I admit I had my share of doubts, having the memories of what it took to write Pentaho Solutions still fresh in my mind.

As it happens, Jos and I both attended the 2009 Pentaho Community Meeting, and there we seized the opportunity to talk to Matt Casters, chief Pentaho Data Integration and founding developer of Kettle (a.k.a. Pentaho Data Integration). Both Jos and I didn't expect Matt to be able to free up any time in his ever busy schedule to help us to write the new book. Needless to say, he made us both very happy when he rather liked the idea, and expressed immediate interest in becoming a full co-author!

Together, the three of us made a detailed outline and wrote a formal proposal for Wiley. Our proposal was accepted in December 2009, and we have been writing since, focusing on the forthcoming Kettle version, Kettle 4.0 . The tentative title of the book is Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration. It is planned to be published in September 2010, and it will have approximately 750 pages.



Our working copy of the outline is quite detailed but may still change in the future, which is why I won't publish it here until we finished our first draft of the book. I am 99% confident that the top level of the outline is stable, and I have no reservation in releasing that already:

  • Part I: Getting Started

    • ETL Primer

    • Kettle Concepts

    • Installation and Configuration

    • Sample ETL Solution


  • Part II: ETL Subsystems

    • Overview of the 34 Subsystems of ETL

    • Data Extraction

    • Cleansing and Conforming

    • Handling Dimension Tables

    • Fact Tables

    • Loading OLAP Cubes


  • Part III: Management and Deployment

    • Testing and Debugging

    • Scheduling and Monitoring

    • Versioning and Migration

    • Lineage and Auditing

    • Securing your Environment

    • Documenting


  • Part IV: Performance and Scalability

    • Performance Tuning

    • Parallization and Partitioning

    • Dynamic Clustering in the Cloud

    • Realtime and Streaming data


  • Part V: Integrating and Extending Kettle

    • Pentaho BI Integration

    • Third-party Kettle Integration

    • Extending Kettle


  • Part VI: Advanced Topics

    • Webservices and Web APIs

    • Complex File Handling

    • Data Vault Management

    • Working with ERP Systems



Feel free to ask me any questions about this new book. If you're interested, stay tuned - I will probably be posting 2 or 3 updates as we go.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...