Here's a common scenario: you have some CSVs containing data and you need to perform some analysis. Sort, filter, create summaries, perhaps even cross two CSVs together in a more complex query. Your first choice would be a spreadsheet. It will be very easy to import your data and do some sorting and filtering. However, when you need some more in depth analysis, this becomes much more difficult.
In this article I will present a simple alternative: import the data to H2 database and manipulate it. You need some basic knowledge of SQL and not much more (you don't need to be a programmer). This article starts from scratch and gets you there quickly. There's no complex installation or heavy-duty server required. You can be up and crunching your data in about 10 minutes.
H2 is an awesome little database engine written in Java. I use it for my Eclipse plugin, nWire, and I wrote about it in the past. It is small, light, simple to install, fully functional and surprisingly quick.
Step 1: Download H2
Go to the H2 Download page. You can select the latest beta, it is fairly stable and offers better performance. Go for the "platform independent ZIP" (works on Windows, Mac OS X, Linux/Unix), I prefer it over the installer which is Windows only (well, I'm on a Mac, maybe that's the reason). However, the installer is simpler to use, so if you're on Windows, you can get it.
Step 2: Install H2
If you downloaded the installer, run it. Otherwise, just pick a folder and unzip the downloaded archive. Installed. You don't even have to restart your machine.
Step 3: Fire up the H2 Console
On Windows: in the H2 folder, locate the bin folder. Inside, you will find "h2.bat". Double click it.
On Mac/Unix: open a terminal, cd to your H2 directory, cd to bin and execute "h2.sh" by typing "./h2.sh".
In both cases, this will open a link in your browser which brings up the H2 console. The console is by default at https://localhost:8082/.
Step 4: Open a fresh database
Th only setting you should modify is the URL. If should look like this: jdbc:h2:file:<path><databaseName>. The database uses several files, so it is best to pick an empty directory (it will be created if it does not exist).
On windows: jdbc:h2:file:C:/data/mydatabase (creates it in C:/data)
On Mac/Unix: jdbc:h2:file:/data/mydatabase (creates it in /data, use ~ to denote your home folder)
Click on "Connect". Upon the first connection, the database will be created and the console will open.
It looks like this:
Step 5: Import your CSV
So, dust off your SQL skills, time to do some light hacking. H2 has a useful function called CSVREAD. Start with a simple command: (Specify full path for your CSV)
This assumes that your CSV is comma separated (isn't that what CSV means?) and that the first line contains the column titles. It uses the default system encoding. You can customize it to fit your case:
CALL CSVREAD('c:/data/test.csv', 'ID|NAME', 'UTF-8', '|')
In this case, we define two columns, ID and Name, UTF-8 encoding and | as value separator. There are many customization options which you might need e.g. for importing dates at a specific format. See this page for more details.
In order to manipulate your data, you want to read it into a table. First, create the table:
CREATE TABLE DATA(ID INT, NAME VARCHAR);
Next, import the data to your table:
INSERT INTO DATA (SELECT * FROM CSVREAD('c:/data/test.csv'));
If you have several CSV files, you can import those into different tables.
Step 6: Crunch your data
Now, it's time for you to get creative. H2 supports standard SQL. The H2 Console has some nice features like completion and a useful display of the database elements. You'll get the hang of it in no time. Inside the H2 folder you will find a doc folder which contains a PDF with full reference to H2 functions.
Step 7 (Optional): Export your data to CSV
You can copy the data straight from the query results in the console and paste it into a spreadsheet. Make sure you copy all the results: you may need to set the "Max rows" to all (in a drop-down above the command pane in the console).
Alternatively, you can use the CSVWRITE function to dump the results to a CSV.
CALL CSVWRITE('c:/data/output.csv', 'SELECT * FROM DATA WHERE ID > 100');
Yes, you could do this with most databases out there. However, this is just simpler, which makes it a viable companion to spreadsheets. The meaning of "I need a database" doesn't have to be days and weeks of hard labor. H2 can do much more, this is just a taste. Whatever you do, H2 always keeps it simple and light.
One final tip: the console can connect to any JDBC compliant database, provided the JDBC drivers are in the classpath. Of course it can work as a remote server which makes it a simple alternative to tools like phpMyAdmin (which is MySQL specific). That's a good topic for a separate post.