Recent Projects: Democrat & Chronicle

One of the new projects I had this summer was a project proposed by the Democrat & Chronicle. The project involved access to a selection of emails sent to the Greece school district in the wake of the YouTube video involving several of their students. The idea was that we would get a dump of email bodies and try to glean some information out of them.

The first hurdle, unfortunately, was getting at the information. Shortly after we were approached with the request, I received an Access file containing around 5000 email bodies to go through. Being what we are, most of the resources in the FOSSBox are oriented around Linux, and no one about had a copy of Access installed to get the data into a more friendly format.

In theory, there are ODBC drivers for access databases, just like there are for any other database system. In practice, however, they seem to only exist for Windows machines which, while not surprising, was disappointing. This led me to dig out an old VirtualBox VM with Windows XP on it, install the Access drivers from Microsoft, and throw LibreOffice on it, too. There's other ways I could have gotten this done instead of LibreOffice, but I was still hoping this could be a simple action at the moment.

LibreOffice Base eventually got into the Access file, but then the troubles started again. It initially prompted me to save a LibreOffice database file, which sounded great to me... it could export it immediately, then I could copy it over and finish the task in Linux. Unfortunately, all this file did was create a small wrapper around the Access file, telling LibreOffice where the file was located, and what was needed to open it. So now I was back to trying to export the data. LibreOffice, though, was not willing to play along. I admit I am less than familiar with the Base component of LibreOffice, however some exploration and more searching online led me to believe I could not do the simple translation of data from one format to another from within Base.

Instead, I needed to select the table I was interested in (the only table in the database), tell LibreOffice to copy the table, then open a new spreadsheet in LibreOffice Calc and save the data that way. While this makes some sense to me (Base being simply for basic interaction with databases, Calc for manipulating raw data), I was dismayed that I could not find some way to export a single table to a common data format, like CSV, instead of having to go through yet another step. In any case, once I dumped the data into Calc, I could easily save it to CSV, drop that into my real computer, stop the VM, and get to work for real.

The end result is this. I'm not sure it will ever be of any particular use to anyone other than myself to remind me how to use the Python NLTK module (whose documentation seems to be geared more towards researchers than those already familiar with Python), and is hardcoded to certain facets of the data I was given, but it does manage to do a few things, and at each step it dumps the state of the data to a file so I can inspect the process and consider possible improvements.


Similar posts to this one

blogroll

social