Automated bank statement filing with Hazel

7 Sep 2012

In a continuing effort to utilize automated tools to eliminate paper from my life and tidy up my filing system, I’ve been working on some additional Hazel recipes to manage the monthly statements my bank accounts produce. This one got a little tricky because I wanted a fully automated solution that would extract the proper date from the statement without any intervention on my part. If I had a digital assistant that could log in to my bank web sites and download the forms for me, I’d be in hog heaven. Sadly, that part still requires my participation.

Here’s the rough sequence of events associated with this Hazel recipe:

  1. I log in and download my financial statements. This is a manual process, but I don’t have to pay any attention to file names. The statements are downloaded to my Downloads folder, and Hazel takes it from there.
  2. A Hazel recipe monitors the Downloads folder looking for statements from my bank. When it finds one, the PDF documents get moved to the Action folder in my Dropbox where I centralize all of my Hazel filing. Why not run the filing rules from within the Downloads folder? If the filing recipe fails to run, I want the downloaded statement to sit in the Action folder where I’ll be more likely to notice the problem. I already have a Hazel rule monitoring my Downloads folder deleting downloads that are older than five days. I might not notice a statement that gets stuck in Downloads before it’s automatically deleted.
  3. The Hazel filing recipe scans through the downloaded PDF statements and identifies bank statements by looking for keywords associated with my bank’s name and the relevant account number.
  4. The recipe continues by using a couple geeky tools to identify the statement date.
  5. The statement PDF get renamed with the statement date and bank name and is filed in the appropriate folder.

Here are the detailed steps starting with #3 above. Step #2 is left as an exercise for the reader, and I’m not going to give you my bank logins to let you do step #1. Nice try.

Identifying document keywords

Most PDF documents generated by commercial vendors are fully searchable. That is, you can use Hazel’s “Contents contains…” parameter to identify specific text strings in the document such as the name of your bank or account number. This is not quite a gimme, however, because it appears that most bank statement PDFs are generated by a scanning process with an OCR (optical character recognition) process applied after the fact. In other words, the bank uses the file which could eventually become a printed statement and converts it to a searchable PDF later. I’ve found a few instances where the OCR process wasn’t perfect and missed my account number match. I had to file those statements manually.

A note about OCR: the ability to convert a scanned image of a document into searchable text is essential to a paperless workflow. You’re going to need some OCR software if you want to do anything interesting with Hazel. The full version of Adobe’s Acrobat software does OCR, but it’s expensive. I use PDFpen which is much less expensive and does a fine job of OCR plus PDF document editing. If you have a decent scanner you may be able to use the scanning software that was bundled with it for OCR. In my office environment I often use the built-in scanning function of our office copier which has a handy one-touch button that sends the document to my email address. After downloading the attachment I use PDFpen to make it searchable. PDF documents which are created by exporting directly from the software used to create the content (e.g., Microsoft Word) don’t typically require OCR. You should be able to search them reliably.

This screenshot shows the “If” part of the hazel recipe (with some judicious blurring of my account number). The first two parameters would probably be sufficient. If the document is a PDF and it contains a string that matches my account number, it’s probably safe to assume that it’s my bank statement. Being the belt and suspenders kind of guy that I am, I wanted to add one more check to make sure. The AppleScript calls an external utility that queries the metadata attached to the file. The mdls utility is an Apple-specific command line tool that lists all of the file metadata that the system-wide search utility, Spotlight, uses to locate files. The PDF statements downloaded from my bank have a metadata field named kMDItemTitle which has the value “U.S. Bank Internet Banking”. I use the mdls utility as an external script from within AppleScript to check for the presense of that value as a final check that what Hazel is looking at is indeed a downloaded bank statement from U.S. Bank. Hopefully you can see how to modify the script shown above to suit your own needs. Doing the command from a Terminal window:

$ mdls [filename] (insert the name of the file)

will print the whole list of metadata fields and values. Look for something that uniquely identifies your files.

It bears repeating that the mdls trick isn’t necessarily required. If you can identify the files using simple “Contents contains…” searches you’ll be just fine.

Extracting the statement date

I wanted my Hazel recipe to search the PDF for something that looks like a date string and determine the month and year of the statement on its own. If you downloaded the statement from your bank’s web site promptly every month then this would be a lot easier. You could simply use the “current date” field in the “Rename” part of the recipe and call it a day. That’s not how I roll. I usually download my statements in batches every few months. I didn’t want to have to open each one and rename it manually. Here’s where things get a little geeky.

Again, we’re calling on a couple external utilities from AppleScript to extract the date: pdftotext and awk.

The pdftotext utility isn’t part of the standard OS X installation, but rather a component of a larger, open source PDF viewer called Xpdf. Its function is to convert a PDF document into plaintext which, despite losing all of its formatting, makes is easily searchable and accessible to other standard command line utilities. Fortunately, Carsten Blüm has packaged up pdftotext in a handy OS X installer which makes it available to scripts within Hazel.

Unix gurus have been using the awk utility for years to process text files and print summary reports on the contents. For the purpose of our Hazel recipe, awk will locate the date string buried in the PDF and make the year and month components of the date available as custom tokens for use later in the recipe.

Both of these utilities come together in the second line of the AppleScript shown above and condensed to the command line equivalent here:

/usr/local/bin/pdftext [path to file] - | awk 'NR < 2 {print $1, $8}'

where the AppleScript inserts the actual file being examined for [path to file] when the script runs.

Renaming and filing

Once the statement date has been determined and the year and month have been saved as custom token, renaming the file to something that makes more sense is easy. Here’s how it looks when those custom tokens are applied in the renaming process.

Once it’s renamed, Hazel proceeds to move the file to the folder where I store my bank statements. The final bit of magic involves automatically sorting the statement into the folder named for the year by using Hazel’s sorting capability. Choose “Sort into subfolder” from the list of available actions and add the custom token for the statement year. With that in place Hazel will automatically move the statement to the subfolder for the correct year and even create a new “2013” folder next January when it processes my statement from that month.

Once you’ve built one of these types of recipes you’ll be able to duplicate it to address statements from other institutions. The trickiest part by far is extracting the date from the statement. No two statements are alike in that regard, and you’ll probably end up Googling awk recipes for a while. You’ll also need to modify your recipe if that institution makes changes to the format of the statement itself since it’s likely that your awk search will no longer work.

If you’ve made it this far and successfully created your own version of this recipe, congratulations! You now know enough about working with recipes in Hazel to be dangerous.

2 responses to Automated bank statement filing with Hazel

  1. Tim, Your are right about knowing enough to be dangerous. I use a Scansnap S1500M and Hazel in my paperless workflow. I was already using automated renaming, sorting and tagging of my files so I was excited to read what you wrote here. I would like to use an extension of it to change the “creation date” of my files so that they will sort appropriately by date when I browse for them in the finder or . I thought I could use the “touch” command in a shell script to apply the change of date and would like to include the day as well as month and year. What do you think? Do you see this working?

    • Bob, am I understanding correctly that you want to sort your files according to something other than alphabetically by filename? I put the date at the beginning of my files so they sort chronologically. It looks like ‘touch’ will allow you to set different modification and access times, but I’m not sure how that will help you.