Catching data errors with traps

You can use Cascading Traps with Cascalog to capture tuples whose processing fails. To store those tuples into a sink tap (for example a local file or hfs-textline), use the :trap keyword with an error sink:

(def errors (lfs-textline "file:///tmp/people.bad_records" :sinkmode :replace)) 
;; or (stdout) or (hfs-textline "hdfs:///tmp/...") if running on Hadoop

(<- [?name ?age]
      (people ?name ?age)
      (:trap errors)
      (< ?age 40))


You may use the functions and macros from the midje.cascalog namespace together with clojure.test test your queries. See Cascalog’s own tests for examples.

It uses for example fact?- to execute a query and compare its outputs with the expected ones or something like (facts query => (produces [[3 10] [1 5] [5 11]]) where (def query (<- ...)). Read Sam Ritchie’s blog post Cascalog Testing 2.0 for more details and examples of midje-cascalog 0.4.0.

Live coding

There are certain features that support live, interactive coding:

  • Use simple Clojure collections as data sources ((def people [["ben" 21] ["jim" 42]]))
  • You can during development easily change some parts of Cascalog code to standard Clojure functions and call them from the REPL, for example a custom operator by replacing (defaggregateop with (defn .
  • Queries can be of course executed from the REPL

Help improve this site

Let us know what was unclear or what has not been covered. Maybe you do not like the guide style or grammar or discover spelling mistakes. Reader feedback is key to making the documentation better.

This documentation site is open source and we welcome pull requests.