If your cluster features Hadoop Security your queries may run into exceptions like this one:
org.apache.hadoop.ipc.RemoteException: token (...) can't be found in cache
That exception fails the second step in any multi-step Cascalog (or Cascading for that regard) query. Reason is, the Kerberos token gets cancelled after the first step succeeded.
A solution to this is to configure JobConf with mapreduce.job.complete.cancel.delegation.tokens
set to false
, like so:
(with-job-conf {"mapreduce.job.complete.cancel.delegation.tokens" false}
...)
Or add it to your job-conf.clj
.
Also, if you happen to schedule your Cascalog jobs via Oozie, you may want to google for HADOOP_TOKEN_FILE_LOCATION and mapreduce.job.credentials.binary and set your jobconf accordingly.
Let us know what was unclear or what has not been covered. Maybe you do not like the guide style or grammar or discover spelling mistakes. Reader feedback is key to making the documentation better.
This documentation site is open source and we welcome pull requests.