Guide list

Cascalog documentation is organized as a number of guides, covering all kinds of topics.

We recommend that you read these guides, if possible, in this order:

Getting started

An overview of Cascalog with a quick tutorial that helps you to get started with it. It should take about 30 minutes to read and try the provided code examples

Understanding Cascalog

Operations

Joins

  • Inner
  • Outer
  • Cross

Running on a cluster

  • Developing and deploying a Cascalog query on a Hadoop cluster

Misc.

Testing and Debugging

  • Testing Cascalog with Midje, part 1
  • Testing Cascalog with Midje, part 2
  • Troubleshooting

Upgrading from 1.x to 2.x

Cascalog for the Impatient

  • This guide is a set of progressive coding examples that start with a simple file copy and builds up to a MapReduce implementation of the TF-IDF algorithm.

Real Code Examples

  • Cascading plus City of Palo Alto open data
  • Forest Monitoring for Action project
  • CDEC Open Health Data Platform

Blog posts from around the web

  • Why Yieldbot chose Cascalog over Pig for Hadoop processing
  • Next-gen sequencing variation statistics with Hadoop using Cascalog
  • Cascalog made easier
  • Using Cascalog for ETL
  • Hardcore Cascalog: Dynamic Queries

List of companies using Cascalog

Help improve this site

Let us know what was unclear or what has not been covered. Maybe you do not like the guide style or grammar or discover spelling mistakes. Reader feedback is key to making the documentation better.

This documentation site is open source and we welcome pull requests.