|<img src="http://www.factual.com/assets/press/large_color_logo_horizontal.png" width=250px />||
Factual is constantly aggregating and processing growing sets of data. We find ourselves relying more and more on the Hadoop stack of technologies. Cascalog has allowed us easy abstraction from details of data sources (with taps, as in cascading). More specifically, we use Cascalog to run our machine learning algorithms on billions of web pages and user contributed data to aggregate factual data present in multiple sources. We also benefit from the ad-hoc nature of Cascalog when doing things such as generating statistics across our datasets, verifying map-reduce job outputs, tracing the history of data through our processing pipeline, and running experimental data manipulation and transformations.
We're also benefiting from the availability of Clojure in Cascalog. Clojure is a natural fit when doing custom data manipulations, and it's also quite useful to use the REPL to experiment. Being able to "call out" to pure Clojure from our Cascalog queries has been a big win.
|<img src="http://www.premus2007.org/images/supporters/hsph.jpg" width=250px />||
At Harvard School of Public Health we use Cascalog to query large datasets generated by next-generation sequencing. We need an approach that facilitates rapid iterations of coding and testing for algorithm development work, and then scales to handle increasingly large data volumes. As a small group that works on many projects simultaneously, we need to be as efficient as possible since any development code could potentially become part of processing pipelines.
Cascalog makes coding for Hadoop much easier. This allows us to focus on the queries and data interpretation. It additionally increases the understandability of the code, which is essential for reproducibility and transparency. A detailed writeup of some of our work with Cascalog is available here.
|<img src="http://static.sl.lumosity.com/compiled/brochure_ware/pages/about/press/logos/images/lumosity_logo-0729e61868aef87b41cbacb2c4cf289f.png" width=250px />||
At Lumosity, we are committed to pioneering the understanding and enhancement of the human brain to give each person the power to unlock their full potential. Data analysis is an important part of our business, whether it's to conduct new scientific studies to learn more about the human brain or analyze user behavior on our site to optimize Lumosity and the training experience. Cascalog allows our Research & Development team to efficiently analyze our database of human cognitive performance – the largest in the world with over 450 million data points - to gain new insights on cognitive training.
|<img src="http://si0.twimg.com/a/1310175040/images/logos/logo_twitter_withbird_1000_allblue.png" width=250px />||
Cascalog is at the core of Twitter's tools for publisher partners. A batch workflow written using Clojure and Cascalog updates a variety of !ElephantDB views a few times a day. These views include time series aggregations, influence analysis, follower distribution analysis, and more. Additionally, Cascalog is used to vertically partition the greater than 40TB dataset in a few different ways to allow for efficient querying later on. Cascalog's conciseness and great expressive capabilities greatly reduce the complexity in our batch processing.
Cascalog is also used for ad-hoc querying and exploratory work, taking advantage of the ease of defining and running queries from the REPL. When a major event happens, we extract relevant tweets from the master datastore to a local computer where they can be analyzed in a quick iterative fashion.
|<img src="http://www.crunchbase.com/assets/images/resized/0000/1336/1336v1-max-250x250.png" width=250px />|
|<img src="http://www.crunchbase.com/assets/images/resized/0012/4488/124488v2-max-250x250.png" width=250px />||
Cascalog forms the core of Yieldbot's intent modeling and matching technology stack. Publisher's data is fed through a batch workflow at regular intervals and performs a wide array of task such as predictive modeling, text processing, metrics aggregation.
Cascalog and Clojure allow us to develop, deploy, explore and iterate on our workflows with extreme speed and minimal effort. You can read about our experience migrating from Apache Pig to Cascalog here: Why Yieldbot Chose Cascalog over Pig for Hadoop Processing
REDD Metrics uses Cascalog at the heart of our large-scale deforestation monitoring system, currently housed at the Center for Global Development in Washington. We process hundreds of gigabytes of NASA satellite data down into concrete predictions on the likelihood that some piece of land will be deforested in the next month. Cascalog allows us to generate timeseries and perform analysis at a scale unimaginable with current "state of the art" practices. We look forward to open sourcing our work in the coming months. For updates, take a look at our blog.
uSwitch uses high-level data to make business decisions and drill down to the microscopic-level to enable a personalised experience to each of our customers. Cascalog sits at the heart of our modular data pipeline transforming immutable event data to clean and extract customer features for the rest of the business. Furthermore, the logical and functional nature of Cascalog enables our small data team to build simple, composable data processing workflow on scale.
Let us know what was unclear or what has not been covered. Maybe you do not like the guide style or grammar or discover spelling mistakes. Reader feedback is key to making the documentation better.
This documentation site is open source and we welcome pull requests.