Cascalog provides various helper operators in the
cascalog.ops namespace. It is common to
(:require [cascalog.ops :as c]) within your namespace declaration. As such, you will frequently see such references as
c/sum in sample code and documentation.
A filter that is equivalent to boolean AND operator such that every function in the parameter must return true for the tuple to be kept. Can take multiple functions and multiple input fields.
(<- [!a !b] (nums !a !b) ((c/all #'even? #'big?) !a !b)))
A filter such that it returns true if any of the passed in function returns true.
(<- [!a !b] (nums !a !b) ((c/any #'even? #'big?) !a !b)))
(<- [?avg] (src ?user ?cnt) (c/avg ?cnt :> ?avg)))
Composition of functions. Executes function from right to left.
(<- [!y] (nums !x) ((c/comp #'double #'exp) !x :> !y)))
is equivalent to:
(<- [!y] (nums !x) (#'exp !x :> !x1) (#'double !x1 :> !y))
!count takes in one input variable. Null values are interpreted as “0” and non-null values are interpreted as “1”. !count returns the sum of those interpreted values.
!count counts the number of non-null values for that variable.
(<- [?count] (source !val) (c/!count !val :> ?count))
!count, but count values regardless whether they are null or not.
(<- [?count] (source !val) (c/count !val :> ?count))
count, but only count distinct items. Null values would be counted as one.
(<- [?count] (source !val) (c/distinct-count !val :> ?count))
Apply the specified function to each of the input variable. Number of inputs must equal number of output fields if the function is expected to return a value, otherwise there is no output variables.
((c/each #'double) ?a ?b ?c :> ?x ?y ?z)
?a :> ?x,
?b :> ?y, and
?c :> ?z.
Returns a subquery getting the first n elements. Can pass in sorting arguments.
wordcount-tap is a subquery with fields
[?word ?count] and we want to pull the top 100 words by count. Here’s how we do that with
(defn top-100 [file-path] (c/first-n (wordcount-tap file-path) 100 :sort ["?count"] :reverse true)) (defmain Top100 [tuple-path results-path] (?- (hfs-textline results-path) (top-100 tuple-path)))
An efficient buffer that does most work in mappers to return the top N tuples.
Some examples using the playground:
Get the top 3 integers:
(?<- (stdout) [?n-out] (integer ?n) (:sort ?n) (:reverse true) ((c/limit 3) ?n :> ?n-out))
Get at most one friend for each person:
(?<- (stdout) [?p ?f-out] (follows ?p ?p2) ((c/limit 1) ?p2 :> ?f-out))
Get 5 follows relationships:
(?<- (stdout) [?p-out ?p2-out] (follows ?p ?p2) ((c/limit 5) ?p ?p2 :> ?p-out ?p2-out))
limit but also emit the “rank” of each item (useful when sorting):
Get the top 3 integers with rank:
(?<- (stdout) [?n-out ?r] (integer ?n) (:sort ?n) (:reverse true) ((c/limit-rank 3) ?n :> ?n-out ?r))
Let us know what was unclear or what has not been covered. Maybe you do not like the guide style or grammar or discover spelling mistakes. Reader feedback is key to making the documentation better.
This documentation site is open source and we welcome pull requests.