Finally, we can register some UDFs and UDAFs based on the above and use them on DataFrames.

import org.apache.spark.scala.sql.{udaf, udf}

val stringsToSketch = udaf(SketchPreaggregator)
val aggSketches = udaf(SketchAggregator)
val getEstimate = udf(sk.getEstimate(sk.deserialize(_)))

val urlSketches = events
  .groupBy(to_date($"ts") as "dt", $"country", $"url_tokens")
  .agg(stringsToSketch($"user_id") as "sketch")

val estimates = urlSketches
  .filter($"url_tokens" like "%laptop%")
  .agg(aggSketches($"sketch") as "sketch")
  .withColumn("unique_users", getEstimate($"sketch"))

Using Data Sketches to extract Fast & Cheap Insights from Big Data

Alberto Cámara

Where I work

Python Barcelona Community

Problems when reporting Big Data

An example: building an interest audience

What to do

Preaggregations

What is a sketch?

Audience size estimation via sketches

A few fantastic algorithms...

...and where to find them

The Theta Sketch in Scala

Taking it over to Spark

What is this good for?

Turning audience size estimation around

Demo time!

Reporting segment overlaps

Into the future

Thanks!