Apache Spark – A Deep Dive – Series 5 of N – Using Broadcasting
Data note · Why Broadcasting? To understand broadcasting let's first do an exercise: Exercise - Find the list of most popular movies from Movielens data…
Read NoteData note · Why Broadcasting? To understand broadcasting let's first do an exercise: Exercise - Find the list of most popular movies from Movielens data…
Read NoteData note · Why FlatMap instead of Map? Map function is a one to one mapping relation between the existing and the new RDD E.g.…
Read NoteData note · Notes: The data set for this exercise is from National Centers for Environmental Information (NCEI) at http://www.ncdc.noaa.gov/data-access/quick-links Click the link Global Historical Climatology…
Read NoteData note · Key Value based RDDs: In series 1 of N we process an RDD which had only one value - movie rating -…
Read NoteData note · Windows Install JDK (Java Development Kit) Visit Java site - http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html Select your environment ( Windows x86 or x64) Accept license and download…
Read NoteData note · What is Apache Spark: a data processing engine much faster than Map-Reduce uses DAG(Directed Acyclic Graphs) to optimize the workflows. How does Apache…
Read NoteData note · The list is quite big but quite a few are noteworthy to be mentioned: Impala: Cloudera's alternative Hortonwork's Hive Faster than Hive…
Read NoteData note · Apache Nifi: It is a data streaming and transformation tool It has a nice Web based UI where we can configure the…
Read NoteOps note · As you can see this even this site on which I am writing my blogs is also powered on WordPress. It gives…
Read NoteData note · Why Flink: more scalable than Storm upto more than 1000s of nodes( massive scale) more fault tolerant than Storm maintain "state snapshots"…
Read Note