
Module C20
The training “Data Science in R and Python on Spark” aims to train future analysts or still “data scientists” ie specialists in the science of data analysis on big volumetrics. She brings the trainee into a world of algorithms and techniques that can be quickly put into practice thanks to the Spark platform.
This training is for technical training populations (computer scientists, mathematicians, physicists, economists or any other field) who had at least a development experience in any programming language and at comfortable with mathematical notions of the terminal level S (vectors, matrices, probabilities etc.).
With very little pre-requisite it is a beautiful training to enter this world of data science with the tools that statisticians favor.
Program
Day 1 : learning R
- Install R Studio
- The data structures and instructions of the R language with TP.
Day 2 and 3 : the Spark language and the Spark R library
- Why Spark
- Install Spark
- The spark shell
- Resilient Distributed DataSets (RDDs)
- Spark API (transformations, actions) with TPs in Spark R
- The notion of Data Frame with TP
- Transform a Data Frame into an SQL Table and Query it with TP
- View the results of queries with R Studio
- Configure and optimize a Spark job