Apache Spark

Big Data Çağrı ŞİŞMAN 27.2.2017

In this article, i will be explaining Apache Spark and how you can use it. 

Apach Spark is a distributed framework to process against big data. It is commonly used for batch and real time analytics and big data workloads. In addition, Spark supports streaming analytics, machine learning and graph databases. You can see the Spark components below image.


SPARK SQL:  Enables users to query data from multiple sources using the common Sql language.         

SPARK STREAMING: Enables users to stream and analyze data in real-time.         

MACHINE LEARNING: Enables users to build and develop applications using Naive Bayes, Decision Tree, Als algorithms which are very useful for  recommendation and prediction engines.

GraphX: Provides an API for expressing graph computation.

SPARK CORE: Provides distributed task dispatching, scheduling, and basic I/O functionalities,

Spark is written by Scala but it also supports Java and Phyton that you can easily write applications by using these programming languages. Spark runs on its standalone, on EC2 on Hadoop Yarn or Apache Mesos. 

You can write ETL applications using Spark Sql. You can analyze real-time data using Spark Streaming. If you want to develop recommendation engine or predictive services, you can use Machine Learning Library to learn anything from your big data.