Apache Spark

Apache Spark 是基于内存的开源分布式计算框架,你可以使用spark在海量数据上做ETL, 数据分析, 机器学习,图计算,spark同时支持批处理和流式处理, 对 Scala, Python, Java, R, 和SQL等语言都有丰富实用的 high-level API.

sdfsf

You could also describe Spark as a distributed, data processing engine for batch and streaming modes featuring SQL queries, graph processing, and Machine Learning.

你可以把spark当做一个支持批处理和流式处理的分布式数据处理引擎,具有SQL查询,图计算,机器学习的功能。

In contrast to Hadoop’s two-stage disk-based MapReduce processing engine, Spark’s multi-stage in-memory computing engine allows for running most computations in memory, and hence very often provides better performance (there are reports about being up to 100 times faster - read Spark officially sets a new record in large-scale sorting!) for certain applications, e.g. iterative algorithms or interactive data mining.
相比较hadoop 二阶段基于磁盘的MapReduce处理引擎, Spark 二阶段内存计算引擎允许大部分计算处于内存中,因此对于特定程序,比如交互式计算和交互式数据挖掘,经常具有更好的性能。

Spark aims at speed, ease of use, and interactive analytics.
Spark is often called cluster computing engine or simply execution engine.
Spark is a distributed platform for executing complex multi-stage applications, like machine learning algorithms, and interactive ad hoc queries. Spark provides an efficient abstraction for in-memory cluster computing called Resilient Distributed Dataset.

Spark 旨在提供更快的速度,简单的使用和交互式分析,
Spark 被称为集群计算引擎或者简称为执行引擎
Spark 是一个旨在处理复杂的多阶段应用(比如机器学习)和交互式特定即时查询的分布式平台。 Spark 给内存集群计算提供了一个很实用的抽象——弹性分布式数据集。
Using Spark Application Frameworks, Spark simplifies access to machine learning and predictive analytics at scale.

Spark is mainly written in Scala, but supports other languages, i.e. Java, Python, and R.
If you have large amounts of data that requires low latency processing that a typical MapReduce program cannot provide, Spark is an alternative.
使用Spark 应用框架, 可以极大的简化了机器学习和预测分析的落地
Spark 应用主要是使用 Scala编写,但是也支持其他语言,比如Java,Python和R语言。
如果你需要在海量数据上进行低延迟计算,并且传统的MapReduce 程序满足不了需求,Spark 会是一个很好的选择
Access any data type across any data source.
Huge demand for storage and data processing.
The Apache Spark project is an umbrella for SQL (with DataFrames), streaming, machine learning (pipelines) and graph processing engines built atop Spark Core. You can run them all in a single application using a consistent API.
对任何形式的数据源和任何数据类型的支持,对海量的数据存储和数据处理的能力,使Apache Spark 围绕SQL查询,流式处理,机器学习和图计算领域形成了一个生态圈,当然这一切都是构建在Spark Core 之上。
Spark runs locally as well as in clusters, on-premises or in cloud. It runs on top of Hadoop YARN, Apache Mesos, standalone or in the cloud (Amazon EC2 or IBM Bluemix).
Spark can access data from many data sources.
Apache Spark’s Streaming and SQL programming models with MLlib and GraphX make it easier for developers and data scientists to build applications that exploit machine learning and graph analytics.

Spark 可以运行在本地也可以运行在分布式集群上面,可以运行在Hadoop YARN , Apache Mesos,standalone上,也可以运行在一些云服务上(Amazon EC2 或者 IBM Bluemix)
Spark支持多种数据源,Apache Spark’s Streaming和SQL 编程模型以及MLlib,GraphX等框架使开发者或者数据分析师可以方便的构建一些数据挖掘或者图计算领域的应用。
At a high level, any Spark application creates RDDs out of some input, run (lazy) transformations of these RDDs to some other form (shape), and finally perform actions to collect or store data. Not much, huh?
You can look at Spark from programmer’s, data engineer’s and administrator’s point of view. And to be honest, all three types of people will spend quite a lot of their time with Spark to finally reach the point where they exploit all the available features. Programmers use language-specific APIs (and work at the level of RDDs using transformations and actions), data engineers use higher-level abstractions like DataFrames or Pipelines APIs or external tools (that connect to Spark), and finally it all can only be possible to run because administrators set up Spark clusters to deploy Spark applications to.
在高度抽象的层次上,Spark在各种数据源上抽象出RDDs,在这些RDDs上可以做一些数据转换(延迟计算),然后执行动作用来获取结果或者输出。
下面从应用开发者,数据工程师和管理员的角度看下Spark, 三种角色都可以让Spark为其所用,不过前提是要投入时间,应用开发者可以使用自己熟悉的语言开发Spark应用(在RDD上对数据进行转换和执行动作),数据工程师可以使用更高一层的抽象,比如DataFrames或者Pipelines API,或者第三方简化工具,当然以上都依赖管理员启动Spark集群然后部署应用。

It is Spark’s goal to be a general-purpose computing platform with various specialized applications frameworks on top of a single unified engine.
Spark的目标就是对各种不同类型的应用提供统一的引擎,从而成为一个通用的计算平台。

欢迎大家关注:sunbiaobiao's微信公众号