Spark的One Stack to rule them all的特性,在Spark SQL即有显现。在传统的基于Hadoop的解决方案中,需要另外安装Pig或者Hive来解决类SQL的即席查询问题。
本文以Spark Shell交互式命令行终端简单的体验下Spark提供的类SQL的数据查询能力
上传数据到HDFS
首先将测试数据上传到HDFS中,本文用到的测试数据来自于Spark安装里面的people.txt文件,它位于spark-1.2.0-bin-hadoop2.4\examples\src\main\resources\people.txt。people.txt的文件内容是:
Michael, 29 Andy, 30 Justin, 19
使用如下命令将people.txt上传至HDFS(people.txt已经拷贝至当前目录
hdfs dfs -put people.txt /user/hadoop
Spark Shell操作
1. 创建SQLContext对象
val cxt = new org.apache.spark.sql.SQLContext(sc); cxt: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@ab552b0
2. 引入隐式转化,用于把RDD转换为SchemaRDD
scala> import cxt._ import cxt._
3. 创建一个POJO类Person
scala> case class Person(name: String, age: Int) defined class Person
4. 读取HDFS中的数据并ORM为Person集合
scala> val people = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0),p(1).trim.toInt))
5. 查看people这个RDD的lineage的关系
scala> people.toDebugString 15/01/03 06:25:17 INFO mapred.FileInputFormat: Total input paths to process : 1 res0: String = (1) MappedRDD[3] at map at <console>:19 [] | MappedRDD[2] at map at <console>:19 [] | people.txt MappedRDD[1] at textFile at <console>:19 [] | people.txt HadoopRDD[0] at textFile at <console>:19 []
6. 将people这个RDD注册为一个虚拟表People
scala> people.registerAsTable("People")
此时查看people的RDD lineage关系,结果同第5步一样
scala> people.toDebugString res2: String = (1) MappedRDD[3] at map at <console>:19 [] | MappedRDD[2] at map at <console>:19 [] | people.txt MappedRDD[1] at textFile at <console>:19 [] | people.txt HadoopRDD[0] at textFile at <console>:19 []
7. 对People表进行查询并查看查询计划和物理计划
scala> val teenagers = cxt.sql("select name from People where age < 20 and age > 10"); teenagers: org.apache.spark.sql.SchemaRDD = SchemaRDD[6] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == Project [name#0] Filter ((age#1 < 20) && (age#1 > 10)) PhysicalRDD [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at ExistingRDD.scala:36 scala> teenagers.toDebugString res3: String = (1) SchemaRDD[6] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == Project [name#0] Filter ((age#1 < 20) && (age#1 > 10)) PhysicalRDD [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at ExistingRDD.scala:36 [] | MapPartitionsRDD[8] at mapPartitions at basicOperators.scala:43 [] | MapPartitionsRDD[7] at mapPartitions at basicOperators.scala:58 [] | MapPartitionsRDD[4] at mapPartitions at ExistingRDD.scala:36 [] | MappedRDD[3] at map at <console>:19 [] | MappedRDD[2] at map at <console>:19 [] | people.txt MappedRDD[1] at textFile at <console>:19 [] | people.txt HadoopRDD[0] at textFile at <console>:19 []
8. 提交查询作业,打印结果
teenagers.map(t => "Name:" + t(0)).collect().foreach(println) ///结果 Justin
参考:http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started
相关推荐
Spark零基础思维导图(内含spark-core ,spark-streaming,spark-sql),总结的很全面。 Spark零基础思维导图(内含spark-core ,spark-streaming,spark-sql)。 Spark零基础思维导图(内含spark-core ,spark-streaming,...
Learning Spark SQL 英文epub 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权,请联系上传者或csdn删除
课时1:Spark介绍 课时2:Spark2集群安装 课时3:Spark RDD操作 课时4:SparkRDD原理剖析 课时5:Spark2sql从mysql中导入 课时6:Spark1.6.2sql与mysql数据交互 课时7:SparkSQL java操作mysql数据 课时8:Spark...
Get a gentle overview of big data and Spark Learn about DataFrames, SQL, and Datasets-Spark's core APIs-through worked examples Dive into Spark's low-level APIs, RDDs, and execution of SQL and ...
#Spark SQL HBase Connector##----------------Note: This Project is Deprecated---------------##--------------And This Project is Not Maintained---------------Spark SQL HBase Connector aim to query HBase...
Get a gentle overview of big data and Spark Learn about DataFrames, SQL, and Datasets-Spark's core APIs-through worked examples Dive into Spark's low-level APIs, RDDs, and execution of SQL and ...
课时1:Spark介绍 课时2:Spark2集群安装 课时3:Spark RDD操作 课时4:SparkRDD原理剖析 课时5:Spark2sql从mysql中导入 课时6:Spark1.6.2sql与mysql数据交互 课时7:SparkSQL java操作mysql数据 课时8:Spark...
Spark SQL 2.3.0:深入浅出,看了下,还行,希望对大家有帮助
PySpark_Day05:Spark SQL 基础入门.pdf
2021贺岁大数据入门spark3.0入门到精通资源简介...共课程包含9个章节:Spark环境搭建,SparkCore,SparkStreaming,SparkSQL,StructuredStreaming,Spark综合案例,Spark多语言开发,Spark3.0新特性,Spark性能调优 。
Databrciks工程师,Spark Committer,Spark SQL主要开发者之一的连城详细解读了“Spark SQL结构化数据分析”。他介绍了Spark1.3版本中的很多新特性。重点介绍了DataFrame。其从SchemaRDD演变而来,提供了更加高层...
课时1:Spark介绍 课时2:Spark2集群安装 课时3:Spark RDD操作 课时4:SparkRDD原理剖析 课时5:Spark2sql从mysql中导入 课时6:Spark1.6.2sql与mysql数据交互 课时7:SparkSQL java操作mysql数据 课时8:Spark...
Apache Spark的XML数据源 一个用于使用解析和查询XML数据的库,用于Spark SQL和DataFrames。 结构和测试工具大部分是从复制的。 该软件包支持以分布式方式处理无格式的XML文件,这与Spark中的JSON数据源限制嵌入式...
Spark SQL 详细介绍 实验介绍 有需要的尽快下载吧
Spark SQL的DataFrame接口支持多种数据源的操作。一个DataFrame可以进行RDDs方式的操作,也可以被注册为临时表。把DataFrame注册为临时表之后,就可以对该DataFrame执行SQL查询。 Spark SQL的默认数据源为Parquet...
主要给大家介绍了关于Spark SQL操作JSON字段的小技巧,文中通过示例代码介绍的非常详细,对大家学习或者使用spark sql具有一定的参考学习价值,需要的朋友们下面随着小编来一起学习学习吧。
spark-sql-性能 一组针对Spark SQL的性能测试
Spark.sql数据库部分的内容
课时1:Spark介绍 课时2:Spark2集群安装 课时3:Spark RDD操作 课时4:SparkRDD原理剖析 课时5:Spark2sql从mysql中导入 课时6:Spark1.6.2sql与mysql数据交互 课时7:SparkSQL java操作mysql数据 课时8:Spark...