- 浏览: 120088 次
- 性别:
- 来自: 杭州
文章分类
最新评论
Hive 中 修改表的 rawDataSize = 1
14: jdbc:hive2://ark3:9994> alter table app_uuid_info_test set tblproperties(rawDataSize=1)
14: jdbc:hive2://ark3:9994> ;
HBASE 表是不会根新的所有手工指点
这个 阀值
set spark.sql.autoBroadcastJoinThreshold=100000000;
JOINT 中 left outer join app_uuid_info_test b
select a.tmp_id,a.uuid,a.eguan_id,a.device_id,b.row_key,b.value.int_20 from tmp.ods_app_hour a left outer join app_uuid_info_test b on a.tmp_id = b.row_key where a.app_id='6069' and a.day = '20180303' and a.tmp_id = '1a16d393e7042213384f994394b763d37121d7' limit 100;
执行期间中 : 就是 MAP JOIN 了。
org.apache.spark.sql.execution.SparkStrategies类 决定是否使用broadcast join的逻辑在SparkStrategies类中,
object CanBroadcast { def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match { case BroadcastHint(p) => Some(p) case p if sqlContext. conf.autoBroadcastJoinThreshold > 0 && p.statistics.sizeInBytes <= sqlContext.conf.autoBroadcastJoinThreshold => Some(p) case _ => None } }
== Parsed Logical Plan ==
'GlobalLimit 100
+- 'LocalLimit 100
+- 'Project ['a.tmp_id, 'a.uuid, 'a.eguan_id, 'a.device_id, 'b.row_key, 'b.value.int_20]
+- 'Filter ((('a.app_id = 6069) && ('a.day = 20180303)) && ('a.tmp_id = 1a16d393e7042213384f994394b763d37121d7))
+- 'Join LeftOuter, ('a.tmp_id = 'b.row_key)
:- 'UnresolvedRelation `tmp`.`ods_app_hour`, a
+- 'UnresolvedRelation `app_uuid_info_test`, b
== Analyzed Logical Plan ==
tmp_id: string, uuid: bigint, eguan_id: string, device_id: string, row_key: string, int_20: string
GlobalLimit 100
+- LocalLimit 100
+- Project [tmp_id#22017, uuid#22018L, eguan_id#22019, device_id#22020, row_key#22195, value#22196[int_20] AS int_20#22197]
+- Filter (((app_id#22014 = 6069) && (day#22015 = 20180303)) && (tmp_id#22017 = 1a16d393e7042213384f994394b763d37121d7))
+- Join LeftOuter, (tmp_id#22017 = row_key#22195)
:- SubqueryAlias a
: +- SubqueryAlias ods_app_hour
: +- Relation[tmp_id#22017,uuid#22018L,eguan_id#22019,device_id#22020,device_mac#22021,device_imsi#22022,device_aid#22023,device_snr#22024,device_udid#22025,debug_state#22026,hjk_state#22027,sir_state#22028,ij_state#22029,sdk_api_level#22030,standard_brand_id#22031,standard_model_id#22032,os_name_id#22033,os_version_id#22034,standard_smo_id#22035,app_key#22036,app_channel#22037,sdk_version#22038,app_version#22039,imeis_code#22040,... 157 more fields] parquet
+- SubqueryAlias b
+- MetastoreRelation default, app_uuid_info_test
== Optimized Logical Plan ==
GlobalLimit 100
+- LocalLimit 100
+- Project [tmp_id#22017, uuid#22018L, eguan_id#22019, device_id#22020, row_key#22195, value#22196[int_20] AS int_20#22197]
+- Join LeftOuter, (tmp_id#22017 = row_key#22195)
:- Project [tmp_id#22017, uuid#22018L, eguan_id#22019, device_id#22020]
: +- Filter (((((isnotnull(app_id#22014) && isnotnull(day#22015)) && isnotnull(tmp_id#22017)) && (app_id#22014 = 6069)) && (day#22015 = 20180303)) && (tmp_id#22017 = 1a16d393e7042213384f994394b763d37121d7))
: +- Relation[tmp_id#22017,uuid#22018L,eguan_id#22019,device_id#22020,device_mac#22021,device_imsi#22022,device_aid#22023,device_snr#22024,device_udid#22025,debug_state#22026,hjk_state#22027,sir_state#22028,ij_state#22029,sdk_api_level#22030,standard_brand_id#22031,standard_model_id#22032,os_name_id#22033,os_version_id#22034,standard_smo_id#22035,app_key#22036,app_channel#22037,sdk_version#22038,app_version#22039,imeis_code#22040,... 157 more fields] parquet
+- MetastoreRelation default, app_uuid_info_test
== Physical Plan ==
CollectLimit 100
+- *Project [tmp_id#22017, uuid#22018L, eguan_id#22019, device_id#22020, row_key#22195, value#22196[int_20] AS int_20#22197]
+- *BroadcastHashJoin [tmp_id#22017], [row_key#22195], LeftOuter, BuildRight
:- *Project [tmp_id#22017, uuid#22018L, eguan_id#22019, device_id#22020]
: +- *Filter (isnotnull(tmp_id#22017) && (tmp_id#22017 = 1a16d393e7042213384f994394b763d37121d7))
: +- *FileScan parquet tmp.ods_app_hour[tmp_id#22017,uuid#22018L,eguan_id#22019,device_id#22020,app_id#22014,day#22015,hour#22016] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[hdfs://mycluster/user/hive/warehouse/tmp.db/ods_app_hour/app_id=6069/day=..., PartitionCount: 24, PartitionFilters: [isnotnull(app_id#22014), isnotnull(day#22015), (app_id#22014 = 6069), (day#22015 = 20180303)], PushedFilters: [IsNotNull(tmp_id), EqualTo(tmp_id,1a16d393e7042213384f994394b763d37121d7)], ReadSchema: struct<tmp_id:string,uuid:bigint,eguan_id:string,device_id:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- HiveTableScan [row_key#22195, value#22196], MetastoreRelation default, app_uuid_info_test
2018-03-04 19:45:47,554 INFO [dispatcher-event-loop-10] storage.BlockManagerInfo: Removed broadcast_173_piece0 on 192.168.220.171:38696 in memory (size: 33.7 KB, free: 334.5 MB)
维表写法
select a.tmp_id,a.uuid,a.eguan_id,a.device_id,b.row_key,b.int_20 from tmp.ods_app_hour a left outer join (select row_key,value.int_20 as int_20 from app_uuid_info_test where row_key in (01,02) ) b on a.tmp_id = b.row_key where a.app_id='6069' and a.day = '20180303' and a.tmp_id = '1a16d393e7042213384f994394b763d37121d7' limit 100
== Physical Plan ==
CollectLimit 100
+- *BroadcastHashJoin [tmp_id#23309], [row_key#23487], LeftOuter, BuildRight
:- *LocalLimit 100
: +- *Project [tmp_id#23309, uuid#23310L, eguan_id#23311, device_id#23312]
: +- *Filter (isnotnull(tmp_id#23309) && (tmp_id#23309 = 1a16d393e7042213384f994394b763d37121d7))
: +- *FileScan parquet tmp.ods_app_hour[tmp_id#23309,uuid#23310L,eguan_id#23311,device_id#23312,app_id#23306,day#23307,hour#23308] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[hdfs://mycluster/user/hive/warehouse/tmp.db/ods_app_hour/app_id=6069/day=..., PartitionCount: 24, PartitionFilters: [isnotnull(app_id#23306), isnotnull(day#23307), (app_id#23306 = 6069), (day#23307 = 20180303)], PushedFilters: [IsNotNull(tmp_id), EqualTo(tmp_id,1a16d393e7042213384f994394b763d37121d7)], ReadSchema: struct<tmp_id:string,uuid:bigint,eguan_id:string,device_id:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- *Project [row_key#23487, value#23488[int_20] AS int_20#22932]
+- *Filter row_key#23487 IN (1,2)
+- HiveTableScan [row_key#23487, value#23488], MetastoreRelation default, app_uuid_info_test
14: jdbc:hive2://ark3:9994> alter table app_uuid_info_test set tblproperties(rawDataSize=1)
14: jdbc:hive2://ark3:9994> ;
HBASE 表是不会根新的所有手工指点
这个 阀值
set spark.sql.autoBroadcastJoinThreshold=100000000;
JOINT 中 left outer join app_uuid_info_test b
select a.tmp_id,a.uuid,a.eguan_id,a.device_id,b.row_key,b.value.int_20 from tmp.ods_app_hour a left outer join app_uuid_info_test b on a.tmp_id = b.row_key where a.app_id='6069' and a.day = '20180303' and a.tmp_id = '1a16d393e7042213384f994394b763d37121d7' limit 100;
执行期间中 : 就是 MAP JOIN 了。
org.apache.spark.sql.execution.SparkStrategies类 决定是否使用broadcast join的逻辑在SparkStrategies类中,
object CanBroadcast { def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match { case BroadcastHint(p) => Some(p) case p if sqlContext. conf.autoBroadcastJoinThreshold > 0 && p.statistics.sizeInBytes <= sqlContext.conf.autoBroadcastJoinThreshold => Some(p) case _ => None } }
== Parsed Logical Plan ==
'GlobalLimit 100
+- 'LocalLimit 100
+- 'Project ['a.tmp_id, 'a.uuid, 'a.eguan_id, 'a.device_id, 'b.row_key, 'b.value.int_20]
+- 'Filter ((('a.app_id = 6069) && ('a.day = 20180303)) && ('a.tmp_id = 1a16d393e7042213384f994394b763d37121d7))
+- 'Join LeftOuter, ('a.tmp_id = 'b.row_key)
:- 'UnresolvedRelation `tmp`.`ods_app_hour`, a
+- 'UnresolvedRelation `app_uuid_info_test`, b
== Analyzed Logical Plan ==
tmp_id: string, uuid: bigint, eguan_id: string, device_id: string, row_key: string, int_20: string
GlobalLimit 100
+- LocalLimit 100
+- Project [tmp_id#22017, uuid#22018L, eguan_id#22019, device_id#22020, row_key#22195, value#22196[int_20] AS int_20#22197]
+- Filter (((app_id#22014 = 6069) && (day#22015 = 20180303)) && (tmp_id#22017 = 1a16d393e7042213384f994394b763d37121d7))
+- Join LeftOuter, (tmp_id#22017 = row_key#22195)
:- SubqueryAlias a
: +- SubqueryAlias ods_app_hour
: +- Relation[tmp_id#22017,uuid#22018L,eguan_id#22019,device_id#22020,device_mac#22021,device_imsi#22022,device_aid#22023,device_snr#22024,device_udid#22025,debug_state#22026,hjk_state#22027,sir_state#22028,ij_state#22029,sdk_api_level#22030,standard_brand_id#22031,standard_model_id#22032,os_name_id#22033,os_version_id#22034,standard_smo_id#22035,app_key#22036,app_channel#22037,sdk_version#22038,app_version#22039,imeis_code#22040,... 157 more fields] parquet
+- SubqueryAlias b
+- MetastoreRelation default, app_uuid_info_test
== Optimized Logical Plan ==
GlobalLimit 100
+- LocalLimit 100
+- Project [tmp_id#22017, uuid#22018L, eguan_id#22019, device_id#22020, row_key#22195, value#22196[int_20] AS int_20#22197]
+- Join LeftOuter, (tmp_id#22017 = row_key#22195)
:- Project [tmp_id#22017, uuid#22018L, eguan_id#22019, device_id#22020]
: +- Filter (((((isnotnull(app_id#22014) && isnotnull(day#22015)) && isnotnull(tmp_id#22017)) && (app_id#22014 = 6069)) && (day#22015 = 20180303)) && (tmp_id#22017 = 1a16d393e7042213384f994394b763d37121d7))
: +- Relation[tmp_id#22017,uuid#22018L,eguan_id#22019,device_id#22020,device_mac#22021,device_imsi#22022,device_aid#22023,device_snr#22024,device_udid#22025,debug_state#22026,hjk_state#22027,sir_state#22028,ij_state#22029,sdk_api_level#22030,standard_brand_id#22031,standard_model_id#22032,os_name_id#22033,os_version_id#22034,standard_smo_id#22035,app_key#22036,app_channel#22037,sdk_version#22038,app_version#22039,imeis_code#22040,... 157 more fields] parquet
+- MetastoreRelation default, app_uuid_info_test
== Physical Plan ==
CollectLimit 100
+- *Project [tmp_id#22017, uuid#22018L, eguan_id#22019, device_id#22020, row_key#22195, value#22196[int_20] AS int_20#22197]
+- *BroadcastHashJoin [tmp_id#22017], [row_key#22195], LeftOuter, BuildRight
:- *Project [tmp_id#22017, uuid#22018L, eguan_id#22019, device_id#22020]
: +- *Filter (isnotnull(tmp_id#22017) && (tmp_id#22017 = 1a16d393e7042213384f994394b763d37121d7))
: +- *FileScan parquet tmp.ods_app_hour[tmp_id#22017,uuid#22018L,eguan_id#22019,device_id#22020,app_id#22014,day#22015,hour#22016] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[hdfs://mycluster/user/hive/warehouse/tmp.db/ods_app_hour/app_id=6069/day=..., PartitionCount: 24, PartitionFilters: [isnotnull(app_id#22014), isnotnull(day#22015), (app_id#22014 = 6069), (day#22015 = 20180303)], PushedFilters: [IsNotNull(tmp_id), EqualTo(tmp_id,1a16d393e7042213384f994394b763d37121d7)], ReadSchema: struct<tmp_id:string,uuid:bigint,eguan_id:string,device_id:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- HiveTableScan [row_key#22195, value#22196], MetastoreRelation default, app_uuid_info_test
2018-03-04 19:45:47,554 INFO [dispatcher-event-loop-10] storage.BlockManagerInfo: Removed broadcast_173_piece0 on 192.168.220.171:38696 in memory (size: 33.7 KB, free: 334.5 MB)
维表写法
select a.tmp_id,a.uuid,a.eguan_id,a.device_id,b.row_key,b.int_20 from tmp.ods_app_hour a left outer join (select row_key,value.int_20 as int_20 from app_uuid_info_test where row_key in (01,02) ) b on a.tmp_id = b.row_key where a.app_id='6069' and a.day = '20180303' and a.tmp_id = '1a16d393e7042213384f994394b763d37121d7' limit 100
== Physical Plan ==
CollectLimit 100
+- *BroadcastHashJoin [tmp_id#23309], [row_key#23487], LeftOuter, BuildRight
:- *LocalLimit 100
: +- *Project [tmp_id#23309, uuid#23310L, eguan_id#23311, device_id#23312]
: +- *Filter (isnotnull(tmp_id#23309) && (tmp_id#23309 = 1a16d393e7042213384f994394b763d37121d7))
: +- *FileScan parquet tmp.ods_app_hour[tmp_id#23309,uuid#23310L,eguan_id#23311,device_id#23312,app_id#23306,day#23307,hour#23308] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[hdfs://mycluster/user/hive/warehouse/tmp.db/ods_app_hour/app_id=6069/day=..., PartitionCount: 24, PartitionFilters: [isnotnull(app_id#23306), isnotnull(day#23307), (app_id#23306 = 6069), (day#23307 = 20180303)], PushedFilters: [IsNotNull(tmp_id), EqualTo(tmp_id,1a16d393e7042213384f994394b763d37121d7)], ReadSchema: struct<tmp_id:string,uuid:bigint,eguan_id:string,device_id:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- *Project [row_key#23487, value#23488[int_20] AS int_20#22932]
+- *Filter row_key#23487 IN (1,2)
+- HiveTableScan [row_key#23487, value#23488], MetastoreRelation default, app_uuid_info_test
发表评论
-
Spark SQL运行 过程 抄的别人的,记录 学习
2018-05-13 23:07 1007抄的别人的,觉得写的特别好 val FILESOURCE ... -
thriftserver log4j.properties 生效
2018-04-09 11:46 419/home/isuhadoop/spark2/sbin/sta ... -
udaf 返回的 子属性
2018-03-20 13:22 414udaf 返回的 子属性 spark.sql(" ... -
spark datasource
2018-03-16 16:36 637DataFrameWriter format val c ... -
spark thrift server 修改
2018-03-04 12:58 555org.apache.spark.sql.hive.thrif ... -
hive hbase thriftserver run
2018-03-03 15:13 371正确方法 : 0\ 拷贝对应目录到 spark2 jars ... -
scala package
2018-01-25 09:48 487#scala 打包 mvn clean scala:com ... -
SPARK SERVER
2018-01-23 22:15 495sbin/start-thriftserver.sh --dr ... -
driver class
2018-01-21 22:11 496sbin/start-thriftserver.sh -- ... -
spark thrift server 调试
2017-10-20 15:50 832spark-hive-thriftserver 本地调试 ... -
spark SQL conf
2017-10-18 14:36 586org.apache.spark.sql.internal.S ... -
java 死锁 ,内存问题 分析
2017-10-17 10:50 315jstack -l pid /opt/soft/jdk/ ... -
thriftServer proxy
2017-10-16 14:21 900sudo yum install haproxy 257 ... -
hive spark conf
2017-09-26 17:44 1274CREATE TABLE org_userbehavior_a ... -
get day
2017-09-19 08:41 249def timeDayNow() = { var ... -
thriftserver
2017-09-14 19:47 429export SPARK_CONF_DIR=/home/yun ... -
thriftserver dynamicallocation
2017-09-08 14:41 548./sbin/start-thriftserver.sh -- ... -
test code2
2017-09-03 13:45 462package org.test.udf import co ... -
test code
2017-08-24 17:52 260def taskcal(data:Array[(String, ... -
struct streaming SQL udf udaf
2017-08-22 11:50 647spark aggregator class H ...
相关推荐
//是否在 Map 端进行聚合,默认为 True ;该设置会消耗更多的内存。 set hive.groupby.mapaggr.checkinterval = 100000000; //在 Map 端进行聚合操作的条目数目 set hive.groupby.skewindata = true; //解决数据...
Map side join 比 reducer side join 快。 但是只有当您执行映射端连接操作的表之一小到足以放入内存时,映射端连接才足够。 日期集信息 客户数据集:每行包含:CustID、Name、Age、CountryCode、Salary。 交易...
技术点29 确定map 端数据倾斜问题 技术点30 判定map 任务吞吐量 技术点31 小文件 技术点32 不可切割的文件 6.2.3 reduce 端问题 技术点33 reducer 任务数过大或过小 . 技术点34 定位reduce 端数据倾斜...
6.1.1 提取作业统计信息的工具6.1.2 监控6.2 确定性能问题的原因6.2.1 了解哪些因素会影响MapReduce 作业的性能 6.2.2 map 端异常技术点28 发现输入数据中的坑技术点29 确定map 端数据倾斜问题 技术...
* public Map<“多端”做为Key的属性的类,主表类> get“多端”列表(){return “多端”列表} * “多端”配置参考@ManyToOne. * 方法三 使用这种配置,在为“一端”添加“多端”时,可以修改“多端”的外键。 * ...
包org.dan.mr.order_pro_mapjoin MapReduce实现订单信息和产品信息的join逻辑,在Mapper端实现,避免数据倾斜 包org.dan.mr.wordindex MapReduce单词索引 包org.dan.mr.shared_friends MapReduce查找共同好友 包org....
map输出的键必须是join键减速器1.Reducer将使用通用密钥从所有文件中获取随机数据。 2.根据标签属性组合两个记录。 问题陈述 : 查找购买总金额以及每个客户的交易次数。 客户表将具有唯一的客户ID以及客户的其他...
我们团队开发web项目一般采用前后端分离,所以后端的的Controller层的功能仅仅只有提供ajax接口,页面集成后的跳转,过滤器和拦截器,所以我就想着自己写一款仿springmvc的mvc框架作为自己和团队以后的开发中小型项目的...
8.3. 组件作为Map的索引(Components as Map indices ) 8.4. 组件作为联合标识符(Components as composite identifiers) 8.5. 动态组件 (Dynamic components) 9. 继承映射(Inheritance Mappings) 9.1. 三种...
8.3. 组件作为Map的索引(Components as Map indices ) 8.4. 组件作为联合标识符(Components as composite identifiers) 8.5. 动态组件 (Dynamic components) 9. 继承映射(Inheritance Mappings) 9.1. 三种策略 ...
组件作为Map的索引(Components as Map indices ) 8.4. 组件作为联合标识符(Components as composite identifiers) 8.5. 动态组件 (Dynamic components) 9. 继承映射(Inheritance Mappings) 9.1. 三种策略 ...
8.3. 组件作为Map的索引(Components as Map indices ) 8.4. 组件作为联合标识符(Components as composite identifiers) 8.5. 动态组件 (Dynamic components) 9. 继承映射(Inheritance Mappings) 9.1. 三种...
8.3. 组件作为Map的索引(Components as Map indices ) 8.4. 组件作为联合标识符(Components as composite identifiers) 8.5. 动态组件 (Dynamic components) 9. 继承映射(Inheritance Mappings) 9.1. 三种...
8.3. 组件作为Map的索引(Components as Map indices ) 8.4. 组件作为联合标识符(Components as composite identifiers) 8.5. 动态组件 (Dynamic components) 9. 继承映射(Inheritance Mappings) 9.1. 三种策略 ...
组件作为Map的索引(Components as Map indices ) 9.4. 组件作为联合标识符(Components as composite identifiers) 9.5. 动态组件 (Dynamic components) 10. 继承映射(Inheritance Mappings) 10.1. 三种...
8.3. 组件作为Map的索引(Components as Map indices ) 8.4. 组件作为联合标识符(Components as composite identifiers) 8.5. 动态组件 (Dynamic components) 9. 继承映射(Inheritance Mappings) 9.1. 三种...
8.3. 组件作为Map的索引(Components as Map indices ) 8.4. 组件作为联合标识符(Components as composite identifiers) 8.5. 动态组件 (Dynamic components) 9. 继承映射(Inheritance Mappings) 9.1. 三种...
8.3. 组件作为Map的索引(Components as Map indices ) 8.4. 组件作为联合标识符(Components as composite identifiers) 8.5. 动态组件 (Dynamic components) 9. 继承映射(Inheritance Mappings) 9.1. 三种策略 ...
8.3. 组件作为Map的索引(Components as Map indices ) 8.4. 组件作为联合标识符(Components as composite identifiers) 8.5. 动态组件 (Dynamic components) 9. 继承映射(Inheritance Mappings) 9.1. 三种策略 ...