This website requires JavaScript.

Spark使用案例-泰坦尼克号数据分析[译]

大家都知道1912年4月14日的泰坦尼克事件,因为人为的失误造成了大祸.现在仍然有很多待探知的问题.比如46000吨的船如何在短短三小时内就下沉了13000英尺

这篇文章并不分析沉船的原因,使用的是公开的数据,字段描述如下:

使用这个数据集我们做一些分析,比如死亡男女的平均年龄,分别有多少男女死于各楼层.

数据集描述

Column 1: PassengerId Column 2: Survived  (survived=0 & died=1) Column 3: Pclass Column 4: Name Column 5: Sex Column 6: Age Column 7: SibSp Column 8: Parch Column 9: Ticket Column 10: Fare Column 11: Cabin Column 12: Embarked

这里下载数据

语句1

发现死亡人员的平均年龄

//创建RDD val textFile = sc.textFile("hdfs://localhost:9000/TitanicData.txt")

//判断数据至少有7列,避免ArrayIndexOutOfBound 异常然后用map分割 val split = textFile.filter { x => {if(x.toString().split(",").length >= 6) true else false} }.map(line=>{line.toString().split(",")})

//这里判断两列数据,一个是人员是否死亡. 第二列 0表示生还,1表示死亡. 第六列应该是数字,使用正则表达式\d+判断. 满足这两项条件的把第五列性别作为key,第六列年龄转为int型并输出. val key_value = split.filter{x=>if((x(1)=="1")&&(x(5).matches(("\d+"))))true else false}.map(x => {(x(4),x(5).toInt)})

//进行平均值的计算:使用reduceByKey分key进行统计.然后使用mapValues计算平均值. key_value.mapValues((_, 1)).reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)).mapValues{ case (sum, count) => (1.0 * sum)/count}.collectAsMap()

输出:

male -> 28.78409090909091, female -> 29.11855670103093

语句2

泰坦尼克号有三种等级的舱,这次我们要计算每种等级舱的死亡人数.

//创建RDD数据集 val textFile = sc.textFile("hdfs://localhost:9000/TitanicData.txt")

//判断是否至少有7列,然后进行分割 val split = textFile.filter { x => {if(x.toString().split(",").length >= 6) true else false} }.map(line=>{line.toString().split(",")})

// 直接组合2,5,6,3 这几列,即包含是否生存,性别,年龄,等级舱.然后相加 val count=split.map(x=>(x(1)+" "+x(4)+" "+x(5)+" "+x(2),1)).reduceByKey(+).collect

输出

count: Array[(String, Int)] = Array((1 female 31 1,2), (0 male 45.5 3,1), (0 male 40 3,4), (0 male 20 3,10), (1 male 32 3,5), (1 female 15 3,3), (1 female 45 2,2), (0 male 52 2,2), (0 male 51 1,1), (1 male 29 3,3), (1 male 56 1,1), (1 female 41 2,1), (1 female 26 1,1), (1 male 27 1,3), (0 male 33 3,7), (1 female 24 3,3), (1 male 18 3,1), (0 female 21 3,3), (0 female 45 3,3), (0 female 26 2,1), (0 male 15 3,1), (0 male 17 3,6), (0 male 32 2,2), (0 male 28 1,2), (0 male 70.5 3,1), (0 male 62 1,2), (1 female 51 1,1), (1 female 36 2,3), (1 female 34 2,4), (0 female 47 3,1), (0 male 23 2,6), (0 female 30 3,2), (1 male 45 3,1), (1 female 19 1,3), (0 male 30 2,5), (0 male 25 2,5), (1 female 44 1,2), (0 female 29 3,2), (0 male 21 2,3), (0 male 55 1,1), (0 male 28 3,10), (1 male 21 3,1), (0 male...

原文地址

Spark Use Case – Titanic Data Analysis

0条评论
avatar