This website requires JavaScript.

Spark使用案例-奥林匹克数据分析

在本篇中我们会进行奥利匹克数据集的分析,在这部分的公开数据中我们会统计每个国家在游泳比赛获得的奖牌数等数据.

数据集描述

Athlete: Name of the athlete Age: Age of the athlete Country: The name of the country participating in Olympics Year: The year in which Olympics is conducted Closing Date: Closing date of Olympics Sport: Sports name Gold Medals: No. of gold medals Silver Medals: No. of silver medals Bronze Medals: No. of bronze medals Total Medals: Total no. of medals

点此下载数据

判断每个国家在游泳项目上获得的奖牌数

//加载RDD
val textFile = sc.textFile("hdfs://localhost:9000/olympix_data.csv")

//分割 ,有乱码问题…不知道怎么处理,我文本编辑器转了码再放上去的
val counts = textFile.filter { x => {if(x.toString().split("\t").length >= 10) true else false} }.map(line=>{line.toString().split("\t")})

//过滤,判断swimmming项目以及获得奖牌数
val fil = counts.filter(x=>{if(x(5).equalsIgnoreCase("swimming")&&(x(9).matches(("\\d+")))) true else false })

//创建RDD(String,Int) 对
val pairs: RDD[(String, Int)] = fil.map(x => (x(2),x(9).toInt))

//计算结果
val cnt = pairs.reduceByKey(_ + _).collect()

找出印度每年赢得的金牌数量

val textFile = sc.textFile("hdfs://localhost:9000/olympix_data.csv")

val counts = textFile.filter { x => {if(x.toString().split("\t").length >= 10) true else false} }.map(line=>{line.toString().split("\t")})

val fil = counts.filter(x=>{if(x(2).equalsIgnoreCase("india")&&(x(9).matches(("\\d+")))) true else false })

val pairs: RDD[(String, Int)] = fil.map(x => (x(3),x(9).toInt))

val cnt = pairs.reduceByKey(_ + _).collect()

找出每个国家获得的奖牌数

val textFile = sc.textFile("hdfs://localhost:9000/olympix_data.csv")

val counts = textFile.filter { x => {if(x.toString().split("\t").length >= 10) true else false} }.map(line=>{line.toString().split("\t")})

val fil = counts.filter(x=>{if((x(9).matches(("\\d+")))) true else false })

val pairs: RDD[(String, Int)] = fil.map(x => (x(2),x(9).toInt))

val cnt = pairs.reduceByKey(_ + _).collect()

原文地址

https://acadgild.com/blog/spark-use-case-olympics-data-analysis/

0条评论
avatar