mahout canopy怎么使用

184次阅读

共计 6888 个字符，预计需要花费 18 分钟才能阅读完成。

这篇文章主要介绍“mahout canopy 怎么使用”，在日常操作中，相信很多人在 mahout canopy 怎么使用问题上存在疑惑，丸趣 TV 小编查阅了各式资料，整理出简单好用的操作方法，希望对大家解答”mahout canopy 怎么使用”的疑惑有所帮助！接下来，请跟着丸趣 TV 小编一起来学习吧！

canopy 原理是聚类算法的一种实现
canopy 是一种简单，快速但是不准确的聚类方法
cannopy 是一种小而美的聚类方法，其算法流程如下
1 设样本集为 S 确定两个阙值 t1 和 t2 其中 t1 t2
2 任取一个样本点 p 属于 s 作为一个 canopy 记为 c, 从 s 中移除 p
3 记录 s 中所有点到 p 的距离 dist
4 若 dist t1 则将其点归为 C
5 若 dist t2 则将其点归为 S
重复 2 - 5 直至 S 为空
T1 和 T2 参数
当 T1 过大时，会使许多点属于多个 cannopy，可能造成各个点的中心点间距比较近，各族区间不明显
当 T2 过大时，增加强标记数据点的数量，会减少族的个数，T2 过小，会增加族的个数，同时，增加计算时间
mahout 中对 canopy clustering 的实现是比较巧妙的，整个聚类过程用两个 map 操作和一个 reduce 操作就完成了
canopy 构建过程可以概括为遍历给定点集 S，设置两个阙值，t1 和 t2 且 t1 t2 选择一个点，用低成本算法计算它与其他
canopy 中心的距离，如果距离小于 t1 则将该点加入那个 canopy 如果小于 T2 则该点不会成为某个 canopy 的中心，重复整个过程，直到 s 非空
距离的实现
org.apache.mahout.common.distance.DistanceMeasure 接口
CosineDistanceMeasure
SquaredEuclideanDistanceMeasure 计算欧式距离的平方
EuclideanDistanceMeasure 计算欧式距离
ManhatanDistanceMeasure 马氏距离，图像处理用的比较多
TanimotoDistanceMeasure jaccard 相似度带权重的欧式距离和马氏距离
canopy 使用注意点
1 首先是轻量距离亮度的选择。是选择一个模型中的属性，还是其他外部属性这对 canopy 的分布很重要
2 T1 和 T2 取值影响到重叠度 F，以及 canopy 的粒度
3.canopy 有消除孤立点的作用，而 kmeas 却无能为力，建立 canopies 后，可以删除那些包含比较少的 canopy，往往这些 canopy 包含孤立点
4，设置好 canopy 内点的数目，来决定聚类中心数目 k，这样效果比较好
[root@localhost bin]# hadoop fs -mkdir /20140824
[root@localhost data]# vi test-data.csv
1 -0.213 -0.956 -0.003 0.056 0.091 0.017 -0.024 1
1 3.147 2.129 -0.006 -0.056 -0.063 -0.002 0.109 0
1 -2.165 -2.718 -0.008 0.043 -0.103 -0.156 -0.024 1
1 -4.337 -2.686 -0.012 0.122 0.082 -0.021 -0.042 1
root@localhost data]# hadoop fs -put test-data.csv /20140824
[root@localhost mahout-distribution-0.7]# hadoop jar org.apache.mahout.clustering.syntheticcontrol.canopy.Job -i /20140824/test-data.csv -o /20140824 -t1 10 -t2 1
6/12/05 05:37:09 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/12/05 05:37:13 INFO input.FileInputFormat: Total input paths to process : 1
16/12/05 05:37:14 INFO mapreduce.JobSubmitter: number of splits:1
16/12/05 05:37:15 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1480730026445_0005
16/12/05 05:37:17 INFO impl.YarnClientImpl: Submitted application application_1480730026445_0005
16/12/05 05:37:17 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1480730026445_0005/
16/12/05 05:37:17 INFO mapreduce.Job: Running job: job_1480730026445_0005
16/12/05 05:38:26 INFO mapreduce.Job: Job job_1480730026445_0005 running in uber mode : false
16/12/05 05:38:27 INFO mapreduce.Job: map 0% reduce 0%
16/12/05 05:39:25 INFO mapreduce.Job: map 100% reduce 0%
16/12/05 05:39:28 INFO mapreduce.Job: Job job_1480730026445_0005 completed successfully
16/12/05 05:39:30 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=105369
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=339
HDFS: Number of bytes written=457
HDFS: Number of read operations=5
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=51412
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=51412
Total vcore-seconds taken by all map tasks=51412
Total megabyte-seconds taken by all map tasks=52645888
Map-Reduce Framework
Map input records=4
Map output records=4
Input split bytes=108
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=140
CPU time spent (ms)=1620
Physical memory (bytes) snapshot=87416832
Virtual memory (bytes) snapshot=841273344
Total committed heap usage (bytes)=15597568
File Input Format Counters
Bytes Read=231
File Output Format Counters
Bytes Written=457
16/12/05 05:39:31 INFO canopy.CanopyDriver: Build Clusters Input: /20140824/data Out: /20140824 Measure: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure@79b0cd8f t1: 10.0 t2: 1.0
16/12/05 05:39:32 INFO client.RMProxy: Connecting to ResourceManager at hadoop02/127.0.0.1:8032
16/12/05 05:39:33 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/12/05 05:39:37 INFO input.FileInputFormat: Total input paths to process : 1
16/12/05 05:39:38 INFO mapreduce.JobSubmitter: number of splits:1
16/12/05 05:39:38 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1480730026445_0006
16/12/05 05:39:38 INFO impl.YarnClientImpl: Submitted application application_1480730026445_0006
16/12/05 05:39:39 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1480730026445_0006/
16/12/05 05:39:39 INFO mapreduce.Job: Running job: job_1480730026445_0006
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=105814
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1970
HDFS: Number of bytes written=527
HDFS: Number of read operations=13
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=26957
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=26957
Total vcore-seconds taken by all map tasks=26957
Total megabyte-seconds taken by all map tasks=27603968
Map-Reduce Framework
Map input records=4
Map output records=4
Input split bytes=112
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=134
CPU time spent (ms)=1880
Physical memory (bytes) snapshot=96550912
Virtual memory (bytes) snapshot=841433088
Total committed heap usage (bytes)=15597568
File Input Format Counters
Bytes Read=457
File Output Format Counters
Bytes Written=527
C-0{n=2 c=[1.000, -3.794, -2.694, -0.011, 0.102, 0.036, -0.055, -0.038, 1.000] r=[1:0.543, 2:0.008, 3:0.001, 4:0.020, 5:0.046, 6:0.034, 7:0.004]}
Weight : [props – optional]: Point:
1.0: [1.000, -4.337, -2.686, -0.012, 0.122, 0.082, -0.021, -0.042, 1.000]
C-1{n=2 c=[1.000, -2.220, -2.270, -0.008, 0.066, -0.008, -0.079, -0.029, 1.000] r=[1:1.031, 2:0.433, 3:0.002, 4:0.016, 5:0.002, 6:0.010, 7:0.005]}
Weight : [props – optional]: Point:
1.0: [1.000, -2.165, -2.718, -0.008, 0.043, -0.103, -0.156, -0.024, 1.000]
C-2{n=1 c=[0:1.000, 1:3.147, 2:2.129, 3:-0.006, 4:-0.056, 5:-0.063, 6:-0.002, 7:0.109] r=[]}
Weight : [props – optional]: Point:
1.0: [0:1.000, 1:3.147, 2:2.129, 3:-0.006, 4:-0.056, 5:-0.063, 6:-0.002, 7:0.109]
C-3{n=1 c=[1.000, -1.189, -1.837, -0.006, 0.050, -0.006, -0.070, -0.024, 1.000] r=[]}
Weight : [props – optional]: Point:
1.0: [1.000, -0.213, -0.956, -0.003, 0.056, 0.091, 0.017, -0.024, 1.000]
16/12/05 05:43:59 INFO clustering.ClusterDumper: Wrote 4 clusters
16/12/05 05:55:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Found 4 items
drwxr-xr-x – root supergroup 0 2016-12-05 05:43 /20140824/clusteredPoints
drwxr-xr-x – root supergroup 0 2016-12-05 05:42 /20140824/clusters-0-final
drwxr-xr-x – root supergroup 0 2016-12-05 05:39 /20140824/data
-rw-r–r– 1 root supergroup 231 2016-12-05 05:21 /20140824/test-data.csv