Mahout 是 apache Soft Foundation 旗下的一个开源项目
Mahout 的许多实现,包括聚类,分类,推荐过滤,频繁子项目挖掘,此外,通过使用 Apache Hadoop 库
Mahout 可以有效的扩展到云中
运行 Mahout 自带的 kmeans 算法同时验证 Mahout 是否能够正常运行
把文件放在 $MAHOUT_HOME 目录下 synthetic_con
23 17
[hdfs@cloudra ~]$ hadoop fs -mkdir  testdata
[hdfs@cloudra root]$ hadoop fs -mkdir  /output
[hdfs@cloudra ~]$ hadoop fs -put testdata
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
export JAVA_HOME=/usr/java/jdk1.7.0_79
hdfs@cloudra ~]$ mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
或者 jar mahout-distribution-0.7/mahout-examples-0.7-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

[root@localhost mahout-distribution-0.9]# hadoop fs -mkdir /user/root/testdata
16/11/23 05:28:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
mkdir: `/user/root/testdata : No such file or directory
[root@localhost mahout-distribution-0.9]# hadoop fs -mkdir -p /user/root/testdata
16/11/23 05:28:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
[root@localhost mahout-distribution-0.9]# ls
bin  lib  mahout-examples-0.9.jar  NOTICE.txt
conf  LICENSE.txt  mahout-examples-0.9-job.jar  README.txt
docs  mahout-core-0.9.jar  mahout-integration-0.9.jar
examples  mahout-core-0.9-job.jar  mahout-math-0.9.jar
[root@localhost mahout-distribution-0.9]# cd ..
[root@localhost soft]# cd ..
[root@localhost ~]# cd –
[root@localhost soft]# ls
data  hadoop-2.6.0  jdk1.7.0_79  mahout-distribution-0.9
[root@localhost soft]# cd data
[root@localhost data]# ls
[root@localhost data]# hadoop fs -put /user/root/testdata
16/11/23 05:29:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
put: `/user/root/testdata : No such file or directory
[root@localhost data]# hadoop fs -put  /user/root/testdata
16/11/23 05:29:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
[root@localhost data]# ls
[root@localhost data]# cd ..
[root@localhost soft]# ls
data  hadoop-2.6.0  jdk1.7.0_79  mahout-distribution-0.9
[root@localhost soft]# cd mahout-distribution-0.9/
[root@localhost mahout-distribution-0.9]# ls
bin  lib  mahout-examples-0.9.jar  NOTICE.txt
conf  LICENSE.txt  mahout-examples-0.9-job.jar  README.txt
docs  mahout-core-0.9.jar  mahout-integration-0.9.jar
examples  mahout-core-0.9-job.jar  mahout-math-0.9.jar
[root@localhost mahout-distribution-0.9]# hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
16/11/23 05:30:30 INFO kmeans.Job: Running with default arguments
16/11/23 05:30:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
16/11/23 05:30:40 INFO kmeans.Job: Preparing Input
16/11/23 05:30:41 INFO client.RMProxy: Connecting to ResourceManager at hadoop02/
16/11/23 05:30:42 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/11/23 05:30:46 INFO input.FileInputFormat: Total input paths to process : 1
16/11/23 05:30:46 INFO mapreduce.JobSubmitter: number of splits:1
16/11/23 05:30:47 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1479907436985_0002
16/11/23 05:30:49 INFO impl.YarnClientImpl: Submitted application application_1479907436985_0002
16/11/23 05:30:49 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1479907436985_0002/
16/11/23 05:30:49 INFO mapreduce.Job: Running job: job_1479907436985_0002
16/11/23 05:31:40 INFO mapreduce.Job: Job job_1479907436985_0002 running in uber mode : false
16/11/23 05:31:40 INFO mapreduce.Job:  map 0% reduce 0%
mahout seqdumper 将 SequenceFile 文件转成可读的文本形式对应的源文件是 将向量文件转化成
mahout clusterdump 分析最后聚类的输出结果,对应的源文件是
[root@localhost bin]# mahout seqdumper -s output/clusters-5/part-r-00000 -o ~/
mahout clusterdump –seqFileDir /user/root/output/clusters-10-final –pointsDir /user/root/output/clusteredPoints –output $MAHOUT_HOME/examples/output/clusteranalyze.txt
Mahout 包含三大块聚类,协同过滤(推荐 item user),分类算法 (贝叶斯)
聚类算法 Canopy 算法(canopy clustering)K 均值算法 (k means cluster) 模糊 K 均值 (fuzzy kmeans),EM 聚类(期望最大化聚类 EXPECTION MAXMIZATION)
均值漂移聚类(Mean shirt clustering)层次聚类(hieratical cluster)狄克磊过程聚类(oirichiet process clustering)
latent dinchiet allocation LOA 聚类
分类 就是按照某种标准给对象贴标签,再根据标签来区分归类
分类是事先定义好类别,类别数不变 比如大豆和绿豆 区分值颜色大小
算法 逻辑回归(logistic regression)贝叶斯(Bayesian)支持向量机(Support vector machine)感知器算法
(perceptron and winnow)神经网络(Neural network)随机森林(random forests)
有限玻尔兹曼机(restric boltzman machine)
推荐 / 协同过滤   Non-distributed recommenders/(Distribute Recommenders) TasteUserCF(item cf,slotone)/item cf
向量相似度计算 RowSimilantyJob /VectorDistanceJob  计算列间相似度 / 计算向量间距离
非 MR 算法   Hidden markov models 马尔科夫模型
集合方法扩展 collocations 扩展了 java 的 collection 类
关联规则挖掘 parallel Fp growth algorithim 并行 FP growth 算法
回归 Locally Weighted Linear Regression 局部加权线性回归
降维 stochastic singular value DeCOMPOSITION 奇异值分解 /pricipal components Analysis 主成分分析 /independent components analysis 独立成分分析 /
gaussian discriminative analysis 高斯判别分析
进化算法 并行化 watchmake 框架

