{"id":480,"date":"2014-09-13T20:36:00","date_gmt":"2014-09-13T12:36:00","guid":{"rendered":"http:\/\/note.systw.net\/note\/?p=480"},"modified":"2023-11-02T20:38:36","modified_gmt":"2023-11-02T12:38:36","slug":"mahout-clustering","status":"publish","type":"post","link":"https:\/\/systw.net\/note\/archives\/480","title":{"rendered":"Mahout Clustering"},"content":{"rendered":"\n<p>Canopy\u5206\u7fa4\u6f14\u7b97\u6cd5<br><strong>#mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job<br><\/strong>\u5e38\u7528\u53c3\u6578\u5982\u4e0b<br>-i: Input vector&#8217;s directory<br>-o: Output working directory<br>-dm: Distance measure<br>\u3000\u5e38\u7528\u7684dm\u9078\u9805\u5982\u4e0b<br>\u3000org.apache.mahout.common.distance.CosineDistanceMeasure<br>\u3000org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure (default)<br>-t1: T1 threshold<br>-t2: T2 threshold<br>-ow: Overwrite output directory if present<\/p>\n\n\n\n<p><strong>Canopy\u6b65\u9a5f<\/strong><br>1.\u6307\u5b9a\u5169\u500b\u8ddd\u96e2\u503c,T1(\u5927\u5708)\u548cT2(\u5c0f\u5708), \u5176\u4e2dT1 &gt; T2<br>2. \u5c07dataset\u5411\u91cf\u5316\u4e26\u653e\u5165list\u4e2d<br>3. \u5f9elist\u4e2d\u4efb\u53d6\u4e00\u9edeP<br>\u5047\u5982\u6c92\u6709\u4efb\u4f55Canopy\uff0c\u5247P\u70baCanopy<br>\u5047\u5982\u9edeP\u8207\u67d0\u500bCanopy\u8ddd\u96e2\u5728T1(\u5927\u5708)\u5167,\u5247\u9edeP\u52a0\u5165\u8a72Canopy<br>\u5047\u5982\u9edeP\u8207\u67d0\u500bCanopy\u8ddd\u96e2\u5728T2(\u5c0f\u5708)\u5167,\u5247\u9edeP\u52a0\u5165\u8a72Canopy\uff0c\u4e26\u5f9elist\u4e2d\u79fb\u9664\u9edeP<br>4. \u91cd\u8986\u6b65\u9a5f2\u76f4\u5230list\u70ba\u7a7a<br>ps:<br>t1:\u5927\u5708,\u4e3b\u8981\u5f71\u97ff\u6bcf\u500bcluster\u5167\uff0c\u9ede\u7684\u6578\u91cf<br>t2:\u5c0f\u5708,\u4e3b\u8981\u5f71\u97ffcluster\u7684\u6578\u91cf\uff0c\u592a\u5927\u6703\u5c0e\u81f4\u53ea\u6709\u4e00\u500bcluser, \u592a\u5c0f\u6703\u5c0e\u81f4\u592a\u591acluster<br>ps<br>canopy\u8f03\u9069\u5408\u6c7a\u5b9a\u7fa4\u6578<\/p>\n\n\n\n<p>&#8230;&#8230;.<\/p>\n\n\n\n<p>K-means\u5206\u7fa4\u6f14\u7b97\u6cd5<br><strong>#mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job<\/strong><br>\u5e38\u7528\u53c3\u6578\u5982\u4e0b<br>-i &lt; hadoop input path &gt; : Path to job input directory.<br>-o : The directory pathname for output.<br>-dm &lt; distanceMeasure &gt;: The classname of the DistanceMeasure.<br>\u3000Default is SquaredEuclidean<br>-k &lt; number of clusters&gt; : The number of clusters to create<br>-t1 &lt; threshold value&gt; : T1 threshold value<br>-t2 &lt; threshold value&gt; : T2 threshold value<br>-x &lt; number of iterations&gt; : The maximum number of iterations.<br>-ow : If present, overwrite the output directory before running job<br>-h: Print out help<\/p>\n\n\n\n<p>&#8230;.<\/p>\n\n\n\n<p><strong>#mahout clusterdump<\/strong><br>\u5e38\u7528\u53c3\u6578\u5982\u4e0b<br>-i &lt; hadoop input path &gt; : Path to job input directory.<br>-o : The directory pathname for output.<br>-p &lt; points Dir&gt; : The directory containing points sequence files mapping input vectors to their cluster. If specified, then the program will output the points associated with a cluster<\/p>\n\n\n\n<p>\u8f38\u51fa\u7d50\u679c<br>VL-8{n=3 c=[5.667, 5.667] r=[0.471, 0.471]}<br>Weight : [props &#8211; optional]: Point:<br>1.0: [5.000, 6.000]<br>1.0: [6.000, 5.000]<br>1.0: [6.000, 6.000]<br>\u8aaa\u660e\u5982\u4e0b<br>VL-8: Cluster name<br>n: \u9019\u500b\u7fa4\u6709\u5e7e\u500b\u9ede<br>c: the center of Cluster<br>r: the radius of the cluster<\/p>\n\n\n\n<p><br>&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;.<\/p>\n\n\n\n<p>demo<\/p>\n\n\n\n<p>test dataset<br><strong>#vi clustering.data<\/strong><br>0 1<br>1 0<br>1 1<br>2 1<br>1 2<br>2 2<br>5 6<br>6 5<br>6 6<br>9 9<\/p>\n\n\n\n<p><strong># hadoop fs -mkdir testdata<br># hadoop fs -put clustering.data testdata<br># hadoop fs -ls -R testdata<\/strong><br>-rw-r&#8211;r&#8211; 3 root hdfs 288374 2014-02-05 21:53 testdata\/clustering.data<\/p>\n\n\n\n<p>&#8230;<\/p>\n\n\n\n<p>canopy\u5206\u7fa4<strong><br>#mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job -t1 3 -t2 2 -i testdata -o output<br><\/strong>&#8230;omit&#8230;<br>14\/09\/08 01:31:07 INFO clustering.ClusterDumper: Wrote 3 clusters<br>14\/09\/08 01:31:07 INFO driver.MahoutDriver: Program took 104405 ms (Minutes: 1.7400833333333334)<\/p>\n\n\n\n<p><strong>#hadoop fs -ls -R output<\/strong><br>drwxr-xr-x &#8211; root hdfs 0 2014-09-08 01:31 output\/clusteredPoints<br>-rw-r&#8211;r&#8211; 3 root hdfs 0 2014-09-08 01:31 output\/clusteredPoints\/_SUCCESS<br>-rw-r&#8211;r&#8211; 3 root hdfs 512 2014-09-08 01:31 output\/clusteredPoints\/part-m-00000<br>drwxr-xr-x &#8211; root hdfs 0 2014-09-08 01:30 output\/clusters-0-final<br>-rw-r&#8211;r&#8211; 3 root hdfs 0 2014-09-08 01:30 output\/clusters-0-final\/_SUCCESS<br>-rw-r&#8211;r&#8211; 3 root hdfs 202 2014-09-08 01:30 output\/clusters-0-final\/_policy<br>-rw-r&#8211;r&#8211; 3 root hdfs 649 2014-09-08 01:30 output\/clusters-0-final\/part-r-00000<br>drwxr-xr-x &#8211; root hdfs 0 2014-09-08 01:29 output\/data<br>-rw-r&#8211;r&#8211; 3 root hdfs 0 2014-09-08 01:29 output\/data\/_SUCCESS<br>-rw-r&#8211;r&#8211; 3 root hdfs 382 2014-09-08 01:29 output\/data\/part-m-00000<\/p>\n\n\n\n<p><strong>#mahout clusterdump &#8211;input output\/clusters-0-final &#8211;pointsDir output\/clusteredPoints<\/strong><br>C-0{n=1 c=[9.000, 9.000] r=[]}<br>Weight : [props &#8211; optional]: Point:<br>1.0: [9.000, 9.000]<br>C-1{n=2 c=[5.833, 5.583] r=[0.167, 0.083]}<br>Weight : [props &#8211; optional]: Point:<br>1.0: [5.000, 6.000]<br>1.0: [6.000, 5.000]<br>1.0: [6.000, 6.000]<br>C-2{n=4 c=[1.313, 1.333] r=[0.345, 0.527]}<br>Weight : [props &#8211; optional]: Point:<br>1.0: [1:1.000]<br>1.0: [0:1.000]<br>1.0: [1.000, 1.000]<br>1.0: [2.000, 1.000]<br>1.0: [1.000, 2.000]<br>1.0: [2.000, 2.000]<\/p>\n\n\n\n<p><br>&#8230;&#8230;&#8230;&#8230;<\/p>\n\n\n\n<p>\u6307\u5b9aK-means\u5206\u62103\u7fa4<br><strong>#mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -k 3 -t1 3 -t2 2 -x 10 -i testdata -o output<\/strong><br>&#8230;omit&#8230;<br>14\/09\/08 00:47:53 INFO clustering.ClusterDumper: Wrote 3 clusters<br>14\/09\/08 00:47:53 INFO driver.MahoutDriver: Program took 194271 ms (Minutes: 3.23785)<\/p>\n\n\n\n<p><strong>#hadoop fs -ls -R output<\/strong><br>-rw-r&#8211;r&#8211; 3 root hdfs 194 2014-09-08 00:47 output\/_policy<br>drwxr-xr-x &#8211; root hdfs 0 2014-09-08 00:47 output\/clusteredPoints<br>-rw-r&#8211;r&#8211; 3 root hdfs 0 2014-09-08 00:47 output\/clusteredPoints\/_SUCCESS<br>-rw-r&#8211;r&#8211; 3 root hdfs 512 2014-09-08 00:47 output\/clusteredPoints\/part-m-00000<br>drwxr-xr-x &#8211; root hdfs 0 2014-09-08 00:45 output\/clusters-0<br>-rw-r&#8211;r&#8211; 3 root hdfs 194 2014-09-08 00:45 output\/clusters-0\/_policy<br>-rw-r&#8211;r&#8211; 3 root hdfs 332 2014-09-08 00:45 output\/clusters-0\/part-00000<br>-rw-r&#8211;r&#8211; 3 root hdfs 332 2014-09-08 00:45 output\/clusters-0\/part-00001<br>-rw-r&#8211;r&#8211; 3 root hdfs 332 2014-09-08 00:45 output\/clusters-0\/part-00002<br>drwxr-xr-x &#8211; root hdfs 0 2014-09-08 00:46 output\/clusters-1<br>-rw-r&#8211;r&#8211; 3 root hdfs 0 2014-09-08 00:46 output\/clusters-1\/_SUCCESS<br>-rw-r&#8211;r&#8211; 3 root hdfs 194 2014-09-08 00:46 output\/clusters-1\/_policy<br>-rw-r&#8211;r&#8211; 3 root hdfs 689 2014-09-08 00:46 output\/clusters-1\/part-r-00000<br>drwxr-xr-x &#8211; root hdfs 0 2014-09-08 00:46 output\/clusters-2<br>-rw-r&#8211;r&#8211; 3 root hdfs 0 2014-09-08 00:46 output\/clusters-2\/_SUCCESS<br>-rw-r&#8211;r&#8211; 3 root hdfs 194 2014-09-08 00:46 output\/clusters-2\/_policy<br>-rw-r&#8211;r&#8211; 3 root hdfs 698 2014-09-08 00:46 output\/clusters-2\/part-r-00000<br>drwxr-xr-x &#8211; root hdfs 0 2014-09-08 00:47 output\/clusters-3-final<br>-rw-r&#8211;r&#8211; 3 root hdfs 0 2014-09-08 00:47 output\/clusters-3-final\/_SUCCESS<br>-rw-r&#8211;r&#8211; 3 root hdfs 194 2014-09-08 00:47 output\/clusters-3-final\/_policy<br>-rw-r&#8211;r&#8211; 3 root hdfs 698 2014-09-08 00:47 output\/clusters-3-final\/part-r-00000<br>drwxr-xr-x &#8211; root hdfs 0 2014-09-08 00:45 output\/data<br>-rw-r&#8211;r&#8211; 3 root hdfs 0 2014-09-08 00:45 output\/data\/_SUCCESS<br>-rw-r&#8211;r&#8211; 3 root hdfs 382 2014-09-08 00:45 output\/data\/part-m-00000<br>drwxr-xr-x &#8211; root hdfs 0 2014-09-08 00:45 output\/random-seeds<br>-rw-r&#8211;r&#8211; 3 root hdfs 551 2014-09-08 00:45 output\/random-seeds\/part-randomSeed<\/p>\n\n\n\n<p><strong>#mahout clusterdump &#8211;input output\/clusters-3-final &#8211;pointsDir output\/clusteredPoints<br><\/strong>VL-8{n=3 c=[5.667, 5.667] r=[0.471, 0.471]}<br>Weight : [props &#8211; optional]: Point:<br>1.0: [5.000, 6.000]<br>1.0: [6.000, 5.000]<br>1.0: [6.000, 6.000]<br>VL-9{n=1 c=[9.000, 9.000] r=[]}<br>Weight : [props &#8211; optional]: Point:<br>1.0: [9.000, 9.000]<br>VL-6{n=6 c=[1.167, 1.167] r=[0.687, 0.687]}<br>Weight : [props &#8211; optional]: Point:<br>1.0: [1:1.000]<br>1.0: [0:1.000]<br>1.0: [1.000, 1.000]<br>1.0: [2.000, 1.000]<br>1.0: [1.000, 2.000]<br>1.0: [2.000, 2.000]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Canopy\u5206\u7fa4\u6f14\u7b97\u6cd5#mahout org.apache. &#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"","fifu_image_alt":"","_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[13],"tags":[],"class_list":["post-480","post","type-post","status-publish","format-standard","hentry","category-dataanalysis"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/systw.net\/note\/wp-json\/wp\/v2\/posts\/480","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/systw.net\/note\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/systw.net\/note\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/systw.net\/note\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/systw.net\/note\/wp-json\/wp\/v2\/comments?post=480"}],"version-history":[{"count":0,"href":"https:\/\/systw.net\/note\/wp-json\/wp\/v2\/posts\/480\/revisions"}],"wp:attachment":[{"href":"https:\/\/systw.net\/note\/wp-json\/wp\/v2\/media?parent=480"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/systw.net\/note\/wp-json\/wp\/v2\/categories?post=480"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/systw.net\/note\/wp-json\/wp\/v2\/tags?post=480"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}