hadoop 2.2 wordcunt

如何运行自带wordcount

1、在linux系统中，所在目录“/home/kcm”下创建一个文件input[ubuntu@701~]$ mkdir input2.在文件夹input中创建两个文本文件file1.txt和file2.txt,file1.txt中内容是“hello word”,file2.txt中内容是“hello hadoop”、“hello mapreduce”（分两行）。

[ubuntu@701~]$ cd input[ubuntu@701~]$ vi file1.txt（编辑文件并保存）[ubuntu@701~]$ vi file2.txt（编辑文件并保存）[ubuntu@701~]$ ls -l /home/kcm/inputfile1.txt file2.txt显示文件内容可用：[ubuntu@701~]$ cat/home/kcm/input/file1.txthello word[ubuntu@701~]$ cat /home/kcm/input/file2.txthello mapreducehello hadoop3.在HDFS上创建输入文件夹wc_input，并将本地文件夹input中的两个文本文件上传到集群的wc_input下[ubuntu@701~]$ hadoop fs -mkdir wc_input[ubuntu@701~]$ hadoop fs -put/home/kcm/input/file* wc_input查看wc_input中的文件：[ubuntu@701~]$ /hadoop fs -ls wc_inputFound 2 items-rw-r--r-- 1 root supergroup 11 2014-03-13 01:19 /user/hadoop/wc_input/file1.txt-rw-r--r-- 1 root supergroup 29 2014-03-13 01:19 /user/hadoop/wc_input/file2.txt4.首先，在window下将wordcount进行打包，我们这里把它打包成wordcount.jar；然后，将wordcount.jar拷贝到linux系统中，存放的目录自己决定即可。

我们这里存放到/home/kcm目录下面。

5.运行wordcount.jar包（转到该jar包存放的目录下）：[ubuntu@701~]$ hadoop jar wordcount.jar /user/hadoop/wc_input /user/hadoop/output

如何写wordcount在hadoop2.7.1中运行

1. 创建本地的示例数据文件：依次进入【Home】-【hadoop】-【hadoop-1.2.1】创建一个文件夹file用来存储本地原始数据。

并在这个目录下创建2个文件分别命名为【myTest1.txt】和【myTest2.txt】或者你想要的任何文件名。

分别在这2个文件中输入下列示例语句：2. 在HDFS上创建输入文件夹呼出终端，输入下面指令：bin/hadoop fs -mkdir hdfsInput执行这个命令时可能会提示类似安全的问题，如果提示了，请使用bin/hadoop dfsadmin -safemode leave来退出安全模式。

当分布式文件系统处于安全模式的情况下，文件系统中的内容不允许修改也不允许删除，直到安全模式结束。

安全模式主要是为了系统启动的时候检查各个DataNode上数据块的有效性，同时根据策略必要的复制或者删除部分数据块。

运行期通过命令也可以进入安全模式。

意思是在HDFS远程创建一个输入目录，我们以后的文件需要上载到这个目录里面才能执行。

3. 上传本地file中文件到集群的hdfsInput目录下在终端依次输入下面指令：cd hadoop-1.2.1bin/hadoop fs -put file/myTest*.txt hdfsInput4. 运行例子：在终端输入下面指令：bin/hadoop jar hadoop-examples-1.2.1.jar wordcount hdfsInput hdfsOutput注意，这里的示例程序是1.2.1版本的，可能每个机器有所不一致，那么请用*通配符代替版本号bin/hadoop jar hadoop-examples-*.jar wordcount hdfsInput hdfsOutput应该出现下面结果：Hadoop命令会启动一个JVM来运行这个MapReduce程序，并自动获得Hadoop的配置，同时把类的路径（及其依赖关系）加入到Hadoop的库中。

以上就是Hadoop Job的运行记录，从这里可以看到，这个Job被赋予了一个ID号：job_201202292213_0002，而且得知输入文件有两个（Total input paths to process : 2），同时还可以了解map的输入输出记录（record数及字节数），以及reduce输入输出记录。

查看HDFS上hdfsOutput目录内容：在终端输入下面指令：bin/hadoop fs -ls hdfsOutput从上图中知道生成了三个文件，我们的结果在＂part-r-00000＂中。

使用下面指令查看结果输出文件内容bin/hadoop fs -cat output/part-r-00000

hadoop运行wordcount实例执行到$ bin/hadoop jar hadoop

这个原因可能是因为你多次进行过format操作，导致缓存有残余文件。

具体步骤：1、bin/stop-all.sh 2、rm -Rf /tmp/hadoop-yourusername/* 3、bin/hadoop namenode -format其中第2步中yourusername替换成你当前的用户，一般来说删除这些临时文件就好了。

Hadoop完全分布式集群运行例程wordcount,为什么所有Task运行在...

在这个实例中，我将会向大家介绍如何使用Python 为 Hadoop编写一个简单的MapReduce 程序。

尽管Hadoop 框架是使用Java编写的但是我们仍然需要使用像C++、Python等语言来实现Hadoop程序。

尽管Hadoop官方网站给的示例程序是使用Jython编写并打包成Jar文件，这样显然造成了不便，其实，不一定非要这样来实现，我们可以使用Python与Hadoop 关联进行编程，看看位于/src/examples/python/WordCount.py 的例子，你将了解到我在说什么。

我们想要做什么？我们将编写一个简单的 MapReduce 程序，使用的是C-Python，而不是Jython编写后打包成jar包的程序。

我们的这个例子将模仿 WordCount 并使用Python来实现，例子通过读取文本文件来统计出单词的出现次数。

结果也以文本形式输出，每一行包含一个单词和单词出现的次数，两者中间使用制表符来想间隔。

先决条件编写这个程序之前，你学要架设好Hadoop 集群，这样才能不会在后期工作抓瞎。

如果你没有架设好，那么在后面有个简明教程来教你在Ubuntu Linux 上搭建（同样适用于其他发行版linux、unix）如何使用Hadoop Distributed File System (HDFS)在Ubuntu Linux 建立单节点的 Hadoop 集群如何使用Hadoop Distributed File System (HDFS)在Ubuntu Linux 建立多节点的 Hadoop 集群 Python的MapReduce代码使用Python编写MapReduce代码的技巧就在于我们使用了 HadoopStreaming 来帮助我们在Map 和 Reduce间传递数据通过STDIN （标准输入）和STDOUT （标准输出）.我们仅仅使用Python的sys.stdin来输入数据，使用sys.stdout输出数据，这样做是因为HadoopStreaming会帮我们办好其他事。

这是真的，别不相信！Map: mapper.py 将下列的代码保存在/home/hadoop/mapper.py中，他将从STDIN读取数据并将单词成行分隔开，生成一个列表映射单词与发生次数的关系：注意：要确保这个脚本有足够权限（chmod +x /home/hadoop/mapper.py）。

#!/usr/bin/env python import sys# input comes from STDIN (standard input) for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # split the line into words words = line.split() # increase counters for word in words: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py # # tab-delimited; the trivial word count is 1 print '%s\\t%s' % (word, 1)在这个脚本中，并不计算出单词出现的总数，它将输出＂ 1＂迅速地，尽管可能会在输入中出现多次，计算是留给后来的Reduce步骤（或叫做程序）来实现。

当然你可以改变下编码风格，完全尊重你的习惯。

Reduce: reducer.py 将代码存储在/home/hadoop/reducer.py 中，这个脚本的作用是从mapper.py 的STDIN中读取结果，然后计算每个单词出现次数的总和，并输出结果到STDOUT。

同样，要注意脚本权限：chmod +x /home/hadoop/reducer.py#!/usr/bin/env python from operator import itemgetter import sys# maps words to their counts word2count = {}# input comes from STDIN for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # parse the input we got from mapper.py word, count = line.split('\\t', 1) # convert count (currently a string) to int try: count = int(count) word2count[word] = word2count.get(word, 0) + count except ValueError: # count was not a number, so silently # ignore/discard this line pass# sort the words lexigraphically;## this step is NOT required, we just do it so that our# final output will look more like the official Hadoop# word count examples sorted_word2count = sorted(word2count.items(), key=itemgetter(0))# write the results to STDOUT (standard output) for word, count in sorted_word2count: print '%s\\t%s'% (word, count) 测试你的代码（cat data | map | sort | reduce）我建议你在运行MapReduce job测试前尝试手工测试你的mapper.py 和 reducer.py脚本，以免得不到任何返回结果这里有一些建议，关于如何测试你的Map和Reduce的功能：—————————————————————————————————————————————— \r\n # very basic test hadoop@ubuntu:~$ echo ＂foo foo quux labs foo bar quux＂ | /home/hadoop/mapper.py foo 1 foo 1 quux 1 labs 1 foo 1 bar 1 —————————————————————————————————————————————— hadoop@ubuntu:~$ echo ＂foo foo quux labs foo bar quux＂ | /home/hadoop/mapper.py | sort | /home/hadoop/reducer.py bar 1 foo 3 labs 1 —————————————————————————————————————————————— # using one of the ebooks as example input # (see below on where to get the ebooks) hadoop@ubuntu:~$ cat /tmp/gutenberg/20417-8.txt | /home/hadoop/mapper.py The 1 Project 1 Gutenberg 1 EBook 1 of 1 [...] (you get the idea) quux 2 quux 1 ———————————————————————...

转载请注明出处51数据库 » hadoop 2.2 wordcunt

如何运行自带wordcount

如何写wordcount在hadoop2.7.1中运行

hadoop运行wordcount实例执行到&#36; bin/hadoop jar hadoop

Hadoop完全分布式集群运行例程wordcount,为什么所有Task运行在...

hadoop运行wordcount实例执行到$ bin/hadoop jar hadoop