pythonwordcount

1.如何使用Python为Hadoop编写一个简单的MapReduce程序

在这个实例中，我将会向大家介绍如何使用Python 为 Hadoop编写一个简单的MapReduce 程序。

尽管Hadoop 框架是使用Java编写的但是我们仍然需要使用像C++、Python等语言来实现Hadoop程序。尽管Hadoop官方网站给的示例程序是使用Jython编写并打包成Jar文件，这样显然造成了不便，其实，不一定非要这样来实现，我们可以使用Python与Hadoop 关联进行编程，看看位于/src/examples/python/WordCount.py 的例子，你将了解到我在说什么。

我们想要做什么？我们将编写一个简单的 MapReduce 程序，使用的是C-Python，而不是Jython编写后打包成jar包的程序。我们的这个例子将模仿 WordCount 并使用Python来实现，例子通过读取文本文件来统计出单词的出现次数。

结果也以文本形式输出，每一行包含一个单词和单词出现的次数，两者中间使用制表符来想间隔。先决条件编写这个程序之前，你学要架设好Hadoop 集群，这样才能不会在后期工作抓瞎。

如果你没有架设好，那么在后面有个简明教程来教你在Ubuntu Linux 上搭建（同样适用于其他发行版linux、unix）如何使用Hadoop Distributed File System (HDFS)在Ubuntu Linux 建立单节点的 Hadoop 集群如何使用Hadoop Distributed File System (HDFS)在Ubuntu Linux 建立多节点的 Hadoop 集群 Python的MapReduce代码使用Python编写MapReduce代码的技巧就在于我们使用了 HadoopStreaming 来帮助我们在Map 和 Reduce间传递数据通过STDIN （标准输入）和STDOUT （标准输出）.我们仅仅使用Python的sys.stdin来输入数据，使用sys.stdout输出数据，这样做是因为HadoopStreaming会帮我们办好其他事。这是真的，别不相信！Map: mapper.py 将下列的代码保存在/home/hadoop/mapper.py中，他将从STDIN读取数据并将单词成行分隔开，生成一个列表映射单词与发生次数的关系：注意：要确保这个脚本有足够权限（chmod +x /home/hadoop/mapper.py）。

#!/usr/bin/env python import sys# input comes from STDIN (standard input) for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # split the line into words words = line.split() # increase counters for word in words: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py # # tab-delimited; the trivial word count is 1 print '%s\\t%s' % (word, 1)在这个脚本中，并不计算出单词出现的总数，它将输出 " 1" 迅速地，尽管可能会在输入中出现多次，计算是留给后来的Reduce步骤（或叫做程序）来实现。当然你可以改变下编码风格，完全尊重你的习惯。

Reduce: reducer.py 将代码存储在/home/hadoop/reducer.py 中，这个脚本的作用是从mapper.py 的STDIN中读取结果，然后计算每个单词出现次数的总和，并输出结果到STDOUT。同样，要注意脚本权限：chmod +x /home/hadoop/reducer.py#!/usr/bin/env python from operator import itemgetter import sys# maps words to their counts word2count = {}# input comes from STDIN for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # parse the input we got from mapper.py word, count = line.split('\\t', 1) # convert count (currently a string) to int try: count = int(count) word2count[word] = word2count.get(word, 0) + count except ValueError: # count was not a number, so silently # ignore/discard this line pass# sort the words lexigraphically;## this step is NOT required, we just do it so that our# final output will look more like the official Hadoop# word count examples sorted_word2count = sorted(word2count.items(), key=itemgetter(0))# write the results to STDOUT (standard output) for word, count in sorted_word2count: print '%s\\t%s'% (word, count) 测试你的代码（cat data | map | sort | reduce）我建议你在运行MapReduce job测试前尝试手工测试你的mapper.py 和 reducer.py脚本，以免得不到任何返回结果这里有一些建议，关于如何测试你的Map和Reduce的功能：—————————————————————————————————————————————— \r\n # very basic test hadoop@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py foo 1 foo 1 quux 1 labs 1 foo 1 bar 1 —————————————————————————————————————————————— hadoop@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py | sort | /home/hadoop/reducer.py bar 1 foo 3 labs 1 —————————————————————————————————————————————— # using one of the ebooks as example input # (see below on where to get the ebooks) hadoop@ubuntu:~$ cat /tmp/gutenberg/20417-8.txt | /home/hadoop/mapper.py The 1 Project 1 Gutenberg 1 EBook 1 of 1 [。

] (you get the idea) quux 2 quux 1 ———————————————————————。

2.急

my_word = raw_input（"请输入一个单词？ "）a_num = my_word.count("a")e_num = my_word.count("e")i_num = my_word.count("i")o_num = my_word.count("o")u_num = my_word.count("u")print "你的句子里有"，a_num，"个a,",e_num，"个e,",i_num，"个i,",o_num，"个o,",u_num，"个u!"。

3.如何用python求一首英文诗的单词数

首先读入这首诗，这里用一个小文件代替了。

然后用函数统计每行的单词数，再统计所有行的单词数，代码如下： def get_line_word(str): word = False # 判断当前字符序列是否包含空格 count = 0 for i in str: if i != " ": word = True else: if word: count += 1 word = False if word: count += 1 return countfin = open('poem.txt')len_word = 0for line in fin: word_line = line.strip() len_word += get_line_word(word_line)print('The total num is ', len_word)一首小诗26个单词，亲测有效。

转载请注明出处51数据库 » pythonwordcount