hadoop demo程序

文章目录

1. Map/Reduce介绍
2. 工程构建
3. 程序运行

Map/Reduce介绍

hadoop主要利用Map/Reduce框架进行快速数据处理，就是将上传到hadoop集群的文件进行分片保存在HDFS上（64M），之后利用Map框架进行预处理后交由Reduce框架处理输出结果，如下图(简易图)：

工程构建

利用idea建立maven工程，pom.xml配置如下：

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>HadoopTest</groupId>
    <artifactId>HadoopTest</artifactId>
    <version>1.0-SNAPSHOT</version>
    <properties>
        <hadoop.version>2.8.0</hadoop.version>
    </properties>
    <dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>${hadoop.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>${hadoop.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-core</artifactId>
        <version>${hadoop.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
        <version>${hadoop.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-common</artifactId>
        <version>${hadoop.version}</version>
    </dependency>
    </dependencies>

</project>

之后建立WordCount.java编译生成jar文件。

ackage org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
    public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken());
            output.collect(word, one);
            }
        }
    }

    public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            int sum = 0;
            while (values.hasNext()) {
                sum += values.next().get();
            }
            output.collect(key, new IntWritable(sum));
        }
    }
    public static void main(String[] args) throws Exception {
        JobConf conf = new JobConf(WordCount.class);
        conf.setJobName("wordcount");
        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(IntWritable.class);
        conf.setMapperClass(Map.class);
        conf.setCombinerClass(Reduce.class);
        conf.setReducerClass(Reduce.class);

        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

        JobClient.runJob(conf);
    }
}

这段代码主要实现了map／reduce处理过程，上节利用命令 -put上传的文件被分配到各个datanode节点。

public void map()按文件行分解为单词输出key/value值

public void reduce()按map传递过来的值统计单词

之后就是在main函数中配置job

程序运行

上面生成了HadoopTest-1.0-SNAPSHOT.jar

运行命令，会在/user/liuce/output看到输出结果

1	hadoop jar ./HadoopTest-1.0-SNAPSHOT.jar org.myorg.WordCount /user/liuce/input /user/liuce/output

eclipse下有插件可以远程连接到hadoop服务器上而不用每次做这些上传删除操作，github上只有2.6版本的高级版本可以自己编译，linux、win下插件都没问题但是macos下一直报错，感觉跟mac系统的目录结构有关一直报找不到job。