# WordCount Example in Cloudera Quickstart VM

WordCount is the Hadoop equivalent of “Hello World” example program. When you first start learning a new language or framework, you would want to run and look into some “Hello World” example to get a feel of the new development environment. Your first few programs in those new languages or frameworks are probably extended from those basic “Hello World” examples.

Most Hadoop tutorials are quite overwhelming in text, but provide little guide on practical hands-on experiments (such as this). Although they are good and thorough tutorials, many new Hadoop users may be lost midway after walls of texts.

The purpose of this post is to help new users dive into Hadoop more easily. After reading this, you should be able to:

1. Get started with a simple, local Hadoop sandbox for hands-on experiments.
2. Perform some simple tasks in HDFS.
3. Run the most basic example program WordCount, using your own input data.

Nowadays, many companies provide Hadoop sandboxes for learning purpose, such as Cloudera, Hortonworks. In this post, I used Cloudera Quickstart VM. Download the VM and start it up in VirtualBox or VMWare Fusion.

### Working with HDFS

Before running WordCount example, we need to create some input text file, then move it to HDFS. First, create an input test file in your local file system.

Next, we need to move this file into HDFS. The following commands are the most basic HDFS commands to manage files in HDFS. In order of appearance below, we create a folder, copy the input file from local filesystem to HDFS, and list the content on HDFS.

It should be noted that for a fresh Cloudera VM, there is a “/user” folder in HDFS but not in the local filesystem. This example illustrates that local file system and HDFS are separate, and the Linux’s “ls” and HDFS’s “ls” interact with those independently.

To see the content of a file on HDFS, use cat subcommand:

For large files, if you want to view just the first or last parts, there is no -more or -tail subcommand. Instead, pipe the output of the -cat subcommand through your local shell’s more, or tail. For example: hdfs dfs -cat wc-out/* | more.

For more HDFS commands, check out links in References section below.

### Running the WordCount example

Next, we want to run some MapReduce example, such as WordCount. The WordCount example is commonly used to illustrate how MapReduce works. The example returns a list of all the words that appear in a text file and the count of how many times each word appears. The output should show each word found and its count, line by line.

We need to locate the example programs on the sandbox VM. On Cloudera Quickstart VM, they are packaged in this jar file “hadoop-mapreduce-examples.jar”. Running that jar file without any argument will give you a list of available examples.

To run the WordCount example using the input file that we just moved to HDFS, use the following command:

The output folder is specified as “/user/cloudera/output” in the above command. Finally, check the output of WordCount example in the output folder.

Congratulations!! You just finished the first step of the journey into Hadoop.