![alt text](https://i1.wp.com/www.genericclass.com/wp-content/uploads/2018/08/hadoop-hdfs.png?fit=320%2C151 "")

---

## HDFS

#### There are multiple HDFS commands roughly corresponding to the ones that are available on any UNIX systems. During this session we will show you how to use them through a Jupyter Notebook. However, keep in mind all those commands are available when used directly through a terminal, the Notebook only interprets those to then execute them as an SH script.

#### Do not forget : even though the provided commands do ressemble the one of any UNIX filesystem, HDFS is not a Filesystem but an Object Storage. All records that are written on HDFS are immutable; this is due to the fact that HDFS does not support random writes. However, even though we cannot change existing content in HDFS, we can append data to existing files.

|Commands|Description|
|-|-|
|ls [-R] <path>|Lists the contents of the directory specified by path, showing the names, permissions, owner, size and modification date for each entry.|
|du [-h] [-s] <path>|Shows disk usage, in bytes, for all the files which match path; filenames are reported with the full HDFS protocol prefix.|
|mv <src><dest>|Moves the file or directory indicated by src to dest, within HDFS.|
|cp <src> <dest>|Copies the file or directory identified by src to dest, within HDFS.|
|rm [-R] <path>|Removes the file or empty directory identified by path.|
|put <localSrc> <dest>|Copies the file or directory from the local file system identified by localSrc to dest within the DFS.|
|get [-crc] <src> <localDest>|Copies the file or directory in HDFS identified by src to the local file system path identified by localDest.|
|getmerge <src> <localDest>|Retrieves all files that match the path src in HDFS, and copies them to a single, merged file in the local file system identified by localDest.|
|cat <filename>|Displays the contents of filename on stdout.|
|mkdir <path>|Creates a directory named path in HDFS.|
|setrep [-R] [-w] rep <path>|Sets the target replication factor for files identified by path to rep. (The actual replication factor will move toward the target over time)|
|stat [format] <path>|Prints information about path. Format is a string which accepts file size in blocks (%b), filename (%n), block size (%o), replication (%r), and modification date (%y, %Y).|
|tail [-f] <filename>|Shows the last 1KB of file on stdout.|
|chmod [-R] mode,mode,... <path>...|Changes the file permissions associated with one or more objects identified by path.... Performs changes recursively with R. mode is a 3-digit octal mode, or {augo}+/-{rwxX}. Assumes if no scope is specified and does not apply an umask.|
|chown [-R] [owner][:[group]] <path>...|Sets the owning user and/or group for files or directories identified by path.... Sets owner recursively if -R is specified.|
|chgrp [-R] group <path>...|Sets the owning group for files or directories identified by path.... Sets group recursively if -R is specified.|

## Quickstart

All the commands you write, preceded by '!', are interpreted as SH commands !
Let's take a look at who we are and where we are...

In [None]:
!whoami
!pwd

## Building the material required to test HDFS

The following set of commands will be executed locally. As shown above, your working directory is your home. This will thus create a simple set of files to play with in the directory ./TestingHDFS

In [None]:
!mkdir ./TestingHDFS
!echo "Content of File1" > "./TestingHDFS/File1.txt"
!echo "File './TestingHDFS/File1.txt' ready !"

!echo "Content of File2" > "./TestingHDFS/File2.txt"
!echo "File './TestingHDFS/File2.txt' ready !"
!echo "========================================="

!mkdir ./TestingHDFS/newdir
!echo "Content of File1 in newdir" > "./TestingHDFS/newdir/newDirFile1.txt"
!echo "File './TestingHDFS/newdir/newDirFile1.txt' ready !"

!echo "Content of File2 in newdir" > "./TestingHDFS/newdir/newDirFile2.txt"
!echo "File './TestingHDFS/newdir/newDirFile2.txt' ready !"

!echo "DONE ! Proceed with the exercices !"


## Using the HDFS commands

In the following cells, we will take a look at the different commands you may use on HDFS. You will see that all ressemble closely the one you could use on your local filesystem. However, remember, this is not a 'FileSystem' but an 'ObjectStorage'.

### ls

In [None]:
!echo "================ Simple listing of the files present in my home directory in HDFS"
!hadoop fs -ls

!echo "================ Equivalent of ls alone but while specifying the actual path"
!hadoop fs -ls /user/$USER

!echo "================ Recursive listing of all the files including the ones in the subdirectories"
!hadoop fs -ls -R

### mkdir
So, yes, it is supposed to be empty. Let's create a directory on HDFS.

In [None]:
!hadoop fs -mkdir /user/$USER/newHDFSDirectory
!hadoop fs -ls

### put

Great, we have it. So, your home on the local file system it /home/USERNAME but your home on HDFS is /user/USERNAME. Keep that in mind.
Now, let's put some files into it.

In [None]:
!hadoop fs -put /home/$USER/TestingHDFS/*.txt /user/$USER/newHDFSDirectory
!hadoop fs -ls /user/$USER/newHDFSDirectory

### cat

The cat command is often usefull to take a peak at some file or pipe them. This command is also usable on HDFS.

In [None]:
!hadoop fs -cat /user/$USER/newHDFSDirectory/File1.txt

### get

To move file from HDFS to a local directory, you may use the 'get' command.

In [None]:
!hadoop fs -get /user/$USER/newHDFSDirectory/File1.txt /home/$USER/TestingHDFS/File1RetrievedFromHDFS.txt
!ls -lh ./TestingHDFS/

### du

'du' is available as well to get summarized information about the size of some HDFS directory for instance.

In [None]:
!hadoop fs -du -h /user/$USER

### mv

You may also move file around in HDFS

In [None]:
!hadoop fs -mv /user/$USER/newHDFSDirectory/File2.txt /user/$USER/
!hadoop fs -ls -R

### rm

Let's remove the directory we just created as well as the files in it.

In [None]:
!hadoop fs -rm -r /user/$USER/newHDFSDirectory/
!hadoop fs -ls -h

So, 'rm' can be used just as usual with arguments, but, the format differs a bit. Same thing for 'ls' as well as all the other commands.

Wait a minute, what is this .Trash directory ?
The idea is that, well, you may remove a file unadverdently at some point. But as you are on HDFS, this file may be quite large, so, as a security measure, no 'rm' command is issued directly. All files you 'rm' are moved to your own ./Trash directory. So, in the eventuality of a mistake, you have another chance. However, you may as well override this behavior and add the '-skipTrash' statement at the end of your 'rm' command; but in that case, you're on your own.

### setrep

On HDFS, all files are automatically replicated multiples times. The default replication factor is 3 but this adjutable at the cluster scale.
However, YOU may ask to either replicate more or less some files using the 'setrep' command.

In [None]:
!hadoop fs -setrep -R 1 /user/$USER/File2.txt
!hadoop fs -ls -h

Ok, wait. How do I know what replication factor is set for a given file ?
Well, remember the second collumn after the permission with a weird integer ? That's the replication factor.

### getmerge

A common practise while using HDFS is to split huge files to match approximatively the block size of HDFS (64M by default). Thus, sometimes, it may be useful to directly merge all the files contained in a given directory in a single one when coming back on your local filesystem. 'getmerge' allow you to do so in one go.

In [None]:
!hadoop fs -put /home/$USER/TestingHDFS/File1.txt /user/$USER
!hadoop fs -getmerge /user/$USER/*.txt /home/$USER/TestingHDFS/FileMerged.txt
!cat ./TestingHDFS/FileMerged.txt

### chmod / chown

Now with permission. The usual 'chmod/chown' combination is usuable on HDFS. Typically, when you move a file to HDFS, it arrives with the default permissions set for HDFS. Never trust the default, always check them.
However, the stupidest issue you'll encounter is that while moving an executable, it will arrive on HDFS as non-executable; so, you'll need to make it executable for Python Streaming, Map Reduce or other things.

In [None]:
!hadoop fs -chmod -R u+x /user/$USER/*.txt
!hadoop fs -ls -h /user/$USER

## To go further

We have seen a small set of the most usefull command to interact with a filesystem and their equivalent on HDFS.
Of course, others exists, you may consult the documentation to learn more, or just ask Hadoop a quick summary.

In [None]:
!hadoop fs -help