Vi siete mai chiesti come fanno i grandi motori di ricerca (Google, Yahoo) e/o i grandi social network (Facebook, MySpace ecc..) ad indicizzare ed elaborare le grosse quantità di dati che giornalmente immagazzinano nei loro server!?
Hadoop è uno dei framework più utilizzati per tali scopi, sviluppato con tecnologia Java, è stato pensato per l’elaborazione di grosse quantità di dati in applicazioni distribuite.
Tanto per avere un idea di cosa stiamo parlando, vi incollo alcuni dati presi da cloudera sulla quantità di dati che giornalmente vengono elebarati con questo framework open source:
Grazie all’esame di Programmazione Concorrente, spero di farlo il prima possibile :P, mi sono interessato a tale framework e per tale motivo ho deciso di scrivere un piccolo tutorial sulla sua configurazione e l’uso su sistemi Mac e Linux.
1. Configurazione SSH
Per prima cosa generiamo le nostre chiavi ssh senza passphrase
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
$ ssh localhost
2. Download Hadoop
$ tar -zxvf hadoop-0.20.2.tar.gz
3. Configurazione Hadoop
Ora passiamo alla configurazione vera e propria di hadoop. Entriamo nella cartella conf di hadoop-0.20.2 (stable version)
conf/hadoop-env.sh
e decommentiamo la variabile JAVA_HOME settandola con la path Java del nostro sistema
Es. Linux
# Set Hadoop-specific environment variables here. # The only required environment variable is JAVA_HOME. All others are # optional. When running a distributed configuration it is best to # set JAVA_HOME in this file, so that it is correctly defined on # remote nodes. # The java implementation to use. Required. JAVA_HOME=/usr/java/version_of_java
Es. Mac OS X
# Set Hadoop-specific environment variables here. # The only required environment variable is JAVA_HOME. All others are # optional. When running a distributed configuration it is best to # set JAVA_HOME in this file, so that it is correctly defined on # remote nodes. # The java implementation to use. Required. JAVA_HOME=/Library/Java/Home
conf/core-site.xml:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
conf/hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
conf/mapred-site.xml:
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration>
4. Formattazione ed Esecuzione Hadoop
4.1 Formattazione
Prima di lanciare il nostro primo programma, dobbiamo formattare il nostro NameNode
$:~ hadoop-*/bin/hadoop namenode -format
Questo comando ci darà piu’ o meno lo stesso output
unicondor@imac81-di-apple-utente:hadoop-0.20.2> bin/hadoop namenode -format 10/10/29 23:11:58 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = imac81-di-apple-utente.local/192.168.1.131 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.20.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 ************************************************************/ 10/10/29 23:11:59 INFO namenode.FSNamesystem: fsOwner=unicondor,staff,_developer,_lpoperator,_lpadmin,_appserveradm,admin,_appserverusr,localaccounts,everyone,com.apple.sharepoint.group.1,com.apple.access_screensharing 10/10/29 23:11:59 INFO namenode.FSNamesystem: supergroup=supergroup 10/10/29 23:11:59 INFO namenode.FSNamesystem: isPermissionEnabled=true 10/10/29 23:11:59 INFO common.Storage: Image file of size 99 saved in 0 seconds. 10/10/29 23:11:59 INFO common.Storage: Storage directory /tmp/hadoop-unicondor/dfs/name has been successfully formatted. 10/10/29 23:11:59 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at imac81-di-apple-utente.local/192.168.1.131 ************************************************************/
4.2 Esecuzione
Finalmente siamo pronti per l’esecuzione vera e propria degli esempi già presenti nella directory di Hadoop, prima di lanciare gli eseguibili dobbiamo inizializzare il Demone HDF (TaskTracker,JobTracker e DataNode)
unicondor@imac81-di-apple-utente:hadoop-0.20.2> bin/start-all.sh starting namenode, logging to /Users/unicondor/Documents/hadoop-0.20.2/bin/../logs/hadoop-unicondor-namenode-imac81-di-apple-utente.local.out localhost: starting datanode, logging to /Users/unicondor/Documents/hadoop-0.20.2/bin/../logs/hadoop-unicondor-datanode-imac81-di-apple-utente.local.out localhost: starting secondarynamenode, logging to /Users/unicondor/Documents/hadoop-0.20.2/bin/../logs/hadoop-unicondor-secondarynamenode-imac81-di-apple-utente.local.out starting jobtracker, logging to /Users/unicondor/Documents/hadoop-0.20.2/bin/../logs/hadoop-unicondor-jobtracker-imac81-di-apple-utente.local.out localhost: starting tasktracker, logging to /Users/unicondor/Documents/hadoop-0.20.2/bin/../logs/hadoop-unicondor-tasktracker-imac81-di-apple-utente.local.out
unicondor@imac81-di-apple-utente:hadoop-0.20.2> bin/hadoop fs -put input/ input
lanciamo finalmente il nostro primo programma: Wordcount
unicondor@imac81-di-apple-utente:hadoop-0.20.2> bin/hadoop jar hadoop-0.20.2-examples.jar wordcount input output 10/10/29 23:25:24 INFO input.FileInputFormat: Total input paths to process : 2 10/10/29 23:25:24 INFO mapred.JobClient: Running job: job_201010292318_0001 10/10/29 23:25:25 INFO mapred.JobClient: map 0% reduce 0% 10/10/29 23:25:32 INFO mapred.JobClient: map 100% reduce 0% 10/10/29 23:25:44 INFO mapred.JobClient: map 100% reduce 100% 10/10/29 23:25:46 INFO mapred.JobClient: Job complete: job_201010292318_0001 10/10/29 23:25:46 INFO mapred.JobClient: Counters: 17 10/10/29 23:25:46 INFO mapred.JobClient: Job Counters 10/10/29 23:25:46 INFO mapred.JobClient: Launched reduce tasks=1 10/10/29 23:25:46 INFO mapred.JobClient: Launched map tasks=2 10/10/29 23:25:46 INFO mapred.JobClient: Data-local map tasks=2 10/10/29 23:25:46 INFO mapred.JobClient: FileSystemCounters 10/10/29 23:25:46 INFO mapred.JobClient: FILE_BYTES_READ=1286 10/10/29 23:25:46 INFO mapred.JobClient: HDFS_BYTES_READ=780 10/10/29 23:25:46 INFO mapred.JobClient: FILE_BYTES_WRITTEN=2642 10/10/29 23:25:46 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=472 10/10/29 23:25:46 INFO mapred.JobClient: Map-Reduce Framework 10/10/29 23:25:46 INFO mapred.JobClient: Reduce input groups=42 10/10/29 23:25:46 INFO mapred.JobClient: Combine output records=84 10/10/29 23:25:46 INFO mapred.JobClient: Map input records=4 10/10/29 23:25:46 INFO mapred.JobClient: Reduce shuffle bytes=1292 10/10/29 23:25:46 INFO mapred.JobClient: Reduce output records=42 10/10/29 23:25:46 INFO mapred.JobClient: Spilled Records=168 10/10/29 23:25:46 INFO mapred.JobClient: Map output bytes=1112 10/10/29 23:25:46 INFO mapred.JobClient: Combine input records=84 10/10/29 23:25:46 INFO mapred.JobClient: Map output records=84 10/10/29 23:25:46 INFO mapred.JobClient: Reduce input records=84
unicondor@imac81-di-apple-utente:hadoop-0.20.2> bin/hadoop fs -cat output/* Flavio 3 salve 2 ciao 1 utente 1
Il secondo metodo è quello di prendere la cartella output dal FileSystem distribuito e copiarla in locale
unicondor@imac81-di-apple-utente:hadoop-0.20.2> bin/hadoop fs -get output output_wordcount/ unicondor@imac81-di-apple-utente:hadoop-0.20.2> cat output_wordcount/* Flavio 3 salve 2 ciao 1 utente 1
unicondor@imac81-di-apple-utente:hadoop-0.20.2> bin/hadoop jar hadoop-0.20.2-examples.jar An example program must be given as the first argument. Valid program names are: aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. dbcount: An example job that count the pageview counts from a database. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying program to find solutions to pentomino problems. pi: A map/reduce program that estimates Pi using monte-carlo method. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node. secondarysort: An example defining a secondary sort to the reduce. sleep: A job that sleeps at each map and reduce task. sort: A map/reduce program that sorts the data written by the random writer. sudoku: A sudoku solver. teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount: A map/reduce program that counts the words in the input files.
Grandissimo!!
Ho girato guide per una giornata ma sono riuscito solo con la tua 😀
Bhe almeno un articolo e’ servito a qualcuno 😀 grazie per il commento positivo Francesco..
Salve, ho dei problemi con il worcount in un cluster da 5 nodi, qualcuno potrebbe darmi una mano?
Ciao Giuseppe, scusa il ritardo… prova a dirmi i tuoi problemi e vediamo di risolvere assieme 😉