Remote debugging Hadoop using Cloudera VM instead of Amazon’s EMR

Abstract

Enable you to debug the map-reduce process on your local machine by setting up Hadoop in a VM instead of debug through logs in amazon’s EMR

Instructions

  1. Download cloudera quick start VM follow the instructions here for pre requisite anddownload link. You can choose the VM type (vmware, virtual box etc’)
  2. Extract the file and start the VM
    1. username: cloudera
    2. password: cloudera
  3. The VM should start with FF browser open
    1. in the home page (or in the bookmarks) click the cloudera manager and wait for it to initialize
    2. once the you have it running you can issue hadoop commands on the terminal
    3. open a terminal and issue “Hadoop fs –ls” which should work but return nothing
  4. make sure you have a connection between the host machine and the cloudera guest VM
    1. issue an ifconfig command on the terminal to find your ip
    2. try to ping your ip from the host
    3. if that does not work you will have to configure the VMplayer/VB network settings
      1. in VB, go to network and select
        1. Attached to: Bridge Adapter
        2. Promiscuous Mode: Allow VMs
      2. Apply the new settings and wait for the network to work again
      3. Issue another ifconfig command to find you new IP
      4. Try to ping it again from the host
  5. On windows only: DO not try to use “Share Folder” or restart the VM, this will cause couldera services to stop functioning because of IP conflicts!
  6. Transfer your hadoop jar to the VM along with the following files to the same folder
    1. Run.sh
    2. Input file to process by hadoop
  7. Issue the following command on terminal to move the input file to Hadoop
    1. Hadoop fs –put <the location of your data file> /user/cloudera/input
    2. Check that the file is there: Hadoop fs –ls /user/cloudera/input
  8. Make sure your Run.sh file has the write data
    export 
    export HADOOP_OPTS=
    echo $HADOOP_OPTS
    
    #remote debug option enabled
    export HADOOP_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=5005 -Dlocal=true"
    
    #remote debug option disabled
    #export HADOOP_OPTS="-Dlocal=true"
    
    #your jar file
    export JARP=<the hadoop jar name you want to execute>.jar
    
    #hadoop output
    export OUTPUT=/user/cloudera/out/ 
    
    #delete the output dir if exists
    hadoop dfs -rm -r $OUTPUT 
    
    ##########################regular run#############################
    hadoop jar $JARP <jar parameters go here, including the output folder we set above>
    ##################################################################
    
    
  9. Execute Hadoop: “./run/sh”
  10. Now you see in the terminal that the debugger awaits a session so you’ll need to connect using the host inteliJ session with your Hadoop jar code
    1. Open inteliJ on your host machine 
    2. Open “run/debug configurations”

       

    3. Create a new remote configuration
    4. image
    5. Start the debug session

      i. Note that you may need to start the session twice if the command before the job execution in the run.sh file fails (you will see this data in the terminal that will state it is waiting for the debugger to attach)

Thank you for your interest!

We will contact you as soon as possible.

Send us a message

Oops, something went wrong
Please try again or contact us by email at info@tikalk.com