Confused Coders is a place where we share lessons and thoughts with you. Feel free to fire you doubts straight on our face and we will try best to come back to you with the clarifications. We also have few pdf's which might be helpful to you for your interview preparations.

     Book shelf: Feel free to download and share. Cheers \m/


Have Fun !

How to write gzip compressed Json in spark data frame

A compressed format can be specified in spark as : conf = SparkConf() conf.set(“spark.hadoop.mapred.output.compress”, “true”) conf.set(“spark.hadoop.mapred.output.compression.codec”, “true”) conf.set(“spark.hadoop.mapred.output.compression.codec”, “”) conf.set(“spark.hadoop.mapred.output.compression.type”, “BLOCK”) The same can be provided to spark shell as: $> spark-shel –conf spark.hadoop.mapred.output.compress=true –conf spark.hadoop.mapred.output.compression.codec=true –conf –conf spark.hadoop.mapred.output.compression.type=BLOCK The code for writing the Json/Text is same as usual- case class C(key: String, value: String) val list = List(C(“a”, “b”), C(“c”, “d”), C(“e”, “f”)) val rdd = sc.makeRDD(list) import sqlContext.implicits._ val df = rdd.toDF df.write.mode(“append”).json(“s3://work/data/tests/json”) Thats it. We should now have compressed GZ files as output.    

Spark Sql job executing very slow – Performance tuning

I have been facing trouble with a basic spark sql job which was unable to process 10’s of gigs in hours. Thats when I demystified the ‘spark.sql.shuffle.partitions’ which tends to slow down the job insanely. Adding the below changes to the Spark Sql code fixes the issue for me. Magic. // For handling large number of smaller files events = sqlContext.createDataFrame(rows]).coalesce(400) events.registerTempTable(“input_events”) // For overriding default value of 200 sqlContext.sql(“SET spark.sql.shuffle.partitions=10”) sqlContext.sql(sql_query)

Indexing csv data in Solr via Python – PySolr

Here is a crisp post to index Data in Solr using Python. 1. Install Pre-requisites – pip – PySolr 2. Python Script #!/usr/bin/python import sys, getopt import pysolr import csv, json #SOLR_URL= def main(args): solrurl=” inputfile=” try: opts, args = getopt.getopt(args,”hi:u:”) except getopt.GetoptError: print ‘ -i -u ‘ sys.exit(2) for opt, arg in opts: if opt == ‘-h’: print ‘ -i -u ‘ sys.exit() elif opt in (“-i”): inputfile = arg elif opt in (“-u”): solrurl = arg # create a connection to a solr server s = pysolr.Solr(solrurl, timeout=10) keys=(“rank”, “pogid”, “cat”, “subcat”, “question_bucketid”, “brand”, “discount”, “age_grp”, “gender”, “inventory”, “last_updated”) record_count=0 for line in open(inputfile, ‘r’).readlines(): splits = line.split(‘,’) […]

How to get Pig Logical plan (Execution DAG) from Pig Latin script

TLDR; A Pig Logical plan is the Plan DAG that is used to execute the chain oj Jobs on Hadoop. Here is the code snippet for obtaining a Pig latin Logical Plan DAG frpm a Pig Script-

PySolr : How to boost a field for Solr document

Adding a Quick note – PySolr : How to boost a field for Solr document Index time boosting conn.add(docs, boost={‘author’: ‘2.0’,}) Query time boosting qf=title^5 content^2 comments^0.5 Read:    

JSolr Exception – Exception in thread “main” org.apache.solr.common.SolrException: Bad Request

Exception in thread “main” org.apache.solr.common.SolrException: Bad Request Bad Request request: Solution: Check Solr logs. INFO – 2014-11-07 07:04:42.985; org.apache.solr.update.processor.LogUpdateProcessor; [feeddata] webapp=/solr path=/update params={wt=javabin&version=2} {} 0 1 ERROR – 2014-11-07 07:04:42.985; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id Here it is: Document is missing mandatory uniqueKey field: id   Another instance: INFO – 2014-11-07 07:13:21.684; org.apache.solr.update.processor.LogUpdateProcessor; [feeddata] webapp=/solr path=/update params={wt=javabin&version=2} {} 0 1 ERROR – 2014-11-07 07:13:21.685; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: ERROR: [doc=0] unknown field ‘win_hour’ Takeaway :  Logs are very helpful. Do have a look before searching else where.

Indexing CSV data file in Solr – Using annotated java pojo’s

1. Java pojo: Add the Java POJO with the required fields- import org.apache.solr.client.solrj.beans.Field; /** * Created by yash on 18/11/14. */ public class ProductBean { @Field private int id; @Field(“rank”) private int rank; @Field(“prodid”) private long prodid; @Field(“cat”) private int cat; @Field(“subcat”) private int subcat; public ProductBean(){} // Required by Solr to initialize bean. public ProductBean(int id, int rank, long prodid, int cat, int subcat) { = id; this.rank = rank; this.prodid = prodid; = cat; this.subcat = subcat; } public int getRank() { return rank; } public void setRank(int rank) { this.rank = rank; } public long getprodid() { return prodid; } public void setprodid(long prodid) { […]

Mahout Exception : java.lang.NoSuchMethodError: org.apache.hadoop.util.ProgramDriver.driver

Another annoying Mahout Error on running the Mahout jobs. Well this is caused because of the reason already discussed. The mahout is not build explicitly for Hadoop 2. So all it needs is a small re build of Mahout: mvn clean install -Dhadoop2 -Dhadoop2.version=2.2.0 -DskipTests=true Thats it. Now mahout slould work just fine. Drop a note in case you get stuck anywhere.

Tunnel all cluster ports on local port – via browser

Couple of steps for tunneling the hadoop box’s ports to local box ports- Ssh to any box in cluster to any local port: ssh -D 9999 dk2567@ Add proxy settings in secondary browser (firefox here): Edit > Preferences > Advanced Connections > Settings> Goto : Manual proxy settings > Add localhost ip and port (9999) in the SOCKS settings: Visit : application on browser : // With Destination IP and destination port.

Local Cassandra cluster via Cassandra CCM – Cluster Manager

Cam across this cool little utility to launch a local Cassandra cluster and test your apps around Cassandra. Check it out. 1. Link 2. Install Instructions. Bring up a 5 Node cassandra Cluster up with Cassandra version 2.1.3. Minimal and Simplistic- cd work/git/ccm ccm status ccm create test -v 2.1.3 -n 5 -s ccm start ccm status 3. Other commands Bring selected node down: ccm node4 stop Use cqlsh on node: ccm node1 cqlsh USE demo; SELECT * from testcf;