Valid CCA175 Dumps shared by NewPassLeader for Helping Passing CCA175 Exam! NewPassLeader now offer the newest CCA175 exam dumps, the NewPassLeader CCA175 exam questions have been updated and answers have been corrected get the newest NewPassLeader CCA175 dumps with Test Engine here: http://https://www.newpassleader.com/Cloudera/CCA175-exam-preparation-materials.html (96 Q&As Dumps, 30%OFF Special Discount: 30free )
NEW QUESTION NO: 7
CORRECT TEXT
Problem Scenario 29 : Please accomplish the following exercises using HDFS command line options.
1. Create a directory in hdfs named hdfs_commands.
2. Create a file in hdfs named data.txt in hdfs_commands.
3. Now copy this data.txt file on local filesystem, however while copying file please make sure file properties are not changed e.g. file permissions.
4. Now create a file in local directory named data_local.txt and move this file to hdfs in hdfs_commands directory.
5. Create a file data_hdfs.txt in hdfs_commands directory and copy it to local file system.
6. Create a file in local filesystem named file1.txt and put it to hdfs
Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Create directory
hdfs dfs -mkdir hdfs_commands
Step 2 : Create a file in hdfs named data.txt in hdfs_commands. hdfs dfs -touchz hdfs_commands/data.txt
Step 3 : Now copy this data.txt file on local filesystem, however while copying file please make sure file properties are not changed e.g. file permissions.
hdfs dfs -copyToLocal -p hdfs_commands/data.txt/home/cloudera/Desktop/HadoopExam
Step 4 : Now create a file in local directory named data_local.txt and move this file to hdfs in hdfs_commands directory.
touch data_local.txt
hdfs dfs -moveFromLocal /home/cloudera/Desktop/HadoopExam/dataJocal.txt hdfs_commands/
Step 5 : Create a file data_hdfs.txt in hdfs_commands directory and copy it to local file system.
hdfs dfs -touchz hdfscommands/data hdfs.txt
hdfs dfs -getfrdfs_commands/data_hdfs.txt /home/cloudera/Desktop/HadoopExam/
Step 6 : Create a file in local filesystem named filel .txt and put it to hdfs touch filel.txt hdfs dfs -put/home/cloudera/Desktop/HadoopExam/file1.txt hdfs_commands/
NEW QUESTION NO: 8
CORRECT TEXT
Problem Scenario 18 : You have been given following mysql database details as well as other info.
user=retail_dba
password=cloudera
database=retail_db
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Now accomplish following activities.
1. Create mysql table as below.
mysql --user=retail_dba -password=cloudera
use retail_db
CREATE TABLE IF NOT EXISTS departments_hive02(id int, department_name
varchar(45), avg_salary int);
show tables;
2. Now export data from hive table departments_hive01 in departments_hive02. While exporting, please note following. wherever there is a empty string it should be loaded as a null value in mysql.
wherever there is -999 value for int field, it should be created as null value.
Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Create table in mysql db as well.
mysql ~user=retail_dba -password=cloudera
use retail_db
CREATE TABLE IF NOT EXISTS departments_hive02(id int, department_name
varchar(45), avg_salary int);
show tables;
Step 2 : Now export data from hive table to mysql table as per the requirement.
sqoop export --connect jdbc:mysql://quickstart:3306/retail_db \
-username retaildba \
-password cloudera \
--table departments_hive02 \
-export-dir /user/hive/warehouse/departments_hive01 \
-input-fields-terminated-by '\001' \
--input-Iines-terminated-by '\n' \
--num-mappers 1 \
-batch \
-Input-null-string "" \
-input-null-non-string -999
step 3 : Now validate the data,select * from departments_hive02;
NEW QUESTION NO: 9
CORRECT TEXT
Problem Scenario 15 : You have been given following mysql database details as well as other info.
user=retail_dba
password=cloudera
database=retail_db
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Please accomplish following activities.
1. In mysql departments table please insert following record. Insert into departments values(9999, '"Data Science"1);
2. Now there is a downstream system which will process dumps of this file. However, system is designed the way that it can process only files if fields are enlcosed in(') single quote and separate of the field should be (-} and line needs to be terminated by : (colon).
3. If data itself contains the " (double quote } than it should be escaped by \.
4. Please import the departments table in a directory called departments_enclosedby and file should be able to process by downstream system.
Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Connect to mysql database.
mysql --user=retail_dba -password=cloudera
show databases; use retail_db; show tables;
Insert record
Insert into departments values(9999, '"Data Science"');
select" from departments;
Step 2 : Import data as per requirement.
sqoop import \
-connect jdbc:mysql;//quickstart:3306/retail_db \
~ username=retail_dba \
--password=cloudera \
-table departments \
-target-dir /user/cloudera/departments_enclosedby \
-enclosed-by V -escaped-by \\ -fields-terminated-by--' -lines-terminated-by :
Step 3 : Check the result.
hdfs dfs -cat/user/cloudera/departments_enclosedby/part"
NEW QUESTION NO: 10
CORRECT TEXT
Problem Scenario 42 : You have been given a file (sparklO/sales.txt), with the content as given in below.
spark10/sales.txt
Department,Designation,costToCompany,State
Sales,Trainee,12000,UP
Sales,Lead,32000,AP
Sales,Lead,32000,LA
Sales,Lead,32000,TN
Sales,Lead,32000,AP
Sales,Lead,32000,TN
Sales,Lead,32000,LA
Sales,Lead,32000,LA
Marketing,Associate,18000,TN
Marketing,Associate,18000,TN
HR,Manager,58000,TN
And want to produce the output as a csv with group by Department,Designation,State with additional columns with sum(costToCompany) and TotalEmployeeCountt
Should get result like
Dept,Desg,state,empCount,totalCost
Sales,Lead,AP,2,64000
Sales.Lead.LA.3.96000
Sales,Lead,TN,2,64000
Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
step 1 : Create a file first using Hue in hdfs.
Step 2 : Load tile as an RDD
val rawlines = sc.textFile("spark10/sales.txt")
Step 3 : Create a case class, which can represent its column fileds. case class
Employee(dep: String, des: String, cost: Double, state: String)
Step 4 : Split the data and create RDD of all Employee objects.
val employees = rawlines.map(_.split(",")).map(row=>Employee(row(0), row{1), row{2).toDouble, row{3)))
Step 5 : Create a row as we needed. All group by fields as a key and value as a count for each employee as well as its cost, val keyVals = employees.map( em => ((em.dep, em.des, em.state), (1 , em.cost)))
Step 6 : Group by all the records using reduceByKey method as we want summation as well. For number of employees and their total cost, val results = keyVals.reduceByKey{
(a,b) => (a._1 + b._1, a._2 + b._2)} // (a.count + b.count, a.cost + b.cost)}
Step 7 : Save the results in a text file as below.
results.repartition(1).saveAsTextFile("spark10/group.txt")
NEW QUESTION NO: 11
CORRECT TEXT
Problem Scenario 58 : You have been given below code snippet.
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2) val b = a.keyBy(_.length) operation1
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(lnt, Seq[String])] = Array((4,ArrayBuffer(lion)), (6,ArrayBuffer(spider)),
(3,ArrayBuffer(dog, cat)), (5,ArrayBuffer(tiger, eagle}}}
Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
b.groupByKey.collect
groupByKey [Pair]
Very similar to groupBy, but instead of supplying a function, the key-component of each pair will automatically be presented to the partitioner.
Listing Variants
def groupByKeyQ: RDD[(K, lterable[V]}]
def groupByKey(numPartittons: Int): RDD[(K, lterable[V] )]
def groupByKey(partitioner: Partitioner): RDD[(K, lterable[V])]
NEW QUESTION NO: 12
CORRECT TEXT
Problem Scenario 9 : You have been given following mysql database details as well as other info.
user=retail_dba
password=cloudera
database=retail_db
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Please accomplish following.
1. Import departments table in a directory.
2. Again import departments table same directory (However, directory already exist hence it should not overrride and append the results)
3. Also make sure your results fields are terminated by '|' and lines terminated by '\n\
Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solutions :
Step 1 : Clean the hdfs file system, if they exists clean out.
hadoop fs -rm -R departments
hadoop fs -rm -R categories
hadoop fs -rm -R products
hadoop fs -rm -R orders
hadoop fs -rm -R order_items
hadoop fs -rm -R customers
Step 2 : Now import the department table as per requirement.
sqoop import \
-connect jdbc:mysql://quickstart:330G/retaiI_db \
--username=retail_dba \
-password=cloudera \
-table departments \
-target-dir=departments \
-fields-terminated-by '|' \
-lines-terminated-by '\n' \
-ml
Step 3 : Check imported data.
hdfs dfs -Is departments
hdfs dfs -cat departments/part-m-00000
Step 4 : Now again import data and needs to appended.
sqoop import \
-connect jdbc:mysql://quickstart:3306/retail_db \
--username=retail_dba \
-password=cloudera \
-table departments \
-target-dir departments \
-append \
-tields-terminated-by '|' \
-lines-termtnated-by '\n' \
-ml
Step 5 : Again Check the results
hdfs dfs -Is departments
hdfs dfs -cat departments/part-m-00001
NEW QUESTION NO: 13
CORRECT TEXT
Problem Scenario 73 : You have been given data in json format as below.
{"first_name":"Ankit", "last_name":"Jain"}
{"first_name":"Amir", "last_name":"Khan"}
{"first_name":"Rajesh", "last_name":"Khanna"}
{"first_name":"Priynka", "last_name":"Chopra"}
{"first_name":"Kareena", "last_name":"Kapoor"}
{"first_name":"Lokesh", "last_name":"Yadav"}
Do the following activity
1 . create employee.json file locally.
2 . Load this file on hdfs
3 . Register this data as a temp table in Spark using Python.
4 . Write select query and print this data.
5 . Now save back this selected data in json format.
Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : create employee.json tile locally.
vi employee.json (press insert) past the content.
Step 2 : Upload this tile to hdfs, default location hadoop fs -put employee.json
Step 3 : Write spark script
#lmport SQLContext
from pyspark import SQLContext
# Create instance of SQLContext sqIContext = SQLContext(sc)
# Load json file
employee = sqlContext.jsonFile("employee.json")
# Register RDD as a temp table employee.registerTempTablef'EmployeeTab"}
# Select data from Employee table
employeelnfo = sqlContext.sql("select * from EmployeeTab"}
#lterate data and print
for row in employeelnfo.collect():
print(row)
Step 4 : Write dataas a Text file employeelnfo.toJSON().saveAsTextFile("employeeJson1")
Step 5: Check whether data has been created or not hadoop fs -cat employeeJsonl/part"
NEW QUESTION NO: 14
CORRECT TEXT
Problem Scenario 20 : You have been given MySQL DB with following details.
user=retail_dba
password=cloudera
database=retail_db
table=retail_db.categories
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Please accomplish following activities.
1. Write a Sqoop Job which will import "retaildb.categories" table to hdfs, in a directory name "categories_targetJob".
Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Connecting to existing MySQL Database mysql -user=retail_dba -- password=cloudera retail_db
Step 2 : Show all the available tables show tables;
Step 3 : Below is the command to create Sqoop Job (Please note that - import space is mandatory) sqoop job -create sqoopjob \ -- import \
-connect "jdbc:mysql://quickstart:3306/retail_db" \
-username=retail_dba \
-password=cloudera \
-table categories \
-target-dir categories_targetJob \
-fields-terminated-by '|' \
-lines-terminated-by '\n'
Step 4 : List all the Sqoop Jobs sqoop job --list
Step 5 : Show details of the Sqoop Job sqoop job --show sqoopjob
Step 6 : Execute the sqoopjob sqoopjob --exec sqoopjob
Step 7 : Check the output of import job
hdfs dfs -Is categories_target_job
hdfs dfs -cat categories_target_job/part*
NEW QUESTION NO: 15
CORRECT TEXT
Problem Scenario 77 : You have been given MySQL DB with following details.
user=retail_dba
password=cloudera
database=retail_db
table=retail_db.orders
table=retail_db.order_items
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Columns of order table : (orderid , order_date , order_customer_id, order_status)
Columns of ordeMtems table : (order_item_id , order_item_order_ld ,
order_item_product_id, order_item_quantity,order_item_subtotal,order_
item_product_price)
Please accomplish following activities.
1. Copy "retail_db.orders" and "retail_db.order_items" table to hdfs in respective directory p92_orders and p92 order items .
2 . Join these data using orderid in Spark and Python
3 . Calculate total revenue perday and per order
4. Calculate total and average revenue for each date. - combineByKey
-aggregateByKey
Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Import Single table .
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba - password=cloudera -table=orders --target-dir=p92_orders -m 1 sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba - password=cloudera -table=order_items --target-dir=p92_order_items -m1
Note : Please check you dont have space between before or after '=' sign. Sqoop uses the
MapReduce framework to copy data from RDBMS to hdfs
Step 2 : Read the data from one of the partition, created using above command, hadoop fs
-cat p92_orders/part-m-00000 hadoop fs -cat p92_order_items/part-m-00000
Step 3 : Load these above two directory as RDD using Spark and Python (Open pyspark terminal and do following). orders = sc.textFile("p92_orders") orderltems = sc.textFile("p92_order_items")
Step 4 : Convert RDD into key value as (orderjd as a key and rest of the values as a value)
# First value is orderjd
ordersKeyValue = orders.map(lambda line: (int(line.split(",")[0]), line))
# Second value as an Orderjd
orderltemsKeyValue = orderltems.map(lambda line: (int(line.split(",")[1]), line))
Step 5 : Join both the RDD using orderjd
joinedData = orderltemsKeyValue.join(ordersKeyValue)
#print the joined data
for line in joinedData.collect():
print(line)
Format of joinedData as below.
[Orderld, 'All columns from orderltemsKeyValue', 'All columns from orders Key Value']
Step 6 : Now fetch selected values Orderld, Order date and amount collected on this order.
//Retruned row will contain ((order_date,order_id),amout_collected)
revenuePerDayPerOrder = joinedData.map(lambda row: ((row[1][1].split(M,M)[1],row[0]}, float(row[1][0].split(",")[4])))
#print the result
for line in revenuePerDayPerOrder.collect():
print(line)
Step 7 : Now calculate total revenue perday and per order
A. Using reduceByKey
totalRevenuePerDayPerOrder = revenuePerDayPerOrder.reduceByKey(lambda
runningSum, value: runningSum + value)
for line in totalRevenuePerDayPerOrder.sortByKey().collect(): print(line)
#Generate data as (date, amount_collected) (Ignore ordeMd)
dateAndRevenueTuple = totalRevenuePerDayPerOrder.map(lambda line: (line[0][0], line[1])) for line in dateAndRevenueTuple.sortByKey().collect(): print(line)
Step 8 : Calculate total amount collected for each day. And also calculate number of days.
# Generate output as (Date, Total Revenue for date, total_number_of_dates)
# Line 1 : it will generate tuple (revenue, 1)
# Line 2 : Here, we will do summation for all revenues at the same time another counter to maintain number of records.
#Line 3 : Final function to merge all the combiner
totalRevenueAndTotalCount = dateAndRevenueTuple.combineByKey( \
lambda revenue: (revenue, 1), \
lambda revenueSumTuple, amount: (revenueSumTuple[0] + amount, revenueSumTuple[1]
+ 1), \
lambda tuplel, tuple2: (round(tuple1[0] + tuple2[0], 2}, tuple1[1] + tuple2[1]) \ for line in totalRevenueAndTotalCount.collect(): print(line)
Step 9 : Now calculate average for each date
averageRevenuePerDate = totalRevenueAndTotalCount.map(lambda threeElements:
(threeElements[0], threeElements[1][0]/threeElements[1][1]}}
for line in averageRevenuePerDate.collect(): print(line)
Step 10 : Using aggregateByKey
#line 1 : (Initialize both the value, revenue and count)
#line 2 : runningRevenueSumTuple (Its a tuple for total revenue and total record count for each date)
# line 3 : Summing all partitions revenue and count
totalRevenueAndTotalCount = dateAndRevenueTuple.aggregateByKey( \
(0,0), \
lambda runningRevenueSumTuple, revenue: (runningRevenueSumTuple[0] + revenue, runningRevenueSumTuple[1] + 1), \ lambda tupleOneRevenueAndCount, tupleTwoRevenueAndCount:
(tupleOneRevenueAndCount[0] + tupleTwoRevenueAndCount[0],
tupleOneRevenueAndCount[1] + tupleTwoRevenueAndCount[1]) \
)
for line in totalRevenueAndTotalCount.collect(): print(line)
Step 11 : Calculate the average revenue per date
averageRevenuePerDate = totalRevenueAndTotalCount.map(lambda threeElements:
(threeElements[0], threeElements[1][0]/threeElements[1][1]))
for line in averageRevenuePerDate.collect(): print(line)
NEW QUESTION NO: 16
CORRECT TEXT
Problem Scenario 28 : You need to implement near real time solutions for collecting information when submitted in file with below
Data
echo "IBM,100,20160104" >> /tmp/spooldir2/.bb.txt
echo "IBM,103,20160105" >> /tmp/spooldir2/.bb.txt
mv /tmp/spooldir2/.bb.txt /tmp/spooldir2/bb.txt
After few mins
echo "IBM,100.2,20160104" >> /tmp/spooldir2/.dr.txt
echo "IBM,103.1,20160105" >> /tmp/spooldir2/.dr.txt
mv /tmp/spooldir2/.dr.txt /tmp/spooldir2/dr.txt
You have been given below directory location (if not available than create it) /tmp/spooldir2
.
As soon as file committed in this directory that needs to be available in hdfs in
/tmp/flume/primary as well as /tmp/flume/secondary location.
However, note that/tmp/flume/secondary is optional, if transaction failed which writes in this directory need not to be rollback.
Write a flume configuration file named flumeS.conf and use it to load data in hdfs with following additional properties .
1 . Spool /tmp/spooldir2 directory
2 . File prefix in hdfs sholuld be events
3 . File suffix should be .log
4 . If file is not committed and in use than it should have _ as prefix.
5 . Data should be written as text to hdfs
Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Create directory mkdir /tmp/spooldir2
Step 2 : Create flume configuration file, with below configuration for source, sink and channel and save it in flume8.conf.
agent1 .sources = source1
agent1.sinks = sink1a sink1bagent1.channels = channel1a channel1b
agent1.sources.source1.channels = channel1a channel1b
agent1.sources.source1.selector.type = replicating
agent1.sources.source1.selector.optional = channel1b
agent1.sinks.sink1a.channel = channel1a
agent1 .sinks.sink1b.channel = channel1b
agent1.sources.source1.type = spooldir
agent1 .sources.sourcel.spoolDir = /tmp/spooldir2
agent1.sinks.sink1a.type = hdfs
agent1 .sinks, sink1a.hdfs. path = /tmp/flume/primary
agent1 .sinks.sink1a.hdfs.tilePrefix = events
agent1 .sinks.sink1a.hdfs.fileSuffix = .log
agent1 .sinks.sink1a.hdfs.fileType = Data Stream
agent1 .sinks.sink1b.type = hdfs
agent1 .sinks.sink1b.hdfs.path = /tmp/flume/secondary
agent1 .sinks.sink1b.hdfs.filePrefix = events
agent1.sinks.sink1b.hdfs.fileSuffix = .log
agent1 .sinks.sink1b.hdfs.fileType = Data Stream
agent1.channels.channel1a.type = file
agent1.channels.channel1b.type = memory
step 4 : Run below command which will use this configuration file and append data in hdfs.
Start flume service:
flume-ng agent -conf /home/cloudera/flumeconf -conf-file
/home/cloudera/flumeconf/flume8.conf --name age
Step 5 : Open another terminal and create a file in /tmp/spooldir2/
echo "IBM,100,20160104" > /tmp/spooldir2/.bb.txt
echo "IBM,103,20160105" > /tmp/spooldir2/.bb.txt mv /tmp/spooldir2/.bb.txt
/tmp/spooldir2/bb.txt
After few mins
echo "IBM.100.2,20160104" >/tmp/spooldir2/.dr.txt
echo "IBM,103.1,20160105" > /tmp/spooldir2/.dr.txt mv /tmp/spooldir2/.dr.txt
/tmp/spooldir2/dr.txt
NEW QUESTION NO: 17
CORRECT TEXT
Problem Scenario 56 : You have been given below code snippet.
val a = sc.parallelize(l to 100. 3)
operation1
Write a correct code snippet for operationl which will produce desired output, shown below.
Array [Array [I nt]] = Array(Array(1, 2, 3,4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33),
Array(34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
5 6, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66),
Array(67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88,
8 9, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100))
Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution : a.glom.collect
glom
Assembles an array that contains all elements of the partition and embeds it in an RDD.
Each returned array contains the contents of one panition