====== BE0M33BDT – Big Data Technologies ====== ==== Schedule ==== Because the English version of this course has to be held only for few students, the form will be different: * self-study of literature and presentations * two whole-day (or four half-day) workshops with training in technologies: * 5. 11. 9.00--16.00 * 4. 12. 9.00--16.00 * homework * test * oral exam ==== Prerequisities ==== * registration in [[https://www.metacentrum.cz/en/Sluzby/Hadoop/index.html|Metacentrum]] (group CVUT:FEL:B0M33BDT or CVUT:FEL:A4M33BDT) * Linux basic skills (file and directory management) * SQL basic skills (creation of table, simple SELECT, GROUP BY, JOIN) * Python basic skills (list, tuple, dict, string manipulation and functions, basic regexp) * general skills in programming/scripting, using console and shell ==== Contents ==== === Theory === * {{ :courses:be0m33bdt:big-data-technologies-what-you-need-to-know.pdf |Syllabus in questions}} * Big Data & Hadoop basics ({{ :courses:be0m33bdt:be0m33bdt-hadoop-basics.pdf |presentation in PDF}}) * Storage & Hive ({{ :courses:be0m33bdt:be0m33bdt-storage-hive.pdf |presentation in PDF}}) * MapReduce ({{ :courses:be0m33bdt:be0m33bdt-mapreduce.pdf |presentation in PDF}}) * Spark RDD ({{ :courses:be0m33bdt:be0m33bdt-spark-rdd.pdf |presentation in PDF}}, {{ :courses:be0m33bdt:spark-rdd-examples.py |examples in Python}}) * Spark SQL ({{ :courses:be0m33bdt:be0m33bdt-spark-sql.pdf |presentation in PDF}}, {{ :courses:be0m33bdt:spark-sql-examples.py |examples in Python}}) _ === Practice, Hands-on training === * Linux & HDFS ({{ :courses:be0m33bdt:training-linux-hdfs.pdf |tasklist in PDF}}) * Hive ({{ :courses:be0m33bdt:training-hive.pdf |tasklist in PDF}}) * Spark RDD ({{ :courses:be0m33bdt:training-spark-rdd.pdf |tasklist in PDF}}) * Spark SQL ({{ :courses:be0m33bdt:training-spark-sql-elementary.pdf |tasklist in PDF}}) _ === Homework === {{ :courses:be0m33bdt:homework.pdf |PDF}} ==== Assessment and exam requirements ==== * for assessment: at least 25 points (50 possible) got for tests and homeworks; the more points, the better position for the exam * for exam: a short interview on theoretical topics, the final mark is "sum" of assessment points and exam performance ==== Useful links ==== * [[https://wiki.metacentrum.cz/wiki/Hadoop|Metacentrum Hadoop reference page]] * [[https://hadoop.apache.org/docs/r2.7.5/hadoop-project-dist/hadoop-common/FileSystemShell.html|HDFS DFS commands]] * [[https://learnxinyminutes.com/docs/python3/|Learn python in Y minutes]] * [[https://docs.python.org/3/|Official python documentation]] * [[https://ryanstutorials.net/regular-expressions-tutorial/regular-expressions-basics.php|Regular expressions at Ryan's tutorials]] * [[https://cwiki.apache.org/confluence/display/Hive/LanguageManual|Hive language manual]] * [[https://spark.apache.org/docs/1.6.0/|Apache Spark manual]] * [[http://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html|PySpark SQL manual]] * [[https://github.com/databricks/spark-csv|CSV files import/export]] ==== Contact ==== Course coordinator: [[mailto:jan.hucin@profinit.eu|Jan Hučín]] ==== Literature ==== Hadoop: The Definitive Guide, 4th Edition, by Tom White