====== BE0M33BDT – Big Data Technologies ======

==== Schedule ====
Because the English version of this course has to be held only for few students, the form will be different:

  * self-study of literature and presentations
  * two whole-day (or four half-day) workshops with training in technologies:
    * 5. 11. 9.00--16.00
    * 4. 12. 9.00--16.00
  * homework
  * test
  * oral exam

==== Prerequisities ====

  * registration in [[https://www.metacentrum.cz/en/Sluzby/Hadoop/index.html|Metacentrum]] (group CVUT:FEL:B0M33BDT or CVUT:FEL:A4M33BDT)
  * Linux basic skills (file and directory management)
  * SQL basic skills (creation of table, simple SELECT, GROUP BY, JOIN)
  * Python basic skills (list, tuple, dict, string manipulation and functions, basic regexp)
  * general skills in programming/scripting, using console and shell

==== Contents ====

=== Theory ===

  * {{ :courses:be0m33bdt:big-data-technologies-what-you-need-to-know.pdf |Syllabus in questions}}
  * Big Data & Hadoop basics ({{ :courses:be0m33bdt:be0m33bdt-hadoop-basics.pdf |presentation in PDF}})
  * Storage & Hive ({{ :courses:be0m33bdt:be0m33bdt-storage-hive.pdf |presentation in PDF}})
  * MapReduce ({{ :courses:be0m33bdt:be0m33bdt-mapreduce.pdf |presentation in PDF}})
  * Spark RDD ({{ :courses:be0m33bdt:be0m33bdt-spark-rdd.pdf |presentation in PDF}}, {{ :courses:be0m33bdt:spark-rdd-examples.py |examples in Python}})
  * Spark SQL ({{ :courses:be0m33bdt:be0m33bdt-spark-sql.pdf |presentation in PDF}}, {{ :courses:be0m33bdt:spark-sql-examples.py |examples in Python}})
_
    
=== Practice, Hands-on training ===

  * Linux & HDFS ({{ :courses:be0m33bdt:training-linux-hdfs.pdf |tasklist in PDF}})
  * Hive ({{ :courses:be0m33bdt:training-hive.pdf |tasklist in PDF}})
  * Spark RDD ({{ :courses:be0m33bdt:training-spark-rdd.pdf |tasklist in PDF}})
  * Spark SQL ({{ :courses:be0m33bdt:training-spark-sql-elementary.pdf |tasklist in PDF}})
_

=== Homework ===
{{ :courses:be0m33bdt:homework.pdf |PDF}}

==== Assessment and exam requirements ====

  * for assessment: at least 25 points (50 possible) got for tests and homeworks; the more points, the better position for the exam
  * for exam: a short interview on theoretical topics, the final mark is "sum" of assessment points and exam performance 

==== Useful links ====

  * [[https://wiki.metacentrum.cz/wiki/Hadoop|Metacentrum Hadoop reference page]]
  * [[https://hadoop.apache.org/docs/r2.7.5/hadoop-project-dist/hadoop-common/FileSystemShell.html|HDFS DFS commands]]
  * [[https://learnxinyminutes.com/docs/python3/|Learn python in Y minutes]]
  * [[https://docs.python.org/3/|Official python documentation]]
  * [[https://ryanstutorials.net/regular-expressions-tutorial/regular-expressions-basics.php|Regular expressions at Ryan's tutorials]]
  * [[https://cwiki.apache.org/confluence/display/Hive/LanguageManual|Hive language manual]]
  * [[https://spark.apache.org/docs/1.6.0/|Apache Spark manual]]
  * [[http://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html|PySpark SQL manual]]
  * [[https://github.com/databricks/spark-csv|CSV files import/export]]


==== Contact ====
Course coordinator: [[mailto:jan.hucin@profinit.eu|Jan Hučín]]

==== Literature ====
Hadoop: The Definitive Guide, 4th Edition, by Tom White