Big Data
Details Close
Big Data
Fourny, Ghislain
This course gives an overview of database technologies and of the most important database design principles that lay the foundations of the Big Data universe. We take the monolithic, one-machine relational stack from the 1970s, smash it down and rebuild it on top of large clusters: starting with distributed storage, and all the way up to syntax, models, validation, processing, indexing, and querying. A broad range of aspects is covered with a focus on how they fit all together in the big picture of the Big Data ecosystem.
No data is harmed during this course, however, please be psychologically prepared that our data may not always be in third normal form.
- physical storage: distributed file systems (HDFS), object storage(S3), key-value stores
- logical storage: document stores (MongoDB), column stores (HBase), graph databases (neo4j), data warehouses (ROLAP)
- data formats and syntaxes (XML, JSON, RDF, Turtle, CSV, XBRL, YAML, protocol buffers, Avro)
- data shapes and models (tables, trees, graphs, cubes)
- type systems and schemas: atomic types, structured types (arrays, maps), set-based type systems (?, *, +)
- an overview of functional, declarative programming languages across data shapes (SQL, XQuery, JSONiq, Cypher, MDX)
- the most important query paradigms (selection, projection, joining, grouping, ordering, windowing)
- paradigms for parallel processing, two-stage (MapReduce) and DAG-based (Spark)
- resource management (YARN)
- what a data center is made of and why it matters (racks, nodes, ...)
- underlying architectures (internal machinery of HDFS, HBase, Spark, neo4j)
- optimization techniques (functional and declarative paradigms, query plans, rewrites, indexing)
- applications.
Large scale analytics and machine learning are outside of the scope of this course.