The confusion around terms such as like NoSQL, Big Data, Data Science, Spark, SQL, and Data Lakes often creates more fog than clarity. However, clarity about the underlying technologies is crucial to designing the best technical solution in any field relying on huge amounts of data including data science, machine learning, but also more traditional analytical systems such as data integration, data warehousing, reporting, and OLAP.
In my presentation, I will show that often at least three dimensions are cluttered and confused in discussions when it comes to data management: First, buzzwords (labels & terms like "big data", "AI", "data lake"); second, data design patterns (principles & best practices like: selection push-down, materialization, indexing); and Third, software platforms (concrete implementations & frameworks like: Python, DBMS, Spark, and NoSQL-systems).
Only by keeping these three dimensions apart, it is possible to create technically-sound architectures in the field of big data analytics.
I will show concrete examples, which through a simple redesign and wise choice of the right tools and technologies, run thereby up to 1000 times faster. This in turn triggers tremendous savings in terms of development time, hardware costs, and maintenance effort.