You just deployed your new version of an application or micro-service; how do you know everything works as expected? You run your comprehensive test suite to verify functional correctness for known scenarios and performance tests before deploying, but does your application really work at the moment or is it just responding with error messages to all incoming requests?
I’m part of the team that runs a huge infrastructure for the SAP HANA development. This infrastructure is vital for nearly all development & testing activities of SAP HANA. As this infrastructure is powered by multiple in-house developed applications, we immediately want to know if an application starts to fail and we need to be able to quickly diagnose what caused the failure.
This talk will give you an overview how we monitor our full stack from the 2000 physical machines up to the 10,000 parallel running Python application processes, micro-service instances and batch processing jobs. It includes a review about the used tools, bad and good examples of instrumentation in Python code, the resulting visualisation and an outlook on upcoming improvements.