Representation of text data from semi-structured log records is a challenging problem that is crucial for the quality of anomaly detection engines. In the presentation, I will show a pipeline to create vector embeddings and normalization rules on semi-structured, text data that could be used in anomaly detection problems.
Semi-structured data such as server logs or system activity metadata is key to detect cybersecurity threats or security breaches. At F-Secure, we apply a variety of machine learning methods to detect anomalies in the stream of semi- structured text-based events to protect our customers. However, many advanced techniques require a numerical representation of text data (file paths, program names, command line arguments, registry records). The most popular methods (one-hot-encoding and simple embeddings) do not capture the specific context and semantics of log data. Typically, when processing the log data, the vocabulary is much bigger than in natural languages. Moreover, we need to identify and normalize randomly generated paths, temporary files, software versions or command-line arguments.
I will present a pipeline to create vector embeddings and normalization rules on semi-structured data using the popular natural language processing (NLP) Word2Vec model. At the end I will show a simple anomaly detection engine that uses the embeddings to find potentially malicious activity. If you are interested in cybersecurity, NLP or log processing you should find it appealing.