Since the emergence of Elasticsearch, common Information Retrieval tasks such as indexing, scoring and retrieval of documents into a search engine have never been easier. However unique challenges still exist for indexing large sets of data from databases. At Jopwell, we need to insure that data in our database is kept in constant sync with data in our search index.
Initially you need to take data from a traditional SQL database and flatten it for indexing in Elasticsearch. Since indexing this data can be a memory intensive task, Celery is useful for ensuring you can index large sets of data in both a distributed and memory-conservative manner. Once all your documents are in your Elasticsearch index, you need to retrieve data from your database related to a user’s search results.
In this talk, I’ll show the basics of creating a search engine in Python, keeping these it synced with another data store and how you can keep your index running smoothly.
- Introduction to the problem (2 min)
- Building your document indexer (7 min)
- Flattening database data into a search document
- Using Celery to index documents efficiently
- Scoring and search results retrieval (7 min)
- Scoring algorithms
- Retrieving matching results from the database
- Strategies for syncing data from (7 min)
- Traditional SQL database
- Elasticsearch index
- Future work (2 min)