PyData London 2016
TBC, but along the lines of: What do you do when your data, whilst formally meeting common requirements, is, to put it mildly, an edge case? And what if you don't have the time to write a bespoke tool for your analysis? I discuss this scenario, and will explain why my data is such difficult to start with.
TBC, but along the lines of: What has 5 times more DNA than a human, hugely repetitive regions, is a mixture of three plants and can differ significantly between varieties. It also supplies 20% of the calories consumed by humanity, and 35% of us depend on it for survival, so there is a strong motivation to understand it. The fact that wheat is not only not a 'model organism' but has some features which means assumptions generally made by the writers of bioinformatics software don't hold makes it hard to work with. In fact, often tools which document themselves as being suitable for use with data in the formats wheat researchers use fall over in this use case. Ideally, we would have time to rewrite them exactly for our use case, but this doesn't always happen.
This may either be a discussion or a talk. So far I've moved hardware twice, abandoned tools, rewritten parts of tools for my use case (I have a simple python example of this), my current task is to understand in detail why a particular piece of software is segfaulting so I will have a lot more insight very shortly.
I will aim to keep the talk more about how to deal with data which differs significantly from the standards in a given field than the particulars of wheat, but there are some neat bioinformatics algorithms so I will explain one to demonstrate why we need to store so much in RAM.