Friday, January 30, 2015

Following a comment from Scott on the last entry here, I looked into making a Wikipedia-based poetry generator. Scraping random WP pages isn't too difficult but the wikipedia Python library makes this project almost trivial. There's another library called Pywikibot - I may have to try that one next.

In the meantime, I have a script for producing short poems from a set of Wikipedia entries. It's only as good as the entries it's retrieved so far. I'd like to turn this into more of a machine learning project such that acceptable output can be distinguished from messy nonsense, but for now it's just a text corpus and a set of Markov chains.

Written language has the benefit of looking like nonsense to the human eye when we can't make sense of it, unlike, say, a protein sequence. It would be nice to leverage that advantage.

In the meantime, this is what I get:

Callinicos, 1989). His daughter Princess Catherine
Clair Township - west Israel Township,
In November 
Interment in Blue
Smith died suddenly
Richard Powell Sharkey, Helen Derr's 
In January 2010, Obadeyi went
They have also been recorded
3 was purely due to 
He even
As soon as look at you,
The reservoir was the second  

Wednesday, January 28, 2015

Today I learned about the existence of the dbpedia project. It essentially turns Wikipedia into a very large database, enabling the information within it to be searched in more specific ways than usual.

I found out about the project through this recent arXiv paper about computational fact checking. They used dbpedia to make a set of triples along the lines of "Cats are mammals." I suspect their project could be useful for more than just fact-checking claims of reptilian cats. Perhaps it could be mashed up with something like this timeline project to have a constantly-updated set of facts most likely to need verifying today.

Monday, January 26, 2015

Gotta have that Mechismo

Staphylococcus aureus alpha hemolysin. Not terribly relevant here but it looks nice. Structural view c/o Dcrjsr and Vincent Chen on Wikimedia Commons.
I found out about Mechismo today - it's a server-based tool for exploring the potential impacts of changes in protein sequences and structures. It will accept a list of potential variants of a protein (even just phosphorylation of a particular site, for example) and then return lists and networks of the potential impact. It's nicely presented and appears to be quite flexible. Here's my favorite part, though: unlike some other tools, it's not species-limited and looks like it should work with virtually any protein with an accession number. It will be worth trying out a bit more with some real data.

Here's the citation for the original paper (not quite sure how I missed it as it's been online since this past November, but it's hard to catch everything):
Betts, M. J. et al. Mechismo: predicting the mechanistic impact of mutations and modifications on molecular interactions. Nucleic Acids Res. 43, e10– (2014).

Friday, January 23, 2015

I've been enjoying Low Light Mixes today. Yes, it tends to be more of the relaxed, ambient material I post here regularly, but these mixes are lovingly curated and well worth listening to.

Tuesday, January 20, 2015

Milk-borne viruses and toasty bacteria

Today I learned about two interesting items:

  • Human mammary tumor virus (HMTV) exists. There's a retrovirus in mice with the straightforward name of Mouse mammary tumor virus (MMTV) and, as the name implies, it's linked with mammary tumors in mice. Mother mice carrying the virus pass it to their pups through milk. HMTV is the human-specific equivalent of MMTV. That's about all I can say without venturing into the realm of contentious debate: we can't really say whether either virus directly causes breast cancer in humans. Even so, it tends to show up frequently in women who've had breast cancer and it's not the only type of retrovirus found in milk.
  • A paper from this past June (citation below) claims to have improved the heat tolerance of E. coli by expressing a heat shock protein from C. elegans in them. It's unusual to express C. elegans proteins in E. coli as they aren't even in the same domain of life (that being said, we regularly express viral proteins in our cells of choice in the lab and no one thinks twice about it). The authors claim that this heterologous expression allowed the experimental E. coli to grow at temps up to 50 degrees C and even briefly at 58 degrees. That's really hot - I've never been able to grow it above 42 degrees myself. I'm skeptical, not the least of which because there's some substantial speculation going on in this paper, but the results can't be that difficult to replicate, right?
Citation for that second item: Ezemaduka, A. N. et al. A small heat shock protein enables Escherichia coli to grow at a lethal temperature of 50°C conceivably by maintaining cell envelope integrity. J Bacteriol 196, 2004–2011 (2014).

Friday, January 16, 2015

Here's the music for today:
Ben Lukas Boysen - Gravity
This is spaceflight music! I'm not flying anywhere today, space or otherwise, yet gravity is still a factor.

Wednesday, January 14, 2015

The usual suspects: databases of virulence factors in Bacteria

Bacteria have earned much of their reputation. Despite their close relationship with humanity, whether in our microbiomes or in our food and beverages, various bacterial species still pose a threat to our society. Luckily, genomics allows us to quickly compare virulent bacterial strains with more temperate ones. The result is more data, and where there's data, there's eventually a database.

Here are a few on the subject:

PATRIC - The Pathosystems Resource Integration Center
Most recently, PATRIC has added a manually literature and database-curated set of virulence factors from six big-name bacterial pathogen genera, including Mycobacterium and Listeria. They now specifically detail virulence factors for ten different pathogen species. Many of these are virulence factor homologs identified by protein BLAST.

PATRIC doesn't make it easy to just get virulence factors, but if you're on an organism page, hit the Specialty Genes tab and then click the Virulence Factor checkbox on the left to filter.
Listeria. Always clean your vegetables.
A recent citation: Mao, C. et al. Curation, integration and visualization of bacterial virulence factors in PATRIC. Bioinformatics 31, 252–8 (2015).

VFDB
Oh, wait, this one's included in PATRIC now as far as I can tell. Its last major release was in 2012.

PAI DB
This database is a bit different: it contains pathogenicity islands rather than just virulence factors. Pathogenicity islands are rich in virulence factors and all manner of other novel genes, so they're important to consider in any discussion of virulence factors. PAI DB appears to be actively updated and even contains a pathogenicity island search function (PAI Finder) though it's just based on BLASTing against known islands and virulence genes.

Their logo is perfect.

A recent citation: Yoon, S. H., Park, Y.-K. & Kim, J. F. PAIDB v2.0: exploration and analysis of pathogenicity and resistance islands. Nucleic Acids Res gku985– (2014). doi:10.1093/nar/gku985

MVirDB
A database with some overlap between PATRIC, MvirDB doesn't look like much and I'm not sure if it's actively maintained. That being said, it's essentially just a few tables, accessible through the download links on the left of the page. MVirDB unifies some smaller databases so it may contain completely unique sets of annotations.

Uniprot
OK, it's not explicitly a database of virulence factors. It's easy to sort it by GO terms, though, like the term for pathogenesis. That GO term has a keyword in Uniprot and you can search for it by dropping this in the search box: "keyword:"Virulence [KW-0843]"". The results can be filtered down to the reviewed Swiss-Prot set for a subset of reliable virulence factors.

Try it yourself!

The NCBI databases don't make it quite this easy, but linking through the Biosystems database seems to help. Here's an example.





Wednesday, January 07, 2015

I read this Forbes article about 23andMe on the bus this afternoon. It details a perfect instance of the customer as commodity. I'm excited about the possibilities regarding personal genomics, especially as whole-human-genome sequencing gets cheaper and easier to interpret. Control over the resulting data will need to remain in the hands of the consumers, though. Those who volunteer their genetic material for analysis need access to the results.

Opener and opener

Here's another decontextualized bit from Emotional Intelligence 2.0: "Be Open and Be Curious".* I think this may be one of the more useful strategies presented in the book, though that makes sense as it's strategy #1 on their list of Relationship Management techniques. The strategy could be reworded in this way: provide context for yourself and others.

It's easy to interact with people while making assumptions about them. We really don't have any other choice as it's just more efficient that way. If someone's behaving angrily, we tend to assume that something irritated them and leave it at that. If they're acting in a way we find strange or illogical, we tend to assume that's just how they are. These tendencies rub me the wrong way. There's a reason for everything, so why not pursue additional detail? A lack of context really may be one of the most pervasive communication issues of the information age.

The whole strategy of "be open and curious" is vague, of course. It's something we all have to do to some extent but also something we might avoid when it's most necessary. As it's worded, the strategy also doesn't really provide any meaningful way to get beyond small talk. That being said, I'd like to think it's applicable to more than just in-person interactions. It may be most effective in spaces where context is at a premium (i.e., social networks) and the culture of anonymity clashes with the demand for content. I'll try it out and see how it works.

*Alternate lesson: tell everyone you were in the Marines at every possible convenience unless that's actually true.

Tuesday, January 06, 2015

Ambience

The music for today is...well, it's this playlist. It's my carefully curated list of background noise, ideal for working. I've kept vocals to a minimum and it's all ambient but not necessarily peaceful. Give it a try if you can. The embedded playlist here only contains 200 tracks but the full version contains exactly 1000.

Monday, January 05, 2015

The wonders of combustion. Photo by Stewart Butterfield.
Happy new year to you, reader!

The keyword at this time of year is usually resolutions*, so I'll share one with you: I will read more papers from my field. It's usually not to difficult for me to skim a paper or two each day, at least, but they frequently aren't directly relevant to what I'm doing. There's just so much interesting material out there! Keeping a wide scope when reading research papers certainly aids with perspective but it provides less material to implement immediately.

Luckily, fresh research has never been easier to find. I'll find at least one directly-relevant paper each week and briefly discuss it on this blog. "Directly-relevant" could include computational microbiology, evolution, bacteriophage biology, or even just novel methods and software.

Today's material is a short report by Marc del Grande and Gabriel Moreno-Hagelsieb in BMC Research Notes. It's relevant because it deals with a concept I've explored lately: gene conservation across numerous bacterial species. A group led by Moreno-Hagelsieb already developed a tool for generating a set of non-redundant genomes**, but here they use it to generate a set of prokaryote*** genomes to analyze co-conservation of their transcription factors (TFs). Why transcription factors? Bacterial genomes contain many transcription factors - even more if we count predicted ones - but it isn't clear if they're broadly conserved in pathways or if they usually show up through rapid evolutionary processes like horizontal gene transfer.**** The set of genomes in this paper doesn't include genomes smaller than 2.5 Mb, presumably to avoid the bias of minimal genomes, but it would have been nice to see them.

As with similar studies, this one was limited by the available data. It's difficult to make predictions about TF interactor conservation when we're not sure what these TFs interact with in the first place. In the end, the authors had to predict the existence of TFs across 857 genomes, then examine the TFs and their interactors in their chosen models of E. coli and B. subtilis. Their conclusion: compared to other protein-coding genes, those coding for TFs have fewer conserved potential interactions across Bacteria. The emphasis is mine as these are predicted interactions based on conservation. That's fine unless the TF has some other reason for its conservation. It's an interesting comparison so I'm curious to see how TF interactions look across bacterial protein interactomes (spoiler alert: available data sources are still a limiting factor).  

Citation:
Del Grande, M. & Moreno-Hagelsieb, G. The loose evolutionary relationships between transcription factors and other gene products across prokaryotes. BMC Res. Notes 7, 928 (2014).


*I'm also resolving to use more images in my blog posts.

**Their report promised a web interface but that doesn't seem to have happened.

***The term "prokaryote" is archaic, isn't it? This paper is really only talking about Bacteria.

****These two models aren't mutually exclusive.