Wednesday, March 04, 2015

Monday, March 02, 2015

My paper about protein complexes in bacteria is officially published - check it out in PLoS Computational Biology.

Essentially a picture of how many species are not E. coli (except those at the bottom - many of them are E. coli). Click to zoom in or get more context in the linked paper.
The one-sentence summary is as follows: bacteria are different from each other. OK, so that isn't really a mind-blowing observation. Microbiologists tend to think of conserved features in a monolithic way when they're working with multiple species, though, to the point that a ribosome from E. coli is considered more or less equivalent to a ribosome from a Bacillus species or even something exotic like Thermus thermophilus.

Our point here was to show that E. coli protein complexes, on average, don't appear to provide a solid model of those complexes present in other bacterial species. The really well conserved protein complexes tend to be those critical to Life As We Know It (i.e., RNA polymerases), but even among some complexes crucial to one species' existence, a closely-related species may lack one piece of the larger complex.* There's a broader biocomplexity issue here, too: we can't work under the assumption that protein complex components will be consistently co-conserved. Two species may appear to contain a complex of similar function, but one species may put the complex together in a different way than the other.

In some cases, complex parts may even be replaced with those more appropriate to the needs of the species. That's a broader discussion than we attempt in this manuscript but it's an ongoing course of study.

*Lack is a strong word here. If a strain or species doesn't need a certain protein (that is, it's not essential) and it doesn't have any obvious inability to maintain growth in its own environmental niche, does it really lack the protein?

Tuesday, February 24, 2015

Academic job searches in two dimensions

More than a few faculty members have reacted sympathetically when I've mentioned the problem of dual academic careers to them. It's hard enough for one person with extensive education to find a job: they may be exceedingly qualified for a set of jobs but only a few of those positions may be open at a time, whether we include non-academic career options or not. Adding another experienced scholar into that mix complicates the situation further. There's a reason this is called the two-body problem and not the two-body paradise of unending joy and tranquility.

How can we make it easier for couples to find jobs when they have very specific requirements?
Location is a major limiting factor. The availability of job information is likely an even larger factor: that majority of available jobs aren't posted publicly. Even among educational institutions, many of which must announce job openings, faculty participate in distinct but invisible hiring networks governed by status and prestige.

We'll have to work with what we have.
    Pairs in places. Some job search sites offer a geographic view.
As far as job searches go, I've found the following sites:

  • The HERC Dual Career search tool. I think I've mentioned this one on the blog before. It only covers the US but includes a variety of institutions and job types.
  • HigherEdJobs dual career search.  Like the HERC site, this one primarily searches the US, but includes Canada and potentially even beyond, though I haven't found any postings there yet. The listings appear to include a number of older posts - I saw some dating from 2012 - which could be a useful feature if you're trying to find out who could be hiring rather than just who is hiring right now.
  • Inside Higher Ed Careers dual career search. This search tool claims to search anywhere in the world but that may be an exaggeration.
  • EURAXESS. Not exclusively a career search site (it does have a search tool for research jobs in Europe, though the selection may be limited depending upon the field) but also a guide for academics who might like to work in Europe.
  • Many universities have dual-career assistance programs, sometimes through HERC. The programs may be referred to using other terms, i.e. Partner Opportunities. Most of them offer job search assistance to the partners of prospective hires, though one of you needs to get that far first! 
A variety of projects dedicated to this task exist (i.e., TANDEM) but they don't appear to have been organized with job seekers in mind. 

One of my recent side projects involves extending the dual-career search concept to jobs worldwide. I'm trying to make it into a meta-search by using publicly available job posts, though the major issue with this approach is variability in post format. Many of them don't even explicitly list geographic locations and a subset even list false-positive locations (a job may be headquartered in London but work may occur somewhere else). There will be some guesswork to do.

Saturday, February 21, 2015

Communication across the tree of life

A squid, the traditional model of quorum sensing. It glows because the Vibrio inside it glow. They have to stay coordinated to glow. Quorum sensing lets that happen. More details and photo source here.
I was looking for a recent review of quorum sensing this morning - specifically in the context of bacterial communication across domains of life - and found this minireview in Frontiers in Plant Science. It isn't about plants. Rather, it uses Pseudomonas aeruginosa interactions with human cells as an example of inter-kingdom interactions* and the resulting behavior changes. It's not the most comprehensive review of inter-kingdom quorum sensing out there (this 2008 review by Hughes and Sperandio comes to mind, plus I'm sure there's more recent reviews I haven't found yet**) but it covers some important developments from the past few years.

As usual, here's my question: where do phages fit into the picture? If we conclude, as Holm and Vikström do, that this particular bacteria-host relationship would be quite different without quorum sensing molecules, the next logical step is to examine factors in the environment with other impacts on quorum sensing (and AHL production, specifically). Those factors certainly include other bacterial cells but they also include phages. There is clearly a precedent for studying phage-host interactions as they occur in quorum sensing: E. coli has been found to reduce the number of potential lambda phage receptors on its surface in response to AHL levels and quorum sensing genes have even been found in a phage infecting C. difficile. It's time to include the viral component of the microbiome in our quorum sensing models.

*It's surveillance rather than a two-way conversation, unless you consider phagocytosis to be a good metaphor for small talk.
**I should probably mention this 2009 Ng and Bassler review, too, because of Bonnie Bassler.

Thursday, February 19, 2015

I was getting ready to write something* to replace Biopython's SwissProt module today as it can't parse Uniprot entries yet. Just as I was starting to do so, I found BioServices, which appears to plug some of the holes in Biopython with regards to services like Uniprot and KEGG. These services all have REST APIs so they aren't difficult to retrieve individual entries from. Writing a new parser for each service, though, is tedious and I'm glad there's an off-the-shelf solution.

One caveat: the Uniprot search function is billed as "a bit unstable". I haven't been able to get it to complete any searches successfully. BioServices may be more useful if a list of accession IDs is already available, though that defeats the purpose for me.

Additional caveat: the REST API does most of the work already in accessing Uniprot, so if you're willing to get your hands dirty then you can access the database contents easily with BeautifulSoup and the right urls. A simple proteome query may look like this:
"http://www.uniprot.org/proteomes/?query=" + your_query_here + "&fil=reference%3Ayes&sort=score"

*And in the end, I wrote it myself anyway.

Sunday, February 15, 2015

Overcome thoroughly the manner of working like extinguishing fire

Not a catchy slogan but free recipes are always nice.
I read this BBC magazine piece about North Korean propaganda slogans this evening. It triggered two questions:
1. Where's the inflection point between "North Korea as convenient source of surreal entertainment" and "North Korea as source of genuine misery and an extant totalitarian state"?

2. Wouldn't propaganda slogans serve as excellent candidates for Markov tomfoolery?

I only have an answer for the second question. The answer is, of course, Yes.

I started off using the BBC's translated list of NK slogans and generated new ones in a very basic fashion. A few examples follow:

Let the whole nation join efforts to open up a broad avenue to independent reunification in this year that marks the 70th anniversaries of national resurrection and legendary hero of the proclivity to imports and ensure domestic production of coal and cement to the death!

Let us fiercely raise the sports wind across the Party Central Committee headed by the Moranbong Band to bring about a great golden age of construction by thoroughly applying the spirit of the office of the dignified powerful Paektusan nation.

Let us glorify the Kumsusan Palace of the Songun revolution and the 25th anniversary of the Workers' Party of Korea as the buglers of the country prospers; the less books we read, the later the country in all aspects as the major fronts of building a thriving socialist country.

OK, so these are virtually indistinguishable from the original slogans. It isn't a large text sample, but perhaps more importantly, there's noticeable consistency (about a third of the ~300 slogans begin with "Let us").  It isn't surprising: propaganda is intended to be simple, inspirational, and accessible. It doesn't usually vary from the basic idea of inspiration plus action equals solidarity.

With that it mind, I should add additional slogans to the training set. They should ideally include material from a variety of movements and political entities. This means that the final product will make less contextual sense, but hey, the goal here is to generate content in the absence of context.

Friday, February 13, 2015

Here's one more option for generating plots in R: plotrix.
It can create some novel things like nested bar graphs and can search for empty space in plots.
There are a number of examples here. The 3D plots are probably best avoided but many of these examples include features not present in ggplot2.
Continuing the music adventure from yesterday, today's genre is Crossover Jazz.
Spotify doesn't use this genre annotation for any specific artist but Wikipedia tells me the term could apply to everyone from Al Jarreau to Spyro Gyra. In some ways, it's an early or more moderate form of Smooth Jazz (as in Kenny G, though he's a crossover jazz musician as well).

I'm going to start off with Gerald Albright and Chuck Loeb, then go from there.

How is it making me feel?
Most of the time I feel like I'm on the Weather Channel during a calm day. Even when it's improvisational, this music tends to feel restrained and commercial, like it's trying to avoid being obtrusive. This effect leaves me feeling restless.

What was memorable?
Gerald Albright - Slam Dunk was one of the few tracks I've heard so far today which sounded like the Wikipedia genre description.

Thursday, February 12, 2015

It's easy to build habits when it comes to media consumption. How can we expand our horizons when we become increasingly content with the same thing? This quandary is one reason why I've enjoyed the emergence of streaming music services. Musical horizon-broadening is frictionless.

So, how do we start? Randomly, of course. I wrote a small script to provide lists of artists from randomly-chosen music genres available on Spotify. The spotipy library was very helpful. Today, I'll delve into the history of a single genre, depending upon what the script tells me to listen to.

That genre is: Merengue!

Peaks. Yes, I know what the difference is. From Tamorlan on Wikimedia Commons.
Spotify tells me I should start with Juan Luis Guerra, so he's first. Then it's Frank Reyes and Antony Santos. After them, it'll be some Toño Rosario and, in an effort to find some historical context, Wilfredo Vargas.

Observations:
How did I feel?
Fairly happy - even the more downtempo songs have a steady pace (which makes sense as this is dance music, after all). It's easy to work with it in the background as I know just enough Spanish to understand the lyrics but not enough to follow the lyrical context. It usually seems to be something about dancing or grandmothers.

What was memorable?
El Baile del Perrito.
Antony Santos - Me Quiero Morir (With that title, I guess it's actually merengue bachata)

Wednesday, February 11, 2015

This project is exactly the kind of thing most of my projects seem to turn into*, though much more competently implemented.

It's an automation bot for Tinder. It builds an average face out of a set of liked/disliked faces, then makes choices about newly provided Tinder profiles based on eigenfaces made using the average face. After that point, it's a chatbot: it carries on conversations with the users behind matching profiles, continuing the conversation if it gets positive responses.

But does it work? That is, does it accomplish the same goal as having a human operator? According to its creator:
What do girls think of the bot? I've gone on at least 10 dates with the help of the bot and I've shown my partners the bot in its entirety. One date literally didn't believe me and thought I was pulling her leg. Another person thought it was really cool and wanted the full tour. All were in agreement that it is not creepy, though some felt it was borderline. Kind of nice considering it's not something you'd come across everyday.
The obvious question here is how well it works for other people.

* I'm not trying to automate dating, but I am trying to automate tedious processes among systems with emergent properties.

Monday, February 09, 2015

A multitude of R plot examples

You don't have to do it all by hand anymore. Note: I don't work with this kind of data.

Here's what happens more often than it should when I sit down to make some figures:

  1. I start with R and remember seeing an example similar to what I'm trying to make
  2. Searching for the example leads me back to the helpful but limited ggplot2 docs or a Stackoverflow question
  3. I piece together what I need from what I've found, knowing all the while that a better example was out there somewhere

This kind of thing drives me crazy, so here's a short list of places where decent R graph examples can be found. Hopefully these sites help others find what they need. One caveat: just because R can spit out a particular figure doesn't mean that figure is appropriate or represents the data well. Your mileage may vary, etc.


  • R Graph Catalog. Intended to complement a guidebook, this set of examples covers a wide variety of presentations and audiences. It filters graph types based on whether they're recommended or not (but hey, you can still use the examples). All code is included right on the site and on Github.
  • Quick-R Graphs. This site covers the basics and includes some useful figures like a plotting symbol chart.  
  • Cookbook for R Graphs. Another book accompaniment.
  • Plotly R Library. Here's where things start to get exotic. Plotly isn't R specific but supposedly plays nice with ggplot2. It could be worth using for interactive charts.
  • Gvis cookbook. A ggplot2 alternative. It also allows for interactive graphs.
  • Wikibooks R Programming Graphics. A few more examples, including some in 3D (those should probably be avoided, honestly).
  • R-bloggers. Not a list of examples as much as a source for examples, especially those of the bleeding-edge kind.
If you're getting tired of finding example graphs, there's also GrapheR, a GUI for producing R graphics.

Edit: wanted to add a postscript about some hacky ways to make ggplot2 display patterns.

Edit 2: I found one more resource in the ZevRoss ggplot2 cheatsheet. It's primary a guide to themes in plots, one of the more fiddly aspects of plotting in R.


Friday, February 06, 2015

The Wikipedia-based poetry project is working differently now. I've taught it to combine Markov chain results with raw bits of related WP pages. As a result, I'm getting farther away from a purely generative model and closer to one reflecting human writing. It still has some distance to cover. Here's a recent example:
he was
arrested at his home
in bloomington, minnesota. the
eastern half
he once entertained the possibility that any
could pass 32nd state on may 11, 1858, created from the
braddock takes
maggie with him after that moved to the term article is also used loosely by some
to include the determiner some. use of the prize
is not used at
present. the hurricanes' aim to
she also included the policy
of isolationism; the
historian manfred the worst wreck in the prussian minister of
war also used loosely
by
some to include the determiner
It's always going to sound like an encycopedia, but eventually it will sound like a self-aware encyclopedia wracked with dread.

Thursday, February 05, 2015

Music: Bersarin Quartett

There was a site where graphics like this could be produced easily but I don't recall its name. It was something difficult to search for, like "Flame" or "Ember".

Today's music is the lovely Bersarin Quartett. Try this track:

Friday, January 30, 2015

Following a comment from Scott on the last entry here, I looked into making a Wikipedia-based poetry generator. Scraping random WP pages isn't too difficult but the wikipedia Python library makes this project almost trivial. There's another library called Pywikibot - I may have to try that one next.

In the meantime, I have a script for producing short poems from a set of Wikipedia entries. It's only as good as the entries it's retrieved so far. I'd like to turn this into more of a machine learning project such that acceptable output can be distinguished from messy nonsense, but for now it's just a text corpus and a set of Markov chains.

Written language has the benefit of looking like nonsense to the human eye when we can't make sense of it, unlike, say, a protein sequence. It would be nice to leverage that advantage.

In the meantime, this is what I get:

Callinicos, 1989). His daughter Princess Catherine
Clair Township - west Israel Township,
In November 
Interment in Blue
Smith died suddenly
Richard Powell Sharkey, Helen Derr's 
In January 2010, Obadeyi went
They have also been recorded
3 was purely due to 
He even
As soon as look at you,
The reservoir was the second  

Wednesday, January 28, 2015

Today I learned about the existence of the dbpedia project. It essentially turns Wikipedia into a very large database, enabling the information within it to be searched in more specific ways than usual.

I found out about the project through this recent arXiv paper about computational fact checking. They used dbpedia to make a set of triples along the lines of "Cats are mammals." I suspect their project could be useful for more than just fact-checking claims of reptilian cats. Perhaps it could be mashed up with something like this timeline project to have a constantly-updated set of facts most likely to need verifying today.

Monday, January 26, 2015

Gotta have that Mechismo

Staphylococcus aureus alpha hemolysin. Not terribly relevant here but it looks nice. Structural view c/o Dcrjsr and Vincent Chen on Wikimedia Commons.
I found out about Mechismo today - it's a server-based tool for exploring the potential impacts of changes in protein sequences and structures. It will accept a list of potential variants of a protein (even just phosphorylation of a particular site, for example) and then return lists and networks of the potential impact. It's nicely presented and appears to be quite flexible. Here's my favorite part, though: unlike some other tools, it's not species-limited and looks like it should work with virtually any protein with an accession number. It will be worth trying out a bit more with some real data.

Here's the citation for the original paper (not quite sure how I missed it as it's been online since this past November, but it's hard to catch everything):
Betts, M. J. et al. Mechismo: predicting the mechanistic impact of mutations and modifications on molecular interactions. Nucleic Acids Res. 43, e10– (2014).

Friday, January 23, 2015

I've been enjoying Low Light Mixes today. Yes, it tends to be more of the relaxed, ambient material I post here regularly, but these mixes are lovingly curated and well worth listening to.

Tuesday, January 20, 2015

Milk-borne viruses and toasty bacteria

Today I learned about two interesting items:

  • Human mammary tumor virus (HMTV) exists. There's a retrovirus in mice with the straightforward name of Mouse mammary tumor virus (MMTV) and, as the name implies, it's linked with mammary tumors in mice. Mother mice carrying the virus pass it to their pups through milk. HMTV is the human-specific equivalent of MMTV. That's about all I can say without venturing into the realm of contentious debate: we can't really say whether either virus directly causes breast cancer in humans. Even so, it tends to show up frequently in women who've had breast cancer and it's not the only type of retrovirus found in milk.
  • A paper from this past June (citation below) claims to have improved the heat tolerance of E. coli by expressing a heat shock protein from C. elegans in them. It's unusual to express C. elegans proteins in E. coli as they aren't even in the same domain of life (that being said, we regularly express viral proteins in our cells of choice in the lab and no one thinks twice about it). The authors claim that this heterologous expression allowed the experimental E. coli to grow at temps up to 50 degrees C and even briefly at 58 degrees. That's really hot - I've never been able to grow it above 42 degrees myself. I'm skeptical, not the least of which because there's some substantial speculation going on in this paper, but the results can't be that difficult to replicate, right?
Citation for that second item: Ezemaduka, A. N. et al. A small heat shock protein enables Escherichia coli to grow at a lethal temperature of 50°C conceivably by maintaining cell envelope integrity. J Bacteriol 196, 2004–2011 (2014).

Friday, January 16, 2015

Here's the music for today:
Ben Lukas Boysen - Gravity
This is spaceflight music! I'm not flying anywhere today, space or otherwise, yet gravity is still a factor.

Wednesday, January 14, 2015

The usual suspects: databases of virulence factors in Bacteria

Bacteria have earned much of their reputation. Despite their close relationship with humanity, whether in our microbiomes or in our food and beverages, various bacterial species still pose a threat to our society. Luckily, genomics allows us to quickly compare virulent bacterial strains with more temperate ones. The result is more data, and where there's data, there's eventually a database.

Here are a few on the subject:

PATRIC - The Pathosystems Resource Integration Center
Most recently, PATRIC has added a manually literature and database-curated set of virulence factors from six big-name bacterial pathogen genera, including Mycobacterium and Listeria. They now specifically detail virulence factors for ten different pathogen species. Many of these are virulence factor homologs identified by protein BLAST.

PATRIC doesn't make it easy to just get virulence factors, but if you're on an organism page, hit the Specialty Genes tab and then click the Virulence Factor checkbox on the left to filter.
Listeria. Always clean your vegetables.
A recent citation: Mao, C. et al. Curation, integration and visualization of bacterial virulence factors in PATRIC. Bioinformatics 31, 252–8 (2015).

VFDB
Oh, wait, this one's included in PATRIC now as far as I can tell. Its last major release was in 2012.

PAI DB
This database is a bit different: it contains pathogenicity islands rather than just virulence factors. Pathogenicity islands are rich in virulence factors and all manner of other novel genes, so they're important to consider in any discussion of virulence factors. PAI DB appears to be actively updated and even contains a pathogenicity island search function (PAI Finder) though it's just based on BLASTing against known islands and virulence genes.

Their logo is perfect.

A recent citation: Yoon, S. H., Park, Y.-K. & Kim, J. F. PAIDB v2.0: exploration and analysis of pathogenicity and resistance islands. Nucleic Acids Res gku985– (2014). doi:10.1093/nar/gku985

MVirDB
A database with some overlap between PATRIC, MvirDB doesn't look like much and I'm not sure if it's actively maintained. That being said, it's essentially just a few tables, accessible through the download links on the left of the page. MVirDB unifies some smaller databases so it may contain completely unique sets of annotations.

Uniprot
OK, it's not explicitly a database of virulence factors. It's easy to sort it by GO terms, though, like the term for pathogenesis. That GO term has a keyword in Uniprot and you can search for it by dropping this in the search box: "keyword:"Virulence [KW-0843]"". The results can be filtered down to the reviewed Swiss-Prot set for a subset of reliable virulence factors.

Try it yourself!

The NCBI databases don't make it quite this easy, but linking through the Biosystems database seems to help. Here's an example.





Wednesday, January 07, 2015

I read this Forbes article about 23andMe on the bus this afternoon. It details a perfect instance of the customer as commodity. I'm excited about the possibilities regarding personal genomics, especially as whole-human-genome sequencing gets cheaper and easier to interpret. Control over the resulting data will need to remain in the hands of the consumers, though. Those who volunteer their genetic material for analysis need access to the results.

Opener and opener

Here's another decontextualized bit from Emotional Intelligence 2.0: "Be Open and Be Curious".* I think this may be one of the more useful strategies presented in the book, though that makes sense as it's strategy #1 on their list of Relationship Management techniques. The strategy could be reworded in this way: provide context for yourself and others.

It's easy to interact with people while making assumptions about them. We really don't have any other choice as it's just more efficient that way. If someone's behaving angrily, we tend to assume that something irritated them and leave it at that. If they're acting in a way we find strange or illogical, we tend to assume that's just how they are. These tendencies rub me the wrong way. There's a reason for everything, so why not pursue additional detail? A lack of context really may be one of the most pervasive communication issues of the information age.

The whole strategy of "be open and curious" is vague, of course. It's something we all have to do to some extent but also something we might avoid when it's most necessary. As it's worded, the strategy also doesn't really provide any meaningful way to get beyond small talk. That being said, I'd like to think it's applicable to more than just in-person interactions. It may be most effective in spaces where context is at a premium (i.e., social networks) and the culture of anonymity clashes with the demand for content. I'll try it out and see how it works.

*Alternate lesson: tell everyone you were in the Marines at every possible convenience unless that's actually true.

Tuesday, January 06, 2015

Ambience

The music for today is...well, it's this playlist. It's my carefully curated list of background noise, ideal for working. I've kept vocals to a minimum and it's all ambient but not necessarily peaceful. Give it a try if you can. The embedded playlist here only contains 200 tracks but the full version contains exactly 1000.

Monday, January 05, 2015

The wonders of combustion. Photo by Stewart Butterfield.
Happy new year to you, reader!

The keyword at this time of year is usually resolutions*, so I'll share one with you: I will read more papers from my field. It's usually not to difficult for me to skim a paper or two each day, at least, but they frequently aren't directly relevant to what I'm doing. There's just so much interesting material out there! Keeping a wide scope when reading research papers certainly aids with perspective but it provides less material to implement immediately.

Luckily, fresh research has never been easier to find. I'll find at least one directly-relevant paper each week and briefly discuss it on this blog. "Directly-relevant" could include computational microbiology, evolution, bacteriophage biology, or even just novel methods and software.

Today's material is a short report by Marc del Grande and Gabriel Moreno-Hagelsieb in BMC Research Notes. It's relevant because it deals with a concept I've explored lately: gene conservation across numerous bacterial species. A group led by Moreno-Hagelsieb already developed a tool for generating a set of non-redundant genomes**, but here they use it to generate a set of prokaryote*** genomes to analyze co-conservation of their transcription factors (TFs). Why transcription factors? Bacterial genomes contain many transcription factors - even more if we count predicted ones - but it isn't clear if they're broadly conserved in pathways or if they usually show up through rapid evolutionary processes like horizontal gene transfer.**** The set of genomes in this paper doesn't include genomes smaller than 2.5 Mb, presumably to avoid the bias of minimal genomes, but it would have been nice to see them.

As with similar studies, this one was limited by the available data. It's difficult to make predictions about TF interactor conservation when we're not sure what these TFs interact with in the first place. In the end, the authors had to predict the existence of TFs across 857 genomes, then examine the TFs and their interactors in their chosen models of E. coli and B. subtilis. Their conclusion: compared to other protein-coding genes, those coding for TFs have fewer conserved potential interactions across Bacteria. The emphasis is mine as these are predicted interactions based on conservation. That's fine unless the TF has some other reason for its conservation. It's an interesting comparison so I'm curious to see how TF interactions look across bacterial protein interactomes (spoiler alert: available data sources are still a limiting factor).  

Citation:
Del Grande, M. & Moreno-Hagelsieb, G. The loose evolutionary relationships between transcription factors and other gene products across prokaryotes. BMC Res. Notes 7, 928 (2014).


*I'm also resolving to use more images in my blog posts.

**Their report promised a web interface but that doesn't seem to have happened.

***The term "prokaryote" is archaic, isn't it? This paper is really only talking about Bacteria.

****These two models aren't mutually exclusive.