Scientific Feed Aggregation

As I mentioned, I was at Northern Voice last weekend and I was able to talk about some of the work I’m spearheading at the BCGSC with feed aggregation. Let me say up front that I don’t presume that what we’re doing is directly related to education, or that it’s useful in that space. My main objective in talking about my work at NV was to get people thinking about these things, and it seems that I’ve succeeded! :)

Scott Leslie pointed me here: Thoughts on the mythical school aggregator

Wow, great conversation! I can’t really talk about the specifics of what and how we’re aggregating at the GSC, but I can offer up some insight and observations that maybe useful for folks trying to build their own custom aggregator that does more than basic feed aggregation. I’m just going to type these in as they come to me, in random order, and they could be technical points or not.

  • Our goal is to tease correlations out of what would appear on the surface to be a large set of very loosely related data. Some of this data is produced by machines, some by humans. Sometimes the results of “intelligent” aggregation might be worth more than the the sum of the parts. It’s glorious when that happens, but it’s not easy, even within a specific problem domain. And we still have a long way to go. Solving it generically means solving the semantic web problem, and I’m not ready to chew off that one just yet.. :)
  • We tried to centralize the creation of this sort of stuff with a single CMS / blogging / publishing / do-everything infrastructure from the heavens. It didn’t work. Not even for simple blogging. We had to let the research groups go their own way with their own projects. Trying to predict their needs ahead of the game was pointless. In the end it was best to stay out of the way as much as possible.
  • Loosely coupled, loosely coupled, loosely coupled. Say it over and over as you fall off to sleep each night.
  • Getting participation is easy when the payoff is immediately obvious. If there’s no obvious payoff, why am I asking someone to do it?
  • On a related note, always feed the egos.
  • It’s fine to target machine consumption in your pipeline code, but human writers should be targetting a human audience. Don’t let your parsing desires get in their way.
  • The syndication approach really is remarkably flexible and useful for all sorts of things outside of traditional blogging. We were already seeing the usefulness of syndication for internal blogging, then wikis, then in our primary CMS. As we started treating “raw data” more like “web content” the idea of treating pipeline output as a feed came naturally.
  • Creating feeds from nearly anything is easier when you’re toolkits and design are smart enough to allow you to generate multiple “views” of your core data structures. The primary goal of these programs isn’t to create RSS. It’s to get results. To do science. Don’t forget that.
  • Demand a lot from your tools and programming environment, and don’t settle for tools that aren’t well matched to the task. Don’t be afraid to try a lot of different stuff out to get a feel for what will work and what won’t. Expect programming languages to evolve rapidly in this space.
  • Premature optimization is the root of all evil.
  • Related to above: avoid any effort to “standardize” on a language to enable system-wide APIs. In fact, as soon as someone starts talking about standardizing on anything you should be nervous. :)
  • Atom in particular is fairly easy for developers to get their heads around and start doing useful things with very quickly. And it’s flexible enough that we often just don’t need to produce custom XML DTDs.
  • We haven’t given up on RDF. Some of our most useful feeds come in RDF format. We have a few systems like DiscoverySpace that actually do their own data collection and mining and spit out nicely filtered RDF on the server backend. The Java app consumes that RDF to present a rich GUI, but it can also be picked up by a feed reader.
  • “Let’s make it generic”. The words of the devil. See standardization above.
  • Microformats can be very useful in the right places. Doubly so when your systems don’t have to interact with the Net at large. :)
  • WikiFormatting approaches for infering “body” content seem to wildly successful. Take advantage of this wherever possible. Most fields have lots of domain-specific vocabulary that’s pretty easy to parse and you don’t need a formal microformat.

Hmm. What else. Some geeky bits I think I can let out of the bag:

  • You might have to aggregate more than RSS/Atom. It might be a SOAP interface. Or a directory with KEGG files in it. Or something else entirely. Expect it. Embrace it. Plan your tooling and architecture accordingly.
  • Don’t expect the traditional LAMP stack to do this sort of stuff. At a smaller scale it may be fine, but as soon as you want to start building rich semantic analysis of masses of data (domain specific or not) you may find that relational databases and web-focused languages aren’t at all suitable. Don’t be afraid to explore data formats and structures “outside the box”.
  • Plone’s Archetypes are pretty fantastic, even if the stack as a whole is a little “heavy”. Same goes for good chunks of the Java web framework space. You might not want to use it, but there are some beautiful lessons to be learned in that space.
  • Object Relational Mappers can be tremendously useful if you do sit on top of a relational database, but I think this quote from SQLAlchemy bears repeating: “SQL databases behave less and less like object collections the more size and performance start to matter; object collections behave less and less like tables and rows the more abstraction starts to matter.
  • A large point of this web aggregation stuff is to decouple you from the programming implementation of a particular data generator. Enjoy that freedom! It’s awesome!
  • NetNewsWire totally rocks. I haven’t found a desktop aggregator on any platform that comes close. It may be the single greatest reason to use a Mac.

Ok. Wow, that was a lot of rambling targetted at no particular audience. I’ll try to post more on some of this stuff when time permits. In the mean time I’ve got a bunch more feeds to watch and learn from in the EduBlogging space and I’ll pop into those conversations when I can.


About this entry