I spent an enjoyable couple of days last week in the beautiful Wellcome Trust taking part in a data/text mining workshop run by The Content Mine.
The idea behind their project is to develop tools that help scientists and other interested sorts pull data from published articles, potentially on a large scale. If you want to learn more, all their presentations are online, as are the materials from the Wellcome workshop.
The second day was a hackday. For this, my team wanted to build on the ContentMine tools to create something that helps you explore the scientific literature and find connections between papers you might not otherwise have found. We thought it would work something like this:
- You search for something interesting like ‘BRCA1’ or ‘robotic exoskeletons’ and download a bunch of relevant (open access) articles using Content Mine apps
- Our tool counts up the frequency of words in each article to produce a list of keywords
- It uses these keywords to find links between all the papers and then
- … produces a cool visualisation of these links
- … tells you about less-common but possibly-still-interesting keywords, which you can use to go off and search for more papers.
After six or so hours of hacking, we’d come up with something rudimentary but useful. Over the past week, I’ve been playing with the idea some more and have put together two tools that more or less do what we set out to hack. The tool works in two stages: summarising and visualising. But first you’ll need to find some articles.
Searching and mining
Before the summariser can summarise anything, you need to give it some articles. At the moment I do it like this:
- Use getpapers to search for and download articles via Europe PubMed Central. This will download XML or HTML versions of the articles onto your computer.
- Run norma to normalise the articles into what Content Mine calls ‘scholarly HTML’.
- Run AMI to extract data from your scholarly HTML. For the time being, I’ve only played with word frequencies and species lists but there’s lots more you can pull out from papers.
You’ll now have a directory on your computer containing a sub-folder for each article. After running the species extractor and the word frequency analyser, it’ll look something like this:
The summariser is available from GitHub as a set of R scripts. Point it to the base of your content-mined directory containing the papers and let it run. It’ll output the following JSON files:
- ‘words.json’ — the top X most frequent words and the articles in which they appear (you pick X when you run the script)
- ‘words_tdidf.json’ — same as above but calculated using term frequency-inverse document frequency (TF-IDF)
- ‘species.json’ — occurrences of binomial species names and the articles in which they appear
The code is heavily inspired by Jim Vallandingham’s Interactive Networks Visualizations tutorial and I strongly recommend you have a look at it if you want to understand what’s going on behind the scenes.
Put the three JSON files produced by the summariser into the visualiser’s data/ folder then launch the site. You’ll see something like this:
This is a network of your downloaded articles alongside their summarised keywords. The blue circles are keywords (or species) and the orange ones are articles. Hover over each to see more details. You can also zoom with the scrollwheel or pan with click+drag if you want to focus on smaller parts of the network. If you click on an orange circle, your web browser will take you to the article.
The two tools still need work but I hope this will prove useful to someone, somewhere. If you have comments/issues/suggestions, please get in touch by one of the ways on my homepage. (Best not to leave comments here because I rarely check them.)