The World Service archive
One dataset we are looking at within this project is the World Service archive. This archive is isolated from other programme data sources at the BBC, like BBC Programmes or the Genome Project, and the associated programme data within it is very sparse. It would therefore benefit a lot from being automatically interlinked with further data sources which makes it such a particularly interesting use-case. The archive is also very large: it covers many decades and consists of about two and a half years of high-quality continuous audio content.
Automated semantic tagging of speech audio
One way of dealing with such a large programme archive with patchy metadata but high-quality content is to use the content itself in order to find links with related data sources. For example if a programme mentions ‘London’, ‘Olympics’ and ‘1948’ a lot, then there is a high chance it is talking about the 1948 Summer Olympics. Using the structured data available in Wikipedia we can then draw a link between a recent programme on the 2012 London Olympics and that archive programme and use that link to provide further historical context.