House of Commons Hansard ˜Big Data

Professor Lesley Jeffries uses ˜corpus linguistics to bring Government information to the public for ˜Big Data debrief project

A University  of Huddersfield professor will play a key role in the development of pioneering computer software that will enable historians and linguistics experts to research and interpret huge bodies of text, such as the 2.3 billion words spoken in parliamentary debates and recorded by Hansard since the early 1800s.

The project is a component of an ambitious £4.6 million scheme named Digital Transformations in the Arts and Humanities, announced by the Minister for Universities and Science, David Willetts.  Funded by the Arts and Humanities Research Council and backed by the Economic and Social Research Council, it is part of a drive to strengthen the UK’s competitive advantage in the field known as ‘Big Data’.

The digital age has meant that vast quantities of  data can now be electronically stored and are easily searchable.  But the English language presents a major problem, because 60 per cent of its word forms have more than one meaning – some have almost 200.  As a result, searches of large bodies of text are severely handicapped and can be rendered meaningless.

A more sophisticated method is needed and it is being developed by experts who have devised a computer system that automatically annotates words in text with their precise meanings according to the context, removing any ambiguity.  This will result in much more precise searches.

The project is named SAMUELS, which stands for Semantic Annotation and Mark Up for Enhancing Lexical Searches, and it is headed by Dr Marc Alexander of the University of Glasgow, who is drawing on the vast database of the Historical Thesaurus of English, which contains 797,000 word forms arranged into 236,000 semantic categories.

The collocations of Hansard

SAMUELS has earned major funding from the Arts and Humanities Research Council’s Digital Transformations in the Arts scheme and the University of Huddersfield’s Professor Lesley Jeffries will receive £68,000 to conduct one of four projects designed to test the new system.  Hansard, with its 2.3 billion words, will be her area of investigation and Professor Jeffries, joined by a specially-appointed research assistant, will use SAMUELS to track the language used in the House of Commons to describe trades unions.

She will pay particular attention to ‘collocations’, where words are habitually and significantly used alongside each other.  The phrase “union barons” – usually a hostile term – is an example and Professor Jeffries has named her project Is there a Baron in the Commons?

It will be a follow-up to a previous linguistics project by Professor Jeffries, when she researched the language used in the Press to describe the unions.

“I imagine that language used in Parliament will be less populist than the Press,” says Professor Jeffries.  “That is where the collocations could give us a clue, particularly as parliamentarians will often be more sophisticated in their usage, but we can still see the idea behind their views if you look at the other language that surrounds it.”

She will be appraising her findings, using SAMUELS, according to which party was in power and the extent of industrial unrest at the time that words were spoken in the Commons.

During her research into the language used by the Press to describe unions she found that they were continually represented in terms of industrial strife, with more positive representations few and far between.  Now she will assess whether the same is true of Commons debate.

Corpus linguistics

Professor Jeffries says that SAMUELS has the potential to be an “amazing” resource for researchers.  The field known as corpus linguistics would increasingly enable historians, political scientists and sociologists to learn more about the attitudes to be found within large bodies of data.

She also hopes that the findings will feed into wider political debate, enabling voters to critique the language used by the media and politicians.  

Digital Transformations in the Arts and Humanities consists of 21 wide-ranging research projects.  When he announced the £4.6 million scheme, Minister for Universities and Science David Willetts said: “Getting quality data out of the hands of a few and into the public domain is an important goal for this Government.  This funding will help to overcome the challenge of making vast amounts of rich data more accessible and easier to interpret by a lay audience.  These 21 projects promise to come up with innovative long-lasting solutions.”

 

 

INTERACTIVE ROUNDTABLE

The Role of Testing within Digital Transformations

Wednesday, January 26, 11AM (GMT)