An effective TDM project is like a really smart, well-connected researcher. Imagine what she brings to each new project:
- She regularly monitors professional journals, conference proceedings, books, videos, webinars, reports, patents and other material in her field.
- She participates in professional conferences, where she meets other researchers in related fields and learns about their current projects.
- She collaborates with colleagues to publish her findings in peer-reviewed journals.
- She monitors organizations and funding sources in her field and analyzes grant patterns.
Because she is familiar with information from a wide range of sources, she can see trends and relationships among concepts that would not be obvious to the casual observer. Perhaps she knows to watch for new developments from South Korea, based on a conference presentation she heard and a recent uptick she noticed in grants to universities there. She probably has an internal taxonomy of all the topics she follows, so she intuitively sees connections between related concepts.
Now imagine exponentially expanding that researcher’s perspective to include all the information available in her field. And imagine a similar superhuman researcher for every imaginable field of inquiry. That is what text and data mining initiatives offer to an organization and, as with any information or knowledge management project, information professionals can play a key role.
As researchers bring more data analytics skills to the table, and as more information—both free and in subscription services—is available, there is a greater need for information professionals who understand how to find, enhance, manage and preserve information, particularly in the arena of text and data mining.
The value of TDM depends on knowing what sources to include, what kinds of connections to monitor and what types of metadata are necessary for a particular project. Info pros bring the ability to ask the right questions, which enables them to see the larger context and identify the specific sets of information that would provide the richest insights. Info pros know which resources to use, weighing the limitations, restrictions and cost of each source. They understand how researchers use information—their approach to a problem, their information-seeking behavior, and what they do with the information next. Info pros know that their clients aren’t interested in which specialized search terms to use or how to harmonize data from multiple sources; info pros build portals and APIs to help their clients get from question to insight as quickly as possible.
Info pros know what data sources to look at—government agency data sets and open-data initiatives, collaborative repositories of data underlying scientific publications such as Dryad () and ICPSR () as well as commercial services such as ϳԹ.
One of the underappreciated skills of info pros is that of what has been called the “reference interview” and is now more properly called an information-needs interview. Before an info pro can connect a researcher with the right TDM tools, the right data sets and the right approaches to find meaningful insights from the information, they have to understand what the researcher’s underlying needs are, including the aspects the researcher may not even think to ask. Info pros are also accustomed to dealing with questions that don’t have easy answers. They know that solving a client’s problem often means pulling material from a variety of sources, collaborating with other groups and figuring out how else they could get to the answer.
Scott Attenborough, TDM industry observer and owner of Content Capital LLC, commented “Sure, info pros have the skills to create the right queries or build the hierarchies, but the real fun is learning the business of the person you are working with. Info pros’ clients often don’t even know what questions to ask, so our job is to understand each client’s use case and then create the right tool to help them understand something important to them—who's working on what molecule, or how this company is working on that disease.”
However, info pros’ familiarity with a wide range of information sources can sometimes get in their way. They are accustomed to searching bibliographic databases, combing through millions of articles, conducting more and more focused searches until they retrieve a manageable number of articles for their researcher. TDM projects, on the other hand, involve searching for patterns and for the unexpected insights while looking at how information pieces fit together. Searchers do not necessarily know what they will find when they start their research, and the “answer” will as often be a series of graphics as a collection of articles.
A problem familiar to any online searcher is the difficulty of finding relevant material on a topic that is not consistently indexed. A pharmaceutical compound may be referenced differently based on the writer’s country, language or custom; on top of that, the subject indexing may not include all of its component parts. A disease may be known by different names—what is called amyotrophic lateral sclerosis or Lou Gehrig’s disease in the U.S. is known as Motor Neuron Disease in the U.K. The same word may have different meanings depending on the context—hearing aids and AIDS (Acquired Immunodeficiency Syndrome), for example. Articles about specific cancers—retinoblastoma or Kahler’s disease—may not mention the word cancer. When multiple datasets are being searched simultaneously, the problem with inconsistent terminology becomes even greater.
TDM projects can address this problem head-on by bringing in authoritative datasets that enable disparate information findable by linking all versions of a concept to a single authoritative entry. Take DBpedia, for example. DBpedia () is a crowd-sourced open data project that is creating a semantic knowledge graph based on trusted information. Structured data from Wikipedia is extracted and a dataset is created of the information in a consistent, searchable format. Content providers as varied as ϳԹ, Eurostat and the BBC can integrate backlinks from DBpedia to their content, increasing the discoverability of their content and enabling researchers to identify new insights from their data.
As an information scientist at a large pharmaceutical company noted, “with TDM we are able to find more reliably and precisely numeric/quantitative information (e.g., dosages) and can even extract them as metadata. The same is true for the extraction of other parameters. Ontologies are of great use to discriminate the context for results/extracted values.”