Skip directly to content

The Evolving Data Lifecycle

RxR's picture
on Sat, 10/19/2013 - 01:16

This essay is based on my presentation at eResearch Conference, Brisbane Australia 10/21/2013

The spotlight is on Data

Data within the research process has now taken center stage. The amount of data ranges the enormous quantities produced by large planned science missions to the smaller amounts produced by individual researchers, the so-called long tail of science. While the current focus in on data, it is important to look at data in context to the research process it self -- the data life cycle.

Looking at the Data Life Cycle

A scientific research process can be represented as a data lifecycle consisting of a series of stages through which data passes during its lifetime.  These stages include data processing, archiving, discovery, and finally use. Use by itself encompasses several sub-stages of access, integration, visualization, analysis, and sharing. These stages may have slight variations within different science domains and applications but in general remain consistent across many domains. The goal of informatics researchers is to make this process efficient for researchers, address existing gaps/hurdles, seamlessly integrate new evolving technology, and enable new types of research capabilities.

Factors Impacting Data Life Cycle

The data life cycle is dynamic, constantly evolving driven by several factors. The factors drive changes to the life cycle at both micro and macro level. At micro level, the changes are to the individual steps within the cycle where as at the macro level, the steps that constitute the cycle may get modified.  While these factors may overlap, they can be categorized based on four different perspectives. These are:

1. Data Perspective

Explosion in Data Volume– The data production capability of the research community has seen explosive growth which has resulted in increased needs for processing and storage.  In addition, data systems require automation to handle large volumes and need to be scalable enough to handle increasing capacities (both computational and storage)  while minimizing data movement across networks.

Increase in Data Complexity- Instrumentations, models and science algorithms producing data are getting more complex which in turn produces data high in complexity.  Data systems now need to support well-structured and rich information models along with the data itself to ensure easy discovery, correct data usage, and long-term preservation.

Need for real-time access and processing- As more data from different observations and models are used in decision support systems, the need for real-time access and processing capability has become one of key focus areas of application science.

2. Data User Perspective

Need for verticalization for Tools/User Experience- Verticalization refers to the customization of a tool based on specific science use or domain application.  Different domains/sub-domains within science have specific needs and consequently most tools require customizations of both user interface and functionality. The process of “verticalization” forces integration of domain information and needs and helps in providing an intuitive user experience of data discovery, access, and analysis.

Changing End User expectations– Google, social networking sites, and mobile apps have all changed the user's expectations regarding the ease of use of science tools.  For example, one of the most common complaints against existing data search tools is that these tools need to be more like Google. Users expect disparate data sets to seamlessly integrate together to support their research objectives.  Both the process of discovery, access, and exploration and moving between these phases is expected to be easy and intuitive.  In addition, tools for data analysis need to be simpler and more intuitive to allow ready adoption by the user community.

Growing interdisciplinary research needs- As more and more research crosses disciplinary boundaries, the greater the need for CI to support cross-discipline data discovery and integration.  Most interdisciplinary users are not proficient in the "vocabulary" used within a specific sub-discipline and require data systems to provide "cross-walks" between vocabularies to enable discovery of data across domains. In addition, rich information models that capture syntactic, content, semantic, quality and provenance elements are required for these users to support the use of these datasets in their existing tools.

3. Technology Perspective

Looking at data systems/tools as a ecosystem - There is a prevailing thought that instead of forcing a top-down integration of existing disparate data systems into Systems of Systems or Federated systems, we should allow a bottom-up approach of systems and services to organically integrate based on both community needs for different disciplines/domains using community accepted standards or best practices. This ecosystem approach can support need-based solutions for a variety of data, information products and services.  The challenges focus on allowing easy aggregation/integration of these disparate systems/tools for different themes/focus/applications.  Software and middleware components need to conform to community accepted standards, conventions and best practices that promote seamless interoperability for data, applications, and technology.  Community-driven standards, protocols and best practices serve as the glue for an ecosystem-based approach. Organizations such as ESIP and RDA are crucial to provide social platforms to promote and enable such community efforts.

Scientific Collaboration – Sharing of knowledge is at the heart of science and yet it is challenging for researchers to efficiently share information and tools with each other.  Sharing of information is primarily limited to publishing a paper and presenting at a conference. Sharing science resources such as code and data can accelerate the process of knowledge transfer, reduce redundancy of effort amongst researchers, and improve productivity. One technological challenge is to develop seamless collaboration solutions fostering knowledge exchange. Solutions should support sharing of all scientific resources amongst individual researchers within both a team of researchers and  the larger scientific community.  These solutions should be compatible with how researchers currently do scientific analysis rather than expecting the researchers to learn and use a new tool or site.

Adopting the next new technology hype - Every few years, a new technology enters the so called hype cycle.  This technology generates enormous interest within both the science and the CI community, and promises to radically improve processes within the data life cycle.  The current focus is on “Big Data” technologies supporting scalable "server side analysis/querying" on vast amounts of data and potentially removing the need for data access for researchers.  These big data technologies leverage the wide adoption of shared nothing architecture and distributed file storage systems and have industry supported open source middleware to support reliable processing.  The challenge for us is to evaluate these new technologies by asking the right questions. What is this new technology enabling that is new and different? Can we quantify the returns with respect to the adoption costs? Nothing comes for free. Adopting a new technology often entails hidden costs that sometimes may outweigh the benefits. Some technologies actually require formulating new business models, especially if there are substantial operational costs associated with providing this new functionality.

4. Policy Perspective

Reproducibility and long term contextual understanding – New requirements mandated by agencies can influence changes on data life cycles.  New requirements such as ensuring there is no/minimal loss information as data bits move across systems as well as over time, readability of the datasets over time, long-term understandability, and finally repeatability of previously obtained results are some current changes. The role of provenance research is integral for enabling these capabilities. While it is easy to design brand new data production systems where both provenance capture and production are integral components, the challenge is how to incorporate provenance components into the legacy systems without impacting these existing systems.  In addition, while provenance can be a central thread for both reproducibility and contextual understanding, ensuring full reproducibility might entail logging every bit of information generated during the data production process.  On some instances, it may cause the metadata to be much larger than the dataset itself.

Archiving long tail research- More and more agencies are now requiring individual researchers to have a Data Management plan.  This plan must address issues such as a strategy for long term archival, preservation and sharing of data collected and created during a lifetime of a project.  However, new infrastructures with adequate data management functionality are now required to support these individual data producers. Who runs these infrastructures and who pays for this functionality? Can you trust that the organization providing the infrastructure will not use your data inappropriately? Will the organization providing the infrastructure have financial support for long term preservation and archiving?  These questions remain unanswered and research community is trying to find solutions for this.  Other issues such as requiring the data producers to provide rich metadata is still a challenge. Experience has shown that any tool requiring individual researchers to fill many forms before depositing their data has met with limited adoption.  Automated/semi-automated methods of metadata creation are needed that can autofill most of the metadata based on existing publications, reports etc., and only require the data producer to verify the autogenerated content.

Growing cybersecurity concerns- Cybersecurity has become a major concern for all organizations, especially large organizations such as federal agencies.  Security policies are now dynamic and are constantly changing.  The challenge for future data systems will be to find the fine balance required between security and openness.

Verticalization Use Case within Data Search

Problem – Some science research requires analyzing data from multiple sensors/instruments around an event.  For example, case study analysis is a common research methodology utilized in Atmospheric science and the analysis focuses on studying an event in detail using data from multiple sources.  The objective of such research is to document in detail the processes for an event using multiple data sources and models.  For researchers focusing on this methodology, data gathering becomes the bottleneck. Gathering all the useful data and information required to complete this research endeavor is both time consuming and labor intensive.  Data needed to support these studies are typically stored in distributed data systems, each with its own user interface, vocabulary, functionalities and features.  Unless the researcher has some a priori knowledge about where to search, they will have difficulty finding all the relevant data. In addition, using supplementary information in addition to data can enrich the analyses in these studies. These supplementary resources (such as official and news reports and  photos and videos found online) can allow the researcher to construct a better contextual picture related to the event of study.

Approach – Design and develop an aggregation tool that autogenerates"Data Albums". A data album is a compiled collection of data, information, and tools around an event or a theme to support scientific research. While the different distributed resources searched are curated based on domain expertise, filtering is required to find the right data and information.  This filtering of the aggregation results is performed utilizing an application ontology. The application ontology is used both for query expansion to limit false hits on resources that provide programmatic APIs and by a relevancy ranking text mining algorithm to filter data sets using all the available metadata. Aggregated results are presented as visual infographics to provide researchers with interfaces allowing them to search for and  find specific events, get a birds eye view of the data and information that is available for such an event, and get access to additional information needed for research.

Value Added – This tool uses existing data systems and does not require these systems to provide new information such as RDF triples, etc. In the age of information overload, the tool provides a value added functionality of auto-curation, thereby reducing the time it takes researchers to find data and information while supportinginterdisciplinary research.

Policy Requirement Use Case: Retrofitting Provenance

Problem - There are chances of errors in the data arising from many different sources, even for data sets that utilize extremely engineered data production processes.  These sources of errors can be using the incorrect input files, modifications to the algorithm, changes in the instrument requiring recalibration etc.  Furthermore, many of these data sets have large inter-dependencies with other data sets; these dependencies form a network that can be represented as a graph. One such example are the data products from AMSR instrument.  A single AMSR Level 2 Brightness temperature is used to derive 14 products with dependencies of each other using different science algorithms.  For a researcher using any of these data products, it is important to understand how this product was generated.  Agencies such as NASA now have data preservation requirements for capturing metadata to allow repeatability and support long term understanding.  Current data processing systems used to generate these data products do not log any of the processing information in a systematic manner. In addition, metadata about the actual data, which itself is two levels (collection vs file), and metadata about the science algorithms used for processing are captured in separate documents that are not linked.

Approach – We built an instrumentation library for the existing data processing system such that it had minimal impact to its existing processing capability.  The instrumentation library is written for perl-based processing systems and logs "consumed", "invoked" and "produced" events within the processing stream.  The captured provenance serves as a core component around which we compile all the other relevant distributed metadata.  The compiled information now consists of not only how the data product was created and what is in the data, but also other key elements such as quality, science algorithm details, etc. In addition, we designed a provenance browser for disseminating this information.  The browser links quick-look (browse) imagery to this rich contextual information along with the data lineage and processing history.

Implications – The approach presented is required to retrofit existing/legacy data production system with minimal impactThere are limitations on what can be done, especially if the system is heterogeneous and there is a lack of direct control over all the software components to allow instrumentation. Hopefully, future data production systems will include both provenance collection and aggregation of contextual information at the beginning of the system design.  To date, the provenance browser has been very well received by different science teams.

Adopting the new technology hype: Big Data Analytics

Problem – It is hard to find precise definitions for both Big Data and Analytics and the definitions commonly used for these terms tend to based on certain perspectives.  There is one underlying assumption that is made with “Big Data” - that the way we now handling data processing and analysis is sooner or later going to break down. 

One can define Big Data from the perspective of the analysis process itself.  One can start with basic definitions presented in Singpurwala's paper [1]; these definitions are from statistician’s perspective for data analysis.Data is defined as something that is directly observable and therefore measurable.  Knowledge is a statement about a hypothesis, and a testing of this hypothesis is done using data.  Information captures the measure of uncertainty about a hypothesis; the role of data is to change this amount of information (increase or decrease the entropy). Singpurwala asserts that "data by itself cannot formulate a hypothesis, rather it changes the odds in favor of or against a hypothesis".  One can argue that his assertion is restrictive, as data exploration can lead to formulation of new hypothesis (granted that the new hypothesis may require testing following the traditional path).  

Big Data and the associated analytic technologies now have created new pathways and changed existing pathways within scientific process.  The challenge is how do we build infrastructure and tools to support these new pathways? There are many technology-based analytic solutions such as Map reduce based technologies, array-based databases, MPI based systems, etc., all of which address big data. Each approach has advantages and weaknesses.  What are the appropriate technology solutions for a given science domain or a problem?  Equally important, what is the best business model for providing these as long term capabilities to the science community?  Most current infrastructures have data archives and computing resources in separate locations. Should there be a movement to merge these two together by creating new centers?

Should science look at the use of data marts in the commercial world and find analogous metaphors?  Computing resources could serve as these data marts where high-value data are “on boarded” or as pre-stages to support analytics. Should the analytic tools be for only power users or the general scientific community? And what kind of queries should they support?

Approach – There are no easy answers to all these questions, but it is important to address these challenges based on each domain’s individual needs.  We have scoped the problem to focus on "event analytics" within any large satellite or model output.  An event is defined as any transient anomaly of interest in space and time.  This approach may have applicability in other domains also.  The focus of event analytics is four-fold: First, enable a researcher to interactively discover interesting events using any arbitrary heuristic query; Second, use the data to characterize the event to address questions of where, when and how often; Third, support discovery of correlations between events (new correlations can also lead to formulation of new hypothesis), and finally, allow researchers to experiment and try out new heuristic rules to best detect the onset of these events.  These heuristics rules in turn can then be used for prediction and forecasting. 

We are investigating two approaches - Polaris and SciDb.  Polaris is a custom built system designed to enable interactive exploration (or data prospecting) of data with a limited set of querying capabilities.  The drive for building this prototype was to explore the utility of allowing interactive exploration of full data sets with very limited queries to the researchers.  Our experience has shown that even with simple queries and visualization such tools hold tremendous value to the researchers, allowing them to play with these full datasets.  The second approach is using SciDb as the underlying framework to provide general-purpose analytic solutions that will support a wide range of analytic capabilities.  The analytic capabilities can be added as python libraries to allow a researcher to create analysis workflows in python on their local machine while the actual computation occurs on the distributed computing servers with data locality in mind.  While you lose the rapid interactivity, this approach enables researchers to perform detailed analysis without dealing with traditional data-related issues of access, storage, and transformation.  We have also validated this approach where we have taken an existing paper focused on the analysis of Somali Jet and replicated their results.  The analysis published in the paper was performed for only one year’s worth of data and we can now use a SciDB-based approach to easily extend the analysis to a time period of 40 years. In addition to this, we can now also relate the results to other climate patterns.

Implications - There will many different solutions to enable new analysis pathways dictated by big data.  Each technology provides different advantages or disadvantages. It is highly unlikely that a single solution will be able to address the needs of the entire scientific community.  Selection of technologies will have to be domain- or problem-specific.  No matter the selection, the integration of such solutions into the existing CI will have impact on both research and education.

For example, in research one can envision these capabilities enabling “data expeditions” – these theme or topic based expeditions use specific datasets that staged on a HPCs for short periods of time such as half a day or full day. A “data expedition” team consisting of researchers, science programmers and students can then explore the data together in  real time and put together a publication draft by the end of the expedition.  Each member within the team would play a specific role. The researchers could be asking questions based on past research results, science programmers could be converting those questions as queries and visualizing the results, and students could be capturing the free flowing discussions as notes.  Such synchronous collaborative data explorations can allow new insights using collective knowledge and expertise.  These “data expeditions” would be analogous to “hackathons” used in the software community.

The paradigm of "data expeditions" can also be introduced into classrooms and used to teach important threshold concepts using this data driven methodology.  Instructors could guide the students as they explore a particular data set and then are asked to develop conceptual or mental models based on the data exploration results using the material taught in the lessons earlier.  In another variation, the students could be asked to just explore the data and see if they find interesting results and develop their own individual insights to explain the results based on their a priori knowledge.  Such approaches are needed train the next generation of researchers who can link theory to real world observations.  These students will be ready to step into the big data age, perform research, explore data, and build new mental models or hypotheses to further scientific research.

 

Enabling Scientific Collaboration

Problem- There are significant untapped resources for information and knowledge creation within the science community in the form of data, algorithms, services, analysis workflows or scripts, and the related knowledge about these resources. Despite the huge growth in social networking and collaboration platforms, these resources often reside on an investigator's workstation or laboratory and are rarely shared. One of the reasons behind this is social issues where the researcher themselves are unwilling to share; the factors contributing to this unwillingness are described by Anatunes [2] as 3Rs – Recognition, Reputation and Reward.   Recognition is the visible summation of contribution, Reputation is the value added or given to a user for a contribution, and Reward is the combined outcome of Recognition and Reputation.  Reputation is a social capital that emerges from the action of members within the social network.  Rewards result from reputation and are benefits that accrue to the individual such as status within the community. The other part of the problem is technological, where we have been unable to infuse these technologies into a researcher’s existing tool.  There are very few scientific tools that support collaboration via sharing, and those that exist typically require learning a new set of analysis tools and paradigms.  As a result, adoption of such tools within science research is often inhibited by the high cost to an individual researcher associated with switching from his or her own familiar environment and set of tools to a new environment and tool set.

Approach– To design an Earth science Collaborative Workbench (CWB) to augment a scientist's current research environment and a tool set to allow him or her to easily share diverse data and algorithms.  The CWB leverages evolving technologies such as the cloud and open source collaboration frameworks to design architecture for scalable and controlled collaboration. The implementation of the CWB is based on the robust open source Eclipse framework, designed to be compatible with widely used scientific analysis tools such as IDL.  The data management components include a myScience Catalog to capture and track metadata and provenance about data and algorithms in a non-intrusive manner with minimal overhead, and a Community Catalog to track the use of shared science artifacts and manage collaborations among researchers.  Interfaces to Cloud services enable sharing algorithms, data, and analysis results as well as access to storage and computer resources.

Implications - The software is being designed and built to fit how researchers currently do scientific analysis in order to facilitate its adoptionTo that end, the project is implementing a bottom-up rather than a top-down approach.  We also want to conduct studies to quantify the return to individual researchers in terms of improved productivity if they adopt these technologies.

Conclusions

Change is inevitable, and there will always be factors influencing and evolving the data life cycle. These changes present opportunities in informatics to innovate with new tools, approaches, develop new infrastructure, design methodologies, and coalesce best practices – with the ultimate goal of accelerating knowledge transfer, reducing redundancy, improving productivity and enabling new science.

References

[1]      N. D. Singpurwalla, “Knowledge management and information superiority (a taxonomy),” J. Stat. Plan. Inference, vol. 115, no. 2, pp. 361–364, Aug. 2003.

[2]       A. Antunes, “Encouraging good science on the Web,” Phys. Today, Am. Inst. Phys., 2009.