Try Out New (and Free) Software for Variable Recoding and Harmonization: QuickCharmStats 1.1 and CharmStats Pro 1.0

Guest post by Dr. Kristi Winters, CharmStats Project Manager, GESIS Data Archive for the Social Sciences

Do you have variables to harmonize for statistical analysis or as part of a large scale study’s data preparation and documentation? There are new, free, and open-source software solutions available that can speed the process of variable harmonization as well as preserve variable-, question- and study-level metadata for you. These programs generate harmonization syntax in multiple statistical languages and generate reports or even codebooks to quickly publish your work in a readable format.

The first product is designed for small scale research (under 100 variables to be harmonized) such as publishing an article or producing a report. QuickCharmStats (QCS) was designed to reduce the time and effort researchers spend harmonizing and recoding variables in preparation for statistical analysis. The second product, CharmStats Pro was designed for larger research projects and for national and international research teams to centralize, document and process the harmonization documentation process.

QuickCharmStats

mint-quick-charmstats-javaQuickCharmStats 1.1 (QCS) was designed for researchers who want to quickly and easily create recoding syntax for use in statistical analysis. It allows you to import the necessary metadata information from SPSS and Stat/Transfer. Once the metadata are imported you can search for the variables you need, import them into your project and quickly produce the syntax you need to harmonize variables in SPSS and Stata.

QCS 1.1 has two further features: Reports and Graphs. The report feature allows you to choose from Templates and instantly create an .html file based on the information and metadata in your project. Save your documentation and bibliographical information, your notes with information on your coding decisions, as well as the harmonization syntax. Share your report with others at anytime, anywhere, or to post it online as a reference. The graph feature provides an image of your harmonization.  These can be saved as .jpegs and used in presentations or included as part of your documentation.

CharmStats Pro

charmstats-pro-javaCharmStats Pro is for those large study researchers or survey teams who want to quickly and easily create and preserve the recoding or harmonization syntax as part of a codebook. CharmStats Pro imports the necessary metadata information to document the harmonizations. Once imported, researchers can search for the variables you need, import them into a project and quickly produce syntax(es) to harmonize variables in SPSS and Stata. This version also allows research teams to add reference information for any sources consulted. CharmStats Pro auto-generates the same reports and graphs found in QuickCharmStats 1.1, for all your documentation needs.

CharmStats Pro is unique because it allows for a shared database that connects all the members of a study team. Those who work on large-scale studies or research projects with several staff can now combined and collaborate on their variable harmonization and documentation work.  To facilitate a cooperative digital environment, CharmStats Pro has a communications suite featuring an internal email and task manager system to help teams organize their work.

We believe that after investing a short amount of time learning how to create projects you and your team will save valuable time and effort by digitizing your harmonization work. Learn more on how to get your original documentation work published for citation by reading this open access article on harmonization documentation standards and reporting: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0147795.

To contact us or download the software, please visit http://www.gesis.org/en/services/data-analysis/data-harmonization/.

Advertisements

Building a German infrastructure for RDM – RfII Recommendations

In June 2016 the RfII published a position paper with recommendations on the structure, processes, and funding of RDM in Germany. It presents an analysis of the current research data management (RDM) landscape in Germany and makes recommendations on how to foster an infrastructure for data management, preservation, and sharing.

The German Council for Scientific Information Infrastructures (RfII – Rat für Informationsinfrastrukturen) consists of 24 members from the scientific community. It was established in 2014 by the state and federal governments to support “the strategic development of a contemporary infrastructure for access to scientific information” (RfII Webpage).

 

The German national RDM landscape is marked by “an overall absence of coordination, and . . . parallel, project-based initiatives” (RfII 2016, p. 5). The (establishment of ) RDM services, procedures, and policies is currently mainly driven by individual academic institutions and research organisations and a small number of learned societies (e.g. in psychology). Research funders, especially the Federal Ministry of Education and Research (BMBF) and the German Research Foundation (DFG) promote RDM and data sharing and foster the development of infrastructures for data management. However, there is no general mandate to manage and share data from funded research projects.

Thus the current situation in Germany presents itself as follows:

  • infrastructure for RDM is built in a largely uncoordinated bottom-up process (p. 5; p. 22);
  • lack of a suitable political framework and of a nation-wide, coordinated information infrastructure, where the latter is understood as comprehensive services and procedures to support the management and use of data in all phases of the data lifecycle (p. 12)
  • a great number of position papers, problem analyses and recommendations for action on the one hand, but an “implementation deficit” on the other (p. 15, our translation)

Contributing to this situation is the fact that the German constitution limits the possibilities for cooperation between the federal states (“Länder”) and the national government (“Bund”) in the field of higher education and research. Universities are funded by the Länder, and funding efforts spanning all or several federal states are limited (also see p. 22).

Another important factor that has to be taken into account, especially when thinking about RDM-related policies and mandates is the freedom of science guaranteed by the German constitution, and specified in the Framework Act for Higher Education. This freedom of science gives researchers control about the way they conduct the research and publish its results.

These limiting factors must be acknowledged when discussing the German RDM landscape. As RfII observes, it will not be possible to develop and implement measures in a top-down fashion. However, “top-down impulses” are required to start processes leading towards a “dynamic integration of distributed knowledge” (p. 34, our translation).

The core recommendations made in the position paper to this end are:

  1. implement long-term funding mechanisms for RDM infrastructures;
  2. establish a national research data infrastructure (NFDI);
  3. foster “responsible data culture”;
  4. invest in the development in human resources for data management;
  5. strengthen international cooperation;
  6. establish mechanism for actively steering the transition process.

In the following, we focus on the suggested NFDI, which is interesting also in light of the report on the European Open Science Cloud published just a few days before the RfII position paper.

A National Research Data Infrastructure

The RfII position paper introduces the vision of a National Research Data Infrastructure as follows:

“Many aspects of research data management are of a generic and hence transferable nature . . . . With an eye to cost and efficiency, generic services can and should be established and offered in a shared manner. The RfII suggests the establishment of a consortium which bundles existing competences and provides basic storage infrastructure and services and ensures a fast transfer of competences within the science system. This National Research Data Infrastructure (NFDI) should take the form of a network spanning disciplines and communities. It should include existing big information infrastructures, the national level of ESFRI projects as well as those repositories serving sufficiently homogeneous user groups”. (p. 40, our translation)

The envisioned infrastructure has a strong focus not only on storage but also on (enabling) access to and use of the data. Thus the recommendations paint the picture of an infrastructure that fosters interdisciplinary research by enabling the combination and analysis of data from different communities/fields (p. 41). Services to support this include an access portal with access rights management, services for data registration and publication, search engines supporting semantic search with automatic translation of terms into community-specific terminology, support for data analysis and visualization (see appendix D.3).

To achieve this, the following challenges need to be addressed

  • overarching minimal standards for quality management in data description and storage;
  • the development of generic procedures of data analysis;
  • development of generic data services and data storage;
  • (continuing) professional education (p. 41)

Among the many questions that will have to be answered to realize the vision of the NFDI, two seem especially relevant:

1. On the national level, what will be the role of discipline-specific infrastructures and services within the NFDI?

The RfII position paper suggests that the NFDI will consist of different layers: it will integrate smaller archives, libraries or computing centers as well as bigger, community-specific infrastructures in addition to the cross-disciplinary, generic services. It is crucial to define the distribution of responsibilities between these different players.

For example, the RfII paper states that the NFDI should be responsible for digital preservation. This would be an important step to increase sustainability in this area. Many aspects of digital preservation beyond storage and backup will have to be dealt with by discipline-/community-specific infrastructures – especially description with metadata, format migrations, and support of users in working with and understanding the data. At the same time, however, the RfII recommends that infrastructure nodes should compete for (financial) resources and users (p. 43). In a way, there is a danger here of re-introducing an uncertainty concerning long-term funding through the backdoor, which in the case of digital preservation services is inherently problematic.

2. How do we ensure compatibility and interoperability of the German NFDI with the European Open Science Cloud?

On June 20, 2016 the Commission High Level Expert Group on the European Open Science Cloud (EOSC) published its first report A Cloud on the 2020 Horizon which “aims to lay out a high level, living roadmap for the realisation of the European Open Science Cloud” (no pag.).

The EOSC report calls for a “complex eco-system of infrastructures” in response to the fact that

“the challenges of ever bigger data can no longer be solved only by ever bigger infrastructure. . . . With the growth of data in more and more disciplines outpacing the increase of transfer speed, many comprehensive datasets are simply too big to move efficiently from one location to another. Moreover, data are in many cases so privacy sensitive that legislation effectively precludes their moving outside the environment in which they have been collected. Therefore, relatively lightweight workflows (e.g. process virtual machines) . . . increasingly visit data where they reside, with supporting reference data and transporting only conclusions outside the safe data vault. . . . Centralised supercomputing locations that are crucial for solving high capacity HPC scientific challenges alone will not adequately support this irreversible trend. Complementary infrastructures are needed”. (no pag.)

It appears that the EOSC report and the RfII position paper adhere to a similar infrastructure “paradigm”. Both argue for a decentral network of interoperable nodes that offer generic or specialized services which are embedded in a framework of suitable policies and which draw on community-/discipline specific knowledge and practices.

The EOSC report puts a rather strong emphasis on the importance of (technical) protocols,  comparable to those on which the Internet builds, to achieve interoperability between the complementary infrastructures, whereas as the RfII paper recommends focusing on standards for data description and generic procedures for data analysis as one of the challenges to be addressed with high priority. It is crucial that the German national development is closely coordinated with the developments on the European level, but given that it is still early days for both the EOSC and the NFDI, this does not seem an unsurmountable task.

DINI/nestor Workshop: #RDM Tools

On June 17, 2016 the fifth DINI/nestor workshop on the topic of research data and data management took place in Kiel, Germany. It addressed the topic of tools for research data and their integration in the research and data management process.

Tools for data management and data handling are important for (at least) two reasons: 1) They help to standardize research data management and its procedures. 2) They can help to foster the adoption of research data management practices by reducing the effort and time researchers have to invest when implementing RDM measures.

The biggest challenge in this regard is, however, to develop tools that are actually used – as presenters pointed out repeatedly, tools have to integrate with the research process as seamlessly as possible to stand a chance of being adopted. A tool requiring researchers to go out of their way and that does not produce an immediate tangible benefit will end up not being used.

The workshop’s presentations and breakout sessions introduced participants to an assortment of different data management tools already available or currently being developed, ranging from Virtual Research Environments to tools for the creation and publication of metadata, and packaging tools for the submission of data and documentation to a repository.

Among the questions discussed during the sessions is that of generic vs. discipline- (or methods-) specific tools. Thus, while it makes sense to offer certain services – such as secure storage, document sharing, project management and communication – centrally, many aspects of data management are very specific to the different disciplines and methods used in the research. This includes, for example, how and with which metadata the research process and data are best documented, the degree of automation of measurement and analysis processes, or typical ways of collaboration.

Below is an overview and short description of the tools presented during the workshop. All presentations (in German) are available for download from the workshop page.

Overview of tools presented

Tool Discipline institutional Collaboration Storage Metadata Data handling Publication
DataWiz Psychology o o o x o x
LZA Lite generic x (x) x x o (x)
MASI generic, Applied Sciences x o x x o x
Replay DH Digital Humanities o x o x x o
VRE GEOMAR Marine science x x x x x x
VRE U Kiel generic x x x x x x
ZBW Journal Data Archive Economics o o (x) x o x

DataWiz is a data management tool currently developed at the ZPID, Leibniz Center for Psychology Information. It supports data management planning and implementation in the field of psychology and supports the documentation of the research process and the data with the help of metadata. It will be possible to directly submit the data and the documentation created with the tool to the ZPID research data center PsychData. The tool will also support the pre-registration of research in the future.

GEOMAR data management portal: This integrated data management system for marine research is a collaborative effort of several large-scale marine research projects begun in 2009. The objective was to create a common working platform and a common research data management rather than addressing the associated challenges in each project separately. Today the platform incorporates tools for data collection, archiving, and publication, as well as for information exchange and collaboration.

LZA Lite is a cooperation between three German universities. It is a Fedora-based platform supporting the secure storage of both administrative records and research data and its enrichment with metadata. It is planned to expand the platform with solutions for working collaboratively and for long-term preservation. The productive system will be launched in 2017.

MASI – Metadata Management for Applied Sciences is a tool currently developed at the TU Dresden in cooperation with several other HEIs. Intended as a research data repository for “living data” it will integrate functions and services for the (automated) description of data with metadata, data storage, and publication and re-use of data.

Replay DH: This is a project to build a git-based versioning tool for research data in the digital humanities carried out at the universities of Ulm and Stuttgart. A GUI and fields for standardized metadata description will be created for git and a DOI registration (also for not-yet final versions of the data) will be implemented.

Virtual Research Environment at the University of Kiel: Based on existing infrastructure, the project currently establishes a generic VRE combining tools for data storage, collaboration, and publication with central services such as identity management and with discipline-specific tools. The VRE is embedded in an organizational setting dedicated to fostering RDM practices at the University of Kiel and offering (face to face) support for researchers among other things.

ZBW Journal Data Archive: This portal was developed at the ZBW – German National Library of Economic as part of the EDaWaX project to support the replicability of research in Economics. Based on CKAN, it allows for the description with metadata and publication of data underlying empirical research articles in accordance with journal data policies. The data is securely stored in the SOEP Research Data Center while the metadata is managed by the ZBW.

 

 

CESSDA Training goes GEBF

As a part of the annual conference of the Gesellschaft für Empirische Bildungsforschung (GEBF) at the beginning of March 2016 in Berlin, CESSDA Training held a workshop on research data management.

Data management and the re-usability of research data are becoming of increasing importance in empirical social and educational research in Germany. Thus the German Federal Ministry of Education and Research has begun to make data management, data archiving and data re-usability a precondition for funding in the field of educational research. In response to this, three German research institutes, namely the Deutsche Institut für Internationale Pädagogische Forschung (DIPF), the Institut für Qualitätseintwicklung im Bildungswesen (IQB) and GESIS, established the Verbund Forschungsdaten Bildung (VFDB).

This research infrastructure supports researchers in managing and archiving their data from the field of educational research. The VFDB web portal (in German) offers best practice guidelines and templates on all relevant topics of research data management in empirical social and educational research. This includes legal and ethical issues, data documentation, data security, and data archiving among other things.

To support the VFDB’s primary objective, CESSDA Training and the IQB commonly held the GEBF workshop to introduce participants to the overall field of research data management. During the two hour session, we talked about the relevance of long-term accessibility of research data and discussed the elements of good data management. Due to the limited time available, this workshop could of course only provide a first introduction to the topic and increase participants’ awareness of issues that might compromise data quality and re-usability. However, this is a first important step on our way to making research data management and sharing a matter-of-course for reseachers in the empirical social and educational science community.

2nd SERISS Advisory Board Meeting

On March 15 the Review Board of the EU-funded project Synergies for Europe’s Research Infrastructures in the Social Sciences (SERISS) met in London. The project brings together cross-national European survey programs and research infrastructures, the European Social Survey (ESS), the Survey of Health, Ageing and Retirement in Europe (SHARE ERIC), and the Consortium of European Social Science Data Archives (CESSDA AS) among others.

The major themes addressed by SERISS are

  • Challenges of cross-national data collection
  • Breaking down barriers between infrastructures
  • Embracing the future of the social sciences (specifically, new forms of data, their collection and analysis) (see http://seriss.eu/about-seriss/project-overview/).

SERISS seeks to foster quantitative social research in Europe by providing solutions to challenges along the entire research data lifecycle with a focus on data preparation and data dissemination.

During the second SERISS meeting in London, the work package leaders gave updates on the current status of activities, discussing the ongoing work as well as further opportunities for collaboration.

CESSDA AS is involved in various tasks, e.g. in the development of interactive tools for cross-national surveys, a project management platform in particular, as well as in trainings and dissemination of SERISS outputs. Thus CESSDA Training and its partners will design and deliver face-to-face trainings on research data management and on the software applications that are currently being developed within SERISS. Thereby, we aim to improve research data management as well as data dissemination for secondary use and to spread the benefits of the project within the research community.

KE Workshop on RDM training and skills

Co-authored with Laurence Horton, London School of Economics and Political Science

Knowledge Exchange (KE) is an international collaboration to enable open scholarship featuring infrastructure bodies from Germany, Denmark, Netherlands, Finland and the UK.

With a focus on Research Data Management (RDM) training, this London event on 9-10 February 2016 contained presentations allied with group discussions for broader themes. The results of the workshop as well as results of a survey carried out by KE among RDM training providers before the workshop will be published as a Knowledge Exchange Report this spring.

Presentations

Ellen Verbakel (TU Delft) introduced The Essentials 4 Data Support Course, targeted at those who support researchers in data management. The course is organized along the research data life cycle and can be taken online for information or certification through blended learning, including assignments and face-to-face teaching. One finding is the website needed short text, images and video.

Jonas Recker (GESIS) talked about CESSDA Training introductory workshops on RDM. Evaluations show participants are happy with the opportunity for questions and discussion. However, demand exists for practical examples plus guidance on informed consent, anonymisation and data protection.

UK Data Service’s Libby Bishop also reported on the value in using “real” data in workshops. Again, the argument was less PowerPoint, more exercises. Libby also mentioned challenges making life harder, including “cloud” storage, encryption, “big data”, and non-academic sources using data for research.

Institutional focus came from Gareth Knight (LSHTM) who identified demand in developing areas for RDM support like training on mobile devices for data capture, advanced anonymisation and encryption training.

Stéphane Goldstein (InformAll) described the KE survey of RDM training. It found training audiences is almost exclusively PhD/Post-doc. Data also suggested discrepancies between learning aims and impact of the training.

Reporting on their train the trainers project, Joy Davidson (DCC) identified groups missing from RDM training and why it is important. A wide range of institutional support staff includes archivists, finance, legal officers who through their roles touch on RDM areas but are not getting support on how their jobs fit into enabling data reuse and preservation.

Two presentations from Denmark showed how a smaller nation is tackling RDM. Henrik Pedersen (University of Southern Denmark) outlined the use of a national forum to ground local expertise in national coordination and Karsten Kryger (Aalborg University Library) sketched their flexible training master plan, while emphasizing that training was a minor but important part of an overall strategy.

Christian Jämsen of the National Institute for Health and Welfare in Finland covered how it manages extensive sensitive patient record data and makes it available for research. Part of the challenge for the institute was knowing what it has and what it can use.

Presentations are available from this dropbox link.

Discussions

Broader theme breakout sessions focused on lessons learnt, challenges in training, cross-national insights, and success criteria for RDM. Summaries of discussions will be available from KE.

A number of recurring themes emerged that were discussed throughout the workshop, including

Level of training and scalability

It appears that most of the trainings delivered by workshop participants were introductory level trainings. This could be due to the fact that it is still “early days” in implementing RDM procedures and RDM training into the research routine. However, as a community we should start thinking about more advanced trainings for certain target groups.

It appears that most training delivered by workshop participants were introductory level. This could be because it is still “early days” in implementing RDM procedures and RDM training into the research routine. However, as a community we should start thinking about more advanced training for target groups.

However, linked to this is how to make RDM training scale. The KE survey revealed that in 2014 over 30% of training offered reached fewer than 50 participants, and another 19% reached fewer than 99 participants. A reason for this may be the small number of dedicated staff but also the fact that in-depth training beyond very general introductory remarks on RDM appears to require smaller groups and room for discussion of subject and projectspecific questions.

Impact and success

The question of impact and success is twofold: how do we measure the impact of RDM training – i.e. how do we know training has actually improved something, but also: how do we measure the success of RDM itself?

 Metadata and repositories for training resources

It became clear in discussions there is a demand to improve discoverability and accessibility of RDM training materials. As rightly pointed out by Laura Molloy and Kevin Ashley, suitable repositories and metadata already exist. It seems that we need to create more awareness that these tools are already out there and possibly have a discussion about how to make them a good fit for a broad spectrum of training resources.

DINI/nestor workshop: Appraisal and selection of research data

On November 17th, 2015 the fourth part of the DINInestor workshop series dedicated to the management and preservation of research data was hosted by the University Duisburg-Essen (program in German; presentation slides will be published here as well).

The workshop  looked at the selection and appraisal of research data from the perspective of different (predominantly German) actors in the research data infrastructure, specifically data holding institutions such as universities and university libraries, “traditional” archives and discipline-specific services, and the German national library.

Introduction

Jens Ludwig (Staatsbibliothek Berlin – Preußischer Kulturbesitz) started off the day with a general introduction to the topic of selection and appraisal of research data. In his presentation he specifically argued against the defeatist stance that claims research data management and preservation are a pointless undertaking because ‘we will never have the resources to adequately manage, appraise, and preserve all the data generated’. While it is true that the means and measures at our disposal are not adequate to preserve everything, using this as an excuse to remain inactive means throwing out the figurative baby with the bath water. Thus, if we cannot preserve everything forever, we must develop strategies to help us identify more reliably what should be kept and for how long.

One approach to this is the definition of selection processes and criteria, for example based on the question of “what for?”: For which purpose do we want to manage and keep data? Possible answers to these questions are:

  • because doing so is an important element of good research practice and enables the replication of research results;
  • or because we assume the data will be useful to other researchers in the future and can help the scientific community to gain new insights.

Preserving data is not an end in itself. It serves a purpose, and selecting data / defining selection criteria for preservation must take this purpose into account.

Paper sessions

The introduction was followed by presentations shedding light on practical aspects of data appraisal. This included examples of how data is selected for preservation and dissemination in bigger organizations, namely, the GESIS Data Archive for the Social Sciences (Natascha Schumann), the German National Library (Sabine Schrimpf), and the Landesarchiv Baden-Württemberg (Christian Keitel). The approaches of these three organizations differ considerably as a result of their different mandates, both in terms of the types of selection criteria employed (e.g. formal vs content-based or a combination of the two) and in regard to the “rigor” of the selection process, reflected among other things in the “acceptance rate” for submitted objects.

A second set of presentations focused on universities and on the “producer” side of the process of selecting research data for preservation and sharing. Kerstin Helbig discussed support services for research data management at Humboldt University Berlin. Specifically, she looked at information needs of researchers who prepare to submit data to an institutional repository or discipline-specific archive. For them it can be difficult to select the “correct” subset of all the data and information they collected and created throughout the entire research process for preservation and/or dissemination. In the context of research data management planning it is important to know and understand the selection criteria employed by the repository or archive.

Documenting and analyzing the questions that researchers have about this process can help archives and repositories to better understand producers’ needs and to communicate with them more efficiently. The same purpose can be served by the results of a survey carried out among researchers in Austrian universities (Paolo Budroni, Barbara Sánchez Solís), presented at the end of the paper sessions.

Break-out sessions and final discussion

The paper sessions were followed by break-out sessions for practical exercises on data appraisal and for discussions focusing on the different stakeholders involved in the process of selection and appraisal and the general structure of an appraisal process for research data. These nicely complemented the presentations and offered a good opportunity for in-depth discussions and for hands-on learning about the appraisal and selection process.

The workshop concluded with an open discussion which highlighted a perspective which unfortunately was not strongly represented in the previous presentations: that of (smaller) institutional, multi-disciplinary repositories in universities. Often, these have a mandate to (or are expected to) accept everything created by the researchers of the university. They perform a kind of “catch basin” function for the “long tail” of research data that needs to be retained for a certain period, e.g. for replication purposes.

It is obvious that selection and appraisal processes in such repositories differ from those of a bigger, discipline-specific service. However, even when “everything” has to be accepted, criteria are required – for example, to determine if and how the data will be curated and disseminated (e.g. legal, ethical aspects), or whether a given dataset should be offered to a subject-specific archive or repository.

One envisioned output of the workshop is a German-language guideline on the selection and appraisal of research data to be created collaboratively in the next year and building on the presentations and discussions of the workshop. One challenge of this guideline certainly will be to include both the perspectives of bigger, discipline-specific services and of smaller institutional repositories.