Clean and Organize


Most data analysis and storage requires data that is digital. Some tools exist that make it easy to create born-digital data sets. For more information, see Prepare and Create Data. However, you will on occasion be required to migrate data from an analog to a digital format. For more information about digitization, see the following guidelines:


Some digital data will require translation or transcription. Transcription is the process of creating a text document about the contents of a file (such as closed captioning for an audio or video file), while translation is the process of duplicating the data in an alternate language. Transcription can describe using Optical Character Recognition (OCR) to create a raw text file or creating a text file for an audiofile. Translation, in comparison, may be creating a text file in English for a letter originally written in Spanish. More accurate translations and transcriptions will require human intervention, but many machine translation tools are available for free.

Examples of tools that do one or both:

  • Otter is an AI transcription tool that includes 600 free minutes per month.
  • Google Docs--Voice Typing is a free transcription tool that works well with a variety of languages.
  • Google Translate is a free tool that will transcribe and translate foreign languages. It has a mobile app that will translate texts in real-time. It will also transliterate from non-roman scripts into roman scripts and vice-versa.
  • TraveLang Translating Dictionaries is a free tool that allows cross-searching of multiple translation dictionaries. (

Quality Control/Edit

Quality control is the process of assessing the consistency and accuracy of your data and revising as necessary. The specifics of the quality control required of your data will be very dependent on the type of data you are creating. For more information on quality control, see "file organization", "version control", and "documentation and metadata" on the Data Storage page of our libguide.


Anonymising your data to protect the privacy of study participants is an important step to take before sharing your data publicly. The following are open-source tools that support anonymisation at various levels, depending on your use case:

  • ARX will not only anonymise your data, it will also analyse the output's utility and privacy risks.
  • NLM-Scrubber is a HIPAA compliant, clinical text de-identification tool.


The description of data is often called creating "metadata" or "data about data."

Many of the tools listed above in Quality Control can also be used to store this data in a separate file, such as within the same folder as your primary data. You can also imbed metadata directly into your files with tools like Adobe Bridge. Adobe Bridge and ExifTool can harvest data contained in many files by default, such as the size, type of file, date it was created, and GIS information. The DMPTool is also helpful in assessing what metadata should be included. 

For more information about metadata for research data, see Documentation and metadata.

Questions? Email