Congratulations, your experiments have been completed and you have a large amount of data. Now, it’s time to analyze these data. But where do you begin? How can you gain the most meaning from your results and publish a clear, compelling, and most importantly, accurate story of your research? Here, we provide 5 tips to help you tackle your data analysis.
- Define your research goals and familiarize yourself with the data.
Always, always, always define your goals BEFORE you begin your analysis. Exploratory research is great, but, if possible, avoid using data analysis as a technique to generate a question or hypothesis. This can result in a significant loss in time. You should already have a clear research question or questions that can be used to guide your analysis.
Think about what questions you want to answer and the types of data and analyses that will help answer those questions. To do this, you must also understand what type(s) of data you have in hand. What data do you have, and what derived data will you need? What is the quality of your data, and how was it generated? The mantra “garbage in, garbage out” is all too true. If your experimental design was flawed in any way, that will be pertinent to the downstream analysis. You should have a good understanding of how the samples were processed. If you do not know this, make sure you obtain that information. Tools like fastqc and fastp are useful for examining initial sequence read quality.
- Dive into the literature. Learn more about the analytical methods you wish to employ.
Your research was not conducted in isolation. There is likely a strong foundation of supporting scientific literature from which to gain valuable insight. Dive into the literature. What types of tools and/or workflows did others use to tackle similar problems? Establish a research plan based on your experiments and what others have done. Document the advantages and disadvantages of various methods and how these may impact the interpretation of results. Consider attending workshops and conferences to learn about relevant techniques and tools for data analysis. Also, use these as an opportunity to network and connect with others in the field. You can use these connections to gain expert advice or seek collaborations.
- Consult with experts.
If you haven’t already, consider consulting with data analysis experts including bioinformaticians and statisticians. While it is best to consult with data analysis experts during the experimental design phase of any project, it is not too late to seek expert advice. Individuals within the Center for Cancer Research can contact the CCR Collaborative Bioinformatics Resource (CCBR) for free bioinformatics support. Bioinformatics support and statistical support are also available for all NCI researchers from the Advanced Biomedical Computational Science (ABCS) group, which provides subject matter expertise in genomics, proteomics, and imaging. Support can be requested by submitting a project request at https://abcs-amp.nih.gov/project/request/ABCS/.
- Establish a data management and sharing plan.
Before you begin your analysis, you should establish a data management plan to organize data, results, and additional output such as generated reports, manuscripts, or presentations. According to the 2023 NIH Data Management and Sharing Policy, a data management and sharing plan is required by all intramural scientists prior to conducting scientific research. Organization is crucial for keeping track of any analysis. You will want to know where raw and derived data and outputs are stored, how they were generated, and what worked and did not work.
Consider maintaining an electronic lab notebook (e.g., Jupyter notebook, Quarto notebook, or Github repository) to document bioinformatic methods, code, and software versions. Use version control to track any changes made to your scripts. This will not only make your research more reproducible but will also make your life easier when the time comes to prep your manuscript for publication. You won’t have to remember if you used script_final.sh or script_final_1.sh for your final analysis.
- Leverage existing tools and resources.
Lastly, remember, there is no need to start your analysis from scratch. Make use of pre-existing pipelines (e.g., Nextflow, Snakemake, ccbrpipeliner) and well-documented workflows whenever available. If you do not understand parameters, stick with the defaults. (Though, do try to understand what defaults are being used and why.) Use existing tools available on the NIH HPC Biowulf, and check out the associated documentation. You can also use cloud computing systems with integrated workflows such as NIDAP, CGC, or AnVIL.
To get started using these tools and more, explore tutorials and courses specifically designed for wet lab scientists from BTEP, Coursera, edX, YouTube, Bioconductor, and more. For advice and troubleshooting, check out online forums such as Biostars, StackOverflow, and Bioconductor, or email us at BTEP for support and guidance.
There are many additional considerations that go into any bioinformatics analysis. However, these 5 tips will help guide you in the right direction, providing a solid foundation for your data analysis journey. If at any time you need support, guidance, or additional training, email us at ncibtep@nih.gov.
– Alex Emmons