Research Datasets

Comprehensive collection of lung cancer histopathological and whole slide imaging datasets used in our ICMR-funded research project

Project Datasets

Primary datasets used in developing our deep learning ensemble model for lung cancer detection

(a) LC25000 Lung Histopathological Images

A curated dataset developed by expert pathologists from the University of South Florida. The original dataset contained 750 images in three classes—250 benign lung tissue (LBT), 250 lung adenocarcinoma (LUAD), and 250 lung squamous cell carcinoma (LSCC). Through augmentation, the dataset was expanded to 15,000 images, with 5,000 images per class.

Images:15,000
Classes:3 (Benign, LUAD, LSCC)
Format:JPEG
Resolution:768 × 768 pixels
Source:Andrew A. Borkowski et al.
Year:2019
License:CC BY 4.0
(b) CPTAC Whole Slide Images for Lung Cancer

A large collection of LUAD and LSCC Whole Slide Images obtained from The Cancer Imaging Archive (TCIA). A total of approximately 300 WSIs were downloaded and annotated by certified medical consultants. Expert pathologists marked the Regions of Interest (ROIs), and image tiles were manually extracted using Aperio ImageScope. All tiles were standardized to 512 × 512 pixels.

Images:~300 WSIs (Tiles extracted)
Classes:2 (LUAD, LSCC)
Format:TIFF / SVS (WSI), PNG/JPEG (Tiles)
Resolution:512 × 512 pixels (tiles)
Source:CPTAC via The Cancer Imaging Archive (TCIA)
Year:2018
License:Public Research Use
(c) GCRI Lung Carcinoma Microscopic Images

Local histopathology data collected from the Gujarat Cancer Research Institute (GCRI). Approximately 1,500 lung carcinoma cases are included, captured at magnifications of 10× and 40× by trained onco-pathologists. The study is conducted under approved ethical guidelines.

Images:~1,500
Classes:Lung Carcinoma (various subtypes)
Format:JPEG / PNG
Resolution:Variable (10× and 40× fields)
Source:Gujarat Cancer Research Institute (GCRI)
Year:2024–2025
License:Institutional Ethical Approval
Ethics:GCRI/GCS Ethics Committee-BHR (Ref: EC/BHR/14/2024)
Institutional Access Only

Dataset Usage Guidelines

These datasets are used in our ICMR-funded research project for developing deep learning models for lung cancer detection. All datasets comply with their respective licenses and usage terms.

For academic and research use, proper citation of the original dataset creators is required. The LC25000 dataset is available under CC BY 4.0 license. CPTAC WSI collections are available for public research use. GCRI data is used under institutional ethical approval.

Research Ethics & Privacy

All medical imaging data used in this project complies with ethical guidelines, patient privacy regulations, and institutional review board (IRB) requirements. The GCRI local cohort data is collected under approved ethical clearance (EC/BHR/14/2024).

Citation for LC25000 Dataset

Borkowski AA, Bui MM, Thomas LB, Wilson CP, DeLand LA, Mastorides SM. Lung and Colon Cancer Histopathological Image Dataset (LC25000). arXiv:1912.12142v1 (2019).

Need Access to Our Datasets?

For research collaboration or dataset access requests, please get in touch with our team. We welcome partnerships with academic institutions and healthcare organizations.

Request Dataset Access