MapMyCells: Input & Output Files

Learn more about the input file requirements for MapMyCells, how to generate input files, and how to interpret output files.

Input files

MapMyCells can map any matrix in which rows are "cells" and columns are genes. MapMyCells will attempt to run on anything that “looks like” cell-by-gene data. It will do great with single cell sequencing data and spatial transcriptomics data but will probably not work well with bulk sequencing data. In will also work on single-cell epigenetics data that has been summarized to gene-level metrics in some situations, but review the results carefully. While we strongly encourage outputting matrices in which rows are "cells" and columns are genes, MapMyCells can transpose the data if needed.

MapMyCells has a file limit of 2 GB. Code for compressing data and for splitting data set into multiple input files are included below, or use the code version which does not have a size restriction.

MapMyCells can accept h5ad files or csv files as input. H5ad files are produced by AnnData, a widely used tool for creating, manipulating, and saving large data matrices, such as for expression data. If your cell-by-gene data is in csv format you can either directly upload to MapMyCells or follow the R or Python guides below to convert to h5ad. Any additional data in the h5ad file will be ignored for mapping, but may be useful for downstream analyses.

Creating csv input files

All programming languages and -omics software applications provide standard methods for outputting matrics to csv files. For example, the fwrite() function in the data.table library and the savetxt() function in the NumPy library will efficiently output numeric matrices to csv in R and python, respectively. To decrease file size, gzip compression is encouraged (leading to csv.gz file extensions). This can be done directly as files are written out, or through third-party applications (like 7-Zip, which comes standard on Windows).

As an example, a well-formatted csv file should look something like this:

,ENSMUSG00001,ENSMUSG00002,ENSMUSG00003,...
cell0,0,1,2,... 
cell1,1,0,0,... 
cell2,3,0,1,...
...

Creating h5ad input files in R & Python

Using Python

We provide scripts and R and python for converting files from standard file formats (csv, hd5f, h5ad) into compressed h5ad files ready for upload to MapMyCells. The scripts are broken up into two sections:

  1. Input: how to read in your data into R/python and store it in AnnData object.
  2. Output: how to output your variable in a compressed h5ad file, check the size, and then split the output file into multiple files for upload to MapMyCells if the size exceeds 2GB.

These scripts require access to R or python through user-provided means (e.g., installed on a local computer or accessed via cloud computing). R can be installed at CRAN, while python can be installed through Miniconda. Instructions for how to run the python script on Google Colaboratory are in development.

Output files

MapMyCells produces two output files. A “standard” CSV output file and an “extended” JSON output file. These files are archived into a single .zip file for download. Modern operating systems all natively support unpacking zip files, usually via a right-click + "Extract all" command.

  • validation_log.txt: Log of messages produced by job. Even returned for failed jobs. Useful for debugging. If the mapping failed, this is probably the file you want.
  • my_job.csv: Returned by all algorithms. CSV table of mapping results. If the mapping worked, this is probably the file you want.
  • my_job.json: Only returned by Hierarchical and Flat mapping. More detailed results and metadata stored in a JSON file.
  • my_job_summary_metadata.json: JSON file recording number of cells mapped to cell types and number of genes mapped to Ensembl IDs.

To extract the individual files in the command line run

 tar -xf path/to/downloaded/file.zip

at which point, the constituent CSV and JSON files should appear in your current working directory.

Alternatively, run

 tar -xvf my_tar_file.zip --directory my_directory

to unpack the files to an existing directory of your choice, e.g. my_directory.

The contents of these files are documented in detail here: https://github.com/AllenInstitute/cell_type_mapper/blob/main/docs/output.md

At a high level, suffice it to say that, modulo a few lines of metadata prefixed with a ‘#’, the CSV file is meant to be read into a dataframe as in (for Python)

 import pandas 
 data_frame = pandas.read_csv(‘path/to/output.csv’, comment=’#’)

or an Excel spreadsheet. The JSON file is the serialized representation of a dict with more detailed results for those comfortable with deserializing JSON blobs. The JSON file is also where the metadata associated with the mapping run is saved.