This repository contains code for:
- Retrieving gene-expression data from the AllenSDK; and
- processing it into nice structures for further analysis in Matlab.
Requires Matlab and python. The AllenSDK package for python must be installed.
If anything is unclear or needs improvement, please send questions by raising an Issue or sending me an email.
This pipeline is based on code developed for Fulcher and Fornito, PNAS (2016), and used for Fulcher et al., PNAS (2019). If you find this code useful, consider citing these papers if relevant to your work, or you can cite this code directly using its DOI.
You first need to get a full list of genes, by running AllGenes.py
.
This outputs you generic information about the genes:
sectionDatasetInfo.csv
(all section data)geneInfo.csv
(gene information: acronym, entrez_id, gene_id, name)geneEntrezID.csv
(just the list of EntrezIDs)
Retrieve all structure IDs of interest directly by adapting WriteStructureInfo.py
to retrieve a custom set of structures.
If you already have structure IDs in Matlab, you can alternatively to this step using WriteStructureIDs.m
-> structIDs_Oh.csv
and structInfo_Oh.csv
.
Save a list of gene entrez IDs for the genes you're interested in.
For all genes, you can use the geneEntrezID.csv
file produced from AllGenes.py
above.
For a subset of genes, you can adapt something like subsetGenes.py
.
Now you've defined the structures and genes you're interested in, you can run the queries to get all combinations of expression data (of brain regions and genes).
This is done using RetrieveGene.py
.
Note that in RetrieveGene.py
, variables need to be set.
First the input files need to match the IDs saved in Steps 1 and 2 above.
Input files
structIDSource
: name of the.csv
file of Allen structure IDsentrezSource
: name of the.csv
file of gene entrez IDs to retrieve
Output filenames
To set:
structInfoFilename
: saves retrieved information for the structure IDs specified.allDataFilename
: saves detailed expression information out to this file.
Generated:
expression_energy_AxB
: expression energy values for the A structures and B section datasetsexpression_density_AxB
: expression density values for the A structure and B section datasetsdataSetIDs_Columns.csv
: dataset IDs representing each column in the above matrices
Then you can import the resulting data into Matlab as:
[GeneExpData,sectionDatasetInfo,geneInfo,structInfo] = ImportAllenToMatlab();
In this function, you must specify the filenames to read in:
fileNames.struct
: the structure info file specified above (structInfoFilename
)fileNames.sectionDatasets
: full information about all datasets retrieved (allDataFilename
)fileNames.geneInfo
:fileNames.energy
:fileNames.density
:fileNames.columns
:
Outputs a processed .mat file: AllenGeneDataset_X.mat
containing information about X unique genes.
Example pipeline:
First generate .csv
files for structure IDs and matching to structure info (for interpretation)
E.g., for the Oh et al. 213-region parcellation:
WriteStructureIDs
This generates structIDs_Oh.csv
and structInfo_Oh.csv
.
In the python file MakeCCFMasks
, these files are listed as inputs, such that
MakeCCFMasks
generates a mask for these, saving as mask_Oh.h5
.