abcd_sex_pfn_replication

View the Project on GitHub ashleychari/abcd_sex_pfn_replication

ABCD Sex Differences Replication Project

This project aims to replicate the work done by Shanmugan et al (2022), Sex Differences in the functional topography of association networks in youth, in the ABCD dataset.

Code Documentation:

Part 1: Discovery and Replication sample generation

You’ll need to change the save filename/paths in the script

  1. Run the get_data_for_ridge.R script created by Arielle Keller using the data in the FilesForAdam folder originally from /cbica/projects/abcdfnets/scripts/FilesForAdam/

  2. Next, run the create_discovery_replication_set_siblings_removed.py script to create the final discovery and replication sets with the removal of siblings. These samples are used in subsequent steps.

Part 2: Atlas Generation and Network Parcellation

Will need to change paths to discovery and replication datasets and the paths that you want to save the resulting files

  1. Run the calc_group_average_mat.py script to get the group average matrix for the networks for all of the subjects in the discovery and replication sets combined

  2. Run the create_hard_parcel.py script to get the hard parcellations from networks 3, 4, and 12 to make main figure 1

  3. Run the create_soft_parcel.py script to get the soft parcellations from networks 3, 4, and 12 to make main figure 1

  4. Run the get_subject_parcels.py script to get the hard and soft parcellations for 4 random subjects with 2 subjects being male and 2 being female to make main figure 1

Part 3: Univariate Analysis

  1. Use convert_mat.py to convert the final_UV.mat files into csvs

NOTE: This step can be skipped if you change the script to use readMat from the R.Matlab package instead of base R’s read.csv function. However, you will have to setup a new singularity container with the necessary packages

  1. Create a singularity container or use the one I created stored in ashpfnsexdiffabcd/software/containers/ called sex_differences_replication_0.0.3.sif

  2. Change the filepaths for the results folder and subject data in the abcd_unvariate_analysis.R script for the discovery data, which will run a GAM at each vertex to determine the effect of sex while controlling for age and motion. The script assumes that the subject data has siblings removed. Next, run the wrapper script run_univariate_analysis.sh via SGE using the following command:

  qsub -l h_vmem=25G,s_vmem=25G /cbica/projects/ash_pfn_sex_diff_abcd/dropbox/run_univariate_analysis.sh
  1. Similarly, do the same thing but for the replication data and use abcd_unvariate_analysis_replication.R script and then run the wrapper script run_univariate_replication.sh via SGE using the following command:
  qsub -l h_vmem=25G,s_vmem=25G /cbica/projects/ash_pfn_sex_diff_abcd/dropbox/run_univariate_replication.sh
  1. Once the gams models have been run, create the unthresholded absolute sum map in Figure 2D from the results using the write_effect_map_abs_sum.py script. You’ll need to change the arguments in the two function calls in the main execution block of the script.

  2. Convert the file obtained from the previous step into a CIFTI format to be loaded into workbench to create the brain maps seen in Figure 2 by using the write_effect_map_to_cifti.py.

NOTE: The script uses a dscalar.nii file of the fslr format as a temporary header to make the new CIFTI file. It is only used to structure the file.

The script can be run like the following example command:

  python3 write_effect_map_to_cifti.py '/Users/ashfrana/Desktop/code/abcd_sex_pfn_replication/multivariate_analysis/svm_072324_run/abs_sum_weight_brain_mat_discovery_haufe_100_runs_072324.npy' '/Users/ashfrana/Desktop/code/abcd_sex_pfn_replication/multivariate_analysis/svm_072324_run/svm_discovery_abs_sum_haufe_weights_072324.dscalar.nii'
  1. Load the resulting dscalar.nii file from the previous step into workbench to get the visualization.

  2. To get the significant veritces barplot, first run the make_barplot_mat.py script to get a matrix of the significant vertices for each network.

  3. Use the create_barplot.R to create the barplot seen in Figure 2A (discovery set) and Figure 2B (replication set) of significant vertices for each network. I ran this in Rstudio.

  4. To create the significant vertices for networks 3, 9, and 12 seen in Figure 2C, use the get_individual_network_mat.py script to get each individual network from the matrix in the previous step. Also used to make the 17 networks seen in Figure S1.

Part 4: Multivariate Analysis

  1. Run create_nonzero_matrix.py using the discovery sample csv and replication sample csv created in Part 1, step 2 to setup the non-zero features matrices for subsequent steps.

  2. Next, run run_2fold_svm_resmulti_times_parallel_w_ROC.py by running the wrapper script svm_resmulti_times_parallel_w_ROC_slurm.sh. The wrapper script submits an array job of size 100 which will run the run_2fold script 100 times (each run as a separate job) via slurm. The svm script runs 2 fold cross validation using the nonzero features matrix to predict sex using SVM. Run the following command to run the script:

  sbatch svm_resmulti_times_parallel_w_ROC_slurm.sh
  1. Create the weight matrix by taking the mean of each of the coefficients from all the fold instances and save this weight matrix using the svm_workbench_visualization_multi_times_matrix.py script.

  2. Use the coefficient matrix created in the previous step in the haufe_transform_weights.py script to do a haufe transformation on the coefficients. Use the haufe_transform_weights.sh to run the script by running the following command:

  qsub -l h_vmem=500G,s_vmem=500G haufe_transform_weights.sh

or if you want to run via slurm, use the haufe_transform_weights_slurm.sh script with the following command after running module load slurm in your terminal window:

  sbatch haufe_transform_weights_slurm.sh
  1. Use the haufe transformed weight matrix from the previous step in the svm_stacked_barplot.R script to get the stacked barplot of feature importance for both Males and Females as seen in Figure 3C. I ran this in Rstudio

  2. Use the haufe transformed weight matrix in the sum_abs_weights_matrix.py to get the absolute value sum of the weights and save this array.

  3. Use write_effect_map_to_cifti_svm.py to convert the matrix in the previous step into a CIFTI file format for visualization/to get Figure 3B.

  4. Use the dscalar.nii file in workbench to get the visualization of the weights.

  5. In order to run the permutation tests, run the svm_permutation_parallel.sh which calls the svm_1000_permutation_scrambled_outcome_parallel.py script which scrambles the sex outcome variable and creates 1000 null models to compare to the accuracy of the 100 models. Use the following command to run the script:

  sbatch svm_permutation_parallel.sh
  1. To generate the permutation inset histogram as seen in Figure 3A, use the get_accuracies.py script to compile the accuracies from the 1000 runs followed by the svm_accuracy_histogram.R to make the histogram plots.

  2. To generate the ROC curve as seen in Figure 3A, run the create_ROC_curve.py with the results from the 100 runs and use the plot_roc.R to make Figure 3A.

  3. To get the support vector machine stats such as average accuracy, average sensitivity, and average specificity use the get_svm_metrics.py script.

Part 5: Genetics

The files generated from the steps below will be used in the enrichement analyses.

  1. Download files from https://github.com/PennLINC/S-A_ArchetypalAxis to be used in the flsr_to_fsaverage5.sh script

  2. Get the FDR corrected absolute sum matrices by using write_effect_map_abs_sum_fdr.py to create the fdr absolute sum matrices that are used for the genetics portion of this project.

  3. Use the fslr_to_fsaverage5.sh script to resample the gams uncorrected abs sum discovery data, gams uncorrected abs sum replication data, gams fdr abs sum discovery data, and gams fdr abs sum replication data from fslr to fsaverage5. Parameters include path to the map (dscalar.nii file) you’re trying to resample to fsaverage5 space and the name that you want to label the resulting files. The file that you’ll want to use in the enrichment analyses will end with _LH.fsaverage5.func.gii.

NOTE: If you are using windows, best to install Git Bash and put an sh at the start of the command to run the script.

Example command for mac users below:

  ./fslr_to_fsaverage5.sh "/Users/ashfrana/Desktop/code/abcd_sex_pfn_replication/univariate_analysis/univariate_analysis_results/uncorrected_abs_sum_matrix_discovery.dscalar.nii" gams_uncorrected_discovery
  1. Download Download AHBAProcessed.zip and AHBAData.zip from https://figshare.com/articles/dataset/AHBAdata/6852911

  2. Read annotation file AHBAData/data/genes/parcellations/lh.Schaefer1000_7net.annot from figshare into R and use read_gene_annot_files.R to save the following files

    save column 5 from ct.table as lh_Schaefer1000_7net_cttable.csv

    save L as lh_Schaefer1000_7net_L.csv

  3. Save probe annotation file using ROIxGene_Schaefer1000_INT.mat and create_probe_annotation.py

    Save the filename as GeneSymbol.csv

  4. Grab the rest of the files needed for the scripts below from /cbica/projects/ash_pfn_sex_diff_abcd/dropbox/genetics_files/ which originally came from the abcdpfnsexdiff project folder on cubic

NOTE: If you choose not to use the files in the above folder, you might have to change the filepaths and the column names corresponding to the file when you load the file in.

Chromosomal enrichment analysis

  1. Run the ABCD_wSex_cor_gene_schaefer403_net7_discovery_20242108.R script to get the chromosomal enrichments found in Figure 4A, recommend running this in Rstudio

Make sure to change the filepaths to use the files created above

Also make sure to change the filepaths for saving the intermediate files such as pvalues table and top 20 X chromosome genes

Cell Type enrichment analysis

  1. Run the cell_types_LAKE.R script to get the cell type enrichments found in Figure 4B, recommend running this in Rstudio

Make sure to change the filepaths to use the files created above

Also make sure to change the filepaths for saving the intermediate files such as pvalues table and top 20 tables for astro, ex5b, ex1, and oli

PSP enrichment analysis

  1. Run the PSP_plots.R script to get the PSP enrichments, recommend running this in Rstudio

Make sure to change the filepaths to use the files created above and for the pvalue table

Part 6: Spin tests

  1. Add medial wall back in to all of the data (gams uncorrected discovery abs sum, gams uncorrected replication abs sum, svm discovery abs sum haufe transformed, svm replication abs sum haufe transformed) using add_medial_wall.R. You can find the medialwall.mask.leftcortex.csv and medialwall.mask.rightcortex.csv here https://github.com/PennLINC/S-A_ArchetypalAxis/tree/main/FSLRVertex. Save results into csv that will later be converted into gii files.

  2. Use the convert_to_gifti.py script to convert the data in the previous step into gii files. The following command can be run in terminal (the last argument indicated whether the data is from PNC or not):

  python3 convert_to_gifti.py "medial_wall_maps/gams_discovery_medial_wall_map.csv" "results_gifti/gams_abs_sum_discovery_map.gii" N
  1. Run the spin_test.py script using the converted gii files. The script can be run using the following example command:
  python3 spin_test.py '/Users/ashfrana/Desktop/code/abcd_sex_pfn_replication/spin_tests/data/results_gifti/gams_abs_sum_discovery_map.gii' '/Users/ashfrana/Desktop/code/abcd_sex_pfn_replication/spin_tests/data/results_gifti/gams_abs_sum_replication_map.gii' "fsLR" 'Gams discovery vs replication' '/Users/ashfrana/Desktop/code/abcd_sex_pfn_replication/spin_tests/results'
  1. Once all spin tests have been completed, you can compile all of the results from the different tests as long as they are in the same folder by using the compile_spin_tests.py. The script can be run by calling the name of the script followed by the folder that the spin test results are stored in (might need to be the full path if the script is not in the same directory as the results):
  python3 compile_spin_tests.py spin_test_results_072524

The following instructions apply to the PNC data only

  1. Grab the PNC gams abs sum data from /cbica/projects/abcdpfnsexdiff/funcParcelSexDiff/inputData/spintest/

  2. Convert the csv files of the gams abs sum LH and RH results into gii files using convert_to_gifti.py

  python3 convert_to_gifti.py '/Users/ashfrana/Desktop/code/abcd_sex_pfn_replication/spin_tests/data/PNC_data/GamSexAbssum_lh.csv' '/Users/ashfrana/Desktop/code/abcd_sex_pfn_replication/spin_tests/data/PNC_data/gams_abs_sum_lh.gii' Y
  1. Use the PNC_to_fslr.py to convert the PNC results to flsr space

  2. Run spin_test.py using the following command:

  python3 spin_test.py '/Users/ashfrana/Desktop/code/abcd_sex_pfn_replication/spin_tests/data/PNC_data/Gam_abs_sum_fslr_test.gii' '/Users/ashfrana/Desktop/code/abcd_sex_pfn_replication/spin_tests/data/ABCD_data/gams_abs_sum_discovery_uncorrected.gii' "fsLR" 'PNC gams discovery vs ABCD gams discovery fslr' '/Users/ashfrana/Desktop/code/abcd_sex_pfn_replication/spin_tests/results'

Part 7: Other plots and figures

  1. To create the hex plots seen in figures 2E, 2F and 3D, use the hex_plots.R absolute sum matrices as arguments along with the data labels like the following command:
  Rscript hex_plots.R "/Users/ashfrana/Desktop/code/abcd_sex_pfn_replication/multivariate_analysis/svm_072324_run/abs_sum_weight_brain_mat_replication_haufe_100_runs_072324.npy" "Replication SVM Weights" "/Users/ashfrana/Desktop/code/abcd_sex_pfn_replication/univariate_analysis/univariate_analysis_results/uncorrected_abs_sum_matrix_replication.csv" "Replication GAM Loadings" "/Users/ashfrana/Desktop/code/abcd_sex_pfn_replication/finalized_figs/hex_plots/svm_rep_vs_gams_rep_hex_plot.png"
  1. To create the Figure S8 plot, use the data in the intermediate step of the significant vertices or svm weights barplots and run the group_networks.py to group by netColor (or network group) and then use this data as input into figure_s8_barplot.R.

NOTE: The figure_s8_barplot.R script does both svm and gams plotting, but requires users to comment out the code that is not being used (ie if plotting svm, comment out gams code). There are comments that indicate which code belongs to gams vs svm.

Part 8: Network specific models

  1. Create the network specific nonzero matrices using create_network_specific_matrices.py by running the following commands:

NOTE: Make sure that you have a folder called network_specific_matrices in your results folder to hold all of the results from the above script

  1. Use submit_network_specific_models_jobs.py to submit 17 array jobs (one for each network) of 100 runs each by calling network_specific_models_slurm.sh which calls run_network_specific_svm.py. You’ll have to change the path to your results folder in both of the python scripts.

  2. Once you have all of the results from step 2, run get_network_specific_accuracies.py to get the accuracies of the 100 runs for each of the 17 networks.

  3. Lastly, use plot_accuracies.R to make a barplot of the accuracies for all of the 17 networks.

Part 9: Pubertal Analyses

  1. Run make_behavior_dfs.py to make the variation of dataframes that are subsampled from the original siblings removed sample and will be used in the puberty analyses. Use data from \cbica\projects\ash_pfn_sex_diff_abcd\dropbox\hormone_pds_data and the discovery and replication samples created at the beginning of the project.

  2. Run submit_scripts.py, which calls a corresponding shell script found in the puberty_submit_scripts folder that calls either abcd_puberty_hormones.R, abcd_puberty_stage_timing_gams.R, and abcd_puberty_stage_timing_gams_sex_specific.R depending on which dataset is used. NOTE: Make sure all of the results directories are created before running this script. (ie pds_male_female_no_age is a directory in the puberty_analyses_2/discovery directory); Also make sure to make your result directory and change the path in the gams R scripts

  3. Run submit_scripts_oSex.py to run the analyses to look at the effect of sex on loadings controlled for by hormones and pds variables. This script calls abcd_puberty_hormones_oSex.R and abcd_puberty_stage_timing_gams_oSex.R depending on which dataset is being used. NOTE: Make sure all of the results directories are created before running this script. (ie pds_male_female_oSex_no_age is a directory in the puberty_analyses_2/discovery directory); Also make sure to make your result directory and change the path in the gams R scripts

  4. To create the absolute sum brain maps, use write_effect_map_abs_sum.py to get the absolute sum matrices. Once you have these, use the cifti_wrapper.py script to convert these matrices to a dscalar.nii format which can then be plotted in workbench.

  5. To create the barplots of the sex effect controlled for hormones/pds variables, use make_barplot_mat.py to make the matrix for the barplot script and then run create_barplot_wrapper.py, which calls create_barplots.R. Make sure you have a barplots directory and matrices directory created to store all the intermediate values and final results.

  6. To get the individual network maps, use make_individual_mats.py to get the fdr z vector for each of the 17 networks for each of the analyses done. Afterwards, run to_cifti_wrapper.py that will convert the significant matrices to dscalar.nii format. To identify which networks are not all zero/are significant, use identify_sig_networks.py, which will create a sig_networks.txt file which contains an array of the significant networks. Using this information, you can then plot the specified networks in workbench for each of the tests.

Part 10: Demographic Table

  1. Use the samples created at the beginning of the project for the demgraphic table. Also use the hormone data and pds data found in \cbica\projects\ash_pfn_sex_diff_abcd\dropbox\hormone_pds_data. Use demographics.ipynb to create the demographic table for the study. NOTE: make sure to run the cells in order and if you need to rerun a cell, rerun from the start.