Some links to Stata and ML resources

### Conference Articles / Presentations / Stata Journal

- 2023 – Presentation_UK_Stata_Conf_2023 – A review of machine learning commands in Stata: Giovanni Cerulli
- 2022 – Applying Machine Learning Techniques in Stata to Predict Health Outcomes Using HIV-related Data (youtube.com) – use of LASSO in HIV setting using Stata
- 2021 – Cerulli_StataConf2021 : ML using Stata and Python
- 2019 – An Introduction to Machine Learning [.2cm] with Stata – Achim Ahrens
- 2024 – ddml – ddml: Double/debiased machine learning in Stata (sagepub.com)
- 2023 – pystackd – pystacked: Stacking generalization and machine learning in Stata (sagepub.com)
- 2020 – lassopack – lassopack: Model selection and prediction with regularized regression in Stata (sagepub.com)
- 2016 – Support Vector Machines (sagepub.com) svmachines
- ELASTICREGRESS: Stata module to perform elastic net regression, lasso regression, ridge regression (repec.org)
- LASSOPACK: Stata module for lasso, square-root lasso, elastic net, ridge, adaptive lasso estimation and cross-validation (repec.org)
- PDSLASSO: Stata module for post-selection and post-regularization OLS or IV estimation and inference (repec.org)

### Stata Blog Articles from 2020

- The Stata Blog » Stata/Python integration part 1: Setting up Stata to use Python
- The Stata Blog » Stata/Python integration part 2: Three ways to use Python in Stata
- The Stata Blog » Stata/Python integration part 3: How to install Python packages
- The Stata Blog » Stata/Python integration part 4: How to use Python packages
- The Stata Blog » Stata/Python integration part 5: Three-dimensional surface plots of marginal predictions
- The Stata Blog » Stata/Python integration part 6: Working with APIs and JSON data
- The Stata Blog » Stata/Python integration part 7: Machine learning with support vector machines
- The Stata Blog » Stata/Python integration part 8: Using the Stata Function Interface to copy data from Stata to Python
- The Stata Blog » Stata/Python integration part 9: Using the Stata Function Interface to copy data from Python to Stata

Other Resources

- User’s corner: Machine learning| Stata News
- Giovanni Cerulli – Machine Learning in Stata (google.com)
- Towards better clinical prediction models: seven steps for development and an ABCD for validation – PMC (nih.gov)

### Seven Steps in developing a prediction model

**Problem definition and data inspection**/**Research Question**- What is the precise research question
- How were patients selected
- What is already known about the predictors?
- Define the Predictors
- Were the predictors reliably and completely measured?
- Define the outcomes of Interest

**Coding of predictors**- Categorical predictors
- Continuous predictors

- Model specification
- Selection of main effects?
- Assessment of assumptions?
- Overfitting?

**Model estimation**– Estimate model parameters- Shrinkage included ?

**Model performance:**- Calibration: Caliberation plot
**A: alpha**– Calibration-in-the-large – Intercept in plot; the agreement between observed endpoints and predictions**B: beta**– Calibration slope – Regression slope in plot; related to shrinkage of regression coefficients

- Discrimination: the ability of the model to distinguish a patient with the endpoint from a patient without
- Concordance
–*C*-statistic- ROC curve – For a binary endpoint, c is identical to the area under the receiver operating characteristic (ROC) curve
- For a time-to-event endpoint, such as survival, the calculation of c may be affected by the amount of incomplete follow-up (censoring)
- Probability of correct classification for a pair of subjects with and without the endpoint
- A better discriminating model has more spread between the predictions than a poorly discriminating model

- Concordance
- Clinical usefulness:
**D – Decision-curve**analysis – Net true-positive classification rate by using a model over a range of thresholds –- Net benefit (NB)

- Calibration: Caliberation plot
**Model validation**- Internal validity
- External validity
- Techniques used: split smaple, cross-valdiation, etc

**Model presentation**