From: Adapting transformer-based language models for heart disease detection and risk factors extraction
Hyperparameter | BERT | RoBERTa | BioBERT | BioClinicalBERT | XLNet |
---|---|---|---|---|---|
Hidden size | 768 | 768 | 768 | 768 | 144 |
Number of layers | 12 | 12 | 12 | 13 | 6 |
Number of attention heads | 12 | 12 | 12 | 12 | 6 |
Feed-forward layer hidden size | 128 | 128 | 128 | 128 | 128 |
Learning rate | \(1\times 10^{-6}\) | \(5\times 10^{-7}\) | \(5\times 10^{-5}\) | \(5\times 10^{-6}\) | \(5\times 10^{-6}\) |
Batch size | 16 | 16 | 16 | 16 | 16 |
Dropout | 0.5 | 0.1 | 0.1 | 0.4 | 0.4 |