Modern Text-to-Speech systems produce synthetic speech almost indistinguishable from real speech. On the other hand, automatic speech recognition systems have become quite good at "understanding" spoken language.
Speech enhancement can profit from both fields by using components of both : A self-supervised learning model like wav2vec 2.0  can extract low-dimensional embeddings from spoken language, which can be fed to the second stage of a TTS system, the Vocoder, e.g. the HiFi-GAN .
If you are interested in researching this disruptive approach to Speech Enhancement, please get in contact.
 Irvin, Bryce, et al. "Self-Supervised Learning for Speech Enhancement through Synthesis." arXiv preprint arXiv:2211.02542 (2022).  Baevski, Alexei, et al. "wav2vec 2.0: A framework for self-supervised learning of speech representations." Advances in neural information processing systems 33 (2020): 12449-12460.  Kong, Jungil, Jaehyeon Kim, and Jaekyoung Bae. "Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis." Advances in Neural Information Processing Systems 33 (2020): 17022-17033.