Cookies on GOV.UK

We use some essential cookies to make this website work.

We’d like to set additional cookies to understand how you use GOV.UK, remember your settings and improve government services.

We also use cookies set by other sites to help us deliver content from their services.

You have accepted additional cookies. You can change your cookie settings at any time.

You have rejected additional cookies. You can change your cookie settings at any time.

  • Department of Health & Social Care

Genome UK: 2022 to 2025 implementation plan for England

Published 13 December 2022

Applies to England

genomics england research environment

© Crown copyright 2022

This publication is licensed under the terms of the Open Government Licence v3.0 except where otherwise stated. To view this licence, visit nationalarchives.gov.uk/doc/open-government-licence/version/3 or write to the Information Policy Team, The National Archives, Kew, London TW9 4DU, or email: [email protected] .

Where we have identified any third party copyright information you will need to obtain permission from the copyright holders concerned.

This publication is available at https://www.gov.uk/government/publications/genome-uk-2022-to-2025-implementation-plan-for-england/genome-uk-2022-to-2025-implementation-plan-for-england

Ministerial foreword

Our country leads the world in genomic healthcare and research. This has never been more evident than in the last 2 years, when we were at the forefront of monitoring and tracking variants of the COVID-19 virus. The NHS in England is the first healthcare service in the world to offer whole genome sequencing ( WGS ) as a part of routine care, and we hold unique research resources such as UK Biobank.

In 2020 we published our Genome UK: the future of healthcare strategy, which outlines how we will become the most advanced genomic healthcare system in the world. Earlier this year, we published our shared commitments for UK-wide implementation 2022 to 2025 , setting out an ambitious plan to ensure genomics research and healthcare can flourish across the whole of the UK.

We are now publishing an implementation plan for England, which builds on the shared commitments and outlines the actions we will take in England over the next 3 years. This plan will take us to the halfway mark of the original strategy in 2025, which means that making strong progress in these 3 years will be pivotal to delivering our ambitions.

Our delivery partners will take forward a world-leading programme of innovation in genomics diagnostics and clinical services, evaluation of new genomic tools in prevention and early detection of disease, and cutting-edge genomics research, all enabled by new, large-scale data capabilities.

The delivery of genomic healthcare for patients through the NHS is continuing to develop at pace too. Genomics England and NHS England will work together to bring innovative genomic technologies closer to the patient, with the evaluation of new sequencing technology to improve the accuracy and speed of diagnosis for cancer patients. NHS England will continue the annual review of the national genomic test directory so that more patients are eligible for genomic testing and, through the NHS Genomic Medicine Service ( GMS ) Alliance transformation projects, will test new diagnostic technologies to ensure that the most innovative and effective genomic technologies are available for use in patient care. Taken together, these actions will deliver real improvements in patient care and health outcomes.

To start this important next phase of our implementation work, the government is announcing the following investments:

  • £105 million for a landmark research programme, led by Genomics England in partnership with the NHS, to study the effectiveness of using WGS to speed up diagnosis and treatment of rare genetic diseases in newborns, potentially leading to life-saving interventions for thousands of babies
  • £22 million for Genomics England to tackle health inequalities in genomic medicine through tailored sequencing of 15,000 to 25,000 participants from diverse backgrounds by the 2025, as well as extensive community engagement work to build trusted relationships with traditionally excluded groups of people
  • £26 million for an innovative cancer programme, led by Genomics England in partnership with NHS England and the National Pathology Imaging Co-operative, to evaluate cutting-edge genomic sequencing technology and use artificial intelligence to analyse genomic data alongside digital histopathology and radiology images, improving the accuracy and speed of diagnosis for cancer patients
  • up to £25 million Medical Research Council-led funding for a 4-year functional genomics initiative, working across UK Research and Innovation ( UKRI ) and other stakeholders to establish an industry-partnered world-class offer on functional genomics, building on already existing infrastructure and UK research expertise

These initiatives will join existing programmes across the UK already, including within Genomics England and the NHS GMS . Our Future Health plans to genotype the world’s largest population cohort to support the early detection of disease, and UK Biobank is continuing work to maximise the capabilities of its resource – the world’s most characterised and widely used research cohort.

The UK has the opportunity to be a genomic superpower. We are already seeing the results of the innovations in genomic healthcare and research in the UK, revolutionising outcomes for patients and generating valuable new investments. This plan will drive this transformation forward.

Will Quince MP Minister of State Minister of State (Minister for Health and Secondary Care), Department of Health and Social Care

Nusrat Ghani MP Minister of State (Minister for Industry and Investment Security), Department for Business, Energy and Industrial Strategy

Executive summary

The UK is a global leader in genetics and genomics. To maintain and extend this leadership position, the UK government published Genome UK: the future of healthcare in September 2020 – a 10-year strategy to create the most advanced genomic healthcare system in the world and deliver better health outcomes at lower cost. In March 2022, the UK government and devolved governments published Genome UK: shared commitments for UK-wide implementation 2022 to 2025 , setting out how they will work together to implement the Genome UK strategy. Recognising the devolved responsibilities, we agreed that the UK government and the devolved governments would each publish separate, nation-specific plans.

This implementation plan for England lays out specific actions that our genomics delivery partners in England will take during the 2022 to 2025 spending review period to implement the commitments in Genome UK. The plan showcases the outstanding research and policy work that is taking place to develop, evaluate and implement new genomic technologies. The actions are not exhaustive – the UK has a very active genomics research and clinical service community, and it would not be possible to include every project, pilot study or trial. We have focused on the key projects put forward by our delivery partners, while recognising that this only represents some of the excellent work underway across England.

Successful delivery of the implementation actions described here will ensure that by 2025 we will have made significant progress in realising the benefits of genomic healthcare. Genomic technologies that support early detection of disease, enable faster and more accurate diagnoses, and speed up the development of personalised, more effective treatments are important tools that will deliver better health outcomes for patients and support sustainable delivery of our healthcare system. We expect that by 2025, genomic healthcare will play a significant role in enabling healthcare reform, propelled by a growing genomics research sector which in turn will play an important part in generating and supporting economic growth.

For the first time, we will also publish a suite of metrics which will determine the combined impact of the initiatives included in this plan and measure progress against our Genome UK objectives.

Summary of key actions for the next implementation period of Genome UK

We will progress our ambitions for diagnosis and personalised medicine, incorporating the latest genomics advances into routine healthcare to improve the diagnosis, stratification and treatment of illness. Our implementation plan includes the following actions:

Genomics England, in partnership with NHS England and the National Pathology Imaging Co-operative, will lead the Cancer 2.0 programme – a £26 million innovative cancer programme to evaluate cutting-edge genomic sequencing technology. The programme will use artificial intelligence (AI) to analyse genomic data alongside digital histopathology and radiology images to improve the accuracy and speed of diagnosis for cancer patients

NHS England will continue to introduce new clinical indications for genomic testing through the national genomic test directory so that more patients can benefit from genomic testing. NHS England will also pilot innovative technologies through NHS GMS Alliance transformation projects to build the evidence base for adopting more genomic technologies in the future

Genomics England and NHS England will deliver the diagnostic discovery pathway, enabling the discovery of new diagnoses for patients with rare disease using the latest research findings. This provides benefits to individual families and insights to inform future development of the NHS GMS , enabling even more rare disease patients to receive a diagnosis in the future

We will progress our ambitions for effective prevention and early detection, generating evidence about the utility of new genomic tests and using genomic insight to improve health outcomes and transform the delivery of healthcare. Our implementation plan includes the following actions:

  • Genomics England will lead on delivering £105 million investment in a landmark research programme, in partnership with the NHS, to study the effectiveness of using WGS to speed up diagnosis and treatment of rare genetic diseases in newborns, potentially leading to thousands of life-saving interventions
  • UK Biobank will build on its strength as the largest collection of WGS data in the world and develop a secure and scalable genome variant imputation service to enrich data collected by Our Future Health and other resources using lower cost genotyping, further improving UK genomic resources for genomic analysis
  • Our Future Health will recruit up to 5 million participants by the end of 2025, making it the UK’s largest ever health research programme. In partnership with Genomics plc, Our Future Health plans to calculate polygenic risk scores ( PRS ) and offer participants healthcare insights, should they choose to receive them. These activities and the uniquely large cohort will enable the research community to advance our understanding of disease risk and build the evidence base on the utility of PRS in healthcare

We will progress our ambitions to extend our lead in genomic research and data. We will support the continuous development of cutting-edge genomic technology, move towards a future federated ecosystem by improved data standards and interoperability of genomic datasets for research use, and improve patient access to clinical trials. Our implementation plan includes the following actions:

  • the Medical Research Council ( MRC ) will provide up to £25 million investment in a 4-year functional genomics initiative, working across UK Research and Innovation and other stakeholders. This initiative will respond to views from UK researchers and industry partners about the priorities for establishing a world-class offer on functional genomics, building on already existing infrastructure and UK research expertise. Open competition funding mechanisms are expected to launch in spring 2023
  • NHS England will continue to develop unrivalled at-scale data infrastructure (secure data environments) to deliver key research and development opportunities, making a variety of data types available in a streamlined, secure and privacy-protected way. This includes working with the NHS to make more clinical test sequences available for research use
  • Our Future Health will use genotype array and other genomic data from its 5 million diverse population cohort to support the development and evaluation of genomic tools in prevention and early disease detection. The genomic data will be combined with questionnaire and health-related data to create a new research platform which is unique in cohort size and supports discovery and translational research at scale. Participants can consent to be recontacted, enabling them to receive health-related insights and be invited to additional studies based on disease risks
  • UK Biobank, one of the world’s most genetically characterised and used research resources, will use £20 million funding from the Wellcome Trust to develop its research analysis platform, further developing cloud-based platform functionality and making the platform even more accessible with a wide range of analytical tools. By end of 2023, all genomic data will be available to industry partners and approved researchers for analysis alongside over 15 years of follow-up health outcomes data, enabling in-depth research into the genetic determinants of disease
  • UK Biobank will use £30 million funding by the Medical Research Council, Calico Life Sciences and the Chan Zuckerberg Initiative for repeat magnetic resonance imaging of up to 6,000 UK Biobank participants over the next 6 years, creating the world’s largest longitudinal imaging dataset
  • Genomics England will lead a £22 million programme to carry out tailored genomic sequencing of 15,000 to 25,000 research participants from diverse ancestry groups that are currently under-represented in genomic research. This will increase our understanding of genomic diversity and its impact on scientific, clinical and health system outcomes, aiming to reduce health inequalities, and improve patient outcomes across all communities
  • the National Institute for Health and Care Research ( NIHR ) BioResource will use £40 million of NIHR funding to continue to build and enhance phenotyping of disease cohorts, including rare diseases, and increase the inclusivity and diversity of existing cohorts. This work will contribute to the long-term goal of creating a national infrastructure platform to enable the rapid recruitment to clinical trials and studies, including better identification of suitable participants

Introduction

In September 2020, the UK government published  Genome UK – the future of healthcare , a 10-year strategy to create the most advanced genomic healthcare system in the world, delivering better patient outcomes at lower cost. It also positions the UK to become the best location globally for genomics research and investment to grow new genomics healthcare companies.

Our aim is to ensure that people and patients across the UK can benefit fully from genomic healthcare through a more preventative approach, faster diagnosis and more personalised treatment leading to improved long-term outcomes. Researchers and industry will be supported in their research and incentivised to secure the UK’s position as an international leader in genomics.

We have adopted a phased approach to implementing the strategy, allowing us to reflect emerging science. In May 2021, we published the Genome UK: 2021 to 2022 implementation plan , setting out priority actions for the financial year 2021 to 2022 in England, with contributions from the Scottish and Welsh governments.

In its recent report Harnessing the UK’s genomics expertise to improve patient outcomes , the Association of the British Pharmaceutical Industry ( ABPI ) highlighted the importance of working across the 4 nations of the UK to achieving our Genome UK ambitions.

While recognising that the planning and delivery of healthcare, including genomic healthcare, is devolved, we agree that concerted, UK-wide action is crucial. There is a clear benefit in working together to progress our common ambition of improving patient care and growing the sector. This is why, in March 2022, we published – jointly with the devolved governments – Genome UK: shared commitments for UK-wide implementation 2022 to 2025 , which sets out our joint commitments for better UK-wide co-ordination and collaboration in genomic research and healthcare.

The 2021 spending review, which set departmental budgets in England and devolved government allocations for 2022 to 2025, provided a timely opportunity to set out how we will progress implementation of the Genome UK vision over the next 3 years in England, working with delivery partners. In 2025 we will be marking the half-way milestone in the Genome UK 10-year timescale, and measurable progress in the next 3 years will be critical to successfully delivering the vision set out in the strategy.

UK genomics landscape

The UK has a vibrant genomics research and healthcare landscape bringing together the NHS, world-leading research assets and a thriving life sciences sector.

Following the completion of the 100,000 Genomes Project in 2018, Genomics England – a Department of Health and Social Care-owned company – holds the largest global research collection of whole genome sequences from patients with cancer and rare diseases. This number is increasing with de-identified genomic data from the NHS GMS being transferred to the Genomics England National Genomic Research Library, a secure national research environment of genomic and health data, with patient consent

UK Biobank, established in 2006, has sequenced the exomes and whole genomes of its 500,000 participants. It now represents the largest collection of genome sequences anywhere in the world, all of which are linked to participants’ detailed NHS health records. Both UK Biobank and Genomics England are now also linking imaging data to already available clinical and genomic datasets.

In 2019, the government established Our Future Health (formerly known as the Accelerating Detection of Disease Challenge) with £79 million from the Industrial Strategy Challenge Fund, via Innovate UK. With volunteers’ consent, Our Future Health will aim to collect and link multiple sources of health and lifestyle information, including genetic data, across a cohort of 5 million adults that truly reflect the UK population. Data will be held in a secure data store and will be de-identified before being made available for health research in trusted research environments. This will create a unique research platform enabling both discovery and translational research at scale, as well as allowing researchers to re-contact and invite participants to future studies based on their genetic and health information. Our Future Health’s planned research programme has already attracted £150 million of industry investment and is expected to make a significant contribution across all Genome UK objectives, with particular benefits expected in disease detection – an important strategic goal which will result in better outcomes for patients. Our Future Health is a UK wide organisation which is expected to recruit participants across the 4 nations. The recruitment of volunteers and data collection has already begun in England, and Our Future Health will continue to work with the devolved governments to develop and roll out plans for Scotland, Wales and Northern Ireland, including recruitment initiatives, logistics and infrastructure building.

In October 2022, NHS England published the first ever NHS genomics strategy Accelerating genomic medicine in the NHS , which sets the strategic direction and priorities of its genomics programme over the next 5 years. The strategy aims to ensure that genomics will be at the heart of a sustainable NHS in the future and the next generation of healthcare in the NHS. It sets out 4 priority areas to this approach:

  • embedding genomics in the NHS through a world leading, innovative service model
  • delivering equitable genomic testing for improved prediction, prevention, diagnosis and precision medicine
  • enabling genomics to be at the forefront of the data and digital revolution
  • evolving the service through cutting-edge science, research and innovation

The life sciences sector offers important and unique opportunities for generating investment and supporting economic growth across the UK. We know that the sector provides a huge boost to the UK economy, generating a turnover of £94.2 billion in 2021, and employing 282,000 people across the UK. Of these jobs 65% are outside of London and the South East. The value of estimated inward life sciences foreign direct investment in the UK was £1.9 billion in 2021, coming behind only the USA in terms of value .

A range of evidence suggests that the genomics sector in particular offers high growth potential. The BioIndustry Association’s report, Genomics nation 2021 , describes the size and growth of the genomics industry in recent years and projections for growth going forward. The report estimates that while the sector’s market capitalisation was less than £10 billion in 2021, their forecast suggests this will reach over £50 billion by 2040. The latest publication from the Office for Life Sciences ( OLS ) on the bioscience and health technology sector statistics also monitors trends in the number of businesses conducting genomics-related activity, the number of employees and the turnover generated from these sites. The latest statistics show that employment has seen substantial increases since 2009 with sharp increases year-on-year since 2017. There was a 12% rise in employment at sites with genomics activity between 2020 and 2021

UK genomics governance

The National Genomics Board , which brings together senior decision makers and representatives from across the genomics sector, including senior officials from the devolved governments, provides strategic oversight and works collaboratively across the UK to harness the benefits of genomic healthcare – ultimately helping to ensure delivery of the vision set out in Genome UK. The National Genomics Board is chaired by the minister responsible for life sciences in the DHSC .

To monitor more detailed progress on Genome UK implementation plans, OLS leads an implementation co-ordination group. The group convenes delivery leads for programmes and projects in England under the pillars and themes set out in the Genome UK strategy and provides a forum for discussion on coordination and progress of implementation commitments and actions included in implementation plans. The implementation co-ordination group also provides specific updates on progress relating to implementation actions to the National Genomics Board, as appropriate.

Together with the devolved governments, OLS has also set up a UK shared commitments group, attended by representatives from the health services in the 4 governments and chaired by OLS . The group considers high-level co-ordination and progress in delivery of the UK-wide shared commitments, as well as agenda items for the National Genomics Board.

Measuring the impact of Genome UK implementation

To assess the long-term impact of implementation actions and initiatives, OLS has derived a set of high-level metrics that will quantify long-term changes in the genomics environment and measure progress against Genome UK ambitions. These metrics have been discussed and agreed with delivery partners and are outlined in Annex A, published alongside this plan. OLS will undertake further work to refine or expand the list of metrics based on evolving data availability and feedback received following the publication of the implementation plan.

These metrics relate to England and we are working with the devolved governments to achieve a harmonised approach across the UK where possible.

We will publish further information via a metrics baseline report in due course. This will include:

  • further specification on the metrics, including scope and definitions
  • baseline figures for each metric that quantify the situation at the earliest possible timepoint
  • any additional metrics that are important for measuring the impact of implementation actions

We plan to publish future implementation progress reports and appropriate updates on metrics.

Implementation actions 2022 to 2025

Diagnosis and personalised medicine.

As we learn more about the role and function of the genome in disease, the application of genomic technologies in diagnosis and personalised medicine is becoming even more important and impactful. Using genomics, we can provide rare disease and cancer patients with an accurate diagnosis earlier and more quickly, while also supporting the use and development of personalised and stratified treatments. Furthermore, extending the use of pharmacogenomics in the NHS will ensure more patients get the right treatments at the right time and at the right dose, improving outcomes for patients and reducing the number of adverse drug reactions and their impact on the NHS. Finally, using genomics can help us to better understand pathogens and how they are spread, as well as the role of a person’s genome in their response to infectious diseases. This supports scientists in controlling outbreaks, as well as developing new diagnostics and treatments for infectious diseases.

In Genome UK we set out our commitments for the ‘diagnosis and personalised medicine’ pillar to:

  • ensure the NHS is ready to evaluate and implement all clinically relevant, genomic technologies and novel genomic healthcare applications based on the latest, robust evidence from experts at the forefront of their fields across the UK and globally
  • offer all patients with a rare genetic disorder a definitive molecular diagnosis using tests that will support research into their condition wherever possible
  • offer genomic testing to all people with cancer where it would be of clinical benefit
  • support the join-up of the NHS and research community with scalable and secure informatics systems, both for clinical decision support and large-scale data processing and analytics
  • secure the best value per clinical WGS anywhere in the world, and help ensure that new clinically relevant technologies become more widely available at a competitive price
  • have a clear, evidence-based position on whether and how pharmacogenomics should be implemented in the health service at scale
  • sequence pathogens quickly and easily using point of care sequencing technology, helping us control outbreaks and fight antimicrobial resistance
  • understand the role of the genome in differing patient outcomes from infectious disease
  • rapidly utilise advances in sequencing technology to develop and deploy new diagnostics and support better, more integrated surveillance of infectious diseases
  • provide international leadership in supporting the development of best practice in infectious disease genomics and public health, through international projects such as the Global Alliance for Genomic Health and the Public Health Alliance for Genomic Epidemiology

Actions our genomics delivery partners have taken over the past 18 months

The NHS GMS in partnership with Genomics England increased the number of rare disease and cancer patients who can access WGS , which is a world-leading diagnostic test. As of the end of October 2022, around 33,000 whole genome equivalents have been sequenced through this service, with an average diagnostic yield of 32%, rising to up to 61% in some conditions. To ramp up the service, NHS England has made it easier for clinicians to order WGS . In April 2022 they made 20 additional clinical indications available, resulting in a total of 190 clinical indications currently available via WGS (inclusive of pilot initiatives).

NHS England have increased the number of genomic tests available to patients, via the test directory. As of October 2022, the test directory includes 357 rare disease clinical indications, covering around 3,200 rare and inherited diseases and 203 cancer clinical indications. The test directory supports the NHS GMS to carry out over 680,000 genomic tests in England every year for common and rare and inherited disease, pharmacogenomics, and cancer.

NHS England delivered a national rapid whole exome sequencing service, now a rapid WGS service, for acutely unwell children with a likely monogenic disorder (disorders likely to be caused by a defect in a single gene) in neonatal intensive care units and paediatric intensive care units. There have been around 2,500 referrals to date, with a diagnostic yield of around 40%. A genetic diagnosis can often guide the children’s clinical management and treatment.

The NHS GMS launched a world-leading national foetal exome sequencing service in October 2020, with 250 referrals to date and a diagnosis identified in around 40% of cases. This testing provides results within a rapid turnaround time to provide a diagnosis and urgently inform clinical management of the index pregnancy.

The UK Health Security Agency ( UKHSA ) expanded COVID-19 viral sequencing capacity within the UK to support national research studies, assessment of vaccine efficacy and evaluation of diagnostic or variants detection testing against genome sequences.

UKHSA began building a public health service infrastructure for pathogen genomics with regional and national sequencing hubs across the UK to support SARS-CoV-2 sequencing and to establish the foundation of a pathogen sequencing framework.

The government provided international leadership in supporting the development of best practice in infectious disease genomics, including through the international pathogen surveillance network global pandemic radar, to strengthen global genomic surveillance.

The DHSC published England’s Rare Diseases Action Plan on 28 February 2022, setting out 16 specific, measurable actions for the next year under the 4 priority areas of the UK rare diseases framework. These include 2 actions which support the commitment in Genome UK to make progress on the rollout of WGS to patients with a suspected rare disease.

NHS England published their genomics strategy, Accelerating genomic medicine in the NHS , to set the strategic direction and priorities of its genomics programme over the next 5 years.

Actions our genomics delivery partners will take over the next 3 years

Genomics England will deliver a £26 million ‘Cancer 2.0’ programme to improve the speed and accuracy of diagnosis for cancer patients. The programme will have 2 main components. Firstly, it will explore the use of novel, long-read WGS within a clinical setting, which has the potential to provide faster, more comprehensive and accurate diagnostic capabilities for certain cancers (compared to the short-read sequencing technology currently used by Genomics England). Secondly, it will work with the National Pathology Imaging Co-operative to combine digital histopathology and radiology images with WGS data – which will be accessible to approved researchers through Genomics England’s research environment. Genomics England will partner with researchers across industry and academia to analyse the multi-modal data at scale using machine learning technology, deriving novel insights in cancer research that enable better predictive models of diagnosis, prognosis and response to treatment. This work builds on the success of the National Pathology Imaging Co-operative, which was created as part of £100 million government investment in 2018 to establish digital pathology and imaging AI centres of excellence across the UK. The work enables new ways of using AI to analyse medical imaging and pathology data, speeding up the diagnosis of diseases (Genome UK commitments 1, 3, 4, 24 and 28).

NHS England is developing a genomics informatics implementation plan to outline how the NHS genomics data infrastructure will support interoperability of data and drive efficiencies across the spectrum from ordering a test through to availability of this data for research at scale. This will enable quicker results for patients and continue to support innovation in genomic healthcare. This commitment was included as one of the 4 priority areas set out in the recently published NHS England strategy (Genome UK commitments 4 and 26).

The NHS GMS will continue to update the test directory annually so that more clinical indications (and therefore, more patients) are eligible for genomic testing. NHS England, supported by a genomics clinical reference group and test evaluation working groups, will review the test directory on an annual basis to keep pace with scientific and technological advances, while delivering value for money for the NHS. A robust and evidence-based process and policy is in place to ensure that genomic testing remains available for all patients where it would be for whom it would be of clinical benefit. This is supported by a horizon scanning process and fast stream application system to ensure the test directory can respond quickly to emerging developments (Genome UK commitments 1, 2, 3 and 6).

The NHS GMS is exploring the introduction of innovative genomic sequencing techniques to improve diagnosis and treatment of patients, including cancer patients. The latest technologies are being piloted through a number of NHS GMS Alliance transformation projects to ensure the most innovative and effective genomic technologies – such as RNA sequencing, long-read sequencing and optical mapping, and liquid biopsies (circulating tumour DNA ( ctDNA )) – can be commissioned in the NHS, based on the latest evidence. This includes a pilot to assess the use of more comprehensive pharmacogenomic testing in clinical care to reduce the number of adverse drug reactions and improve the efficacy of drugs and patient outcomes (Genome UK commitments 1, 3, 5 and 6). The National Disease Registration Service – including the National Cancer Registration and Analysis Service and the National Congenital Anomaly and Rare Disease Registration Service – will work closely with the NHS GMS to support demand modelling and evaluation of the uptake of genomic testing in eligible patient groups across cancer, congenital anomalies and rare disease.

The NHS Southwest Genomic Laboratory Hub has recently launched a new rapid WGS service for specific conditions to provide actionable WGS results more quickly for patients across England. This service will be provided to acutely unwell children with a likely monogenic disorder. Rapid WGS has the potential to increase the detection of diagnostic variants, offer more individuals a diagnosis, and enable more patients to access life-saving treatments. This could result in fewer days in hospital, fewer invasive procedures including major surgeries, and improvements in the NHS by reducing the need for multiple diagnostic tests (Genome UK commitments 1, 2 and 5).

Over the next 3 years, UKHSA will lead the ongoing COVID-19 response in England, including on new variants and supporting national recovery. They are maintaining COVID-19 horizon scanning and genomic surveillance throughout 2023 to 2024. They are also using research studies as key surveillance tools, including the Office for National Statistics ( ONS ) COVID-19 infection, antiviral efficacy and healthcare workers exposure studies (Genome UK commitment 9).

UKHSA is also working to reduce the harmful impact of hepatitis B, hepatitis C and HIV . They are establishing a programme of engagement with the NHS to collate HIV WGS data, implement sequencing methodology for HIV and implement HCV WGS in clinical laboratories (Genome UK commitment 8).

UKHSA has committed to enhance the resilience and scalability of national and local public health systems, by introducing standards and frameworks for data and services to facilitate responsiveness and flexible scope. They will strengthen their data and analytics capability by developing specialized analytical platforms for pathogens genome data analysis, beyond COVID-19 (Genome UK commitment 8).

NHS Blood and Transplant ( NHSBT ) has established a programme to expand the use of genomic testing to deliver more accurate, personalised and rapid donor-recipient matching in blood transfusion, solid organ and stem cell transplantation. This in turn will reduce the risk of allo-immunisation for patients and for transplant rejection (Genome UK commitment 1).

For transfusion, NHSBT is a founding member of an international collaboration between global blood services, industry and academia. In the next 12 months they will deliver validated genotyping technology for the identification of clinically relevant red blood cell antigens. In 2023 to 2024 NHSBT , in collaboration with the NIHR BioResource, will complete the genotyping of 80,000 regular blood donors, supporting research and clinical trials to improve matching of blood products for patients with sickle cell disease and other inherited haemoglobinopathies. This work will inform the future deployment of genotyping technology in the testing and provision of more personalised blood products to benefit patients.

For stem cells and solid organ transplantation, NHSBT has established a 3-year collaboration project with Oxford Nanopore Technologies to develop a validated rapid full HLA gene sequencing to support matching in stem cell and organ transplantation (from financial years 2022 to 2023 to 2025 to 2026). The project aims to develop faster, more accurate and scalable sequencing which can significantly improve pathways to transplantation for patients on the organ transplant waiting list, and for patients needing stem cell transplants to treat cancer or rare diseases.

Our Future Health is working in collaboration with NHSBT to recruit blood donors who may provide consent for Our Future Health to share genotype data with NHSBT . This data will be used as an initial screen for donors who may have rare blood types, which can then be verified by NHSBT to improve matching in blood transfusion services.

Genomics England will enable ongoing discovery of new diagnoses for rare disease patients through the latest research developments. Research performed using data stored in the National Genomic Research Library has identified more than 1,400 new diagnoses as of September 2022. Genomics England, with NHS England, have established the diagnostic discovery pathway to return these findings efficiently to local NHS clinical teams for clinical interpretation and reporting. The findings from diagnostic discovery not only benefit individual families, but also provide insights that can inform future developments in the NHS GMS (Genome UK commitment 2). We will be able to track genomically-confirmed rare disease diagnoses through the National Congenital Anomaly and Rare Disease Registration Service.

Prevention and early detection

Effective prevention, including screening and early detection of disease, has the potential to dramatically improve health outcomes and transform the delivery of healthcare. By generating evidence about the clinical utility of new tests and technology (such as PRS , ctDNA tests and newborn WGS ), we can work towards developing new approaches to preventative healthcare.

As genomics research and innovation continues at pace, we are considering the potential applications in national policy and health and care services. The Office for Health Improvement and Disparities ( OHID ) leads on prevention within DHSC . OHID works closely with the UK National Screening Committee, OLS , NHS England and genomics delivery partners to ensure that the best research and evidence inform the development of policy and potential applications across health and care services to prevent or provide earlier detection of disease.

Our Genome UK strategy commitments are to:

  • enable the NHS to move from a system that primarily detects and treats illnesses to one that utilises genomics to predict and prevent ill health
  • continue to develop a public health and screening system that uses genomics to intensify screening and interventions in those at high risk
  • establish a clear, evidence-based position on whether and how genomic sequencing should be implemented for newborns, and how that genomic data could inform their care later in life
  • formulate a clear, evidence-based position on whether and how PRS can be best utilised at scale in the health service
  • explore how genomic testing can continue to be best used in reproductive medicine to support parents to make informed choices

Genomics England has progressed discussions to inform the development of their newborn sequencing project via expert working groups to ensure successful delivery of the newborn genomes programme. Work has included engagement workshops and an online survey to understand views and inform the choice of conditions to screen for. This involved engaging with NHS clinician specialists who are advising how downstream treatment pathways would work, interviewing parents to understand experiences and attitudes, commissioning an evidence assessment and literature review on ethical issues, and running feasibility studies to assess the optimal approach to sampling and sequencing.

Our Future Health has commenced recruitment, inviting volunteers to participate in the UK’s largest ever health research programme via the NHS and other partners, and delivering appointments in partnerships with Boots and the Acacium Group. To progress their PRS commitments, Our Future Health has awarded contracts for biological sample receipt and processing (UK Biocentre, Randox Laboratories Limited), genotype assay design (Illumina), genotyping service provider (Eurofins). They will collaborate with Genomics plc for imputation and PRS calculations.

The UK National Screening Committee’s evaluative rollout of non-invasive prenatal testing for Down’s syndrome, Edward’s syndrome and Patau syndrome as part of the fetal anomaly screening programme started in June 2021. Three NHS Genomic Laboratory Hubs carried out testing on behalf of the national network. This approach to prenatal testing could reduce the need for invasive tests which are associated with an elevated risk of miscarriage.

Genomics England will use £105 million of government funding in a landmark research programme, in partnership with the NHS, to study the effectiveness of using WGS to find and treat rare genetic diseases in newborns. The Newborn Genomes Programme will analyse the genetics of newborns to speed up diagnosis of treatable conditions, which could result in thousands of life-saving interventions. The research programme will also explore how babies’ genomic data could be used for discovery research, focusing on developing new treatments and diagnostics for NHS patients. It will also explore the potential benefits and broader implications of storing a baby’s genome over their lifetime. Working with key stakeholders from a range of disciplines and with NHS England, the programme has so far:

  • established the principles for the genes and variants to be included
  • identified a process for asking parents to consent for their newborn to be included in the study
  • researched the best way to take samples from newborns
  • worked with bioinformaticians to establish the strategy for analysis of the data
  • identified the pathways and systems needed to return results to parents and enable the right care and treatment pathway

Further details on which trusts the programme will work with will be posted on Genomics England’s website in early 2023 (Genome UK commitments 11, 12 and 13).

In partnership with Genomics plc, Our Future Health plans to leverage UK Biobank’s WGS data to design and implement a bioinformatics pipeline and calculate polygenic and integrated risk scores. Our Future Health plans to genotype participants’ blood samples using a single nucleotide polymorphism array which is optimised for the UK population and will include single nucleotide polymorphisms that contribute to PRS . The array will also be optimised to predict blood types and assess pharmacogenetic variants, offering the potential for participant healthcare insights, including the return of polygenic or integrated risk scores if they choose to receive them, in partnership with the NHS. Metrics will include participants recruited, samples genotyped and PRS calculated (Genome UK commitments 11 and 14).

UK Biobank will seek to develop a secure and scalable imputation service within its cloud-based research analysis platform to enable detailed information contained within its 500,000 participant genomes be used to enrich data collected by other resources using lower cost genotyping assays. It is anticipated that Our Future Health, as well as other studies, will be able to use this service as part of an approved research project to enhance the data they collect to increase the impact of future research findings, and support continued innovation within genomic healthcare for the benefit of patients (Genome UK commitment 11).

NHS England and Our Future Health will set up a joint group to support Our Future Health in the return of risk information to participants, generated by PRS , and understand the impact it may have on the NHS (Genome UK commitments 11 and 14).

NHS England has developed a ground-breaking commercial partnership with GRAIL for the testing and use of their Galleri genomic test for cancer, aiming to accelerate the test into widespread usage as rapidly as the evidence allows. The agreement encompasses the trial stage (currently underway, 140,000 participants recruited) and an interim implementation phase, should early results from the trial hit specific benchmarks. As long as the research shows effectiveness, 500,000 tests will be rolled out in the financial years 2024 to 2025 and 2025 to 2026. Continued evaluation will be required as the test is used in more people in the NHS. The results will feed into the UK National Screening Committee’s consideration for future screening programmes. NHS England is working closely with GRAIL to learn from the trial and ensure a smooth roll-out of the 500,000 tests (Genome UK commitments 11 and 12).

NHS England has funded a transformation project through the NHS GMS Alliance, which will explore the implementation of ctDNA tests in the NHS, starting in stages 3 and 4 non-small cell lung cancer patients (Genome UK commitments 11 and 12).

The UK National Screening Committee and NHS England will continue their 3-year evaluative rollout of non-invasive prenatal testing for Down’s syndrome, Edwards’ syndrome and Patau’s syndrome. The rollout will be monitored to ensure any changes to the fetal anomaly screening programme pathway and screening processes can be recommended quickly and confidently by the UK National Screening Committee. Metrics for the programme have been published (Genome UK commitments 11, 12 and 15).

The Genome UK prevention and early detection working group will work with OHID and genomics delivery partners within the group to ensure that they understand and can engage effectively with the relevant policy decision making processes to go from research through to policy and implementation (Genome UK commitments 11 and 12).

Polygenic risk score ( PRS )

A PRS is an estimate of an individual’s genetic predisposition to a heritable trait, that is, their risk of developing a disease. The information used to develop the score usually comes from genome-wide association studies which analyse large numbers of common genetic variants (single nucleotide polymorphisms) and their association with disease. A PRS typically comprises the sum of the effects of many single nucleotide polymorphisms (thousands or millions) across the genome into a single number, which is proportional to the individual’s genetic predisposition for that trait. This can be combined with other information (such as age, sex or blood pressure) to create an integrated risk score. PRS and their application heavily depend on population-scale genomic biobanks such as UK Biobank, which has catalysed PRS research and studies of their clinical validity and utility.

Potential uses for PRS include:

  • disease risk prediction – increasing the performance of disease risk models, or identifying individuals at significant genetic risk of disease but with few or no conventional risk factors
  • screening – by predicting risk of developing disease and thereby enabling preventative interventions, or to allow risk stratification for existing screening programmes
  • informing diagnosis and prognosis – improving diagnostic accuracy by predicting the subtype or severity of a disease
  • management – helping to predict the response to a drug in order to guide treatment
  • prompting risk reducing behaviours, such as increased exercise, weight loss, health system engagement and adherence to screening

The PHG Foundation has produced a series of reports which set out the potential applications of polygenic scores for risk prediction in different contexts , as well as outlining current gaps in the scientific evidence. So far there has been a lot of research on evaluating the performance of polygenic score models – that is, the standalone or incremental predictive value of a PRS on top of other established risk factors. To demonstrate the clinical utility of PRS , further implementation and translational research is needed into how PRS can be delivered within care pathways for specific conditions and target populations, as well as assessment of the outcomes.

Examples of programmes contributing to evidence around PRS

Our Future Health plan to calculate PRS for research and offer these, in a responsible manner, to participants who wish to receive them.

UK Biobank genotyping data on 500,000 participants has enabled genomics research worldwide on an unprecedented scale that led to the concept of PRS being developed. Researchers found that about 5% of the UK Biobank population had a PRS which identified them as having a similar risk of developing heart disease to someone with familial hypercholesterolaemia. UK Biobank has since made PRS available for its 500,000 participants based on models developed by the wider research community for over 50 diseases areas, together with tools to evaluate PRS in a testing subset of the UK Biobank. When access to primary care data becomes available, researchers will be able to develop further PRS across a wider range of conditions and contribute to the growing evidence base.

Genomics plc is a provider of PRS to the UK Biobank and Our Future Health programmes. In partnership with the NHS and GPs in the north of England, they recently ran a study called HEART which added PRS testing to the cardiovascular risk assessments – health checks – carried out in routine primary care in more than 800 participants. HEART evaluated the genomics integrated risk tool which combined a cardiovascular disease PRS with the currently used QRISK© method. GPs reported that the new tool was straightforward to incorporate into day-to-day practice, and that they and their patients found the test results offered helpful information. The results also found that 24% of all participants had clinically significant changes to their risk when genetics was added, leading GPs to report that they would change their management of 13% of study participants. The PRS models used were developed using UK Biobank data on 500,000 participants. These PRS models have been incorporated back into the UK Biobank resource together with evaluation tools to allow researchers worldwide to validate and develop further PRS models for an increasing range of diseases.

Investigators at many universities and NIHR biomedical research centres are conducting research into multiple aspects of PRS , including the development of analytic methodologies, score optimisation, open translational resources and tools, as well as PRS implementation and delivery for particular diseases.

There are international PRS efforts which focus on ethnic diversity, health equity and implementation, such as the:

  • National Institutes of Health PRIMED Consortium (USA)
  • National Institute of Health eMERGE Network (USA)
  • National Institute of Health All of Us Programme (USA)
  • INTERVENE Consortium (EU Horizon 2020)

Research and data

The UK is already a leader in genomic research. In Genome UK, we said that we would work to extend that lead by developing an ecosystem of world-leading, secure genomics datasets to drive research and support translation of research findings. These datasets will need to include more data from diverse ancestry groups, currently under-represented in genomic research. They should support studies that will improve our understanding of disease and help us develop new, more precise therapies. Over the next 3 years, we will move towards a future federated ecosystem by improved data standards and interoperability of genomic datasets for research use, including working through commercial principles for data access. The datasets will also be used to improve clinical trial recruitment so that patients will be able to benefit from improved access to genomically-informed trials. The action on our commitment to create a world-class offer on functional genomics, will further help to attract investment and grow the life sciences sector, in turn supporting economic growth.

Given the interdependency between genomic research and analysis of genomic data, the Genome UK cross-cutting theme for data has been integrated into the research pillar.

  • ensure that clinical genomic testing and genomics research contribute to powerful national data resources
  • co-ordinate the UK’s existing and future genomics ecosystem, enabling ground-breaking research at scale for the benefit of patients
  • enable and empower genomics research, providing capabilities at a unique scale
  • achieve greater diversity within our reference genomes, and future genome-wide association studies will reflect the UK’s diverse populations
  • incentivise the genomics research community to prioritise areas of high NHS unmet need
  • support hypothesis-driven identification, recruitment, phenotyping and biosampling of uniquely informative cohorts of patients
  • develop consent and data standards that support innovation for the benefit of patients and the NHS, while maintaining trust in the safe, appropriate and responsible use of data
  • work at a UK level to ensure there is equitable access to opportunities to participate in clinical trials informed by genomic data commitments

From the data cross-cutting theme:

  • through the use of machine learning and AI, understand how genomically-informed healthcare and prevention could be improved and implemented in the NHS, embedding potentially lifesaving technologies quickly and efficiently in the NHS
  • establish a clear set of standards for genomic and health data
  • develop systems to enable federated access to data for research use to enable comparisons across multiple datasets
  • track the usage of our datasets and maintain an upward trajectory of both numbers and user experience
  • learn from the growing number of AI-based businesses in the UK on how to turn these applications into healthcare interventions

UK Biobank completed WGS of all 500,000 participant samples at the end of 2021 as part of the world’s largest sequencing project. Genome data for 200,000 genomes were released to approved researchers in November 2021, extending the genetic characterisation beyond existing cohort-wide exome and genotyping data, which will lead to exciting new insights. In collaboration with the funding industry parties, new data engineering methods have been developed to enable ‘at scale’ processing and curation of WGS data.

UK Biobank has continued to develop its UK cloud-based research analysis platform in partnership with DNAnexus. The platform was made available to all researchers in September 2021 and is now being used by over 2,500 researchers. Leveraging technological methods developed with UK Biobank, DNAnexus has been selected to provide the underlying platform for the Our Future Health trusted research environments to provide access to its research data when these become available. UK Biobank continues to release new platform functionality to enable researchers to easily analyse the vast, multi-modal data held within UK Biobank’s biomedical database.

The NHS GMS Research Collaborative is a partnership between NHS England, Genomics England, the NIHR and the NHS GMS that aims to support genomic research on a national scale. The collaborative has published the process for submitting applications for the NHS GMS to support genomic research or pilot new technologies . They have to date received 10 proposals from across the NHS, academia and industry. Five proposals have also been received through the early feedback review service that enables researchers to use the expertise of individuals in the NHS GMS to inform their proposals. Early feedback requests are reviewed in an average of 18 working days. First draft capacity and capability statements have been developed by all NHS GMS Alliances to start to build a picture of the genomic research already underway across England. The collaborative has also formed a consent sub-group to develop a patient choice framework that makes consented non- WGS genomic data available for research.

Following industry feedback about the importance of functional genomics capabilities for UK competitiveness, generating disease insights and improving drug discovery, the Medical Research Council, working with OLS , have engaged with academic and industry stakeholders to develop a functional genomics initiative with government funding secured.

Genomics England has successfully concluded the first year of their diverse data project, delivering 5,000 samples from diverse communities. The programme has initiated engagement activities with representatives from potential future cohorts and other genomic institutions looking to develop initiatives in diversity.

Genomics England has partnered with Lifebit to develop a new, cloud-based secure data environment which will provide improved functionality and usability for authorised researchers. This is being run alongside Genomics England ‘s original research environment to ensure Genomics England can provide the right service to the broadest spectrum of use cases.

NHS England’s data for research and development programme was announced in March 2022. It includes up to £18 million, subject to government approval, to support data and research commitments in Genome UK to improve the use of genomics and related health data for research and innovation.

DHSC ’s data saves lives strategy was published in June 2022, which included the commitment to bring together genomics data and work with NHS England to ensure that genomic data generated through clinical care is fed back into patients’ records.

Development work has continued in the Global Alliance for Genetics and Health, including approval for consent clauses for large scale projects, using work from 14 countries. They are also involved in a series of outreach activities with major genomics centres to promote adoption of standards.

UK Biobank will be making its genome data available to industry partners and approved researchers, releasing the final genomes of their 500,000 total to approved researchers in the last quarter of 2023. This data will be available to analyse alongside over 15 years of follow-up health outcomes data, enabling researchers to further understand the genetic determinants of a wide range of diseases. Metabolomic and proteomics data are also being added to the UK Biobank resource, increasing characterisation of biological pathways and underlying disease mechanisms. UK Biobank will continue adding long-read sequencing and methylation assay data for the study of epigenetics to the resource. With £20 million funding from the Wellcome Trust, UK Biobank will also develop its research analysis platform, improving cloud-based platform functionality and going beyond genomic analyses to include more traditional analysis tools, such as imaging analytics interpretation of and health record data. This expansion of UK Biobank data availability will ensure that access to the resource, which is already one of the world’s most genetically characterised and used research resources, becomes even more accessible and provides the necessary analytical tools to deepen our understanding of disease (Genome UK commitments 16 and 18).

UK Biobank has secured funding for the world’s largest longitudinal imaging dataset with £30 million committed by the Medical Research Council, Calico Life Sciences and Chan Zuckerberg Initiative. UK Biobank has already captured magnetic resonance imaging data from the brain, heart and abdomen, together with bone density and ultrasound scans of the carotid arteries, from over 60,000 participants. They aim to collect this data on up to 100,000 participants over the next years. This additional funding will allow repeat imaging to commence in the first quarter of 2023 on 60,000 participants, 2 to 7 years after their initial scan. When combined with the extensive phenotypic and deeply characterised genetic data already available in UK Biobank’s database, repeat imaging data will advance understanding of the progression of a wide range of chronic diseases of mid-to-later life. It will also lead to improvements in diagnosis before symptoms even occur and enable the early interventions of potential therapies. UK Biobank is exploring linkages to existing digital pathology data collected within the NHS on its participants, with a pilot starting to link to digital histopathology slides for colorectal cancer patients in Leeds and Oxford (Genome UK commitments 16 and 18).

In addition, UK Biobank has started a pilot with the Wellcome Sanger Institute to undertake single-cell RNA sequencing on an initial 5,000 participants to enable research into functional genomics studies in individual cells. The pilot will conclude in early 2023 and, subject to funding, will be extended to 60,000 participants to undertake repeat imaging measures between now and 2028.

Genomics England will receive £22.4 million government funding to carry out world-leading research, in collaboration with academic and commercial partners, to improve our understanding of genomic diversity and its impact on scientific, clinical and health system outcomes. This 3-year programme will increase the volume, depth and breadth of genomic data available from individuals belonging to ancestry groups that are currently under-represented in genomic research. This will be driven by tailored sequencing of 15,000 to 25,000 participants from diverse backgrounds by 2025, as well as community engagement work. Clinicians, analysts, researchers, patients and community groups will work together to develop new tools, processes and approaches for changing research, service-delivery practices, recruitment and care. The processes will be more equitable and the tools will be openly available to support international efforts, highlighting the UK’s global leadership in genomics research. The diverse data programme aims to reduce health inequalities and improve patient outcomes within genomic medicine, improve genomics research with diverse populations, and earn trust of under-represented groups in genomics-informed personalised medicine (Genome UK commitments 16, 19 and 20).

Our Future Health will aim to generate genotype array and genome-wide imputation data on up to 5 million participants. This will be combined with questionnaire and health-related linked data to create a research platform that enables discovery and translational research. Participants’ consent includes the ability to re-contact them, enabling them to receive health-related insights be invited to additional studies based on disease risks. Our Future Health is aiming to recruit a cohort that reflects the UK population by age, ethnicity and socio-economic status, using census 2021/22 data as the comparison (Genome UK commitments 16, 19 and 21).

The Medical Research Council has set up a new advisory group to support the scoping and design of an up to £25 million investment by UKRI - MRC in a 4-year functional genomics initiative, working across UKRI and other stakeholders. This initiative will respond to views from UK researchers and industry partners about the priorities for establishing a world-class offer on functional genomics, and build on existing infrastructure and UK research expertise. Open competition funding mechanisms are expected to launch in spring 2023 (Genome UK commitment 17, 18 and 21).

NHS England will continue to work towards expanding the ability for researchers to access a range of genomics datasets through linkage of sources by scoping and testing Global Alliance for Genetics and Health interoperability modules with delivery partners. They will work with the NHS GMS on their interoperability programme to improve genomic test request processes and identify where processes could also improve research uses. They will also work with Genomics England on commercial principles around data access. This will be funded through the spending review allocation (Genome UK commitments 20 and 23).

NHS England will continue to develop unrivalled at-scale data infrastructure (secure data environments) to deliver key research and development opportunities, making a variety of data types available in a streamlined secure and privacy-protected way. This includes work to scale up the NHS GMS ’ WGS capacity, enabling more clinical test sequences to be made available for research use (Genome UK commitments 16 and 18).

NHS England is supporting faster, more effective and diverse data-enabled clinical trials by developing a service called Find, Recruit and Follow-Up. The service will use data and digital tools to speed up the identification and recruitment of patients potentially eligible for specific clinical studies and enable follow-up. It will give a wider, more diverse cohort of the UK population the opportunity to take part in clinical research. Find, Recruit and Follow-up aims to address some of the challenges trialists face when they conduct studies in the UK, to reverse the decline in the number of studies taking place in the UK, and to enhance the quality of service. This service is one step towards creating a globally competitive, digitised, holistic and data-enabled clinical research process in the UK (Genome UK commitments 18, 21 and 23).

The NHS will drive equity in access to clinical trials by aligning clinical trial targets with standard of care NHS testing. In appropriate circumstances this will involve partnering with clinical trial units and industry to identify eligible patients. This will require a mechanism to systematically horizon scan upcoming clinical trials to ensure the correct targets are added to the test directory, while also having the data sharing infrastructure in place to share genomic data safely where appropriate and with the necessary patient consent (Genome UK commitments 21 and 23).

Over the next 2 years, NIHR BioResource will progress their long‑term goal of creating a national infrastructure platform to enable the rapid recruitment to clinical trials and studies. Progress will be monitored through annual reporting, where metrics such as number of participants recruited and number of studies where NIHR BioResource has been used to support recall to studies are captured (Genome UK commitments 18, 19 and 21). They will use £40 million of NIHR funding to:

  • build and enhance phenotyping of disease cohorts, including rare diseases
  • increase inclusivity and diversity of existing cohorts
  • establish a Young People’s BioResource
  • enhance and promote the offer to industry
  • develop new approaches to patient and public involvement

The NHS GMS Research Collaborative will continue to use the NHS GMS infrastructure to facilitate a full spectrum of research and innovation, from discovery to translation, adoption and diffusion across the NHS. As part of the evolving NHS GMS Alliance infrastructure, the NHS will establish NHS genomic networks of excellence. These will bring together the NHS GMS , academia, universities, industry and other partners in networks to deliver genomic research from discovery to adoption and spread, in specific priority areas designated by NHS England and aligned to NHS priorities (Genome UK commitments 17 and 19).

In partnership with Genomics England, patients and clinicians, NHS England have developed a national patient choice framework that supports clinicians, regardless of clinical specialty, to discuss the implications and impact of having WGS and whether a patient would consent to their genomic data being accessible for research via the National Genomic Research Library. To date, of patients undergoing WGS in the NHS who have been offered the opportunity to participate in research, approximately 93% of patients have given their consent. NHS England is working with partners to put in place mechanisms for enabling consent and collation of NHS genomic sequencing data for research and innovation purposes at a national and regional level (Genome UK commitment 22).

As set out in the Life sciences competitiveness indicators 2022: life science ecosystem , health data facilitates medical research and diagnostics. This can enable the development of treatments and earlier detection of disease. A rich supply of health data can allow for analysis of key health indicators, including genomics, to diagnose disease earlier when it is easier and less expensive to treat. High-quality data and associated architecture can bring together datasets to allow more detailed research and development of AI and health technologies. There are currently no metrics available for the information environment. OLS is therefore considering how the UK data environment can be measured against other countries (Genome UK commitment 27).

NIHR , the Medical Research Council and the Wellcome Trust have provided funding to the Global Alliance for Genetics and Health to develop standards and policies for sharing genomic and related health data. The Global Alliance for Genetics and Health aims to develop secure technical standards and frameworks to promote responsible use of genomic data for the benefit of human health, and drive uptake of standards through effective communications, dissemination and engagement. In the UK, Global Alliance for Genetics and Health standards are already being actively deployed within Genomics England, enabling better communication with Genomic Laboratory Hubs, the General Medical Council, and the broad NHS (Genome UK commitments 22 and 25).

Case study: Genes and Health

Genes and Health is a long-term, population health resource of adults, combining genetic data and lifetime multisource NHS health record data (primary care, hospital, and national NHS Digital) with the ability to invite volunteers to return with consent for more detailed research studies.

Genes and Health is researching British-Bangladeshi and British-Pakistani ethnic minority groups who have marked health inequalities (such as the highest rates of type 2 diabetes and early heart disease in the UK) and who are poorly represented in other large genetic research studies to date. Without such resources, modern genomic medicine and precision medicine (such as disease risk prediction) might not benefit communities with the greatest need.

The resource is open to international scientific researchers. Currently, over 85 groups of academic and industry researchers working across multiple disease and basic science fields are approved to analyse Genes and Health data via a UK cloud-based secure data environments. Genetic data includes chip genotyping and exome sequencing on all volunteers.

The first East London Genes and Health volunteer took part in 2015, with Bradford Genes and Health opening in 2019 and Manchester Genes and Health in 2022. There are now over 53,000 Genes and Health volunteers, and a target of 100,000 by 2024. Genes and Health is embedded in the local communities it is studying, with a wide-reaching and authentic programme of engagement activities. Their community advisory group works closely with the Genes and Health executive to prioritise research topics and build acceptance and long-term support.

Recall studies to date include:

  • 32 volunteers at very high risk of heart attack returned a diagnosis of low-density lipoprotein receptor familial hypercholesterolemia with appropriate preventative treatment (none of whom were previously aware of their genetic diagnosis and risk level)
  • laboratory studies on an individual lacking the HAO1 protein provided key safety information and biological insights for a new drug, lumasiran
  • over 1,000 volunteers identified for a blood ‘cell atlas’ sequencing project at the Wellcome Sanger Institute

Case study: NHS DigiTrials

NHS England has funded NHS DigiTrials (a delivery partner of Find, Recruit and Follow-Up) which offers data services to support high-priority, large-scale clinical trials. DigiTrials reduces the time, effort and cost of developing new drugs, treatments and services, bringing benefits to patients, the public and the NHS. NHS DigiTrials’ services can be used to accelerate the recruitment of diverse trial participants and increase the number of people identified as potentially eligible to participate in trials.

The NHS-Galleri trial is studying the clinical and economic performance of the Galleri test using healthy NHS volunteers. The cohort needed for the study was challenging to reach via normal clinical settings and had to align with specific demographic and cancer risk factors. NHS DigiTrials has supported recruitment by identifying eligible participants using routinely collected NHS Digital data. By July 2022, the service had recruited 140,000 volunteers across 8 areas in England in just 10 months, making it one of the fastest recruited large-scale randomised trials.

NHS DigiTrials is also supporting recruitment to the Our Future Health research programme. Millions of adults from all backgrounds will be invited to take part in the programme by providing a blood sample, information about their health and lifestyle, and their consent to link their NHS records. This will be used to create a detailed picture that represents the whole of the UK, helping researchers to discover more effective ways to predict, detect and treat common diseases. NHS DigiTrials will support recruitment by using data to identify people who are eligible and inviting them to join. Letters are sent directly to eligible participants to see if they want to take part in the programme. This means that no patient data leaves NHS Digital or is shared with the research programme.

Cross-cutting themes

Engagement and dialogue with the public, patients and our healthcare workforce.

As we move forward in implementing our vision for genomic healthcare, it will be essential to bring patients and the public with us through continued engagement activities. Patient and public engagement is built into the governance of the major organisations that deliver Genome UK, such as Genomics England, NHS England and Our Future Health. We are now considering how patient engagement should be approached by Genome UK’s governance structures to ensure that their voice is embedded into our decision making.

  • ensure that patients, the public and the NHS workforce have an increased awareness and understanding of the potential benefits of genomic healthcare by increasing its visibility and committing to open, honest engagement about what is involved
  • set out clearly how patient data can be used to advance research, and inform the public about research that has successfully used their data to improve diagnosis, understanding or treatment of patients in the UK
  • ensure that there are appropriate measures to protect patient privacy and confidentiality, so that patient data is used in ways that are acceptable to the public

The following commitment from the ethics theme is also relevant here:

  • keep an open dialogue and continue to openly engage with relevant patient and participant groups, continuing to involve the public and building on the engagement through the 100,000 Genomes Project

NHS England’s NHS GMS people and communities forum has held regular meetings, with topics such as the NHS Genomic Strategy, test directory, clinical genomics service specification, WGS and consent, and the NHS GMS Research Collaborative discussed.

Genomics England has continued to regularly engage their participant panel in addition to engagement around the newborn project. Engagement for the newborn project has included:

  • running an online survey that received over 600 responses
  • holding workshops with members of the public, people living with rare genetic conditions, and healthcare professionals
  • running a series of sessions with genetic counsellors and regional meetings with clinical and other specialists to explain the draft principles

In July 2021, Genomics England published the results of a public dialogue on the use of WGS in newborn screening , finding that members of the public were broadly supportive as long as the right safeguards and resources are in place.

UK Biobank has completed a consultation with the UK’s public engagement charity, Involve, to identify opportunities for greater participant involvement and engagement as part of study governance and future enhancements. In addition, its ethical advisory committee has started to bring together a focus group to assess participant views on extending linkages to health-related records and, specifically, access to participant tissue samples that may have been collected within the NHS.

Our Future Health has held regular meetings of their public advisory board, feeding into aspects of the programme including consent revisions, trusted research environment plans and pilot evaluation. Public representatives have also joined other Our Future Health advisory boards, including their ethics advisory board and technology advisory board.

The NHS GMS will continue to drive the proactive involvement of patients and the public from our diverse communities, nationally through the NHS GMS people and communities forum and regionally throughout the NHS GMS infrastructure (Genome UK commitments 29 and 42).

NIHR BioResource will increase patient and public involvement in the review of applications, support participant recruitment through their participant portals, work with under‐served communities and support patients to develop their research ideas into research projects. Each bioresource cohort uses innovative ways to recruit participants. For example, the Young People’s BioResource has developed a young ambassador programme to help a group of young people promote the aims of the Young People’s BioResource to their peers and to support recruitment. Recruitment to this programme started in September 2022 (Genome UK commitments 29 and 42).

OLS is leading work on how best to engage patients in its Genome UK implementation co-ordination group and associated working groups (Genome UK commitment 42).

UK Biobank will further expand its patient and public involvement activities. Following a review it commissioned with the UK’s public participation charity, Involve , it is exploring additional ways it can inform and involve its 500,000 participants as part of the ongoing study and future enhancements (Genome UK commitments 29 and 42).

Our Future Health will continue to grow their public and participant involvement by having representatives in their wider governance structures. They will co-develop and co-design policies and procedures with their public and participant representatives, as well as involving members of the public in regular user testing of participant-facing materials. Our Future Health will engage with communities and partners to increase awareness of the programme in order to maximise participation, particularly from minority populations that have historically been under-represented in large-scale, population-based studies (Genome UK commitments 29 and 42).

The national data advisory group has been established and is now meeting routinely. Membership draws from across expert external health and care stakeholders, as well as patients and regional system representatives. The responsibilities of the group include:

  • testing approaches and thinking for national programmes and policy areas, including what topics should be engaged on and how
  • providing advice on national strategic products, including on the engagement standard for public engagement
  • considering how national strategic work can support local and regional teams on data issues
  • national strategic communications advice, including tone and focus
  • advising on national strategic stakeholder engagement and co-design work (Genome UK commitments 30 and 31)

Workforce development and engagement with genomics through training, education and new standards of care

The genomics workforce spans both the health service and industry. It includes laboratory-based staff such as clinical scientists, genomic technologists and bioinformaticians, specialist clinical staff such as clinical geneticists and genetic councillors, and members of the mainstream workforce who encounter genomics in their role, such as doctors (including general practitioners), pharmacists, nurses and midwives. Each of these professions play a vital role in the genomic healthcare ecosystem, and continued proactive efforts are required to ensure that they have the support and resources needed to deliver genomic advances to patients now and in the long term.

NHS England, working in partnership with Health Education England ( HEE ) and the DHSC , are currently developing a long-term workforce plan for the NHS, as commissioned by the government earlier this year.

  • ensure that all new graduating doctors, nurses, midwives, pharmacists, allied health professionals, dental and relevant nonclinical staff have a level of awareness and knowledge of genomics that is relevant to their role
  • ensure that the healthcare science workforce continues to have advanced genomic training and education within their programmes
  • put in place continuing professional development programmes to ensure all relevant staff maintain an up-to-date and role-appropriate understanding of genomics
  • use workforce modelling data to inform investment decisions for training numbers across all professions and support workforce growth to meet the needs of the NHS GMS , particularly in specialist scientific and medical workforce areas
  • establish and invest in training pipelines for in-demand occupations such as bioinformatics to build capacity within the health service and the wider sector
  • redevelop clinical pathways and standards of care to that fully incorporate the latest genomic testing and results
  • support the NHS workforce by providing simple, practical, informatics solutions for training, genomic analysis and decision-support

NHS England and HEE undertook a joint workforce data capture exercise for the NHS Genomic Laboratory Hub workforce between July and September 2021. This data will be used to inform supply and demand modelling.

HEE surveys aimed at the pharmacy and the nursing and midwifery workforces were launched. The findings will help HEE to understand the levels of interaction with genomics in practice, and gaps in knowledge can then be addressed through the Genomic Education Programme’s strategic approach to workforce development.

HEE has developed their clinical pathway initiative in collaboration with NHS England and the Academy of Medical Royal Colleges. The clinical pathway initiative outlines a stepwise approach to multi-professional clinical pathways, identifying the workforce associated with each touchpoint along the pathway and the education and training interventions required where there are gaps in knowledge or competency. The clinical pathway initiative provides a platform for sharing workforce education and training needs across different clinical pathways to support the workforce and avoid duplication across the system.

A joint HEE and NHS England spending review bid resulted in funding for additional genomics-related scientist training programme and higher specialist scientist training places, an increase in practice educators, and the establishment of a genomics training academy.

An NHS England and HEE pharmacy genomics workforce group has been set up to provide a forum for national collaboration and co-ordination of pharmacy workforce planning and education activity related to genomics. A pharmacy genomics roundtable, hosted by HEE ’s Genomic Education Programme and NHS England, was held in November 2022.

To ensure that genomics is represented in the undergraduate curricula of the mainstream workforce, HEE will:

  • scope the existing undergraduate curricula for medicine, nursing, midwifery and dentistry
  • work with higher education institutions to update curricula to integrate genomic medicine
  • provide ‘off the shelf’ packages to support the delivery of genomic education and training in the undergraduate setting (Genome UK commitment 32)

To ensure that the specialist genomic workforce have access to the continuing professional development required for their roles, HEE and NHS England are establishing a genomic training academy. Over the next 3 years this will involve developing education and training resources mapped to profession-specific curricula. Metrics will include:

  • establishment of the genomic training academy and infrastructure
  • numbers of modules and teaching events developed and delivered
  • numbers of workforce who have benefitted from the genomic training academy training
  • evaluation of the genomic training academy and resources (Genome UK commitments 33 and 34)

HEE will utilise NHS England and Genomic Education Programme workforce modelling data by increasing scientist training programme numbers across the specialist genomics workforce (laboratory and clinical) and developing and delivering new models of genomic education and training provision to ensure the offering is relevant to different specialties and professions. Increase in scientist training programme numbers will be funded through spending review 2021 allocation. HEE will also investigate how to retain bioinformaticians through new models of working. Metrics will include:

  • numbers of clinical scientists in post
  • numbers of WGS cases going through the laboratories
  • retention of current staff (Genome UK commitments 33, 34, 35 and 36)

HEE will support clinicians to use the test directory by developing 2 massive open online courses aligned to the rare disease and cancer genomic pathways and continuing the development of their GeNotes resource. GeNotes consists of 2 tiers. Tier 1 is mapped to the test directory and supports the clinician to choose the right genomic test for the right patient at the right time and navigate the test directory and its supporting resources. Tier 2 is ‘the knowledge hub’, providing an extended learning opportunity for clinicians engaging with the resource and content. Metrics will include the number of people accessing the course, as well as evaluation of the course (Genome UK commitments 37 and 38).

HEE will actively work to bridge the clinical-research gap through monthly blogs where they discuss a particular aspect of genomic research and its clinical impact, through the establishment of a new ‘expert webinar’ series and through the development, funding and collaborative delivery of the masters in genomic medicine framework (Genome UK commitment 34).

HEE will also be horizon scanning to determine where new research findings may impact on clinical practice and using this to inform workforce modelling and clinical pathways. HEE will collaborate with Royal Colleges and the Academy of Medical Royal Colleges to ensure that new advances are prospectively represented in curricula (Genome UK commitments 35 and 37).

The NIHR clinical research network’s medical directorate is committed to helping ensure the NHS workforce is competent in delivering genomic research. They are working with specialties and NIHR Learn, as well as with the academy to develop training curricular and materials for busy clinicians (Genome UK commitments 33 and 34).

OLS and industry continue to explore how the skills value chain approach could support the adoption of emerging skills in the sector. This work will encompass current areas of shortage, such as bioinformatics, data analytics and computational biology (Genome UK commitments 35 and 36).

A strategy outlining the approach to supporting educational and training needs for the pharmacy workforce will be published in early 2023 (Genome UK commitment 34 and 35).

The national nursing and midwifery genomic transformation programme, led by the NHS England genomics unit, has been commissioned to provide a 2-year programme of activity from 2022 to 2024. The programme will engage with hundreds of nurses and midwives to support the development of their knowledge and skills in genomics, building their confidence and capability to lead, deliver and co-ordinate genomic practice in everyday care. The programme will also support nursing and midwifery leaders to define exemplar genomic pathways and accelerate the adoption of standardised practice at appropriate clinical touchpoints to increase equity of access or reduce unwarranted variation (Genome UK commitments 34 and 37).

Case study: apprenticeships

The BioIndustry Association’s 2022 Genomics nation report included a ‘spotlight on skills’ which quoted that 70% of UK genomics small and medium-sized enterprises ( SMEs ) relying on the full range of skilled professionals say that it is particularly difficult to recruit for computational or data skills. Apprenticeships offer development of highly sought-after informatics skills in existing biotech talent within the industry context.

Cranfield University has been offering a bioinformatics masters-level apprenticeship since 2019 and postgraduate training in bioinformatics since 2002. The course features a strong focus on genomics and genetics as well as computational skills, providing life science companies with the opportunity to address these skills gaps within their organisations.

Freeline – a clinical-stage biotechnology SME – identified a high achieving research assistant with an aptitude for computational sciences who wanted an opportunity to develop their skillset. The employee was keen to bridge the gap between biology and computer skills, with the aim of being able to aid the Freeline team in discovering and developing new ways to deliver gene therapies for patients. The employee and their line manager therefore actively sought out Cranfield’s apprenticeship programme, with the employee joining as a part-time master’s student.

Freeline have found that the bioinformatics apprenticeship provides a structured forum for the employee to develop their knowledge, technical and computational skills. Employees and their line managers meet with the course organisers to discuss progress every 3 months. The employer also works with the course organisers to develop a bioinformatics research project that is relevant to the company and forms part of the evaluation of the course. An advantage of the course is that the exercises carried out by the apprentice are grounded in real-world data, with Freeline’s employee working to develop tools that can be practically used in the company’s pipeline.

Case study: BioIndustry Association’s Manufacturing Advisory Committee Leadership Programme

The leadership programme supports the development and training of managers in the biopharmaceutical and cell and gene therapy industries through cross-sector learning and peer networks, helping deliver future leaders.

Two key aims of this initiative are to:

  • promote cross-sector learning by offering an overview of the work of other companies across biopharma, vaccines, and cell and gene therapies by seeing them in action
  • develop a network with peers to share best practice and develop relationships to encourage possible future collaborations

The pilot programme was launched in January 2017 and completed in January 2019. An alumni group was set up afterwards to support networking. There are currently 92 participants from 36 member companies benefitting from the programme.

Supporting industrial growth in the UK

The UK has a thriving genomics industry and, as set out in Genome UK, we are committed to making the UK the best location globally to start and scale new genomics healthcare companies and innovations. To deliver this, it is essential that engagement with companies across the sector is embedded in genomic healthcare across diagnosis and personalised medicine, prevention and early detection, and research and data and that the delivery partners for Genome UK have a strong relationship with sector leaders to continue to support growth.

We have held workshops with representatives of the genomics industry to identify top priorities, resulting in clear actions for delivery partners, and an increased understanding of how best to support our world-leading genomics industry. We will continue to take this approach in future years.

  • develop integrated data resources, biosampling capabilities and collaborative academic and clinical expertise that will make the UK the most attractive location globally for genomic healthcare start-ups
  • help to increase life science industry research and development spend in the UK by identifying new opportunities for innovative and cutting-edge industry partnerships
  • work to improve the availability of capital, including through the Life Sciences Investment Programme, which will deliver around £600 million of investment – both public and private – with a significant focus on UK life sciences companies over the next 10 to 15 years

Actions our genomics delivery partners will be take over the next 3 years

As set out in the shared commitments for implementation published in March 2022, OLS and its industry delivery partners, including those based in the devolved governments, committed to holding a joint workshop in partnership with the trade associations, ABPI and the BioIndustry Association. Two workshops have already taken place during 2022, with further workshops planned to gather industry feedback on Genome UK implementation and better understand industry priorities.

OLS will continue to evaluate their bioscience and health technology sector statistics , which publishes information on the shape and size of the genomics sector in the UK to ensure it is meeting user needs and evolving alongside the changing genomics landscape. OLS will work with users and delivery partners to collect feedback and ensure the statistics provide the most accurate reflection of the sector and continue to measure how the sector is changing.

OLS , through the bioscience and health technology sector statistics serie s and the BioIndustry Association’s Genomics nation reports , have published statistics monitoring the activity of genomics companies in the UK. There are some differences in the definitions of genomics companies in these published statistics. The BioIndustry Association and OLS will work together to understand these differences and establish a consistent approach to measuring and reporting on the size, makeup and growth of the sector to inform policy development. More information on what companies are included in the OLS statistics can be found in the accompanying 2020 user guide and within the subsectors chapter of the BioIndustry Association’s Genomics nation report.

The Life Sciences Scale-up Taskforce, co-ordinated by OLS and supported by the BioIndustry Association, produced tangible actions the government and industry could take forward to increase the availability of capital for life sciences, including genomics companies. This included exploring mechanisms that bring together institutional investors and specialist venture capital firms to attract more private investment towards specialist funds supporting innovation, to strengthen the UK’s investment ecosystem, and to help growing companies scale in the UK (Genome UK commitment 41).

Delivery partners for Genome UK will continue to identify new industry partnership opportunities. This will include aiming for equitable opportunities for SMEs , to enhance UK attractiveness and continue to support growth. We will assess the process for SME access to data resources, bio sampling capabilities and expertise. We will also take action to simplify the application process for small businesses which want to supply to government and to increase visibility of subcontracting opportunities (Genome UK commitment 40).

Case study: Genomics England working with SMEs

Genomics England has a mandate from our participants, expressed clearly by the participant panel and Access Review Committee, to provide access to the National Genomic Research Library for the biopharma industry on the basis of fair economic return. For this reason, Genomics England charges an access fee for our commercial partners, and 8 of the top 10 pharmaceutical companies in the world pay to access our data.  We also recognise the huge contribution to research and development made by the SME industry, including start-up companies. For this reason, SMEs are charged one sixth of the rate of large biopharma partners for access. Furthermore, Genomics England have partnered with 3 accelerators that focus on early-stage start-ups with a specific genomics focus in cancer and rare disease. These accelerators provide an effective filter for identifying some of the best start-ups. For these start-up and pre-revenue companies, Genomics England provides zero-rated access for the first year, which can be extended at Genomics England’s discretion beyond one year.

For example, Nostos Genomics is an AI-driven start-up that is looking to use sequencing data to cut rare disease diagnosis time by 99% by automating the identification of disease-causing mutations in patients’ DNA. This could cut costs for labs, free up resources and, crucially, improve patient care. Nostos is using Genomics England’s rare disease dataset as a training and validation set. The company is at an early stage of clinical validation, but the results look promising. As a result, Genomics England has extended zero-rated access for a second year with Nostos Genomics. We are proud to support start-ups like this to build companies based on Genomics England data, and to focus on improving the lives of participants that we represent.

Maintaining trust through strong ethical frameworks

In implementing genomic healthcare, we want to harness the power of genomic and genetic information combined with other health data to be able to provide more timely, improved diagnosis and offer better, equitable and more personalised treatments and access to clinical trials. To enable these advances, it is important that the public and patients can be reassured that ethical questions including those regarding consent, confidentiality and the handling of genomic data in research have been considered in a comprehensive way, with public and patient participation, and that these questions are addressed with robust data governance and secure data protocols.

Together with the devolved governments, we agreed to work together on embedding ethical considerations in both genomic healthcare policy development and programme planning and implementation. To achieve this, we said we would hold a series of workshops later in 2022, in collaboration with the Nuffield Council on Bioethics.

  • establish a gold standard UK model for how to apply strong and consistent ethical and regulatory standards. We will share these standards and expertise globally and help partners across the world develop and implement their own frameworks
  • ensure that our regulatory and ethical frameworks support rapid healthcare innovation, while reflecting legal frameworks and retaining public and professional trust. We will keep under review the balance between regulation and innovation

The Nuffield Council has led work on gathering case studies and convening stakeholder workshops, which took place in December 2022. The workshops explored the feasibility and development of a UK model for how to apply strong and consistent ethical standards in genomic healthcare and research.

In July 2022, the Nuffield Council on Bioethics issued a call for case studies from the past 10 years describing how organisations have dealt with ethical issues they have encountered during their work on genomics initiatives. The case studies cover examples from across the NHS, government, academia, and the charity and private sectors. They were analysed for common themes and examples of best practice, and were discussed at UK-wide workshops in December 2022. A report of the outcomes and an illustrative selection of the case studies will be published on the Nuffield Council on Bioethics website in early 2023. This work is aimed at scoping the feasibility and development of a UK model for how to apply ethical standards in genomics initiatives and will lead to a better understanding of ethical issues encountered in genomics research and how to overcome them (Genome UK commitments 43 and 44).

Genomics England has expanded its ethics team and is developing its approach to embedding ethics more substantively in its research activities. This includes refreshing the Access Review Committee for the National Genomic Research Library, co-producing policies on issues such as recontact with input from the participant panel, and streamlining processes for internal ethics advice in collaboration with the legal and information governance teams.

The Newborn Genomes Programme at Genomics England is embedding ethics operationally in the study to ensure an ethical study design, implementation and evaluation. The programme has an ongoing independent newborn ethics working group to help identify ethical considerations and support the development of new policies. The programme is also commissioning research including a literature review on the ethics of sequencing newborn and the regulatory and governance aspects of the lifetime genome to inform future thinking and delivery of the programme. Other supporting public dialogue will be undertaken on specific areas – for example, to explore public views on the acceptability and scope of the discovery research potential for this cohort.

Is this page useful?

  • Yes this page is useful
  • No this page is not useful

Help us improve GOV.UK

Don’t include personal or financial information like your National Insurance number or credit card details.

To help us improve GOV.UK, we’d like to know more about your visit today. Please fill in this survey (opens in a new tab) .

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 16 March 2023

Genetic association analysis of 77,539 genomes reveals rare disease etiologies

  • Daniel Greene 1 , 2 ,
  • Genomics England Research Consortium ,
  • Daniela Pirri 3 ,
  • Karen Frudd 3 , 4 ,
  • Ege Sackey 5 ,
  • Mohammed Al-Owain 6 ,
  • Arnaud P. J. Giese   ORCID: orcid.org/0000-0001-7228-9542 7 ,
  • Khushnooda Ramzan 8 ,
  • Sehar Riaz 7 , 9 ,
  • Itaru Yamanaka   ORCID: orcid.org/0000-0003-0293-8070 10 ,
  • Nele Boeckx 11 ,
  • Chantal Thys 12 ,
  • Bruce D. Gelb   ORCID: orcid.org/0000-0001-8527-5027 2 , 13 , 14 ,
  • Paul Brennan 15 ,
  • Verity Hartill 16 , 17 ,
  • Julie Harvengt 18 ,
  • Tomoki Kosho   ORCID: orcid.org/0000-0002-8344-7507 19 , 20 ,
  • Sahar Mansour 5 , 21 ,
  • Mitsuo Masuno 22 ,
  • Takako Ohata 23 ,
  • Helen Stewart 24 ,
  • Khalid Taibah 25 ,
  • Claire L. S. Turner 26 ,
  • Faiqa Imtiaz 8 ,
  • Saima Riazuddin   ORCID: orcid.org/0000-0002-8645-4761 7 , 9 ,
  • Takayuki Morisaki 10 , 27 ,
  • Pia Ostergaard   ORCID: orcid.org/0000-0002-2190-1356 5 ,
  • Bart L. Loeys   ORCID: orcid.org/0000-0003-3703-9518 11 , 28 ,
  • Hiroko Morisaki 10 , 29 ,
  • Zubair M. Ahmed   ORCID: orcid.org/0000-0003-2914-4502 7 , 9 ,
  • Graeme M. Birdsey   ORCID: orcid.org/0000-0002-0981-8672 3 ,
  • Kathleen Freson 12 ,
  • Andrew Mumford 30 , 31 &
  • Ernest Turro   ORCID: orcid.org/0000-0002-1820-6563 2 , 14 , 32 , 33  

Nature Medicine volume  29 ,  pages 679–688 ( 2023 ) Cite this article

30k Accesses

18 Citations

294 Altmetric

Metrics details

  • Computational platforms and environments
  • Disease genetics
  • Genetics research
  • Next-generation sequencing

The genetic etiologies of more than half of rare diseases remain unknown. Standardized genome sequencing and phenotyping of large patient cohorts provide an opportunity for discovering the unknown etiologies, but this depends on efficient and powerful analytical methods. We built a compact database, the ‘Rareservoir’, containing the rare variant genotypes and phenotypes of 77,539 participants sequenced by the 100,000 Genomes Project. We then used the Bayesian genetic association method BeviMed to infer associations between genes and each of 269 rare disease classes assigned by clinicians to the participants. We identified 241 known and 19 previously unidentified associations. We validated associations with ERG , PMEPA1 and GPR156 by searching for pedigrees in other cohorts and using bioinformatic and experimental approaches. We provide evidence that (1) loss-of-function variants in the Erythroblast Transformation Specific (ETS)-family transcription factor encoding gene ERG lead to primary lymphoedema, (2) truncating variants in the last exon of transforming growth factor-β regulator PMEPA1 result in Loeys–Dietz syndrome and (3) loss-of-function variants in GPR156 give rise to recessive congenital hearing impairment. The Rareservoir provides a lightweight, flexible and portable system for synthesizing the genetic and phenotypic data required to study rare disease cohorts with tens of thousands of participants.

Similar content being viewed by others

genomics england research environment

Rare coding variant analysis for human diseases across biobanks and ancestries

genomics england research environment

Rare copy number variants in over 100,000 European ancestry subjects reveal multiple disease associations

genomics england research environment

Effective variant filtering and expected candidate variant yield in studies of rare human disease

Collectively, rare diseases affect 1 in 20 people 1 , but fewer than half of the approximately 10,000 cataloged rare diseases have a resolved genetic etiology 2 . Standardized genome sequencing (GS) of large, phenotypically diverse collections of patients with rare diseases enables etiological discovery across a wide range of pathologies 3 , 4 , 5 while boosting genetic diagnostic rates for patients. The 100,000 Genomes Project (100KGP), the largest GS study of patients with rare diseases to date, sequenced 34,523 UK National Health Service patients with rare diseases and 43,016 of their unaffected relatives. The linked genetic and phenotypic data of 100KGP participants were then made available to researchers through a web portal called the Genomics England Research Environment. The scale and complexity of such large GS datasets and the hierarchical nature of patient phenotype coding 6 induce numerous bioinformatic and statistical challenges. Most importantly, the full genotype data from GS studies of tens of thousands of individuals are typically stored in unmodifiable files many terabytes in size, leading to high storage and processing costs. Recently developed frameworks, such as Hail 7 and OpenCGA 8 , afford greater flexibility. However, they are designed to capture genotypes for variants across the full minor allele frequency (MAF) spectrum, from rare (MAF < 0.1%) to common (MAF > 5%) variants. To accommodate large numbers of genotypes, they depend on distributed storage systems and require numerous software packages, hindering deployment. We developed a database schema, the ‘Rareservoir’, for working with rare variant genotypes and patient phenotypes flexibly and efficiently. We deployed a Rareservoir only 5.5 GB in size of 100KGP data and applied the Bayesian statistical method BeviMed 9 to identify genetic associations between coding genes and each of the 269 rare disease classes assigned to patients by clinicians. Of the previously unknown associations that we identified, we followed up the most plausible subset in confirmatory analytical and experimental work.

The Rareservoir

Relational databases (RDBs) provide a unified, centralized structure for storing, querying and modifying data of multiple underlying types. In principle, an RDB could provide a convenient foundation for the analysis of genotypes, variants, genes, participants and statistical results, but they cannot accommodate tables of the scale required to store exome or genome-wide genotypes in a moderately sized cohort. An RDB can, however, accommodate a sparse representation of genotypes corresponding to rare variants only, which encompass almost all variants having a large effect on rare disease risk. We developed an RDB schema, the Rareservoir, and complementary build procedure for the analysis of rare diseases, which by default, stores genotypes corresponding to variants for which all population-specific MAFs are likely to be <0.1%. This reduces the number of stored genotypes in a large study by about 99% (Extended Data Fig. 1 ). The Rareservoir encodes variants as 64-bit integers (‘RSVR IDs’) (Extended Data Fig. 2 ), which can represent 99.3% of variants encountered in practice without loss of information. RSVR IDs occupy a single column and increase numerically with respect to genomic position, allowing fast location-based queries within a simple database structure. To support the build process of a Rareservoir, we developed a complementary software package called ‘rsvr’ (Extended Data Figs. 2 and 3 ). The package includes tools to annotate variants with MAF information from control databases (for example, gnomAD 10 ), pathogenicity scores (for example, combined annotation-dependent depletion (CADD) scores 11 ) and predicted Sequence Ontology (SO) 12 consequences with respect to a set of transcripts. We use a 64-bit integer (‘CSQ ID’) to record the consequences for interacting variant/transcript pairs, where each bit encodes one of the possible consequences, ordered by severity. Encoding the consequences in this way is efficient and enables succinct queries that threshold or sort based on severity of impact. The Rareservoir also contains a table with genetically derived data for each sample (including ancestry, sex and membership of a maximal set of unrelated participants) and a table of ‘case sets’ storing the rare disease classes assigned to each participant.

BeviMed infers 241 known and 19 unknown genetic associations

We built a Rareservoir, 5.5 GB in size, containing 11.9 million rare exonic and splicing single-nucleotide variants (SNVs) and short insertions or deletions (indels) affecting canonical transcripts of protein-coding genes in Ensembl v.104 (ref. 13 ) from a merged variant call format file (VCF) containing genotype calls for 77,539 participants, including 29,741 probands, in the Rare Diseases Main Programme of the 100KGP (Data Release v.13) (Extended Data Fig. 4 ). During enrollment to the 100KGP, expert clinicians used the clinical characteristics of each affected participant to assign them to one or more of 220 ‘Specific Diseases’. The Specific Diseases are hierarchically arranged into 88 ‘Disease Sub Groups’, each of which belongs to 1 of 20 ‘Disease Groups’. Whereas the eligibility criteria for many Specific Diseases aligned to the same or closely related rare diseases, for others such as ‘Intellectual disability’, the criteria were broader and encompassed diverse genetic etiologies. We generated 269 analytical case sets corresponding to all distinct Specific Diseases and Disease Sub Groups, ranging in size from 5,809 to one proband, and stored them in the Rareservoir (Fig. 1a and Extended Data Figs. 5 and 6 ). We included these two levels of the phenotyping hierarchy to account for heterogeneity in presentation or diagnosis among cases sharing the same genetic etiology, with the aim of boosting power to identify statistical genetic associations.

figure 1

a , Bars showing the size of each case set used for the genetic association analyses grouped by Disease Group and coloured by type (Disease Sub Group or Specific Disease). Case sets smaller than five are shown as having size 4 to comply with the 100KGP policy on limiting participant identifiability. The names and sizes of the case sets for an exemplar Disease Sub Group, ‘Cardiovascular disorders’, are shown. b , BeviMed PPAs > 0.95 arranged by Disease Group. Only the strongest association for each gene within a Disease Group is shown. Associations are colored by their PanelApp evidence level (green, amber or red). Associations that were mapped to PanelApp by manual review, rather than using our automatic matching algorithm, are marked with an asterisk ( Source Data Fig. 1 ). Previously unidentified associations are shown in grey. The shape of the points shows whether the association was with a Disease Sub Group (squares) or Specific Disease (circles).

Source data

Using the Bayesian statistical method BeviMed 9 , we obtained a posterior probability of association (PPA) between each of the 19,663 protein-coding genes and each of the 269 rare disease classes. BeviMed computes posterior probabilities over a baseline model of no association and competing association models, each of which assumes a particular mode of inheritance (MOI; dominant or recessive) and consequence class of etiological variant (in this study, high impact, moderate impact or 5′ untranslated region (UTR)). The PPA is obtained by summing the posterior probabilities over all association models. The association model with the greatest posterior probability (the modal model) determines the inferred MOI and class of etiological variant. Conditional on an association model, BeviMed models the pathogenicity of each included rare variant. In the model, participants with at least one pathogenic allele (under a dominant MOI) or at least as many pathogenic alleles as the ploidy (under a recessive MOI) have a pathogenic configuration of alleles, which determines their risk of case status. For each rare disease class, we selected a set of unrelated cases based on pedigree information provided by the 100KGP and compared them with participants not in the case set who belonged to different pedigrees and to a maximal set of unrelated participants, also provided by the 100KGP. To account for correlation between case sets, we only recorded the association for each gene having the highest PPA within a given Disease Group. Using a significance threshold of PPA > 0.95, we identified 260 significant associations, 241 of which were documented by the PanelApp gene panel database 14 , an expert-curated and annotated resource containing gene lists with high, medium or low levels of prior supporting evidence of causality for rare diseases (Fig. 1b ). Of the 241 known associations that we identified, 43 (17.8%) were with Disease Sub Groups. For example, within each of the nine known genes associated with the Disease Sub Group ‘Posterior segment abnormalities’, the set of cases explained by variants with a conditional posterior probability of pathogenicity > 0.8 comprised participants encompassing multiple Specific Diseases (Extended Data Fig. 7 ). This demonstrates that participants with different Specific Diseases belonging to the same Disease Sub Group sometimes share defects in the same gene, which confirms that treating Disease Sub Groups, not just Specific Diseases, as case sets boosts statistical power.

Of the 241 associations identified as previously known according to PanelApp, 237 (98.3%) had an inferred MOI that was consistent with the MOIs listed for the relevant gene. Of these, the consistent MOI was found in the matched panel (223 associations), in the notes for the matched panel (5 associations) or in the MOIs listed for an alternative relevant panel (9 associations) in PanelApp (Source Data Fig. 1 ). This provided independent evidence that the genetic associations we labeled as known (without reference to MOI information) are genuinely supported by evidence in the literature, further demonstrating the accuracy of BeviMed’s inference. Of the four known associations with an inferred MOI that was incongruous with PanelApp, two had supporting evidence for the inferred MOI in the literature that was absent from PanelApp: EDA with dominant ‘Ectodermal dysplasia without a known gene mutation’ 15 and AICDA with dominant ‘Primary immunodeficiency’ 16 . The two associations with an MOI that was unsupported in the literature were between UCHL1 and dominant ‘Inherited optic neuropathies’ and between SLC39A8 and dominant ‘Intellectual disability’.

Among 5,253 of the probands included in our analysis, the table of clinically reported variants available from the 100KGP Rare Diseases Main Programme at the time of this study comprised 4,907 distinct variants that had been classified as pathogenic or likely pathogenic in 1,863 genes. For 855 of these genes, etiological variants had been reported for only one family, suggesting that many genes that are etiological in the 100KGP are not identifiable by statistical association. Nevertheless, across the 260 associations identified by BeviMed, 2,536 distinct rare variants had a posterior probability of pathogenicity > 0.8 conditional on the modal model and were observed as part of a pathogenic configuration of alleles in a case (Source Data Fig. 1 ). Interestingly, among the subset of 2,485 variants contributing to the 241 known associations, only 1,604 featured in the table of clinically reported variants.

We found 19 previously unidentified genetic associations. To select a shortlist for further investigation, we assigned a plausibility score (range 0–3) based on three sources of additional evidence (Table 1 ). First, we considered evidence of purifying selection from gnomAD v.2.1.1. Any dominant associations with high-impact variants in a gene having a probability of loss-of-function intolerance (pLI) >0.9 or with moderate-impact variants in a gene having a Z score >2 were considered to be supported by population genetic metrics of purifying selection. To avoid disadvantaging recessive associations, which are unlikely to leave a detectable signature of purifying selection in gnomAD even if genuine, they were considered to be supported by default. Second, we considered cosegregation data: any association for which variants having a posterior probability of pathogenicity conditional on the modal model >0.8 tracked with case status in at least three additional family members and for which no affected relatives lacked the pertinent variants were considered to be supported by cosegregation. Third, we performed a comprehensive review of the literature for each gene and made a subjective assessment of whether an association was supported by biological function or previously known disease associations for related genes. In total, three genetic associations had a plausibility score of three and were, therefore, investigated further by gathering additional experimental evidence and looking for replication in other sequenced rare disease collections.

Variants in ERG are responsible for primary lymphoedema

BeviMed identified a dominant genetic association between high-impact variants in ERG and the Specific Disease ‘Primary lymphoedema’, a group of genetic conditions caused by abnormal development of lymphatic vessels or failure of lymphatic function 17 , 18 . Three such variants were responsible for the high PPA, with locations ranging from codon 182 to 463 on the canonical Ensembl transcript ENST00000288319.12. One of the probands had two unaffected parents without the variant allele—one sequenced by the 100KGP and the other by Sanger sequencing—suggesting that the truncating heterozygous variant had appeared de novo. A participant in a fourth family who had been enrolled to the 100KGP for an unrelated condition also carried a predicted loss-of-function variant in ERG . Upon manual chart review, this participant had features associated with this unrelated condition but additional features consistent with primary lymphoedema, providing internal replication within the discovery cohort (Fig. 2a ).

figure 2

a , Pedigrees for the four probands with loss-of-function variants in the canonical transcript of ERG , ENST00000288319.12. Hom. ref., homozygous reference. b , Truncated bar chart showing the distribution of the number of reads supporting the p.S182Afs*22 alternate allele in the 100KGP. The embedded windows show the read pileups at this position in the two affected members of the family with the variant encoding p.S182Afs*22 (het., heterozygous genotype call). The reads supporting the reference allele are in blue and those supporting the variant allele are in red. c , Schematic showing the effects of each variant at the cDNA and amino acid level and on the protein product with respect to the canonical transcript. PNT, pointed domain; ETS, Erythroblast Transformation Specific DNA binding domain; AA, amino acid. d , Reverse transcription PCR amplification of ERG mRNA in HDLECs relative to HUVECs. Data are normalized to GAPDH. Statistical significance was assessed using a two-sided Student’s t test. NS, not significant ( P  = 0.39). e , Immunoblot (representative of two replicates) of HUVEC and HDLEC protein lysates identified several bands corresponding to ERG isoforms expressed at similar intensities in both cell types. f , Immunofluorescence microscopy (representative of three replicates) of HDLECs shows ERG (green) nuclear colocalization with the lymphatic endothelial cell nuclear marker PROX1 (violet) and the nuclear marker DAPI (blue). HDLEC junctions are shown using an antibody to VE-cadherin (yellow). Scale bar, 50 µm. g , En face immunofluorescence confocal microscopy (representative of five replicates) of mouse ear skin. Vessels are stained with antibodies to the lymphatic marker PROX1 (violet) and ERG (green). Scale bar, 100 µm. h , Exemplar immunofluorescence microscopy image of HEK293 cells overexpressing wild-type ERG and the p.T224Rfs*15 variant ERG . Cells were stained for ERG (green) and nuclear marker DAPI (blue). Scale bars, 20 μm. The brightness is optimized for print. i , Dot plot of the estimated proportion of ERG not overlapping the nuclear marker DAPI in each of a set of immunofluorescence microscopy images of HEK293 cells overexpressing different ERG cDNAs (20 replicates for the wild type (WT), 17 replicates per tested mutant). The estimated proportions were significantly higher in each of the variants compared with WT: P  = 1.52 × 10 −11 , 4.10 × 10 −13 and 3.03 × 10 −5 for each of p.S182Afs*22, p.T224Rfs*15 and p.A447Cfs*19, respectively (two-sided Student’s t tests).

The affected father of the proband with the variant encoding p.S182Afs*22 was called homozygous for the reference allele, initially suggesting a lack of cosegregation of the variant with the disease in that pedigree. However, a review of the GS read alignments for the father revealed that 2 of the 48 reads overlapping that position supported the alternative allele. Specifically, these reads contained a deletion of a single G within the central poly-G tract of the motif ‘AGCTGGGGGTGAG’. To assess whether this could be the result of erroneous sequencing, we counted the number of such reads in the 77,539 genomes in the 100KGP and found that the proband and the father were the only two with more than one such read. This indicated that these reads in the father were unlikely to be erroneous but instead, that he was mosaic (Fig. 2b ), consistent with the observation that his lymphoedema became clinically apparent over two decades later than his daughter, indicating milder disease. A further 130 samples collected through the 100KGP had a single read containing the deletion. This number was consistent with observations in the 80 other exonic loci that contain the same 13-base pair (bp) motif (mean 99.67 samples, range 4–149 samples), suggesting that, rather than being mosaic, the 130 samples contained individual sequencing errors. Furthermore, none of the participants who gave these samples had been assigned the Specific Disease ‘Primary lymphoedema’.

ERG encodes a critical transcriptional regulator of blood vessel endothelial cell gene expression 19 that is essential for normal vascular development 20 . However, little is known about the contribution of ERG to lymphatic development or how primary lymphoedema could arise from loss-of-function ERG variants that affect different parts of the ERG protein (Fig. 2c ). Total cellular expression of ERG detected by real-time quantitative polymerase chain reaction (PCR) in purified RNA and by immunoblotting of protein extracts was the same in primary human dermal lymphatic endothelial cell (HDLECs) as human umbilical vein endothelial cell (HUVEC) (Fig. 2d,e , respectively). Moreover, immunofluorescence microscopy of cultured HDLECs showed that ERG expression colocalized with the lymphatic endothelial cell nuclear marker PROX1 (Fig. 2f ), a finding confirmed in vivo by immunostaining whole mounts of ear skin from mice at 3 weeks after birth (Fig. 2g ). The positions of the p.S182Afs*22 and p.T224Rfs*15 variants suggest nonsense-mediated decay and haploinsufficiency as a possible disease mechanism. The other two variants, however, are located in the final exon of ERG and may, therefore, evade nonsense-mediated decay. We studied both types of variant in more detail to explore potential disease mechanisms. In HEK293 cells, which do not express endogenous ERG, overexpression of wild-type ERG cDNA recapitulated the nuclear expression pattern observed in the HDLEC and mouse ear skin models. However, overexpression of ERG mutant cDNAs resulted in mislocalization of ERG outside of the nucleus, in the cytosol (Fig. 2h,i and Extended Data Fig. 8 ), preventing it from binding to DNA and exerting its function as a transcription factor 21 . Together, these data confirm high levels of ERG expression within the nuclei of the lymphatic endothelium consistent with a transcription regulatory function during lymphangiogenesis. They also suggest that in the primary lymphoedema cases, defective lymphangiogenesis may result from reduced ERG availability in the nucleus because of either haploinsufficiency resulting from nonsense-mediated decay or mislocalization.

Variants in PMEPA1 result in Loeys–Dietz syndrome

BeviMed identified a dominant genetic association between high-impact variants in PMEPA1 and the Specific Disease ‘Familial thoracic aortic aneurysm disease’ (FTAAD). The variant with the highest conditional probability of pathogenicity was an insertion of one cytosine within a seven-cytosine stretch in the last exon of the canonical Ensembl transcript ENST00000341744.8. This variant, which is predicted to induce a p.S209Qfs*3 frameshift, was observed in three FTAAD pedigrees of European ancestry in the 100KGP discovery cohort. We replicated the association in three additional collections of cases. First, the same variant was identified independently in eight affected members of three pedigrees of Japanese ancestry in a separate Japanese patient group. Second, a single-cytosine deletion within the same polycytosine stretch as the previous variant, and encoding p.S209Afs*61, was found in an FTAAD case enrolled in a separate collection of 2,793 participants in the 100KGP Pilot Programme. Lastly, we identified a family in Belgium wherein the affected members carried a 5-bp deletion in the same stretch of polycytosines inducing a frameshift two residues upstream of the other two variants (p.P207Qfs*3).

All pedigrees exhibited dominant inheritance of aortic aneurysm disease with incomplete penetrance and skeletal features including pectus deformity, scoliosis and arachnodactyly with complete penetrance, which cosegregated with the respective variants in genotyped participants (Fig. 3a ). To assess whether PMEPA1 families affected by FTAAD form a phenotypically distinct subgroup, we analyzed the Human Phenotype Ontology (HPO) terms assigned to the 593 FTAAD families in both programs of the 100KGP. Using a permutation-based method 22 , 23 based on the semantic similarity measure of Resnik et al. 24 , we found that the four 100KGP PMEPA1 families were significantly more similar to each other than to other FTAAD families chosen at random ( P  = 5.7 × 10 −3 ). To characterize the PMEPA1 phenotype in greater detail, we compared the prevalence of each of the HPO terms in the minimal set of terms present in at least three of the four families with the prevalence in the other FTAAD families. We identified four HPO terms related to the musculoskeletal system that were significantly enriched (Fig. 3b ), echoing the phenotypic characteristics of the syndromic aortopathy Loeys–Dietz syndrome 25 , 26 .

figure 3

a , Pedigrees for the three probands in the 100KGP (discovery cohort) heterozygous for the frameshift insertion predicting p.S209Qfs*3 and probands from replication cohorts, including one from the 100KGP Pilot Programme heterozygous for the frameshift deletion predicting p.S209Afs*61, three of Japanese ancestry heterozygous for p.S209Qfs*3 and one Belgian pedigree heterozygous for a frameshift deletion encoding p.P207Qfs*3. All variant consequences are shown with respect to the canonical transcript of PMEPA1 , ENST00000341744.8. b , HPO terms present in at least three of the four PMEPA1 FTAAD families, excluding redundant terms within each level of frequency, alongside their frequency in four PMEPA1 FTAAD families and the other 589 unexplained FTAAD families. Terms are ordered by P values obtained by a Fisher exact test of association between the term’s presence in an FTAAD family and whether the family is one of the four PMEPA1 families. Terms were declared significant (indicated by an asterisk) or not significant (NS) by comparing their Fisher test P values and rank with a null distribution of equivalent pairs obtained by permutation (10,000 replicates). For each rank, the P value of the term on the fifth percentile was used as an upper bound for declaring an association significant, provided all terms at higher ranks were also significant. The P values for each term were as follows: ‘Dolichocephaly’, P  = 2.9 × 10 −4 ; ‘Abnormal axial skeleton morphology’, P  = 6.7 × 10 −3 ; ‘Striae distensae’, P  = 0.013; ‘Pes planus’, P  = 0.014; ‘Ascending tubular aorta aneurysm’, P  = 0.62. c , Graph showing PMEPA1 and genes with high evidence (green) of association with FTAAD in PanelApp. Edges connect genes where the string-db v.11.5 27 confidence score for physical interactions between corresponding proteins was >0.6. Genes known to be associated with Loeys–Dietz syndrome are highlighted in blue. PMEPA1 is highlighted yellow. d , Schematic showing the effects of each variant at the cDNA and amino acid level and on the protein product.

To understand the molecular mechanisms underlying this defect, we examined the protein–protein interactions 27 for PMEPA1 and the complete set of high-confidence genes in the ‘Thoracic aortic aneurysm or dissection’ PanelApp panel. PMEPA1 encodes a negative regulator of transforming growth factor-β (TGFβ) signaling 28 , a pathway previously implicated in multiple aortopathies, including Loeys–Dietz syndrome 29 . The genes underlying known forms of Loeys–Dietz syndrome encode part of a tightly interacting subgroup of proteins in the TGFβ pathway, in which there is a direct interaction between the proteins encoded by SMAD2 , SMAD3 and PMEPA1 (Fig. 3c ). As the two candidate variants occur in the last exon of the transcript, they are likely to evade nonsense-mediated decay 30 . However, their truncating effects are predicted to remove a PPxY interaction motif while leaving the SMAD interaction motif intact (Fig. 3d ), possibly affecting binding between PMEPA1 and SMAD2/3 and altering TGFβ signaling through a gain-of-function mechanism.

Variants in GPR156 lead to recessive congenital hearing loss

BeviMed identified a recessive genetic association between high-impact variants in GPR156 and the Specific Disease ‘Congenital hearing impairment’. Two high-impact variants in GPR156 were responsible for the strong evidence of association: a 1-bp insertion predicting p.S207Vfs*113 and a 1-bp insertion predicting p.P718Lfs*86 with respect to the canonical Ensembl transcript ENST00000464295.6. One family contained two affected siblings who were both homozygous for the p.S207Vfs*113 variant inherited from heterozygous parents. In a second family, there were also two affected siblings, in this case compound heterozygous for the same p.S207Vfs*113 variant that was maternally inherited and a different p.P718Lfs*86 variant that was paternal. Using GeneMatcher 31 , we identified a third pedigree from Saudi Arabia with biallelic truncating variants in GPR156 . This consanguineous pedigree contained four siblings with hearing impairment, all of whom were homozygous for a variant predicting p.S642Afs*162 (Fig. 4a ). The eight affected individuals in these three families all had congenital nonsyndromic bilateral sensorineural hearing loss (see Extended Data Fig. 9 for illustrative audiograms).

figure 4

a , Schematic of the three pedigrees with cases homozygous or compound heterozygous for loss-of-function variants in the canonical transcript of GPR156 , ENST00000464295.6. Blank symbols indicate individuals with an unknown genotype. b , Histograms of expression log fold changes for different sets of genes in mouse hair cells compared with surrounding cells: all mouse genes (left) and mouse genes homologous to their human counterparts in the ‘Hearing loss’ PanelApp panel, stratified by whether they had a stereocilia-related Gene Ontology (GO) term (that is, a term whose name contained ‘stereocilia’ or ‘stereocilium’ or the descendant of such a term) (right). The log fold change for Gpr156 is shown as a horizontal line. c , Maximum intensity projections of confocal Z stacks in the organ of Corti and vestibular system of a P10 wild-type mouse immunostained with GPR156 antibody (green) and counterstained with phalloidin (red). Top row, overview of the organ of Corti and vestibular system. Middle and bottom rows, magnified images of outer hair cells and inner hair cells, respectively. No stereociliary bundle staining was observed. The punctate staining observed in the organ of Corti was absent or significantly decreased in the utricle of the vestibular system. Scale bars, 10 μm (each image is representative of three replicates). d , Schematic showing the effects of each variant at the cDNA and amino acid level and on the protein product. e , Exemplar western blot taken from three replicates of GPR156–GFP using anti-GPR156 antibody in untransfected Cos7 cells (Cos7); Cos7 cells transfected with the wild-type construct (WT); and Cos7 cells transfected with the constructs containing each of the mutant alleles p.S642Afs*162 (S642), p.P718Lfs*86 (P718) and p.S207Vfs*113 (S207).

GPR156 encodes probable G protein-coupled receptor 156, which has sequence homology to the class C GABAB receptors 32 . Although previously designated as an orphan receptor, GPR156 has recently been identified as a critical regulator of stereocilia orientation on hair cells of the auditory epithelium and other mechanosensory tissues 33 . Its expression is highly restricted to hair cells in the inner ear 34 . Disruption of stereocilia is a common pathogenic mechanism underlying many human Mendelian hearing loss disorders 35 , and the overexpression of GPR156 in hair cells relative to surrounding cells was commensurate with the overexpression of the 21 genes currently implicated in hearing impairment having a Gene Ontology term relating to stereocilia (Fig. 4b ). By immunostaining of the Corti and vestibular system from wild-type mice, we found that GPR156 strongly colocalizes with actin at the apical surface of the outer and inner hair cells of the organ of Corti (Fig. 4c ).

The p.S207Vfs*113 variant is located in the sixth of 10 exons of GPR156 and therefore, predicts absent expression through nonsense-mediated decay of the GPR156 mRNA. In contrast, the p.S642Afs*162 and p.P718Lfs*86 variants both occur within the final GPR156 exon and likely result in expression of abnormal GPR156 with an altered amino acid sequence and premature truncation of the cytoplasmic tail (Fig. 4d ). To determine the effect of the variants on protein expression, we transfected Cos7 cells, which do not express GPR156 endogenously, with constructs containing cDNAs for wild-type GPR156 or GPR156 containing each of the three mutant alleles, tagged with a green fluorescent protein (GFP) reporter. While cells transfected with wild-type sequence expressed GPR156–GFP fusion protein robustly, cells transfected with the mutant constructs either did not express the protein appreciably or exhibited markedly reduced expression, suggesting that all three of the truncated proteins are degraded (Fig. 4e ). These data suggest that the biallelic chain truncating variants in GPR156 cause a congenital hearing loss by preventing expression of GPR156 protein, thereby disrupting stereocilia formation in the auditory epithelium.

The standardization of GS within a health care system, together with powerful frameworks for genetic and phenotypic data processing and statistical analysis, promises to advance the resolution of the remaining unknown etiologies of rare diseases. We have developed a lightweight and easily deployable RDB, the Rareservoir, for genetic analysis of rare diseases using approaches such as BeviMed. In one unified analysis, we identified 260 associations, of which 241 had been published previously in a body of work spanning several decades of genetics research. Our results give an upper bound on the false discovery rate of 7.3%. In contrast, a recent analysis of 57,000 samples in the 100KGP reported 249 known and 579 previously unidentified associations 36 , giving an upper bound on the false discovery rate of 70%, which suggests that our analytical approach has a greater specificity for a given sensitivity. The associations spanned 86 disease classes across a wide range of organ systems. Interestingly, only 64% of the variants contributing substantially to the known associations were present in the table of clinically reported variants available at the time of this study. This suggests that, as cohorts grow larger, the results of statistical inference could help guide the clinical reporting process. The case sets we used in our genetic association analysis were based on the formal disease classifications used by the 100KGP. Some of the case sets, such as ‘Intellectual disability’ (5,529 probands), are particularly large and likely to be highly genetically heterogeneous, potentially limiting the power of our analyses. Careful partitioning of heterogeneous case sets using individual-level HPO terms 6 has the potential to boost power. Of the 19 previously unidentified associations, we shortlisted, replicated and validated three. These three etiologies involve genes that had not previously been implicated in any of these human diseases. The remaining 16 associations include further plausible hypotheses. For example, LRRC7 , which we identified to be associated with intellectual disability, encodes a brain-specific protein in postsynaptic densities 37 , and LRRC7-deficient mice exhibit a neurobehavioral phenotype 38 . USP33 , which we found to be associated with early-onset hypertension, encodes a deubiquitinating enzyme implicated in regulating expression of the β2-adrenergic receptor regulation 39 . These and other candidates will require replication and validation before they can be considered causative genes.

The present study has several limitations. First, approximately 82% of the participants in the 100KGP are of European ancestry. While this percentage is in line with the proportion of residents in England and Wales reporting their ethnic group as white in the 2011 UK census (86%), its large magnitude constrains power to identify causative variants specific to other ancestry groups. Second, of the 269 case sets analyzed, 28 contained fewer than five probands, limiting power to identify the causes of the corresponding disease classes and highlighting the need for continued enrollment of patients with ultra-rare disorders. Third, we have only considered SNVs and indels in coding genes. The exploration of structural variants and of rare variation in noncoding genes and in regulatory elements of the genome may help identify further etiologies. Lastly, we focused our attention on monogenic models of rare disorders, even though the genetic etiologies of certain rare diseases may be polygenic. In addition, important variation in clinical presentation of monogenic disorders may be explained by polygenic effects. These limitations point toward multiple promising avenues of future research to uncover the remaining unknown genetic determinants of rare diseases.

Motivation for developing a sparse RDB

Computational approaches for discovering the etiologies of rare diseases typically depend on the analysis of a heterogeneous set of files, each of which can be very large and follow a distinct convention. Genotypes, for example, are ordinarily stored in VCFs containing data for one sample or for multiple samples. In the latter case, the data are usually distributed in files covering many different ‘chunks’ of the reference genome. Variant-level information, such as consequence predictions or pathogenicity scores, is typically encoded in strings that require extensive parsing to decode, either from within the VCFs containing the genotypes or in separate files. Modifying genotype or annotation files (for example, to incorporate newly generated data) requires rewriting files in their entirety. Phenotype data, pedigree data and the results of statistical inference are stored in a further set of files. Consequently, analyses are often burdensome to conduct and prone to error. Frameworks, such as Hail 7 and OpenCGA 8 , afford greater flexibility, but they depend on the centrally organized deployment of a distributed storage system, hindering usability and portability.

RDBs are widely used, mature technologies, well known for their speed, reliability, flexibility, structure and extensibility. In the context of rare diseases, an RDB can in principle render the modification, combination and addition of data on samples, variants, genes and other entities efficient, reliable and straightforward to implement using a single query language. Unfortunately, the performance of RDBs degrades substantially when the number of records in a table reaches several billion, and the number of genotypes in a cohort the size of the 100KGP easily surpasses this threshold. However, the MAFs of pathogenic variants with strong effects on rare disease risk are typically kept below 1/1,000 by negative selection, and the proportion of nonhomozygous reference genotypes for variants within that MAF stratum is only about 1% of the total (Extended Data Fig. 1 ). Consequently, it is possible to construct a compact RDB that includes virtually all the pathogenic variants even in a large cohort such as the 100KGP. This provides an opportunity for exploiting the benefits of a single unified RDB containing the nonhomozygous genotypes of rare variants upon which to conduct the entirety of the etiological discovery process. Furthermore, it provides a natural foundation for developing web applications for the multidisciplinary review of genetic, phenotypic, statistical and other data.

Rareservoir

The Rareservoir is an RDB schema and a complementary software package rsvr for working with rare disease data. The database stores data including rare variant genotypes, variant annotations, phenotypes, sample information and pedigrees (Extended Data Fig. 1 ), but it can be extended arbitrarily. A Rareservoir is built through a series of steps from a set of input data and parameters (Extended Data Fig. 3 ). The ‘bcftools’ program 47 extracts (‘bcftools view’) and normalizes (‘bcftools norm’) variants from either a set of single-sample genome variant call format files (gVCFs) or from a merged VCF. In all steps of the procedure, variants are encoded as RSVR IDs using the ‘rsvr enc’ tool. Merged VCFs typically contain cohort-wide variant quality information in the FILTER column, which can be used to select variants for processing. However, this is not readily obtained from single gVCFs. To address this, we developed the ‘rsvr depth’ tool, which computes variant quality pass rates at all positions in the genome based on a random subsample of gVCFs. If the input is a merged VCF, an internal (that is, within-VCF) allele frequency threshold is applied with bcftools to filter out internally common variants. If the input is a set of single-sample gVCFs, internally common variants are filtered out in two steps, for computational efficiency. First, a set of variants that are statistically almost certain to be common based on a random sample of gVCFs is identified—by default, the variants for which a one-sided binomial test under the null hypothesis that the MAF = 0.01 is rejected at a significance level of 10 −6 (done using the ‘rsvr tabulate’ tool). Second, all gVCFs are read sequentially, filtering out the variants identified in the previous step (using the ‘rsvr mix’ tool) and those for which the pass rates identified with ‘rsvr depth’ do not meet the threshold. Retained genotypes are then loaded into a temporary genotype table in the database in order to apply the final internal allele frequency filter by executing an SQL ‘DELETE’ statement. These variants are then annotated with gnomAD ‘probabilistic minor allele frequency’ (PMAF) scores 3 using the ‘rsvr pmaf’ tool. The PMAF score is calculated with respect to a given allele frequency threshold t by evaluating a binomial test (at a significance threshold of 0.05) on the observed frequency of the variant under the null hypothesis that the variant has an allele frequency of t . If in any gnomAD population, the null is rejected for t  = 0.001 and the allele count is at least two, the score is set to zero. If the null is rejected for t  = 0.0001, the score is set to one. If the null is not rejected, the score is set to two. Finally, if the variant is absent from gnomAD, the score is set to three. For the nonpseudoautosomal dominant regions of chromosome X, only allele counts for males are used in calculations. Variants are then additionally annotated with their CADD phred scores using the ‘rsvr ann’ program and loaded into the VARIANT table. At this point, variants in the VARIANT and GENOTYPE table that have a PMAF score of zero may be deleted because they are unlikely to be involved in rare diseases. We then annotate the retained variants with predicted transcript consequences for a given set of transcripts specified in a Gene Transfer Format file. The 100KGP Rareservoir uses Ensembl v.104 canonical transcripts with a protein-coding biotype, of which >90% are MANE (Matched Annotation from National Center for Biotechnology Information (NCBI) and European Bioinformatics Institute (EBI)) 48 transcripts. The ‘rsvr seqfx’ program determines a set of SO terms for each interacting transcript–variant pair and encodes them as a CSQ ID, which is added to the CONSEQUENCE table. This table can also hold Loss-Of-Function Transcript Effect Estimator (LOFTEE) scores 10 corresponding to a transcript–variant pair. Note that, as LOFTEE scores on the Genomics England Research Environment correspond to Ensembl v.99 transcripts, we mapped Ensembl v.104 canonical transcripts to the most similar v.99 transcripts having an identical coding sequence in order to obtain the LOFTEE scores for the 100KGP Rareservoir, finding a match for >98% of transcripts. The contents of the Gene Transfer Format file are also imported into the database to create tables of transcript features (FEATURE), transcripts (TX) and genes (GENE). Optionally, VARIANT, GENOTYPE and CONSEQUENCE may be filtered for RSVR IDs that have CSQ IDs meeting particular criteria: for instance, to retain only variants with protein-coding consequences. The SAMPLE table of metadata and genetic statistics for each sample represented in the input VCF(s) must then be added to the database, including mandatory columns containing the ID, sex, family and an indicator of inclusion in the maximal unrelated set of samples in the database. The VARIANT, GENOTYPE and CONSEQUENCE tables are indexed by RSVR ID to support fast lookups by genomic location. The SAMPLE table and GENOTYPE table are indexed by sample ID, allowing fast lookups by sample. The CONSEQUENCE, TX and GENE tables are indexed by transcript and gene ID, allowing fast lookups of variants based on gene/transcript-specific consequences. If sample phenotypes have been encoded using phenotypic terms (for example, International Classification of Diseases 10 (ICD10) codes or HPO terms), terms from the relevant coding systems can be added to a generic PHENOTYPE table mapping code IDs to descriptions, and codes assigned to samples can be added to the SAMPLE_PHENOTYPE table. Disease labels may be added to the CASE_SET table. The majority of the compute time required for building the database is taken by reading the genotype data from the input VCFs, which may be executed in parallel over separate regions against a merged VCF or over single gVCFs. The rsvr tool, implemented in C++, executes rapidly, with ‘rsvr seqfx’ capable of assigning CSQ IDs for all Ensembl v.104 canonical transcripts to all variants (over 685 M) in gnomAD v.3.0 in under 40 min on a single core. The 100KGP Rareservoir, which is stored in an SQLite database, returns complex gene-specific queries in under 1 s. For example, (1) a table with 628 rows containing the moderate- and high-impact variants with a PMAF score ≥1 in TTN along with the corresponding consequence predictions and CADD scores takes 0.57 s; (2) a table with 1,498 rows containing, for each variant, the samples and genotypes for individuals who carry an alternate allele takes 0.61 s; and (3) a classification for each of the 77,539 participants into proband with the Specific Disease ‘Dilated cardiomyopathy’, relative of such a proband, unrelated control, or relative of a control takes 0.65 s. Specific details on implementation of the workflow, code for encoding data as SQL statements compatible with Rareservoir and the mapping between bits in the 64-bit CSQ ID and each SO term assigned by rsvr seqfx can be found in the rsvr software package (see the code availability). Software packages rsvr 1.0, bcftools 1.9 and perl 5 were used to build the 100KGP Rareservoir.

Encoding RSVR IDs

SNVs and indels may be encoded as 64-bit integers called RSVR IDs. To compute an RSVR ID for a given variant, the following expression is evaluated:

where c is the chromosome number (using 23, 24 and 25, respectively, to represent chromosomes X, Y and MT); p is the position; and | r | and | a | are the lengths of the reference and alternate alleles, respectively. A is a sequence identical to the alternate allele, a , when its length is less than 10 and otherwise, equal to the first five followed by the last four elements of a . In the summation, nucleotides are assigned values A = 0, C = 1, G = 2 and T = 3. The expression evaluates to integers that can be represented using 63 bits, setting the most significant bit to zero when encoding as 64-bit integers. The chromosome, position, reference, alternate allele lengths and alternate allele bases are thereby encoded, respectively, by the subsequent 5, 28, 6, 6 and 18 bits (with 2 bits per base for the alternate allele). This procedure and its inverse are implemented in the ‘rsvr enc’ and ‘rsvr dec’ programs, respectively. The reference and alternate alleles of input variants are first normalized by removing any redundant identical sequence from the starts and then, the ends. The proportion of variants in gnomAD 3.0 weighted by allele count that can be encoded losslessly is 99.3%, while 99.8% can be represented by a distinct RSVR ID. The full variant information corresponding to any encountered ambiguous RSVR ID may be stored in full in an additional table. Structural variants that can be represented by a position and length may also be encoded using distinct 64-bit RSVR IDs alongside SNVs and indels by setting the most significant bit to one and subsequently, encoding the type of structural variant using 2 bits (deletion 0, duplication 1, inversion 2, insertion 3), the chromosome using 5 bits (as done for SNVs and indels), and the start and length consecutively using 28 bits.

Genetic association analysis of 100KGP data

We constructed a Rareservoir in the Genomics England Research Environment containing the PASSing 49 variants in the merged VCF of 77,539 consented participants in the 100KGP Rare Diseases Programme. This Rareservoir only included variants with a PMAF > 0 according to GnomAD v.3.0, an internal MAF < 0.002 and at least one predicted consequence on a canonical transcript in Ensembl v.104. Variants with a greater MAF are unlikely to be highly penetrant for diseases eligible for inclusion in the 100KGP and are likely to have, at most, small effects on risk, making them challenging to validate. Variants with a median genotype quality <35 and SNVs with a CADD Phred score <10 were also excluded from the analyses.

For each of the 269 rare disease classes (Extended Data Figs. 5 and 6 ), we applied the BeviMed 9 association test to rare variants extracted from the Rareservoir database in each of the 19,663 canonical transcripts belonging to a gene with a ‘protein_coding’ biotype. The analysis was carried out using R 3.6.2, making use of functionality from packages Matrix 1.2–18, dplyr 0.8.5, bit64 0.9–7, bit 1.1–14, DBI 1.1.0, RSQLite 2.1.4 and BeviMed 5.7. The case set for a given disease class and gene was constructed by selecting one case from each pedigree containing at least one person affected with the disease class. For the purposes of the association analysis, participants were labeled ‘explained’ by a given gene if they had variants in that gene classified as ‘pathogenic_variant’ or ‘likely_pathogenic_variant’ in the ‘gmc_exit_questionnaire’ table in the Genomics England Research Environment. To boost power, we used this information to reassign cases that were explained by variants in a different gene to the control group.

Using BeviMed, we performed a Bayesian comparison of a baseline model of no association and each of six association models defined by an MOI and a class of etiological variant.

No association (prior probability 0.99)

Dominant association with ‘high’-impact variants having a PMAF ≥ 2 (that is, corresponding to a target MAF < 0.01%; prior probability 0.002475)

Dominant association with ‘moderate’-impact variants having a PMAF ≥ 2 (prior probability 0.002475)

Dominant association with ‘5′ UTR’ variants having a PMAF ≥ 2 (prior probability 0.00005)

Recessive association with ‘high’-impact variants having a PMAF ≥ 1 (that is, corresponding to a target MAF < 0.1%; prior probability 0.002475)

Recessive association with ‘moderate’-impact variants having a PMAF ≥ 1 (prior probability 0.002475)

Recessive association with 5′ UTR variants having a PMAF ≥ 1 (prior probability 0.00005)

Thus, the overall prior probability of association was 0.01, and there was an equal prior probability of dominant and recessive inheritance. The PPA was the sum of the posterior probabilities of models 2–7. We imposed a stricter PMAF threshold under a dominant MOI than under a recessive MOI because ceteris paribus, dominant variants are under stronger negative selection than recessive variants. The three groups of variants were selected as follows.

5′ UTR variants: those with a 5_prime_UTR_variant consequence

High-impact variants: those with any consequence amongst start_lost, stop_lost, frameshift_variant, stop_gained, splice_donor_variant or splice_acceptor_variant, excluding variants with a ‘low-confidence’ LOFTEE score 10

Moderate-impact variants: those with any consequence amongst start_lost, stop_lost, frameshift_variant, stop_gained, splice_donor_variant or splice_acceptor_variant, missense_variant or inframe_deletion

The rationale for embedding variants from the high-impact class in the moderate-impact class is that both types of variant are capable of inducing a loss of function. The prior on the probability that a modeled rare variant is pathogenic, conditional on either the association model mediated by 5′ UTR variants or the association model mediated by moderate-impact variants, was set to Beta(2,8). This encodes a prior conditional expectation that 20% of rare variants are pathogenic, which is well suited to missense and 5′ UTR variants. However, we specified a distribution with a greater mean for the high-impact models. Specifically, the prior on the probability that a modeled high-impact variant is pathogenic was set to Beta(3,1), which reflects a prior conditional expectation that 75% of rare variants are pathogenic because loss-of-function variants tend to be functionally equivalent to each other. BeviMed reports the posterior probability that each variant is pathogenic conditional on the MOI and the class of etiological variant. The methodology is described in further detail in the original BeviMed publication 9 .

We applied the following postprocessing of BeviMed results with a PPA > 0.95.

We reran BeviMed including all samples (that is, with relatives of cases and controls). Associations for which the analysis with all samples caused the PPA to fall below 0.9 were filtered out due to conflicting evidence for the association within families.

We reran BeviMed after removing variants absent from affected relatives of the cases. Associations for which this removal caused the PPA to drop below 0.25 were filtered out because they depended on variants that were not shared by affected cases within families.

To guard against false positives due to incorrect pedigree data, population structure or cryptic relatedness, we applied the following algorithm. We obtained the distribution of the number of rare variants in the Rareservoir shared by pairs of individuals within each assigned ancestry in the 100KGP. The top percentile in each of these distributions was used to indicate potential relatedness between participants in the same population. We reran BeviMed after removing cases so as to ensure that no more than one case from any set of potentially related cases sharing a variant was included in the analysis. Associations for which this analysis caused the PPA to fall below 0.25 were filtered out.

To account for correlation between case sets, for each gene, we removed all but the most strongly associated disease class within each disease group before reporting the 260 associations remaining. Without the postprocessing, the number of reported associations would have been 302. Conditional on the modal model underlying each of the 260 associations, we recorded the variants with a posterior probability of pathogenicity >0.8 accounting for at least one case in the 100KGP.

PanelApp annotation

Significant associations were colored according to PanelApp 14 (Fig. 1b ) evidence levels for panel–gene relations (green for high evidence, amber for moderate evidence and red for low evidence) for panels of type ‘Rare Disease 100K’, which are organized hierarchically by Disease Sub Group and Disease Group, or of type ‘GMS Rare Disease’. Given an association between a gene and a case set (corresponding either to a Specific Disease or to a Disease Sub Group), we searched for panels that contained the gene and had the same name as the case set (ignoring case). If such a match was not found, we searched for panels that contained the gene and that belonged to a Disease Sub Group with the same name as the Disease Sub Group of the case set. When this matching rule generated multiple matches, we selected the panel(s) with the highest evidence. If multiple panels still remained, we selected the panel with the smallest number of genes. Associations for which no matching panel in PanelApp could be found were inspected manually to assess whether PanelApp contained an alternative suitable panel (marked with an asterisk in Fig. 1b ).

Shortlisting previously unidentified genetic associations for validation

Several sources of independent evidence were used to shortlist significant associations for validation. For each source, a score of one was awarded if the evidence was supportive and zero otherwise. Scores were then added over the different sources and used to rank the associations. Associations for which at least three sources of evidence were supportive were taken forward for further investigation. The sources of evidence and qualifying criteria for being considered supportive are listed below. Note that here we refer to variants that had a probability of pathogenicity >0.8 conditional on the modal model as ‘probably pathogenic’.

Counting cosegregating pedigree members. The pedigrees harboring pathogenic configurations of probably pathogenic alleles were checked for cosegregation between genotype and affection status. This evidence counted as supportive for associations for which all such pedigrees demonstrated cosegregation and there were at least three additional relatives who had not been included in the association analysis but for whom there was cosegregation. Note that Binary Alignment and Map (BAM) files for the affected members of pedigrees who were called homozygous reference for probably pathogenic variants were checked for evidence of mosaicism to guard against the possibility that they were falsely portraying a lack of cosegregation.

pLI and Z scores. pLI and Z scores for depletion of missense variants were obtained from the gnomAD v.2.2.1 browser 10 . pLI > 0.9 for associations in which high-impact variants were most strongly associated was counted as supportive, whilst Z scores >2 for associations in which moderate-impact variants were most strongly associated were counted as supportive.

Recessive association. Population genetic metrics of purifying selection (pLI scores and Z scores) are sensitive to depletion of high-impact variants and missense variants, respectively. They are, therefore, useful measures to corroborate dominant associations. However, these metrics have low sensitivity to identify the signatures of selection against recessive diseases because isolated pathogenic variants in heterozygous form do not lead to a reduction in reproductive fitness. To avoid disadvantaging recessive associations identified by BeviMed, they were assigned a contribution of one point to the score.

Literature review. A comprehensive literature review assessing the gene’s role (if any) in biological processes relevant to the disease, other diseases and a survey of model organisms was undertaken and determined to be either supportive or not.

ERG : primary endothelial cell culture

Single-donor primary HDLECs (Promocell) were cultured in Endothelial Cell Growth Medium MV2 (Promocell). Pooled donor HUVECs (Lonza) were grown in Endothelial Cell Growth Media-2 (Lonza). HUVECs and HDLECs were grown on 1% (vol/vol) gelatin and used between passages 3 and 5.

ERG : real-time PCR

HUVECs and HDLECs were grown to confluency in a pregelatinized six-well dish. Total RNA was isolated using the RNeasy Mini Kit (Qiagen), and 1 µg of total RNA was transcribed into cDNA using Superscript III Reverse Transcriptase (Thermo Fisher Scientific). Quantitative real-time PCR was performed using PerfCTa SYBR Green FastMix (Quanta Biosciences) on a Bio-Rad CFX96 System. Gene expression values of ERG in HUVECs and HDLECs were normalized to GAPDH expression and compared using the ΔΔC T method. The following oligonucleotides were used: ERG, 5′-GGAGTGGGCGGTGAAAGA-3′ and 5′-AAGGATGTCGGCGTTGTAGC-3′; GAPDH, 5′-CAAGGTCATCCATGACAACTTTG-3′ and 5′-GGGCCATCCACAGTCTTCTG-3′.

ERG : immunoblotting analysis

Immunoblotting was performed according to standard conditions. Proteins were labeled with the following primary antibodies: rabbit anti-human ERG antibody (1:1,000; ab133264; Abcam) and mouse anti-human GAPDH (1:10,000; MAB374; Millipore). Primary antibodies were detected using fluorescently labeled secondary antibodies: goat anti-rabbit IgG DyLight 680 and goat anti-mouse IgG Dylight 800 (Thermo Scientific). Detection of fluorescence intensity was performed using an Odyssey CLx imaging system (Li-COR Biosciences, Lincoln) and Odyssey v.4 software.

ERG : immunofluorescence analysis of endothelial cells and mouse tissues

Confluent cultures of HUVECs and HDLECs were fixed with 4% (wt/vol) paraformaldehyde for 15 min and permeabilized with 0.5% (vol/vol) Triton X-100 before incubation with 3% BSA (wt/vol) in phosphate buffered saline (PBS) containing the following primary antibodies: goat anti-human PROX1 antibody (1:100; AF2727; R&D Systems), rabbit anti-human ERG antibody (1:100; ab92513; Abcam) and mouse anti-human VE-cadherin (1:100; 555661; BD Biosciences). Secondary antibody incubation was carried out in 3% BSA (wt/vol) in PBS using the following antibodies: donkey anti-goat IgG Alexa Fluor-488 (1:1,000; A-11055), donkey anti-rabbit IgG Alexa Fluor-555 (1:1,000; A-31572) and donkey anti-mouse Alexa Fluor-594 (1:1,000; A-21203). All secondary antibodies were from Thermo Fisher Scientific. Nuclei were visualized using DAPI (Sigma-Aldrich). Confocal microscopy was carried out on a Carl Zeiss LSM 780 confocal laser scanning microscope with Zen 3.2 software. All animal experiments were conducted with ethical approval from Imperial College London under UK Home Office Project Licence PEDBB1586 in compliance with the UK Animals (Scientific Procedures) Act of 1986. Ear tissue was collected from euthanized 3-week-old male and female C57BL/6J mice and fixed in 4% (wt/vol) paraformaldehyde at room temperature for 2 h. Tissue was then washed with PBS followed by a blocking and permeabilization step using 3% (wt/vol) milk in phosphate-buffered saline solution containing 0.3% (vol/vol) Triton X-100 (PBST) for 1 h at room temperature. The following primary antibodies were used for immunofluorescence staining: goat anti-human PROX1 antibody (1:100; AF2727; R&D Systems) and rabbit anti-human ERG antibody (1:100; ab92513; Abcam). Primary antibodies were incubated at 4 °C overnight in 3% (wt/vol) milk in PBST. The following day, tissues were washed three times with PBST over the course of 2 h at room temperature. Tissues were incubated with secondary antibodies at room temperature for 2 h in 3% milk (wt/vol) in PBST. Primary antibodies were detected using fluorescently labeled secondary antibodies: donkey anti-goat IgG Alexa Fluor-488 (1:400; A-11055; Thermo Fisher Scientific) and donkey anti-rabbit IgG Alexa Fluor-555 (A-31572; Thermo Fisher Scientific). Stained samples were mounted onto glass slides using Fluoromount G (Thermo Fisher Scientific). Images were acquired using a Zeiss LSM 780 confocal laser scanning microscope with Zen v.3.2 software. All confocal images represent maximum intensity projection of Z stacks of single tiles.

ERG : subcloning and overexpression in HEK293 cells

We subcloned ERG (ENST00000288319.12) from HUVECs into the mammalian expression vector pcDNA3.1 (Thermo Fisher). ERG variants were generated by site-directed mutagenesis using the Quikchange Lightning kit (Agilent) using the wild-type ERG cDNA as the template. Expression of wild-type and mutant ERG was carried out using polyethylenimine (Sigma-Aldrich) transfection reagent in HEK293 cells grown in Dulbecco’s Modified Eagle Medium (DMEM) (Thermo Fisher) with 10% (vol/vol) FBS. After 24 h, cells were fixed with 4% (wt/vol) paraformaldehyde for 15 min and permeabilized with 0.5% (vol/vol) Triton X-100 before incubation with 3% BSA (wt/vol) in PBS containing mouse monoclonal anti-ERG antibody (1:100; sc-376293; Santa Cruz Biotechnology). Secondary antibody incubation was carried out in 3% BSA (wt/vol) in PBS using donkey anti-mouse Alexa Fluor-488 (1:1,000; A-21202; Thermo Fisher). Nuclei were visualized using DAPI (Sigma-Aldrich). Confocal microscopy was carried out on a Carl Zeiss LSM780 confocal laser scanning microscope with Zen 3.2 software.

ERG : estimation of nuclear and nonnuclear ERG in HEK293 cells

Each image was read into a pair of channel-specific 1,024 × 1,024 matrices in R v.4.2.1 using the readCzi function from the readCzi R package v.0.2.0. A pixel was declared to contain a nuclear region if the intensity in the blue channel exceeded 60% of the 95th percentile of blue intensities across all pixels above background (identified as exceeding 1.35 × 10 −2 by visual inspection of bimodal intensity histograms). A pixel was declared to contain ERG if the intensity in the green channel exceeded 30% of the 95th percentile of the green intensities within the pixels previously declared to be nuclear. To fill in intranuclear gaps, any nonnuclear pixels adjacent to at least five nuclear pixels were declared nuclear. The estimated proportion of ERG that was cytosolic in an image was set to the number of ERG pixels that did not overlap nuclear pixels divided by the number of ERG pixels.

GPR156 : western blots

We subcloned GPR156 from human brain cDNA into EGFP-N2 vector. The three mutant GPR156 constructs were generated by mutagenesis using the QuickChange kit (Stratagene) and wild-type GPR156–GFP as a template. For expression analysis, the wild type and mutant constructs were transfected in COS7 cells grown in DMEM (Gibco) with 10% FBS. Transfections were performed with Lipofectamine 2000 reagent (Life Technologies). Cells were harvested 48 h after transfection; lysed in buffer containing 1% 3-[(3-cholamidopropyl)dimethylammonio]-1-propane sulfonate (CHAPS), 100 mM NaCl and 25 mM N-2-hydroxyethylpiperazine-N-2-ethane sulfonic acid (HEPES), pH 7.4; and clarified by centrifugation at 18,407 g . Lysates (20 μg) were run on a 4–20% sodium dodecyl sulfate–polyacrylamide gel electrophoresis (SDS-PAGE) gel. The membrane was blocked with 5% milk, incubated with anti-GPR156 (1:200) and developed with horseradish peroxidase (HRP)-conjugated secondary (sheep anti-rabbit) antibody (1:1,000). Comparable loading was checked by stripping and reprobing the blots with anti-GAPDH (1:500) antibody (Santa Cruz Biotechnology).

GPR156 : whole-mount immunostaining of GPR156 in mouse inner ears

All the animal work was approved by the University of Maryland, Baltimore Institutional Animal Care and use Committee (IACUC 420002). Inner ears were dissected from C57BL/6J mice with a postnatal age of 10 days and fixed in 4% paraformaldehyde in PBS overnight. For whole-mount immunostaining, the cochleae were microdissected and were subjected to blocking for 1 h with 10% normal goat serum in PBS containing 0.25% Triton X-100, followed by overnight incubation at 4 °C with anti-GPR156 antibody (1:200; PA5-23857; Thermo Fisher) in 3% normal goat serum with PBS. F-actin was decorated using phalloidin (1:300). Confocal images were acquired from a Zeiss LSM710 confocal microscope, and images were processed using ImageJ v.1.53t software.

The 100,000 Genomes project was approved by East of England–Cambridge Central Research Ethics Committee ref:20/EE/0035. Only participants who provided written informed consent for their data to be used for research were included in the analyses. The study at the University of Maryland was approved by the institutional review board (RAC no. 2100001), and written informed consent was obtained by clinicians at King Faisal Hospital in Saudi Arabia from the participating individuals. The study of the Japanese ancestry pedigrees bearing PMEPA1 truncating alleles was approved by the Institutional Review Board of the National Cerebral and Cardiovascular Centre (M14-020) and Sakakibara Heart Institute (16–035), and written informed consent was obtained from the participating individuals.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Genetic and phenotypic data for the 100KGP study participants are available through the Genomics England Research Environment via the application at https://www.genomicsengland.co.uk/join-a-gecip-domain . PanelApp gene panels and evidence of associations were obtained using the PanelApp application programming interface ( https://panelapp.genomicsengland.co.uk/api/docs/ ) on the 20 October 2021. CADD v.1.5 ( https://cadd.gs.washington.edu/ ), gnomAD v.3.0 ( https://gnomad.broadinstitute.org/ ) and Ensembl v.104 ( http://may2021.archive.ensembl.org/index.html ) were used for variant annotation. Source data are provided with this paper.

Code availability

The rsvr tool and Rareservoir schema are available from https://github.com/turrogroup/rsvr .

Boycott, K. M. et al. International cooperation to enable the diagnosis of all rare genetic diseases. Am. J. Hum. Genet 100 , 695–705 (2017).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Ferreira, C. R. The burden of rare diseases. Am. J. Med Genet A 179 , 885–892 (2019).

Article   PubMed   Google Scholar  

Turro, E. et al. Whole-genome sequencing of patients with rare diseases in a national health system. Nature 583 , 96–102 (2020).

Wang, Q. et al. Rare variant contribution to human disease in 281,104 UK Biobank exomes. Nature 597 , 527–532 (2021).

Kaplanis, J. et al. Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature 586 , 757–762 (2020).

Greene, D., Richardson, S. & Turro, E. Phenotype similarity regression for identifying the genetic determinants of rare diseases. Am. J. Hum. Genet 98 , 490–499 (2016).

Hail Team. Hail 0.2. https://github.com/hail-is/hail (2022).

Lopez, J. et al. HGVA: the Human Genome Variation Archive. Nucleic Acids Res. 45 , W189–W194 (2017).

Greene, D., Richardson, S. & Turro, E. A fast association test for identifying pathogenic variants involved in rare diseases. Am. J. Hum. Genet 101 , 104–114 (2017).

Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581 , 434–443 (2020).

Rentzsch, P., Schubach, M., Shendure, J. & Kircher, M. CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores. Genome Med. 13 , 31 (2021).

Eilbeck, K. et al. The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol. 6 , R44 (2005).

Article   PubMed   PubMed Central   Google Scholar  

Howe, K. L. et al. Ensembl 2021. Nucleic Acids Res. 49 , D884–D891 (2021).

Article   CAS   PubMed   Google Scholar  

Martin, A. R. et al. PanelApp crowdsources expert knowledge to establish consensus diagnostic gene panels. Nat. Genet. 51 , 1560–1565 (2019).

Korber, L., Schneider, H., Fleischer, N. & Maier-Wohlfart, S. No evidence for preferential X-chromosome inactivation as the main cause of divergent phenotypes in sisters with X-linked hypohidrotic ectodermal dysplasia. Orphanet J. Rare Dis. 16 , 98 (2021).

Kasahara, Y. et al. Hyper-IgM syndrome with putative dominant negative mutation in activation-induced cytidine deaminase. J. Allergy Clin. Immunol. 112 , 755–760 (2003).

Martin-Almedina, S., Mortimer, P. S. & Ostergaard, P. Development and physiological functions of the lymphatic system: insights from human genetic studies of primary lymphedema. Physiol. Rev. 101 , 1809–1871 (2021).

Gordon, K. et al. Update and audit of the St George’s classification algorithm of primary lymphatic anomalies: a clinical and molecular approach to diagnosis. J. Med Genet. 57 , 653–659 (2020).

Kalna, V. et al. The transcription factor ERG regulates super-enhancers associated with an endothelial-specific gene expression program. Circ. Res. 124 , 1337–1349 (2019).

Shah, A. V., Birdsey, G. M. & Randi, A. M. Regulation of endothelial homeostasis, vascular development and angiogenesis by the transcription factor ERG. Vasc. Pharm. 86 , 3–13 (2016).

Article   CAS   Google Scholar  

Hoesel, B. et al. Sequence-function correlations and dynamics of ERG isoforms. ERG8 is the black sheep of the family. Biochim. Biophys. Acta 1863 , 205–218 (2016).

Westbury, S. K. et al. Human phenotype ontology annotation and cluster analysis to unravel genetic defects in 707 cases with unexplained bleeding and platelet disorders. Genome Med. 7 , 36 (2015).

Greene, D., Richardson, S. & Turro, E. ontologyX: a suite of R packages for working with ontological data. Bioinformatics 33 , 1104–1106 (2017).

Resnik, P. et al. Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J. Artif. Intell. Res. 11 , 95–130 (1999).

Article   Google Scholar  

Ciurica, S. et al. Arterial tortuosity. Hypertension 73 , 951–960 (2019).

Loeys, B. L. et al. A syndrome of altered cardiovascular, craniofacial, neurocognitive and skeletal development caused by mutations in TGFBR1 or TGFBR2. Nat. Genet. 37 , 275–281 (2005).

Szklarczyk, D. et al. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 47 , D607–D613 (2019).

Watanabe, Y. et al. TMEPAI, a transmembrane TGF-beta-inducible protein, sequesters Smad proteins from active participation in TGF-beta signaling. Mol. Cell. 37 , 123–134 (2010).

Creamer, T. J., Bramel, E. E. & MacFarlane, E. G. Insights on the pathogenesis of aneurysm through the study of hereditary aortopathies. Genes (Basel) 12 , 183 (2021).

Thermann, R. et al. Binary specification of nonsense codons by splicing and cytoplasmic translation. EMBO J. 17 , 3484–3494 (1998).

Sobreira, N., Schiettecatte, F., Valle, D. & Hamosh, A. GeneMatcher: a matching tool for connecting investigators with an interest in the same gene. Hum. Mutat. 36 , 928–930 (2015).

Ellaithy, A., Gonzalez-Maeso, J., Logothetis, D. A. & Levitz, J. Structural and biophysical mechanisms of class C G protein-coupled receptor function. Trends Biochem. Sci. 45 , 1049–1064 (2020).

Kindt, K. S. et al. EMX2-GPR156-Gai reverses hair cell orientation in mechanosensory epithelia. Nat. Commun. 12 , 2861 (2021).

Scheffer, D. I., Shen, J., Corey, D. P. & Chen, Z. Y. Gene expression by mouse inner ear hair cells during development. J. Neurosci. 35 , 6366–6380 (2015).

Miyoshi, T. et al. Human deafness-associated variants alter the dynamics of key molecules in hair cell stereocilia F-actin cores. Hum. Genet 141 , 363–382 (2022).

Smedley, D. et al. 100,000 Genomes pilot on rare-disease diagnosis in health care - preliminary report. N. Engl. J. Med. 385 , 1868–1880 (2021).

Thalhammer, A., Trinidad, J. C., Burlingame, A. L. & Schoepfer, R. Densin-180: revised membrane topology, domain structure and phosphorylation status. J. Neurochem. 109 , 297–302 (2009).

Chong, C. H. et al. Lrrc7 mutant mice model developmental emotional dysregulation that can be alleviated by mGluR5 allosteric modulation. Transl. Psychiatry 9 , 244 (2019).

Berthouze, M., Venkataramanan, V., Li, Y. & Shenoy, S. K. The deubiquitinases USP33 and USP20 coordinate beta2 adrenergic receptor recycling and resensitization. EMBO J. 28 , 1684–1696 (2009).

Birdsey, G. M. et al. The endothelial transcription factor ERG promotes vascular stability and growth through Wnt/Beta-catenin signaling. Dev. Cell 32 , 82–96 (2015).

Motiejunaite, J., Amar, L. & Vidal-Petiot, E. Adrenergic receptors and cardiovascular effects of catecholamines. Ann. Endocrinol. (Paris) 82 , 193–197 (2021).

Munoz-Lasso, D. C., Roma-Mateo, C., Pallardo, F. V. & Gonzalez-Cabo, P. Much more than a scaffold: cytoskeletal proteins in neurological disorders. Cells 9 , 358 (2020).

Zuchero, J. B. et al. CNS myelin wrapping is driven by actin disassembly. Dev. Cell 34 , 152–167 (2015).

DeWard, A. D., Eisenmann, K. M., Matheson, S. F. & Alberts, A. S. The role of formins in human disease. Biochim. Biophys. Acta 1803 , 226–233 (2010).

Ninoyu, Y. et al. The integrity of cochlear hair cells is established and maintained through the localization of Dia1 at apical junctional complexes and stereocilia. Cell Death Dis. 11 , 536 (2020).

Geppert, M. et al. The role of Rab3A in neurotransmitter release. Nature 369 , 493–497 (1994).

Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25 , 2078–2079 (2009).

Morales, J. et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 604 , 310–315 (2022).

Genomics England Research Consortium. Variant QC for 100,000 Genomes Project merged VCF files. https://re-docs.genomicsengland.co.uk/site_qc/ (2022).

Download references

Acknowledgements

This research was made possible through access to the data and findings generated by the 100,000 Genomes Project. The 100,000 Genomes Project is managed by Genomics England Limited (a wholly owned company of the Department of Health and Social Care). The 100,000 Genomes Project is funded by the National Institute for Health Research and National Health Service (NHS) England. The Wellcome Trust, Cancer Research UK and the Medical Research Council have also funded research infrastructure. The 100,000 Genomes Project uses data provided by patients and collected by the National Health Service as part of their care and support. GS was performed by Illumina at Illumina Laboratory Services and was overseen by Genomics England. We thank all NHS clinicians who have contributed clinical phenotype data to the 100,000 Genomes Rare Diseases Programme and all staff at Genomics England who have contributed to the sequencing, maintenance of the research environment and assembly of the standard bioinformatic files that were required for our analyses. We thank the participants of the rare diseases program who made this research possible. We are grateful to V. Keeley for providing access to paternal DNA ( ERG ), F. Elmslie for inviting a patient to the clinic ( ERG ) and T. Jaworek for technical assistance ( GPR156 ). D.G. was supported by the Cambridge British Heart Foundation (BHF) Centre of Research Excellence (RE/18/1/34212) and the Wellcome Collaborative (219506/Z/19/Z). V.H. was supported by an Medical Research Council (MRC)/National Institute for Health and Care Research Clinical Academic Research Partnership (MR/V037617/1). G.M.B. and K. Frudd were funded by BHF (PG/17/33/32990). G.M.B. and D.P. were funded by BHF (PG/20/16/35047). E.S. was supported by the Swiss Federal National Fund for Scientific Research (CRSII5_177191/1). S.M. and P.O. were supported by the MRC (MR/P011543/1) and BHF (RG/17/7/33217). K. Freson was supported by Katholieke Universiteit (KU) Leuven Special Research Fund (BOF) (C14/19/096) and Research Foundation – Flanders (G072921N). Work at the University of Maryland, Baltimore was supported by the National Institute on Deafness and Other Communication Disorders/National Institutes of Health (R01DC016295 to Z.M.A.). M.A.-O., F.I. and K.R. were supported by the King Salman Center for Disability Research (85722). E.T. was supported by the Mindich Child Health and Development Institute, the Charles Bronfman Institute for Personalized Medicine and the Lowy Foundation USA.

Author information

Authors and affiliations.

Department of Medicine, University of Cambridge, Cambridge, UK

  • Daniel Greene

Mindich Child Health and Development Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA

Daniel Greene, Bruce D. Gelb & Ernest Turro

National Heart and Lung Institute, Imperial College London, London, UK

Daniela Pirri, Karen Frudd & Graeme M. Birdsey

University College London Institute of Ophthalmology, University College London, London, UK

Karen Frudd

Molecular and Clinical Sciences Institute, St. George’s University of London, London, UK

Ege Sackey, Sahar Mansour & Pia Ostergaard

Department of Medical Genomics, Centre for Genomic Medicine, King Faisal Specialist Hospital & Research Centre, Riyadh, Saudi Arabia

Mohammed Al-Owain

Department of Otorhinolaryngology Head and Neck Surgery, School of Medicine, University of Maryland, Baltimore, MD, USA

Arnaud P. J. Giese, Sehar Riaz, Saima Riazuddin & Zubair M. Ahmed

Department of Clinical Genomics, Centre for Genomic Medicine, King Faisal Specialist Hospital & Research Centre, Riyadh, Saudi Arabia

Khushnooda Ramzan & Faiqa Imtiaz

Department of Biochemistry and Molecular Biology, School of Medicine, University of Maryland, Baltimore, MD, USA

Sehar Riaz, Saima Riazuddin & Zubair M. Ahmed

Department of Bioscience and Genetics, National Cerebral and Cardiovascular Center, Osaka, Japan

Itaru Yamanaka, Takayuki Morisaki & Hiroko Morisaki

Center for Medical Genetics, Antwerp University Hospital/University of Antwerp, Antwerp, Belgium

Nele Boeckx & Bart L. Loeys

Department of Cardiovascular Sciences, Center for Molecular and Vascular Biology, KU Leuven, Leuven, Belgium

Chantal Thys & Kathleen Freson

Department of Pediatrics, Icahn School of Medicine at Mount Sinai, New York, NY, USA

Bruce D. Gelb

Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA

Bruce D. Gelb & Ernest Turro

Northern Genetics Service, Newcastle upon Tyne Hospitals National Health Service Trust International Centre for Life, Newcastle upon Tyne, UK

Paul Brennan

Department of Clinical Genetics, Chapel Allerton Hospital, Leeds Teaching Hospitals National Health Service Trust, Leeds, UK

Verity Hartill

Leeds Institute of Medical Research, University of Leeds, Leeds, UK

Centre for Medical Genetics, Centre Hospitalier Universitaire de Liège, Liège, Belgium

Julie Harvengt

Department of Medical Genetics, Shinshu University School of Medicine, Nagano, Japan

Tomoki Kosho

Center for Medical Genetics, Shinshu University Hospital, Nagano, Japan

South West Thames Regional Genetics Service, St. George’s University Hospitals National Health Service Foundation Trust, London, UK

Sahar Mansour

Department of Medical Genetics, Kawasaki Medical School Hospital, Okayama, Japan

Mitsuo Masuno

Okinawa Chubu Hospital, Okinawa, Japan

Takako Ohata

Oxford University Hospitals National Health Service Foundation Trust, Oxford, UK

Helen Stewart

Ear Nose and Throat Medical Centre, Riyadh, Saudi Arabia

Khalid Taibah

Peninsula Clinical Genetics Service, Royal Devon & Exeter Hospital, Exeter, UK

Claire L. S. Turner

Division of Molecular Pathology and Department of Internal Medicine, Institute of Medical Science, The University of Tokyo, Tokyo, Japan

Takayuki Morisaki

Department of Human Genetics, Radboud University Medical Center, Nijmegen, the Netherlands

Bart L. Loeys

Department of Medical Genetics, Sakakibara Heart Institute, Tokyo, Japan

Hiroko Morisaki

School of Cellular and Molecular Medicine, University of Bristol, Bristol, UK

Andrew Mumford

South West National Health Service Genomic Medicine Service Alliance, Bristol, UK

Department of Haematology, University of Cambridge, Cambridge Biomedical Campus, Cambridge, UK

  • Ernest Turro

Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA

You can also search for this author in PubMed   Google Scholar

Genomics England Research Consortium

Contributions.

D.G. developed software, conducted analyses and cowrote the paper. G.E.R.C. provided genetic and phenotypic data and access to the Genomics England Research Environment. C.T. performed experiments and interpreted results. B.D.G. provided biological interpretation and feedback on the manuscript. K. Freson designed and supervised experiments, provided biological interpretation and contributed to writing the paper. A.M. provided clinical oversight, provided biological interpretation and contributed to writing the paper. E.T. oversaw the study and cowrote the paper. The following contributions relate to the three gene-specific vignettes. For ERG , D.P., K. Frudd and E.S. performed experiments and interpreted results. S.M. and C.L.S.T. provided additional clinical information. P.O. coordinated validation and contributed to writing the paper. G.M.B. designed and supervised experiments and contributed to writing the paper. For PMEPA1 , I.Y. and N.B. conducted experiments and interpreted results. P.B., V.H., J.H., T.K., M.M. and T.O. provided clinical information. T.M. and B.L.L. oversaw clinical and experimental studies. H.M. recruited the Japanese cases, conducted experiments, interpreted and analyzed results, and oversaw genetic studies. For GPR156 , H.S. provided additional clinical information for the compound heterozygous family. K.T. clinically evaluated and recruited the p.S642Afs*162 family. A.P.J.G., K.R. and S. Riaz conducted experiments and interpreted results. M.A.-O. assisted with experiments, interpreted results and contributed clinical information. S. Riazuddin, F.I. and Z.M.A. designed and supervised experiments, analyzed results and provided reagents and tools.

Corresponding author

Correspondence to Ernest Turro .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Medicine thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Saheli Sadanand, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended data fig. 1 reduction in the number of genotypes stored per sample..

For 100 randomly chosen 100KGP participants belonging to each ancestry group (taken from amongst those with an inferred probability >0.9 of belonging): a , boxplots showing the distribution of the number of non-homozygous reference PASSing genotypes for variants on chromosomes 1–22 and X which meet the default Rareservoir MAF filtering criteria (that is a PMAF score >0 using gnomAD v3.0 and internal MAF < 0.002); b , boxplots showing the distribution of the proportion of all PASSing non-homozygous reference genotypes that meet the default Rareservoir MAF filtering criteria. In both plots, the lower, centre and upper lines respectively indicate the lower quartile, median and upper quartile. Whiskers are drawn up to the most extreme points that are less than 1.5× the interquartile range away from the nearest quartile.

Extended Data Fig. 2 General schematic of the database build procedure and contents.

Variants are extracted from VCF files, filtered on internal cohort allele frequency, encoded as 64-bit RSVR IDs and loaded into a table containing the corresponding genotypes. The variants are annotated with scores reflecting their predicted deleteriousness (in this case, CADD scores) and probabilistic minor allele frequency scores (PMAF) from gnomAD. The consequences of each variant with respect to a reference set of transcripts are generated and loaded into a table. Sample information including pedigree membership and membership of a maximal set of unrelated participants is loaded into a table. The case groupings for case/control association analyses are stored in a table.

Extended Data Fig. 3 Detailed schematic of the database build procedure.

Variants may be imported to a Rareservoir from either single gVCF files or a merged VCF file, following the procedures indicated by red and blue arrows respectively.

Extended Data Fig. 4 Schematic showing the variant data in the 100KGP Main Programme Rareservoir.

The number of variant/transcript pairs, the distribution of CADD scores and a breakdown of gnomAD frequency classes is shown for each annotated SO term in the context of the structure of the ontology.

Extended Data Fig. 5 The 269 case sets, Disease Groups A–I.

The names and sizes of the case sets used for the genetic association analyses, grouped by Disease Group and coloured by type (Disease Sub Group or Specific Disease). Disease Sub Groups with only one Specific Disease were excluded to avoid repeating identical analyses. Case sets smaller than 5 are labelled ‘<5’ and shown as having size 4 to comply with 100KGP policy on limiting participant identifiability. For legibility, only Disease Groups starting with the letters A–I are shown here.

Extended Data Fig. 6 The 269 case sets, Disease Groups M–Z.

An extension of Extended Data Fig. 5 showing the case sets in Disease Groups starting with the letters M–Z.

Extended Data Fig. 7 Breakdown of cases attributable to associations with ‘Posterior segment abnormalities’ by Specific Disease.

For each gene associated with the Disease Sub Group ‘Posterior segment abnormalities’, a bar plot showing the number of cases having each of the different Specific Diseases who have an inferred pathogenic configuration of alleles in the gene. This example illustrates that sets of cases with the same etiological gene may be assigned different Specific Diseases. Consequently, pooling cases within Disease Sub Group can boost power.

Extended Data Fig. 8 Microscopy images of HEK293 cells overexpressing ERG.

Exemplar immunofluorescence microscopy images of HEK293 cells overexpressing wild type ERG (from 20 replicates) and each of the p.S182Afs*22, p.T224Rfs*15 and p.A447Cfs*19 variants of ERG (each from 17 replicates). Cells were stained for ERG (green) and nuclear marker DAPI (blue). Scale bar, 20μm.

Extended Data Fig. 9 Illustrative audiograms for GPR156 cases.

Air and bone conduction audiograms for the two affected daughters of the family with compound heterozygous GPR156 truncating alleles.

Supplementary information

Reporting summary, source data fig. 1.

Sheet 1 shows a table of associations shown in Fig. 1b annotated with BeviMed PPAs (PPA), the level of the case set in the disease label hierarchy (Level), the inferred variant class and MOI for the association, the matched PanelApp panel for the association, the method that was used to find the match (Match method, either ‘Automatic’ or ‘Manual’), the associated evidence level for the match and the notes on the consistency between the MOI listed by PanelApp for the association and the inferred MOI (MOI match comment). Sheet 2 shows a table of variants having a probability of pathogenicity >0.8 conditional on the modal model and forming a pathogenic configuration of alleles in at least one case. While these variants contributed to the reported statistical associations, they have not been individually scrutinized according to ACMG guidelines.

Source Data Fig. 2

Uncropped western blot images corresponding to Fig. 2e .

Source Data Fig. 4

Uncropped western blot images corresponding to Fig. 4e .

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Greene, D., Genomics England Research Consortium., Pirri, D. et al. Genetic association analysis of 77,539 genomes reveals rare disease etiologies. Nat Med 29 , 679–688 (2023). https://doi.org/10.1038/s41591-023-02211-z

Download citation

Received : 21 October 2022

Accepted : 06 January 2023

Published : 16 March 2023

Issue Date : March 2023

DOI : https://doi.org/10.1038/s41591-023-02211-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Constitutive activation mechanism of a class c gpcr.

  • Jinwoo Shin
  • Junhyeon Park

Nature Structural & Molecular Biology (2024)

Mutations in the U4 snRNA gene RNU4-2 cause one of the most prevalent monogenic neurodevelopmental disorders

  • Chantal Thys

Nature Medicine (2024)

Variants in LRRC7 lead to intellectual disability, autism, aggression and abnormal eating behaviors

  • Jana Willim
  • Daniel Woike
  • Hans-Jürgen Kreienkamp

Nature Communications (2024)

Critical assessment of on-premise approaches to scalable genome analysis

  • Amira Al-Aamri
  • Syafiq Kamarul Azman
  • Andreas Henschel

BMC Bioinformatics (2023)

Novel GPR156 variants confirm its role in moderate sensorineural hearing loss

  • Memoona Ramzan
  • Nazim Bozan
  • Mustafa Tekin

Scientific Reports (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

genomics england research environment

genomics england research environment

World of Genomics: England

When: Thursday, November 8th, 2018 Time: 10am PST / 6pm GMT

The webinar will be available on-demand after this date.

' src=

Last updated on 9th November 2023 by Kira Newbon.

Best known for fish and chips, football, the Royal Family, and Shakespeare, England is also a world-leader when it comes to genomics research. Being a UK-based company in the genomics field, and having had the opportunity to speak to English leaders in the genomics space at our event “The Festival of Genomics and Biodata”, we know the genomics scene in England pretty well – or so we hope! The UK, and by extension England, is currently positioning itself as a “Genomics Superpower,” so read on to find out more.

The population of England

England is part of the United Kingdom and covers more than half of the island of Great Britain. To the north of England is Scotland and the Irish Sea, to the west lies Wales and the Atlantic Ocean, to the south the English Channel, and to the east the North Sea. England’s geography features rolling hillsides, with mountain ranges including the Pennines to the north.

genomics england research environment

England has been inhabited for over 800,000 years, as evidenced by stone tools and footprints found in Norfolk. The earliest sign of modern human life in North-western Europe, a jawbone found in Devon, dates back 41-44,000 years ago. England has a rich history of human habitation, with remains from the Mesolithic, Neolithic and Bronze Ages, such as Stonehenge and Avebury. In the Iron Age, Britain was inhabited by the Celtic Britons and in AD 43 the Roman conquest of Britain began and lasted until the 5 th century.

The end of Roman rule led to the Anglo-Saxon settlement of Britain, which is regarded as the origin of England and The Kingdom of England emerged in the 10 th century. In 1066, the Norman conquest of England began a new dynasty. The Tudors and later the Stuart dynasty established England as a colonial power, with the English Civil War resulting in the execution of King Charles I and the establishment of a republic. The British Empire began in the late 1500s in England and ruled the largest colonial empire in recorded history before decolonisation in the 20 th century. In the 19 th century, England became the epicentre of the Industrial Revolution, quickly becoming the world’s most industrialized country.

Geographic and demographic information

Summary statistics.

  • Land area:  130,278 sq km
  • Total: £1.86 trillion (2019)
  • Per capita: £32,866 (2019)

Population statistics

  • Population size:  56,536,000 people (2021)
  • Birth rate (UK, 2020):  10 per 1,000 people
  • Death rate (UK, 2020):  10 per 1,000 people
  • Infant mortality rate (UK, 2020):  4 per 1,000 people
  • Male 2020 estimate: 79 years
  • Female 2020 estimate: 83 years
  • Ethnicities (England and Wales, 2021) : White (81.7%), Asian, Asian British (9.3%), Black, Black British, Caribbean or African (4.0%), Mixed, multiple ethnic groups (2.9%), other ethnic group (2.1%)

(Source: Office for National Statistics and World Bank )

Healthcare system

The English healthcare system provides universal healthcare via the National Health Service (NHS) and is funded primarily through general taxation and national insurance. All English residents are entitled to free public health care, including hospital, physician, and mental health care services. The government, through NHS England, oversees and allocates funds to 191 Clinical Commissioning Groups, which govern and pay for care delivery at the local level. Approximately 10.5% of the population in England holds voluntary supplemental insurance for more rapid access to elective care. The government owns hospitals and providers of NHS care, including ambulance services and mental health services, and is responsible for ensuring comprehensive coverage. The NHS provides a wide range of services, including preventive services, hospital care, physician services, mental health care, and rehabilitation.

The main providers of primary care are General Practitioners (GPs), who act as gatekeepers for secondary care. People are required to register with a local GP of their choice, but many practices are full and do not accept new patients, limiting choice. To address the shortage of doctors, there has been a shift towards larger practices using multidisciplinary teams, including specialised services, pharmacists, and social workers. Most GPs are private contractors, but a growing number of practices are employing GPs on a salaried basis. Inpatient specialist care is mostly provided in NHS hospitals, and the NHS reimburses practices for the services they deliver. After-hours care is usually contracted by Clinical Commissioning Groups (CCGs) to GP cooperatives or private companies. Publicly owned hospitals are organized either as NHS trusts or foundation trusts, and all public hospitals have contracts with local CCGs to provide services.

Public Health England (PHE) was a government agency in England responsible for protecting and improving public health. Established in 2013, it combined the functions of several health bodies. On 29 March 2021, the UK government announced that PHE would be disbanded and its functions divided among several organizations. The health protection functions were transferred to the UK Health Security Agency (UKHSA), while health improvement functions went to the Office for Health Improvement and Disparities, NHS England, and NHS Digital (which has now merged with NHS England as of February 1 st , 2023). The UKHSA is an executive agency of the Department of Health and Social Care and is responsible for public health protection and infectious disease control in England, and its establishment was prompted by the COVID-19 pandemic. The formation of the UKHSA took place in 2021 and became fully operational on 1 October 2021.

Healthcare priorities

England has made significant strides in improving public health over the past few decades. Since 1990, life expectancy has increased by over five years for men and over three years for women. The biggest contributor to this rise has been the substantial reduction in cardiovascular disease-related deaths, which is largely due to a combination of declining smoking rates, healthier diets, better access to preventative medication, and improved treatments. The screening of diseases has also been a key focus, with over 21 million tests performed each year covering over 30 conditions. Additionally, vaccination efforts have been successful in reducing the number of cases of infectious diseases, with tuberculosis cases at a record low and HIV diagnoses reaching their lowest levels since 2000.

However, there are still numerous challenges that England faces in terms of public health. The improvement in infant mortality and life expectancy has stalled in recent years, and many people are spending more time in poor health. There is a high prevalence of unhealthy behaviours, such as smoking, that are leading causes of premature death, particularly among low-income and vulnerable groups. Cardiovascular disease remains a major concern, affecting over six million people and accounting for one in four deaths. Additionally, infectious diseases such as measles and sexually-transmitted infections continue to pose a threat to public health.

Despite the challenges, there are also many opportunities for England to continue improving public health. The increasing use of technology, such as online tools and wearable devices, is opening up new avenues for monitoring health, early diagnosis, and tailored advice and support. The focus on preventing cardiovascular disease has been strengthened through the National Cardiovascular Disease Prevention System Leadership Forum (CVDSLF) and local-level efforts to improve detection and management of cardiovascular risk. Moreover, NHS Digital (formerly Public Health England) manages national datasets, publishes tools and resources, and manages disease registries such as the National Cancer Registration and Analysis Service which collects cancer data in England to drive improvements in cancer care and outcomes. The service aims to diagnose 75% of cancers at stage 1 and 2 by 2028 and is working to develop a new indicator to understand the stage at which cancers are diagnosed.

Genomic medicine capabilities

The NHS has a long history in genomics, starting with the first genetic laboratory services in the 1960s and most recently with the launch of the NHS Genomic Medicine Service (GMS) in 2018. The GMS is a nationally coordinated service, locally delivered, that offers cutting-edge benefits for patients, including the use of genomics in routine clinical care. The NHS GMS is the result of a decade of investment in genomics by the UK government and the NHS. The 100,000 Genomes Project, which aimed to sequence 100,000 whole genomes of patients in the NHS, was a big part of this investment. Genomics England was later created to support this effort.

In the UK, there are approximately 300 genetic counsellors – there are approximately 7000 worldwide – and the vast majority of genetic counsellors in England practise within clinical genetics services in the NHS.

The NHS Long Term Plan in 2019 set out that through the NHS GMS, the NHS would use whole genome sequencing as part of routine care for seriously ill children with rare genetic disorders, children with cancer, and adults with rare conditions or specific cancers. The NHS GMS is made up of a consolidated national genomic laboratory network of seven genomic laboratory hubs and seven GMS alliances, a single National Genomic Test Directory, and a clinical genomic service that diagnoses and manages complex rare and inherited diseases. The NHS also has a national genomic knowledge base, a partnership with Genomics England, and a national genomics unit. The NHS has adopted next-generation sequencing panel testing, fetal exome sequencing, and rapid whole exome sequencing. The use of genomic medicine has allowed patients to access over 12 newly licensed precision medicines.

In October 2022, the NHS launched a world-first national genetic testing service that will provide rapid life-saving tests for babies and children: the Newborn Genomes Programme. The service will process DNA samples of babies and children who are seriously ill or who are born with rare diseases, such as cancer. It will benefit over 1,000 children in intensive care each year, who previously had to undergo extensive tests with results taking weeks. The service will give medical teams results within days, allowing them to kickstart lifesaving treatment plans for more than 6,000 genetic diseases. Genomics England is delivering the Generations Study as part of their collaboration with the Newborn Genomes Programme. An initial list of genes and conditions that will be included in the Generation Study was published by Genomics England in October 2023. It includes 223 individual conditions caused by genetic changes in more than 500 genes. The list is likely to be updated in response to emerging research throughout the study. 

Notable projects

  • Genome UK : Government strategy that sets out to “create the most advanced genomic healthcare ecosystem in the world, where government, the NHS, research and technology communities work together to embed the latest advances in patient care.”
  • UK Biobank : Established in 2007, the UK Biobank is a biomedical database containing health-related data. This includes genomic data on half a million UK participants.
  • Our Future Health : The UK’s “largest ever health research programme”, Our Future Health aims to recruit up to 5 million adult volunteers from across the UK to collect health-related data for research and preventive medicine.
  • 100,000 Genomes Project : A British initiative which sequenced 100,000 genomes from approximately 85,000 NHS patients affected by rare diseases or cancer.
  • Newborn Genomes Programme : A joint project by NHS England and NHS Improvement (NHSE/I) and Genomics England to sequence the genomes of over 100,000 newborns to identify genetic conditions that can be addressed clinically.
  • Cancer 2.0 Initiative : This initiative is comprised of 2 programmes – the Long-Reads and Methylation Sequencing Programme and the Multi-Modal Programme to help clinicians deliver more personalised treatment for 300,000+ patients per year.
  • “Data saves lives” strategy : Strategy to make the NHS and social care more data-driven and use data to bring benefits to patients, care users, and staff on the frontline by investing in secure data environments, the latest in technology, and giving people better access to their own data.
  • Diverse Data Initiative : This initiative by Genomics England aims to reduce health inequalities and improve genomic medicine by addressing the overrepresentation of populations from ‘WEIRD’ (western, educated, industrialised, rich and democratic) backgrounds in genomic databases.
  • National Genomic Research Library (NGRL) : A comprehensive database that allows approved researchers access to genomic data, health data and samples, under joint control of NHS England and NHS Improvement (NHSE/I) and Genomics England.
  • NHS Genomics Education Programme : A four-year £20 million Genomics Education Programme launched by Health Education England (HEE) in 2014 to improve access to genomics education for NHS staff. The programme has developed many educational resources, including a Master’s in Genomics Medicine framework, which can be undertaken as continued personal and professional development (CPPD) modules, a postgraduate certificate or diploma, or a full master’s degree. 

Notable organisations and companies

  • NHS Genomic Medicine Service (GMS): The arm of the NHS that harnesses the power of genomic technology and aims to provide equitable care, create a single National Genomic Test Directory which covers the use of all genomic technologies, and create a national genomic knowledge base to provide real world data to researchers and industry.
  • Genomics England : A company owned by the Department of Health and Social Care and created to execute the 100,000 genome project. Partners with the NHS to embed genomics into routine healthcare, improve diagnostics and treatment for patients, power researchers with their large genomic database.
  • Oxford Nanopore Technologies: Founded in 2005, Oxford Nanopore has developed long-read sequencing technology and is the only company that offers real-time analysis of native DNA or RNA and sequence any length of fragment
  • Wellcome Sanger Institute : Established in 1992, a non-profit research organization focused on genomics research. Funded mainly by the Wellcome Trust, it was created to play a role in the Human Genome Project as a major DNA sequencing centre.
  • Institute of Cancer Research : Established in 1909, and specialises in genetic epidemiology, molecular pathology, and therapeutic development for cancer research. The ICR is most famous for identifying that the basic cause of cancer is damage to DNA.
  • Health Data Research UK (HDRUK) : An independent, registered charity which aims to “develop and apply cutting-edge approaches to clinical, biological, genomic and other multi-dimensional health data, addressing the most pressing health research.”
  • UK Health Security Agency (UKHSA ): Formed after the dissolution of PHE and is responsible for “protecting every member of the community from the impact of infectious diseases” and other health threats.
  • Centre for Improving Data Collaboration : A new business unit within NHSX (NHSX is a UK government unit which sets national policy and best practice) which aims to support the NHS and social care to enter into data-sharing partnerships to benefit patients and the public.

Notable individuals

  • Francis Crick: English scientist who played a crucial role in deciphering the helical structure of the DNA molecule.
  • Rosalind Franklin: English chemist and X-ray crystallographer who played a central role in uncovering the molecular structures of DNA, RNA and viruses.
  • Frederick Sanger: English biochemist who won two Nobel Prizes in Chemistry for his ground-breaking discoveries: The molecular structure of proteins, and the development of “Sanger sequencing” which is a chain-terminating method for DNA sequencing.
  • C.H. Waddington: A British developmental biologist and geneticist who laid the foundations for systems biology, epigenetics, and evolutionary developmental biology.
  • Reginald Punnett: An English geneticist who co-founded the Journal of Genetics with William Bateson in 1910. He is probably best known for creating the Punnett square which is still used by biologists today to predict the probability of possible genotypes of offspring.
  • Dame Anne McLaren: An English scientist who was a leading figure in developmental biology and whose work helped lead to human in vitro fertilisation.
  • Adam Rutherford: English geneticist, best known for his contributions to the Guardian and popular science books such as “The Book of Humans” and “A Brief History of Everyone Who Ever Lived”.

Future genomics landscape

England already has a strong history of cutting-edge genomics research, and this trend looks set to continue in the coming years. The National Health Service (NHS) in England is currently piloting a potentially revolutionary blood test, known as Galleri, that can detect over 50 types of cancer in its early stages. The test, developed by GRAIL, can detect these cancers through a simple blood sample and will be trialled on 165,000 patients. The results of the study are expected by 2023, and if outcomes are positive, the pilot could be expanded to involve about 1 million participants in 2024 and 2025. The Galleri blood test can detect various cancers that are difficult to diagnose early, such as head and neck, ovarian, pancreatic, oesophageal and blood cancers. The test could help the NHS reach its goal of increasing the proportion of cancers detected early, which is crucial to reducing cancer mortality.

To further realise the potential of genomics data, NHS England and the government’s “Data saves lives” strategy has outlined a plan for improving the use of data in healthcare, with the goal of using genomic data in combination with other health data to drive improvements for patients. The NHS aims to develop an interoperable data infrastructure and use cutting-edge tools to maximize diagnosis and access to precision medicine. The National Genomic Research Library (NGRL) has grown to include over 110,000 clinically linked genomes, making it one of the largest collections of whole genomes in the world for cancer and rare disease research. The recent update, part of the 100,000 Genomes Project, added aggregated data of over 78,000 genomes and allows for easier research using the Genomics England Research Environment platform. Moreover, the constantly improving clinical data from NHS Digital and Public Health England complements the detailed genomic data, making it a valuable resource for disease-specific research.

In December 2022, NHS England announced a £13.5 million investment for the development of a network of secure data environments (SDEs) for health and social care data. These SDEs, which meet high standards of privacy and security, will facilitate research and analysis of health data without compromising privacy. The funding is part of a larger investment of £200 million for making health data more accessible for research and analysis. The sub-national SDEs will cover 5 million citizens each and will operate in conjunction with the national secure data environment, offering privacy-protected access to data.

Another exciting project that looks set to strengthen the genomics landscape in England is Our Future Health. Our Future Health is the “UK’s largest health research programme” aiming to recruit 5 million people to donate their health information to improve disease prevention, detection, and treatment. It will be a secure, encrypted database that only authorized researchers with strict ethical and scientific criteria can access. The program’s goal is to help future generations live in good health for longer.  It is funded by the UK government alongside various research partners and plans to last until 2025. In December 2022, Genomics England announced it will receive £175 million in funding to support its efforts in boosting the accuracy and speed of diagnosis for cancer patients and newborns with rare genetic conditions. The funding will go towards three initiatives, including a Newborn Genomes Programme that will sequence the genomes of up to 100,000 newborns, a cancer programme (Cancer 2.0 Initiative) that will use genomic sequencing and AI to improve diagnosis, and a Diverse Data initiative to tackle health inequalities by increasing the representation of non-European ancestry in genomic research. Another part of Cancer 2.0 is Genomic England’s plans to create the world’s largest cancer research platform, which will collect and analyse vast amounts of cancer-related data and turn it into better treatments for patients. With AI analytics, the platform will merge various data formats from genomics, pathology, and radiology. Currently, Genomics England has 16,000 participants via the 100,000 Genome Project and aims to digitize hundreds of thousands of pathology and radiology images. The platform aims to be running by the end of 2023. These initiatives aim to create the world’s most advanced genomic healthcare system, supported by the latest scientific advancements, patient engagement, workforce development, and industrial growth.

More on these topics

Share this article.

Twitter

More From Front Line Genomics

More from front line genomics.

genomics england research environment

'World of Genomics: Estonia' - Original article written by Shannon Gunn, 2021. Updated by Lyndsey…

genomics england research environment

Original article by Shannon Gunn, updated by Kira Newbon in July 2023. Sometimes referred to…

genomics england research environment

Original article by Shannon Gunn, November 2021. Updated by Kira Newbon, August 2023. Home to…

  • Learning to use the RE
  • De novo data code book
  • 100kGP disease models
  • COVID-19 clinical data
  • Genomic data
  • Transcriptomics pilot data
  • AggV2 allele frequencies
  • AggV2 Principal components and genetically inferred relatedness
  • AggV2 Ancestry inference
  • AggV2 file manifest
  • Genetic similarity to worldwide populations (ancestry) in the UK Biobank
  • Solved cases (rare disease)
  • HLA variants
  • Orthogonal standard-of-care (SOC) test data (cancer)
  • 100,000 Genomes Cancer Programme - pan-cancer publication
  • Clinical application of tumour in normal contamination assessment from WGS - TINC publication
  • Publicly available data
  • Frequent data releases
  • Application data versions
  • Terminology server
  • Integrative Genome Viewer (IGV) - visualise genomic data
  • PanelApp - curated gene lists
  • Terminal application
  • Jupyter notebooks
  • LibreOffice
  • Using software on the HPC
  • How to request software installation within the Research Environment
  • Python packages and personal conda environments
  • Jupyter Lab on the HPC
  • Reporting potential diagnoses and contacting clinicians
  • Past training sessions
  • Using GEL data for publications and reports, September 2024
  • Getting medical records for participants, July 2024
  • An introduction to the Research Environment, live training session at GERS
  • Finding participants based on genotypes, June 2024
  • Building rare disease cohorts with matching controls, May 2024
  • Introduction to the RE, April 2024
  • Building cancer cohorts and survival analysis, March 2024
  • Importing tools and data to use in the Research Environment, February 2024
  • Using the Research Environment for clinical diagnostic discovery, January 2024
  • Using the HPC to run jobs, December 2023
  • What tools and workflows should I use to fulfil an overall goal?, November 2023
  • Using GEL data for publications and reports, October 2023
  • Getting medical records for participants, August 2023
  • Finding participants based on genotypes, July 2023
  • Building rare disease cohorts with matching controls
  • Building cancer cohorts and survival analysis
  • New datasets in the RE, May 2023
  • Importing tools and data to use in the Research Environment, March 2023
  • Using the GEL Research Environment for clinical genetic diagnosis, February 2023
  • Introduction to the Research Environment, January 2023
  • Using the HPC to run jobs, November 2022
  • Getting medical histories for participants, September 2022
  • Finding participants based on genotypes, July 2022
  • Building a cohort based on phenotypes, May 2022
  • Introduction to the Research Environment, March 2022
  • Bioinformatics Clinics

Upcoming live training ¶

In-person training session ¶.

We are holding an in-person training session at our Canary Wharf offices on 20th November 2024:

Learn more about our in-person training

Virtual training sessions ¶

Why? The purpose of these sessions are to train researchers to leverage the Genomics England clinical and genomic data, to assist them in their analysis.

Who? These sessions will be accessible to all users eligible to access the Research Environment / Genomics England dataset - including both Discovery Forum and Research Network members.

When? We arrange training sessions twice a month, with topic-based sessions on the second Tuesday of the month, and an Introduction session for researchers new to the RE on the third Tuesday. Please see the schedule below.

What? You can find material from previous training sessions in the links on the left. What comes next depends on you, so don't hesitate to reach out to us with suggestions on what to cover in upcoming sessions.

Where? All training sessions will be held online via Zoom and recordings of the session will be made available afterwards with all sensitive data removed.

Check this page regularly for updates on the next training session. Also, do not hesitate to get in touch to recommend a topic for upcoming sessions.

Research Environment training sessions schedule ¶

Date Topic Details and registration
8th October What tools and workflows should I use to fulfil an overall goal?
12th November Running workflows on the and Cloud
10th December Introduction to the
14th January Using the Research Environment for clinical diagnostic discovery
21st January Introduction to the
11th February Importing data and tools to use in the
18th February Introduction to the
11th March Working with R in the
18th March Introduction to the
8th April Working with python in the
15th April Introduction to the
13th May Building cancer cohorts and survival analysis
20th May Introduction to the
10th June Building rare disease cohorts with matching controls
17th June Introduction to the
8th July Finding participants based on genotypes
22nd July Introduction to the
19th August Introduction to the
9th September Getting medical records for participants
16th September Introduction to the
14th October What tools and workflows should I use to fulfil an overall goal?
21st October Introduction to the
11th November Using data for publications and reports
18th November Introduction to the
9th December Running workflows on the and Cloud
16th December Introduction to the

Past training sessions ¶

Date Topic Materials
September 2024 Using data for publications and reports
July 2024 Getting medical records for participants
July 2024 An introduction to the Research Environment: live training session at GERS
June 2024 Finding participants based on genotypes
May 2024 Building rare disease cohorts with matching controls
April 2024 Introduction to the
March 2024 Building cancer cohorts and survival analysis
February 2024 Importing data and tools to use in the
January 2024 Using the Research Environment for clinical diagnostic discovery
December 2023 Running workflows on the
November 2023 What tools and workflows should I use to fulfil an overall goal?

COMMENTS

  1. Research Environment

    All analysis on the Genomics England dataset happens within a secure, cloud workspace called the Research Environment. To access the data, researchers must first apply to become a member of either the Genomics England Clinical Interpretation Partnership (academics, students, and clinicians) or the Discovery Forum (industry partners).

  2. Homepage

    This will be fixed for release 19, due later in 2024. 15th April 2024 - Participant Explorer update. The Participant Explorer has been updated with 100,000 Genomes data release 18. Getting started. How-to guides. Data in the Research Environment. Desktop applications in the Research Environment.

  3. Members' home

    The secure Research Environment is where all approved researchers can access the most up-to-date Genomics England dataset - the largest of its kind in the world. If you've already applied to join the Research Network (academics, clinicians, and students) or the Discovery Forum (industry partners), you can use the button below to log in to ...

  4. Research and Partnerships

    Research using one of the largest genomic datasets in the UK. We partner with world's leading researchers from two critical areas: academia and the life sciences sector. Their research covers a wide scope, from translating genomic and health data into scientific breakthroughs to turning pioneering research into the medicines, treatments, and ...

  5. Welcome to the Genomics England Research Portal

    Welcome to the Genomics England Research Portal From here you can track your application to join the Research Network or access the 100,000 Genomes Project data.As a Research Network or Discovery Forum member you will be able to browse and submit projects in the Research Registry, manage your contact details and access our other spaces such as the IG Training and Research Environment.

  6. Genomics England Research Environment User Guide

    The main types of data in the Research Environment are: Clinical and phenotype data data for each participant. Genomic data for each participant from our sequencing providers. Genomic and associated data from the Genomics England bioinformatic pipelines. Publicly available genomic datasets and cohorts. Research Community provided data.

  7. Insights for precision oncology from the integration of genomic and

    To access the genomic and clinical data within this Research Environment, researchers must first apply to become a member of either the Genomics England Research Network (previously known as the ...

  8. Homepage

    Genomics England analyses sequenced genomes for the NHS and then equips researchers to use data to help find the cause of disease. ... Enabling scientific research via our large genomic database Our secure Research Environment equips approved researchers to make discoveries that enhance participants' lives.

  9. Genomics England Research Environment User Guide

    Welcome pack. Welcome to the Genomics England Research Environment (RE) documentation! The RE is a virtual desktop accessible through Amazon WorkSpaces (AWS) where you can access and analyse Genomics England data. This getting started section of the documentation has a few guidelines and suggestions to help you get to grips with our environment.

  10. Useful links · Customer Self-Service

    Research Environment user guide. In this documentation, we will provide you with the knowledge and training materials that you will need to navigate and analyse the wealth of data available to you. We suggest that you go through the documentation step-by-step so that you become familiar with the data and the analysis tools within the Research ...

  11. Genome UK: 2022 to 2025 implementation plan for England

    This is being run alongside Genomics England 's original research environment to ensure Genomics England can provide the right service to the broadest spectrum of use cases.

  12. Genetic association analysis of 77,539 genomes reveals rare ...

    Note that, as LOFTEE scores on the Genomics England Research Environment correspond to Ensembl v.99 transcripts, we mapped Ensembl v.104 canonical transcripts to the most similar v.99 transcripts ...

  13. PDF Trusted Research Environments (TRE)

    • Genomics England Research Environment8 • UK Data Service Secure Lab9 This is also the approach being adopted in the development of a national health data research capability to support COVID-19 research questions10. There are a growing number of practical and cost saving benefits to this approach. It can maximise the

  14. Resources and tools

    Developing tools to support genomic medicine. Our large team of bioinformaticians have the task of analysing whole genome sequence data for clinical applications. Our work promotes us to keep in touch with best practice and standards across our industry with the most advanced tools. When other commercial or open source tools won't satisfy our ...

  15. Research Environment Training Session: Using the Research Environment

    This means that you are a Research Network or Discovery Forum member that has met the necessary verification checks and passed our Information Governance training course. If you do not meet this criterion by 13th January 2025 you will be unregistered for this session. Timetable. 13.45 Identifying participants who need a diagnosis.

  16. World of Genomics: England

    The recent update, part of the 100,000 Genomes Project, added aggregated data of over 78,000 genomes and allows for easier research using the Genomics England Research Environment platform. Moreover, the constantly improving clinical data from NHS Digital and Public Health England complements the detailed genomic data, making it a valuable ...

  17. Genomics England Research Environment User Guide

    Accessing the. RE. To access the RE you will need to install an AWS client, set up two-factor authentication with Okta and login using the credentials sent by Genomics England. This guide takes you through getting access to the RE for the first time, including: Installing an AWS client. Adding the RE WorkSpace to AWS.

  18. Release v18 (21/12/2023)

    This document provides a description of the 100k Genomes Project (previously known as Main Programme) Data Release v18 dated 21st December 2023. Each progressive release incorporates new content, enhances existing content, and enables more effective use of the data in the National Genomics Research Library (NGRL).

  19. Research Environment Training Session: Introduction to the Research

    The Genomics England Research Environment provides access to Genomics England data, including genomes, variants and phenotypic data from rare disease and cancer patients from the 100,000 Genomes project and NHS Genomic Medicine Service. Due to the sensitive nature of the data, all analyses on these data must be carried out within the Research ...

  20. Genomics England Research Environment User Guide

    The purpose of these sessions are to train researchers to leverage the Genomics England clinical and genomic data, to assist them in their analysis. Who? These sessions will be accessible to all users eligible to access the Research Environment / Genomics England dataset - including both Discovery Forum and Research Network members. When?