Alpha This is a work in progress and may change. Your feedback is very welcome.
  


Tags : Starting

February 10th 2023

Week 4 : Building foundations

Before starting to add new features or work on new projects, it's always good to take stock and to think about refactoring (improving the codebase) and looking at what foundations need to be built.

This week has been all about foundational work. Some of time has been spent refactoring and restructuring the template code so that it can be shared between microservices. This has been accompanied by starting to build some tools which will allow local development environments and production environments to pull in up to date templates and data as part of the build and deploy process.

Another part of the work, equally foundational, was to start developing shared constants for the different pipelines and the website. This will reduce the risk of wrong or outdated constants being used. These will all be hosted in an open GitHub repository.

The final piece of foundational work has been looking at the IMGT paper on consistent numbering and beginning to develop a data structure to hold that information and information on the role of individual positions within the moledules (i.e TAPBPR binding, TCR binding, peptide binding).

It's not all been software development this week. I've also been joining in on an Open Targets project, talking through user needs with one of the PDB in Europe teams and attending a couple of fascinating seminars. It's lovely to be part of an organisation where I get to see and participate in such interesting scientific discussions.

February 3rd 2023

Week 3 : Prioritising first tasks

The PID template is still in the process of being refined but has been very useful to think through the possible places to start and understand which tasks/projects are dependent upon each other. From this analysis there are two clear first tasks which are quite foundational to work going forwards:

IMGT numbering of structures

The structures from the PDB have a variety of different residue numberings. The majority are numbered sequentially with residue 1 being the first residue after the signal peptide. Some structures however are numbered with residue 1 being the first amino acid of the signal peptide. The variety of numbering is further complicated by the MHC molecules of some species having deletions or insertions within the antigen-binding domains.

Fortunately, the IMGT has created a numbering scheme to provide an elegant solution to this problem. This particular task is to develop an automated way to apply the IMGT numbering scheme to MHC Class I structures from the PDB and to develop an automated way to test/validate the assignment.

Looking at ways to link IPD/IMGT allele names to data within Open Targets

There is a wealth of information in OpenTargets and the GWAS catalogue which would provide disease association context for a future feature on "histo" which is a page for each allele, with information on polymorphism locations, motifs if known, immunopeptidome/epitopes and structural data/predictions.

Currently, OpenTargets uses ENSEMBL ids as search criteria and stores information about potentially associated variants keyed against HGVS descriptions. This information is available on the IPD-IMGT/HLA database (example for HLA-A*02:01) and this has potential to create linkages. One interesting opportunity that arises from this is to create "SameAs" type mappings and make them publicly available.

January 27th 2023

Week 2 : Getting started by writing some documents

You can start some things straight away, but when there are multiple things you could do, prioritising and working out what the things are is crucial to avoid too much context switching. For those of us used to agile, PIDs (Product Initiation Documents) feel like things from the "waterfall past", but I think they're still relevant if they're made to reflect user needs that is at the core of any agile methodology.

I've spent some time this week thinking about how to do this. In some ways, it feels like a luxury to start by developing a document format, but it feels foundational and worth the effort to get a format that captures the right information.

I started with some desk research, looking at a lot of different existing templates for things to reuse. The format I liked the best came from this blog post by Robert Drury. It has the right level of detail, not too little, not too much. However, I felt it didn't really include much information about the users and their needs. So I looked at how we used to frame some of this in the work I did in the charity sector with CAST.

Problem statements

The first tool that I love to use is a problem statement. These are framed from the perspective of a user and really help to understand what is trying to be solved/fixed/improved.

When {who are the people affected}
Are {what is the situation}
Then {what problem arises}
This means {what are the effects of the problem}

For example:

When people or tools
Are trying to compare two MHC structures from species/loci with insertions/deletions
Then they cannot easily see which residue number is comparable
This means that comparison is hard, or scripts/tools are brittle

This is quite a clumsy example in some ways, but it serves to show the problem reasonably well.

Knowledgeboards

Knowledgeboards are a great way of both organising information and thinking through what is known and what needs to be researched. There are three categories in a knowledgeboard:

What we know Things you know for certain, and why

What we think we know Things you need more evidence for

What we don’t know Things you need to find out

The aim of all user research is to move things from uncertainty to certainty. Only things which have evidence should be in "what we know", ideally with a link to that evidence.

Quadrants

Since working with Matt McAlister, I've always loved using quadrant charts to look at problems. I sketched out how you might map the risks of a project on a quadrant and it works nicely I think.

Not likely, high impact
Likely, high impact
Not likely, low impact
Likely, low impact

Testing the prototype template

Finally, since it's always good to "design with data", I filled in a couple of PIDs for potential projects/tasks/features to see how it worked and then made a few tweaks to the template where I felt information didn't have a home or was in the wrong place.

Now to write more PIDs and to look at which things make the most sense to do first, thinking both about user needs and other things which depend upon them.

p.s. once I feel confident with the template, I'll share it

January 20th 2023

Week 1 : Hello EBI/ARISE

I started the ARISE fellowship, which runs for three years on the 16th January 2023. I'm now on the Wellcome Genome Campus in the European Bioinformatics Institute, with OpenTargets as my host group and an association with the PDB in Europe group. Having been a remote worker for much of the pandemic, having a desk and being with colleagues is lovely. It's even more wonderful to be back amongst scientists for the whole week after working as an independent researcher for the last year and a half and working in the technology sector before that.

As I'm always keen to start at "the B of the bang" and ARISE is a continuation and scale-up of the work already performed, I've set myself some tasks for between meetings.

The first one is around weeknotes, both building the small software feature to display them and the practice of writing them.

The software feature is being used to test a way of developing microservices and having them be part of an overarching product. As with the structures part of the website, the weeknotes feature will be developed in Flask/Zappa and deployed as an AWS Lambda function. However, to keep with the principle of "small pieces, loosely joined" new sections of the overall product will be individual microservices which are then composited using API Gateway. Search will be the next feature to get some upgrades and to become a microservice.

The practice of writing them will be inspired by a couple of pieces of current reading:

Hopefully, the weeknotes will be of use to others. At the very least they will be a way of lightweight continuous reporting and reminding me of what was done when (in terms of features, experiments will obviously be in the lab notebook).

Finally, a meeting with Sameer from PDB in Europe made me think again about the need for a PID (Project Initiation Document) template which captures user-centricity and existing knowledge better and is more agile than waterfall. So that will be a task for next week.

January 13th 2023

Week 0 : New beginnings

Hello, welcome to the weeknotes for the "histo" project. First an introduction and some background.

I'm Chris, I'm a bioinformatician/software developer/immunologist. I have been awarded an EMBL and the Marie Skłodowska-Curie Actions (MSCA) ARISE fellowship to build tools for immunologists to use. Primarily the tools are related to the Class I and Class II molecules of the Major Histocompatibility Complex that present antigen to T-cells.

The aim of the work is to pull together data and tools to allow immunologists, structural biologists and machine learning developers to answer some specific questions such as these below:

  • what governs the motif and conformation of peptides bound to MHC molecules and how does this change the shape of the peptide binding cleft
  • what governs the three-dimensional recognition of MHC:peptide complexes by T-cells
  • what data exists about peptide repertoires and how can this data be used to extrapolate to MHC alleles with no data
  • what is the structural variability between MHC alleles, MHC molecules of different species and between specific MHC:peptide complexes and how can this information be used predictively
  • what is the role of each position within the MHC molecule, and how can this data be used to explain biological phenomena (such as TAPBPR dependence and independence)
  • what can we understand from data about the effects of peptide processing and transcript level in the cell to better predict epitope sequences

The aim of the ARISE fellowship is to learn how to build and deliver "research infrastructures" such as "histo". These will be built using user-centric design/development methods and, in this case in particular, by "working in the open". This means that I'll be sharing the process as well as the outputs/outcomes so that as much of the learning can be shared as quickly as possible. Hopefully, there will be publications involved in this process, but these "weeknotes" will help as a regular and immediate view into the project.