Radiology and ResNets: Our summer internship experience with AIMI

13 min readAug 30, 2022

Left to Right: Sarah Pan, Aarav Wattal, Taylor Tam

Intro to the team: The Philosophers!

We’re three high-schoolers who participated in the AIMI Center’s high school internship program. During this two-week experience, we were taught the fundamentals of machine learning (ML), introduced to professionals in the field, and presented with an ML challenge. Now you may be wondering: “Who are these people and can high schoolers really do ML?” Well, to answer the first question, we’re a few teenage computer science nerds who love ML. Aarav Wattal is a rising senior at the Quarry Lane School who enjoys playing basketball and vibing to great pop music. Sarah Pan is a rising junior at Phillips Academy who is lactose intolerant and likes reading Sally Rooney. And Taylor Tam is a rising junior at Menlo School who enjoys watching the show Mr. Robot, playing electric guitar, and may have a slight resemblance to Mabel from Gravity Falls (look it up).

Oh, and the answer to the second question is a definitive yes.

The challenge and data annotations

Our task was to develop a machine learning algorithm that could correctly identify and locate medical tubes in chest x-rays. In these images, there were three types of tubes that we were interested in: endotracheal tubes, which are used to maintain a tracheal airway, central venous catheters which deliver medicine intravenously, and chest tubes, which are utilized to drain excess fluids or gas.

To begin tackling this challenge, we were given 2100 chest x-rays from AIMI’s CheXPert dataset. Each of these images was unique, with varying types quantities of medical devices (in some cases, none) and with different locations dependent on the case. However, these These images alone were not sufficient to train a model. What we needed were ‘labeled images’ or, in other words, for each image to correlate with some subset of the three devices. This would allow our algorithm to learn patterns associated with the devices present and correctly extrapolate onto additional data.

Using Md.ai, a software used to annotate medical images, we drew bounding boxes around devices of interest. Each bounding box also had a tag that corresponded to one of the three tubes. With the knowledge obtained from a 30-minute radiology lecture (from our amazing mentor Christian Bluethgen, a.k.a the radiology master), we squinted hard and labeled away.

After labeling the dataset, the actual challenge loomed in front of us. Could we train a machine learning model to do the same?

What are ML models and neural networks?

Training a model is an intricate process involving plenty of math (partial derivatives and large-scale matrix operations). Humans aren’t very good at these sorts of things, but luckily, computers are. We used machine learning libraries to help us build, train, and test the model as well as to preprocess our data. But, in good conscience, we had to really understand how neural networks worked before making our own model.

The basis of any neural network is the perceptron, an artificial neuron that operates by multiplying its input by a set of “weights” (commonly referred to as parameters) and applying an activation function. Many perceptrons make up a layer, and a neural network is composed of many layers of interconnected perceptrons. A convolutional neural network, the standard model in computer vision, utilizes both convolutional and pooling layers. Convolutional layers pass a kernel (a matrix used to reduce image data) over subsections of the input. Pooling layers section their input into a few parts and choose the most “interesting” value (in this case, the greatest) to output.

An example of max pooling — the 4x4 color-coded areas are reduced to single values by taking the maximum of the four numbers in the region

These functions make CNNs effective and efficient ways to do computer vision. Having more hidden layers in the neural network usually correlates to higher accuracy but also comes with the risk of overfitting. In other words, this means that the model performs well on the training set but isn’t able to generalize onto unseen data.

Our approach

Once we understood how neural networks functioned, we could return to the issue of formatting the data properly and finding the correct libraries to use. To convert the data into a more model-friendly version, we used the libraries NumPy, pandas, and md.ai. For creating and training the model itself, we used PyTorch, which enabled us to transform the data to tensors, access a pre-trained ResNet-50 model, apply optimization and loss functions, as well as to train and test the model.

Initially, we broke down the task of creating the bounding boxes into two steps: identifying which (if any) tubes were present in a particular X-ray image file and generating a bounding box around the endpoint of the tube into the chest. After brainstorming a few approaches to the first part of the problem, we decided to start with a simple binary classifier.

We soon recognized that a binary classifier could only do binary classification (a real shocker!) We needed to create three different binary classification models, each with identical architecture but trained to identify a different type of tube (one for endotracheal tubes, one for chest tubes, and one for central venous catheters). We knew that this solution was pretty inefficient, but we decided to try it out anyway.

Part 1: the binary classifier

The model we chose to use was ResNet-50, a convolutional neural network 50 layers deep and containing 5 stages, each with a convolution and identity block. Using this architecture had multiple advantages. First, it enabled us to use transfer learning, which entails using previous knowledge about one deep learning problem and applying it to a similar problem. Since the model had already been trained on the ImageNet dataset, it could use what it had learned from these images and apply it to our chest x-ray problem. In addition to this, it used identity mapping to prevent gradient vanishing, a problem that prevents weights from being updated and therefore, the model from learning. It was the combination of these two benefits along with the model’s architecture that led us to choosing it for our binary classifier.

We assigned ourselves a medical device, so we could each create a binary classifier for our specific device. As each of us opened Colab with colossal cups of coffee in hand, we let out a large sigh and started coding.

Above is a diagram of the architecture for the binary classifier ResNet-50 model, consisting of 5 convolution stages, each reducing the dimensions of the input matrix as the model processes it, getting closer to and closer to the output.

Depicted above is another image of the ResNet-50 model architecture with more details on each of the five convolution stages and matrix operations to get to the final output.

After training the models on the eighty-percent split of 1680 images and validating the models on the remaining 420, we tried them on test data. All three of the models were relatively successful in their tasks — the endotracheal tube and chest tube classifiers achieving 87% accuracy and the central venous catheter classifier with 85% accuracy.

Now that we had concrete results for binary classifiers, we bid our adieus and promptly ended our internship. Thanks for reading!

Just kidding.

There was a second part to the problem we had yet to address: creating bounding boxes. As we thought about ways to solve that problem, we stumbled upon another one. Running three separate models for the classification process was indefensibly inefficient, so we started thinking about a way to optimize this task.

Part 2: the multilabel classifier

After creating our three binary classifiers, we wanted to train a singular model that could do everything our three binary classifiers could do with equal, if not better performance. After some masterful googling, we determined that we needed a multi-label classifier. This would allow our model to assign multiple devices to an image, which was the functionality that the binary classifiers lacked.

Again, we used transfer learning. But why?

A major benefit of transfer learning is being able to use a pre-made architecture and then creating small modifications to fit a specific problem. In our case, this meant enabling our model to have multiple outputs. Instead of writing and debugging an entire architecture from scratch, we could focus on specific modifications that would better fit the model to our problem (this was maybe our biggest reason, because creating a 30+ layer architecture is our collective worst nightmare.) Additionally, a big issue we faced was our relatively small dataset — a difficulty that transfer learning addressed. Since the majority of the model was pre-trained, it already recognized certain markers, like edges and contours. Using these pre-learned indicators improved accuracy, efficiency, and helped prevent overfitting.

After deciding on transfer learning, our next task was to determine what pretrained model to use and then figure out what modifications we would make. After some discussion, we agreed on using a pretrained ResNet34, an artificial neural network similar to ResNet 50, which, as its name suggests, has 34 different layers. Next, we added three binary classification heads to our ResNet backbone. Some long nights — early mornings rather — later, we had figured it out! We adjusted the model by adding a new output layer and creating three neurons in the layer, one for each device. Finally, we moved onto the task of data wrangling.

Luckily, we were able to apply our experience from previous data wrangling to this round. Following a similar process to our binary classifier, we started by organizing the data into a dictionary that paired images with labels. This time, however, we created our labels to be in the form of a one by three matrix — each entry corresponding to whether a device was present.

When it came time to train, we wrote a training function, implemented binary cross-entropy loss, and wrote another function to measure model accuracy. Using 75% of our data to train and 25% to validate, we were ecstatic to report a ninety-seven percent accuracy!

We were next given two datasets for final evaluation, one with images sourced from Stanford and one with images sourced externally. Our model performed with 97% accuracy on the first dataset and with 92% accuracy on the latter set. Since the externally sourced dataset presented images taken in different conditions or with slightly different looking devices than those our model had been trained on, a slightly lower accuracy was expected. Seeing the discrepancy in results further proved the necessity of having a diverse training dataset, which would have made the model more applicable to a wider range of images. Nevertheless, we were still super happy with the accuracy we had been able to achieve on both datasets! (Team bonding moment!)

Although we were more than satisfied with our multi-label classifier, we were still missing an essential component of our project. If you hadn’t guessed already, the bounding box problem had resurfaced yet again. This time, without classification optimizations to distract us, we were stumped.

We thought about building a model that would only locate a certain type of tube. When that led to a dead-end, we thought about creating a model under the assumption that each image had one of each tube. You can probably see why this idea didn’t work out. We moved onto a multi-output regression model that would have (theoretically) worked, but we ultimately chickened out. When we genuinely believed that we were at the end of our luck, our other amazing mentor, Rogier, introduced us to YOLO.

This was a turning moment.

Part 3: YOLO (You only look once!)

The YOLOv5 model has a very unique architecture that enables it to identify and classify elements in images. Its main selling point is its extreme inference speed — up to 130 to 150 frames per second (FPS) with rival models only coming in around 75 FPS. This makes it very effective for live image/video segmentation. It is composed of 3 main parts: the backbone, the neck, and the head. For the backbone of the neural network, it uses a CSP Darknet, which partitions the feature map of the base layer into parts and then merges them through a cross-stage hierarchy, thereby enabling more gradient flow through the network. Its neck utilizes a PANet, or Path Aggregation Network, which contains components that boost the information propagation through the pipelines and are used by the model to generate feature pyramids that optimize the model for generalization. The model’s head applies anchor boxes on features and creates final output vectors with class probabilities and bounding boxes. This architecture allows the model to make predictions after only one forward propagation (You Only Look Once), which separates YOLO from its two-stage counterparts.

While YOLO seemed very good at doing its job, we weren’t (initially) very good at YOLO. For one, YOLO’s annotation format differed from the dataset objects we had used prior. After figuring out a way to automate the conversions, we ran into another obstacle. YOLO expected us to delineate the data file paths in a “.yaml” file. Perhaps we were too well-acquainted with ordinary JPEGs that seeing such an exotic file extension sent us into minor shock. But luckily, the YAML file was easy to edit, and we had finished all the configuring and preprocessing.

YOLO Results

At first, we didn’t know how to interpret our YOLO results. There were a few automatically generated images, and those looked pretty spot-on, but the graphs that came along with those images were a little more abstract.

We tackled the confusion matrix first which, as its name suggests, tells us how confused our model was. Ideally, a dark blue diagonal would run from the top right to the bottom left. Our confusion matrix certainly had a trace of this diagonal, but it didn’t perform as well in certain classes compared to others. For example, only 59% of endotracheal tubes were classified correctly and the other instances were incorrectly identified as chest tubes or the background.

The graph of the F1 score against confidence was also helpful in our evaluation. This one was harder to demystify compared to the confusion matrix but nonetheless proved pretty straightforward. The F1 score is the harmonic mean of precision and recall. Precision is the likelihood that a positive result is actually positive, and recall is the likelihood that an actual positive is classified properly. The F1 score combines these “competing” metrics into a single numerical value. In many problems, the F1 score is preferable over accuracy as the F1 score prevents an overrepresented class’ results from dominating the overall result.

But what wasn’t so straightforward was why YOLO generated a graph of our model’s F1 score against model confidence. Our comprehensive Googling didn’t provide comprehensive explanations either. But what we did know was that confidence corresponded to “how sure” the model is of its decision. Despite our uncertainty about the graph’s purpose, we interpreted that the model performed best when it was between 30% and 65% sure of its prediction.

Fun fact: our classifiers achieved their results with not that many epochs. YOLO needed over 300 (and this was something our Colab GPUs didn’t want to hear.) In fact, after training YOLO, we exceeded our maximum use capacity, and the Colab gods revoked our GPU privileges. Without anything to tinker with, we sat in complete radio silence, feeling somewhat empty that we couldn’t tweak our code. (Perhaps this is the tech equivalent of getting lost in the woods.) But in those moments of solitude, our journey had never been so evident, and we were lucky to have a chance to reflect.

Summing up!

Working at AIMI allowed us to explore a rich area of research that very few high schoolers experience, and thankful is an inadequate start in describing how we feel. As a team, we’ve shared countless moments collaborating with each other and celebrating our successes, but more importantly, we’ve experienced the importance of struggle. Through the many unglamorous errors we’ve received, the unkept menagerie of academic paper symbols, and the unamenable task of data wrangling, our sense of determination strengthened ten-fold. We’ve learned that anything is accomplishable through the power of StackOverflow and our amazing mentors (also sufficient GPU + RAM.)

Our internship experience at AIMI was extraordinarily rewarding. We’d like to thank Alaa Youssef, Johanna Kim, and Jacqueline Thomas for organizing the program, Christian Bluethgen, Jean-Benoit Delbrouck, Karin Stacke, and Rogier van der Sluijs for leading us along the way, the amazing lecturers who informed our decisions, and the field authorities for inspiring our futures. And a huge thank you to everyone involved for giving us the tools to go even further in ML.

In the meantime, Aarav plans to shoot a few baskets. Sarah’s going to enjoy her oat milk ice cream with a copy of Normal People. Taylor has a few riffs to practice. And of course, we’re looking forward to many, many more team projects. So next time you see a Tesla using autodrive … just kidding! Anyways, expect to see more of us soon. With this amazing experience and our new understanding of ML, we’re definitely not going to stop here.

Philosophers out!

Feel free to contact Sarah Pan(span24@andover.edu), Aarav Wattal (aaravwattal@gmail.com) or Taylor Tam (taylor@taylortam.com) with any questions or comments.

Please contact aimicenter@stanford.edu for questions regarding the Stanford Center for Artificial Intelligence in Medicine and Imaging. Students interested in Stanford AIMI internships can find application information here.

Sources

CheXpert (internal training and test set)

CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison

“Depiction of the Convolution layer with a filter in convolutional neural network (CNN).” Research Gate, Dec. 2019. Accessed 14 July 2022.

Dwivdei, Priya. “Understanding and Coding a ResNet in Keras.” Towards Data Science, Medium, 4 Jan. 2019, towardsdatascience.com/understanding-and-coding-a-resnet-in-keras-446d7ff84d33.

Khosla, Savya. “CNN | Introduction to Pooling Layer.” Geeks for Geeks, 29 July 2021, www.geeksforgeeks.org/cnn-introduction-to-pooling-layer/.

MIMIC-CXR (external validation)

MIMIC-CXR Database v2.0.0

“ResNet-50 architecture.” Research Gate, Jan. 2020, www.researchgate.net/figure/ResNet-50-architecture-26-shown-with-the-residual-units-the-size-of-the-filters-and_fig1_338603223.

Radiology and ResNets: Our summer internship experience with AIMI

Intro to the team: The Philosophers!

The challenge and data annotations

What are ML models and neural networks?

Written by Stanford AIMI

No responses yet