Tensorflow object detection API mAP score

I am using [SSD MobileNet V2 FPNLite 320x320] to train my model. I have chest x-ray to detect Covid-19. There 1349 Normal chest x-rays and 3883 Covid-19 chest x-rays. I have used different Augmentations to increase my Normal chest-xray from 1349 to 2215. and pneumonia images from 3883 to 4032.
I have trained my model 4000 training steps, and it has training loss of 0.21 and evaluation loss of 0.23.

I am confused that I have gotten 0.837 mAP. Is it a good result? I need to compare my results with some other papers, but they have accuracy nearest 96 or 98. There result is in accuracy and mine is in mAP. How am I going to compare them??


Another question is that most of the Tensorflow object detection API has mAP between 0.20 to 0.50, but mine is 0.83. So is their some issue with my model or its fine? Because it is detecting and classifying most of the images accurately.

Please please any expert guide me regarding all of the issues I have asked.

Hey @Annie ,

Sounds like an interesting project.

To answer a few of your questions real quickly. Your mAP of 0.837 refers to 83.7% which is quite good for a training run. Everyone’s models are different and metrics are really dependant upon the purpose of your model. It sounds like you have two classes so 0.837 sounds about right for a two class model.

This article by Roboflow is quite basic but is good for understanding how to interpret mAP results, specifically for your model needs.

A good comparison spot may be this EfficientDet training session of two Android Figurines. The training method and model architecture are different, but it gives you an idea of what ~120 images across two classes with a certain amount of steps achieves.

For the amount of data/images you have 4000 steps seems like quite a low number for training steps. But this is hugely dependant upon your complexity of what you’re trying to train.

In regard to your Loss Value and Evaluation Loss Value its important to look at these value on a curve over time. Using Tensorboard is great for this, so it might be good to load Tensorboard so you can see your Loss curves. That graph will look something like below:

That being said, it appears like your Loss values are quite low which seems good. But it comes down to convergence.

Taking a quick glance at your data I can see you have a few -1.00000 scores for some different metrics, namely AR@100 (small) and AR@100 (medium) this indicates to me that your Evaluation dataset doesn’t have scenarios where small and medium images were being evaluated. Im not sure if all your detections are large, but if they are not, it might be worth putting some more diverse image samples (ie small and medium detections) into your evaluation dataset (always trying to keep this as random as possible). It would be important to train for this as well if you do.

You can learn more about the breakdown of these metrics on the COCO website COCO - Common Objects in Context below is a picture of their eval table.

All in all though it comes down to what your use for the model is. Is it a theoretical proof of concept model or is the plan to use it in production. If it’s about juicing the mAP score I would throw more training steps at it and see what happens. If it’s about moving into production and real life scenarios I would run the model over a series of real use cases. One thing that occurs to me is that if its x-rays of lungs, people may have other lung issues that may not be either covid/pneumonia or normal. It might be worth trying to take this into account in your training and evaluation process.

Just some thoughts. Good luck, seems like an important applied Object Detection model no matter the final use of it.

Hi wwfisher. Thank you so much for such detailed answer. You have made me clear many things, but I have some confusion.

As you said that 4000 steps are less and I should run some more steps. I was first training with 3000 steps and my training loss was 0.20 and evaluation loss was 0.26. After I trained for 4000 steps the training loss remained the same but evaluation loss was 0.27. And finally I trained for 5000 steps, the evaluation loss was 0.19 and evaluation loss went directly to 0.37. You can see that it is going towards overfitting I don’t know why?

And I want to ask that, as I have 1349 Normal chest x-rays and 3883 covid-19 chest x-rays, Should I apply data augmentation to Normal chest x-rays separately to remove the data imbalance? Purpose of asking this question is that [SSD MobileNet V2 FPNLite 320x320] pipeline.config file have data augmentation code in it. So do I need to apply it separately?

And as you said that “One thing that occurs to me is that if its x-rays of lungs, people may have other lung issues that may not be either covid/pneumonia or normal. It might be worth trying to take this into account in your training and evaluation process.”

So here I wanted to ask that if my only purpose is to detect covid-19 why should I try on other chest problems??

Ideal training lengths change wildly between different datasets and is a lot of trial and error. The usual practise is to set the length of the training hight (many steps) and monitor the training session and stop when overfitting begins to occur.

It looks like by training for 3000 steps, 4000 steps and 5000 steps and comparing your results you’re getting a good idea of where overfitting is occurring. I guess its important to remember, overfitting is where your model is trained too specifically on your data and is not general enough to be accurate on new data (ie, the model becomes too good at detecting for the trained images that any new data introduced is too incorrect for the model to be useful).

This article is quite good in explaining some of the common issues surrounding convergence. But it looks like you’ve found a good balance.

Regarding pipeline augmentation this is again up to the model trainer. Some people like to maintain full control of their augmentation and control it manually, some people don’t mind allowing the model/pipeline to do it.

The key point to focus on is what augmentation method the pipeline is using and what you are doing manually (ie are you flipping your images horizontally, vertically, rotating, adding distortion, warping etc.)

Its important to look at what you’re trying to detect and apply it to what is a realistic real use application of the model might be. For instance, if I was trying to detect a dog, it would make sense to augment my training data to flip horizontally, because there is a chance the dog could be both facing left and/or right. Rotating the image 180 degrees though wouldn’t be that handy, because most images of dogs aren’t going to be upside down. The picture below demonstrates that. Your lung scans may be different though, depending what the scans look like.

I guess the key thing to think about when augmenting images is to look at what you’re trying to detect and augment it in useful real world application ways. Balancing data is a very good idea as well, as much as you want your model to detect covid accurately you also want it to detect normal/no covid accurately as well.

In regard to the question about ‘images of other lung issues’ this again will be about how you intend to use the model. If your use case is the scenario where the images either have covid or they are normal/no covid then this wouldn’t weigh into the equation, purely detecting covid may be enough. If however, your use case is all images of lungs are being analysed and they may have any range of different forms in them, it might make sense to try to differentiate between similar looking objects. An example of this is shown in the image below:

I’m not totally sure of the answer to this, and have heard competing logic around the problem. Telling the difference between a square (covid) and a triangle (no covid) is fairly straight forward, but what happens when its close to a square (a different lung infection). Depending where your threshold of positive identification sits it will have to pick to either be a square or a triangle (because they are the only two options it can pick from).

This is just something to think about for real world applications, whether this is what you’re actively pursuing or not is a different story, but it’s still worth thinking about as a theoretical dilemma for your model.

This also leads into the issue around null detections as well, and how to build them into your dataset. Another blog article by Roboflow on that is here, its worth a read if it’s not directly related to your model.

Some of this stuff may have nothing to do with your model, but its probably worth keeping it in mind when looking at your data and trying to understand how your model gets to where its going (if that makes sense).

okay and after training should I test my model on all my testing images which are 624 chest x-ray images? Also should I find a way to find accuracy of my model? because tensorflow object detection APi models do not show the accuracy they just show mAP.

And also is it okay if my model is detecting covid-19 correctly with 80% score or 89% or 70%??

And for chest x-ray is upside down augmentation is valid?? Just like you showed example for dog. X-ray inages could b upside down right?? Or sideways

Testing your model on as many images as possible is probably a good idea, once you’ve done this inference its good practise to inspect the test data that was incorrectly or not detected and try to analyse why your model struggled to detect these images.

This then gives you a guide of how to fine-tune your model for better performance in the future, but also to make a choice about whether this model can be used in any sort of production setting successfully and safely.

In regard to what a good score is (70%, 80% etc) this comes down to your own personal judgement and the purpose of the model and its application. Is this a pre-screening test that will then be inspected with human intervention etc. I can’t really make a judgement call on that.

I’m not very familiar with what your chest x rays looks like but if the bounding square for detection is purely focused on an abnormality and doesn’t have any orientation specific objects (ie bones etc) upside down seems like a valid augmentation. If the bounding box includes things that are orientation specific, (rib cages etc.) it may be counter productive.

My model has a total loss of 0.24 after training, but evaluation loss of 0.37. Is it considered as overfitting?