A rather tangential comment - this paper is an example of how NOT to write an abstract. An abstract is expected to tell me what new piece of knowledge I can learn by reading more. The content of this abstract is only 20% of what a real abstract should be .. the first half of the first sentence is almost all that's needed (could include which archa it beats). The rest of the abstract needs to cover this (perhaps one sentence each) -
1. Intro - a note on the overall problem domain - object detection in this case and bit zoomed in to the DL space.
2. Related work - work so far in the domain .. without critizin it.
3. Problem statement - what is the knowledge gap in the related work this paper is talking about.
4. Solution - how did we address the gap.
5. Validation - how do we claim our solution addressed the gap it was intended to address.
This paper's abstract covers only the last part and sporadically a bit of 2. What I want to know is this abstract is "what is the new learning in the yolov7 arch?"
Perhaps the bigger picture here is that it points to metrics chasing as a proxy for a "research agenda" in the ML community.
Probably the most interesting trick from the paper is using the head as a soft supervisor for earlier layers of the network, with the intuition being that if the earlier layers learn to imitate the higher capacity later layers, it frees up the capacity of the later layers to better learn the residual and provides more dense supervisory signal.
Yes, but to my surprise the "compound scaling" provides 3x more improvement in their ablation study. Also, I don't understand Table 8 in their ablation study for aux heads, specifically: why does it have different base benchmark values from Tables 6 and 7?
As someone who got only his feet wet with OpenCV like 20 years ago, so basic shape recognition and no AI involved, what read/software, etc. would you suggest to catch up and play with current technology without being inundated by theory that I'm sure I couldn't grasp?
Go to huggingface.com and start with some of the tutorials. The operational/engineering skill sets alone are all you need to treat modern ML models like any other black box API/SDK.
If you go to the associated code, you'll see that it needs a 'backbone', 'neck' etc. What is a backbone? Questions that arise directly from the code will lead you towards good blog articles, etc. https://huggingface.co/spaces/nateraw/yolov6/blob/main/yolov...
OTOH, you could go and have a look at (for instance) the Stanford vision courses for a more 'theoretical' approach. But the code itself is often solid guide to what's going on (the frameworks used for Deep Learning map well onto what's being discussed in blogs/lectures/papers).
Github repo mentions "teaser: Yolov7-mask" showing segmentation as well. Highly relevant to my interests. Sadly I can't easily discern any other info on this topic.
What are you using it for if can share? I’ve thought about training some of these and releasing the weights but I’ve never found a reason they’d really be useful personally so it never really happened
I'm working on a computer vision pipeline that relies heavily on segmentation to detect objects in video feeds. We capture about 6 hours of video each day. So being somewhat close to real time with our processing rate is important ...
Why? For me at this point YOLO means a family of detectors that in a single pass propose a bounding box per pixel and filters them with some clustering algorithm. When I see YOLOfoo I know what kind of architecture to expect. A more descriptive name like YOLO-tricks instead of YOLOvX would be nice though.
> What good is speed if the accuracy isn't significantly better than a coin flip?
Because distinguishing an object as belonging to one class out of a thousand with 50% accuracy doesn't mean it's a coin flip. You'd need a thousand-sided coin. Random chance in that case is 0.1%, which maeks 50% way, way better.
It’s more nuanced than this. It’s not “look at this image and tell me yes or no if there’s a car in it” it’s more like “tell me where all the cars are in this image, if any.” We use this a lot, and ramping up recall we can do some interesting use cases.
I assure you it’s highly useful in the real, real world.
And that's why nobody actually uses it for those things, at least not yet. Don't forget that advancement is often incremental, and that in this case advancement has actually been somewhat fast. YOLOv3 came out in 2018.
In YOLOv7, YOLO and v7 don't go well together. No, not at all. YOLO normally means "You Only Live Once", and v7 means it's lived at least six times before this.
While the author likely didn't have that intention, that's what came across.
Even for YOLO meaning "You Only Look Once" YOLO and v7 do not go together well.
YAML originally stood for "Yet Another Markup Language" until somebody pointed out that it wasn't actually a markup language, so they retro-named it "YAML Ain't Markup Language".
1. Intro - a note on the overall problem domain - object detection in this case and bit zoomed in to the DL space. 2. Related work - work so far in the domain .. without critizin it. 3. Problem statement - what is the knowledge gap in the related work this paper is talking about. 4. Solution - how did we address the gap. 5. Validation - how do we claim our solution addressed the gap it was intended to address.
This paper's abstract covers only the last part and sporadically a bit of 2. What I want to know is this abstract is "what is the new learning in the yolov7 arch?"
Perhaps the bigger picture here is that it points to metrics chasing as a proxy for a "research agenda" in the ML community.