Text
Projection
Consider the vectors \(\mathbf{b}\) and \(\mathbf{a}\). We want to find the projection of the vector \(\mathbf{b}\) onto the vector \(\mathbf{a}\). Let the angle between the two vectors be \(\theta\). Hence, to obtain the component of the vector \(\mathbf{b}\) onto the \(\mathbf{a}\) -- \(\mathsf{comp}_{a}b\) -- is the following,
\[\cos{(\theta)} = \frac{\mathsf{comp} X}{\mathsf{comp}_{a}b} \]
0 notes
Text
Paper review #1 - Human Detection using Learned Part Alphabet and Pose Dictionary
Authors: C. Yao, X. Bai, W. Liu, L.J. Latecki
In this work, the authors present a part-based pedestrian detection approach analagous to an alphabet. The characters are similar to parts, while words are similar to poses, a combination of parts. As words, the poses aren't random; they follow a structure, such as first head, then shoulders, and so on. However, we first need to extract the parts to create our alphabet.
For this purpose, the authors employ the discriminative clustering to gather parts that are representative and distinctive. Such approach is different from other works, because they usually use annotated parts, which might reduce the amount of training samples and not necesssarily the annotated parts are more representative and distinctive.
After gathering the parts, where each parts cluster represents a letter in the alphabet, now it is time to generate the pose dictionary. It is done by 1) applying the classifiers in the training images; 2) sorting the detected parts within a bounding box by its angle; 3) storing the pose following an azimuthal reference, where each pose is composed by the cluster indeces.
During the test, part detectors are applied using a sliding window approach, which generates an hypothesis map. Each part contributes to a Hough map with its centroid which will be filtered using a mean-shift algorithm for non-maximum supression (NMS). To finish this step, the algorithm estimates the bounding box's size of possible pedestrians, using the following equations:
\[ w(h) = \frac{ \sum_l{\rho(Q_l) * w(Q_l) * \hat{w}_{Q_l}} }{ \sum_l{\rho(Q_l) * } } \]
\[ h(h) = \dots \]
The last step is the classification based on text recognition metrics. They propose the use of three metrics from the text recognition literature:
Local evidence: the total vote of hypothesis \( h \);
Interactions among parts: Hypothesis verification via dictionary search (edit distance);
Global information: root filter with components for viewpoint changes.
Experiments
Root HOG is better than Dalal and Triggs' HOG because it is applied over a more strong hypothesis, obtained by the part detectors.
Questions
1. When training a component to cover each viewpoint changes, do the authors use the dictionary of poses to gather the training samples for that viewpoint?
2. If we split the training set in different viewpoints, we have less training samples to train for each component. Does it reduce the model's predictive power?
Conclusion
Besides the other contributions such as the 3D to 1D pose encoding, the great achievement of this work is to propose a better way to capture parts information, use it to train the part classifiers, and then generate hypothesis given such Hough map.
0 notes
Text
The Drunkards' Walk - Chapter 1
I'm starting a series of summaries of books I'm reading. I'll begin with the first chapter of The Drunkards' Walk, by Leonard Mlodinow.
0 notes