Skip to main content

CAP6411 – Fall 2023

Computer Vision Systems (3 Credit Hours)

Course Content

Welcome to the Fall 2023 semester, and thank you for choosing this class.

As a start, I am asking that you reply to these questions:

  1. Have you taken CAP5415? This is a pre-requisite class because the goal of this class is not to go through all the computer vision topics, which are covered in CAP5415, but to hone your skills in building computer vision systems.
  2. Do you have any experience in coding and training deep learning models, including Convolutional Neural Networks (CNN), transformers, etc.?
  3. Please register a user account on UCF Newton GPU cluster. We will need it to complete class assignments and projects. Link: — if you run into issues, you may contact

Potential Valuable Resources

HuggingFaceLinks has a large repo of valuable models and code in computer vision, language and multimodal. It may be helpful for your assignments and team projects to find code snippets, models, etc.


This is a graduate course that builds on CAP 5415 so you should already have a “basic knowledge” of some state of the art computer vision techniques. If you don’t have the pre-requisite class, but has taken other graduate computer vision classes before, please reach out to me.

Class Format

  1. There are about 16 weeks of classes, twice a week on Tue and Thu, 3-4:15pm. We will have online and face to face, but as much as possible face to face is preferred. There will be a week from Oct 2-6, when I will be traveling to the ICCV conference, so that week will be online. Any weeks we have to go online due to unforeseen circumstances will be announced.
  2. Reading papers before class will be beneficial, but not required. Before each class, a paper will be posted for you to read ahead of the class. I will go through the paper technique (with slides) and corresponding results that were reported in class.
  3. Individual assignments (70% of your grade):
    • After we go through a paper (which can take 1-3 classes), each student will present a live demo a week after the paper is taught, where we will select a live input (e.g., a live image on google for image classification on ResNet50). Each student will provide a short report (1-3 pages) on what was learned, and what have been some issues trying to run the code. Finally, the report should also contain some thoughts on how the model can be improved and/or deployed to the real world.
    • Each assignment will be given a score out of 100. We will then divide the total score by the number of assignments we managed to do multiplied by 100. 70% of this final score goes towards the final grade score.
    • Plagiarism: Students should not be sharing or copying code and/or report. The goal of these assignments is to ensure each student get first hand knowledge of running these state of the art techniques.
  4. Team project (30% of your grade):
    • Due on Dec 4th. Each team will submit a detailed report and a video of demo. In the next two weeks, we will have 4 days, Tue and Thu of the following two weeks, where each team will present their project and a live demo on the board. To ensure fairness (otherwise the team doing the last demo has the most time to finish the project), the live demo must be as close to the submitted video as possible. Projects will be judged according to:
      • Creativity and novelty (40%, with a special surprise for some of the most creative ideas)
      • Code clarity and correctness (30%)
      • Clarity of report (30%)
        • Motivation of the idea is clearly articulated
        • Literature review is comprehensive
        • Experiments clearly showing good performance
        • A section on what was accomplished each week and the designation of tasks
      • In addition, we will also look at the size of the project (on a scale of 0 to 1, where 1 is worthy of about 3.5 months of efforts from start of course to Dec 4th). This will be used as a multiplier – let’s say you score 80 on the first three bullets but your project size is graded at 0.6 as it is not fully needing 3.5 months, then the final project score is 48.
    • We will have groups of 4-5 students for each project team.
    • Potential projects we can implement on the board:
      1. Project LLaVA:
        • The smallest model has 7B parameters, which will not fit on the board.
        • The team will have two options:
          1. Build a LLaVA 1.5B version.
          2. Called LLaVA from the board in the cloud.
      2. Project LISA: same as LLaVA, the model is not likely to fit on the board.
      3. Project SAM: same as above.
      4. Project DALLE: same as above.
    • All the above suggested projects will require some sort of UX on top of the board. Synaptics will provide some support if needed.
      • Teams can also exercise their creativity and build the coolest AI applications on the board (extra bonus can apply).
    • Plagiarism: Teams are welcome to share code, knowledge and UX about the board’s software stacks, etc.
  5. For in-person students, attendance will be taken and my policy is that you should not miss more than 20% of the classes.
    • Please ensure you have the UCF Here mobile app as the attendance will be taken via you scanning QR code with your app twice.
    • We are all grownups so if there are attenuating situations, this is not a hard rule. Please come talk to me if you are facing difficulties.

Student Learning Outcomes

The goal of this course is to equip the students with the abilities to understand the state of the art techniques in computer vision, replicate results that have been reported in papers (taking open-sourced github code mostly), making recommendations how to improve and/or deploy to real-world settings (we will have several GPU boards sponsored by Synaptics that we will attempt to implement on). Note that we won’t be able to cover all the topics in computer vision, particularly from the perspective of the goal of this course to learn how to build computer vision systems.

Topics covered (subject to small changes):

  1. Foundational Models:
    • Convolutional Neural Networks (CNN), specially the Residual Networks (ResNets), for ImageNet classification.
    • Vision Transformer (ViTs) for ImageNet classification.
    • R-CNN models (fast, faster R-CNN) for object detection.
    • Panoptic segmentation (linkLinks to an external site.)
    • CLIP models for image-text zero shot classification.
  2. Self Supervised Learning (SSL):
    • A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)
    • DinoV2
  3. Generative Models:
    • Dall-E and Dall-E2 models (link)
    • Stable diffusion (link)
    • Segment Anything Model (SAM)
    • LISA: Reasoning Segmentation via Large Language Model
    • LLaVA: Large Language and Vision Assistant
  4. Model distillation, Smaller is Better (if we have time):
    • One of the key things in building CV systems is that many models these days are not deployable on the edge as they are too big. Many techniques have been proposed to distill large models into smaller deployable models. I will go through some work here if we have enough time left.

Statement on Academic Integrity:

The UCF Golden Rule will be observed in the class. Plagiarism and Cheating of any kind on an examination, quiz, or assignment will result at least in an “F” for that assignment (and may, depending on the severity of the case, lead to an “F” for the entire course) and may be subject to appropriate referral to the Office of Student Conduct for further action. I will assume for this course that you will adhere to the academic creed of this University and will maintain the highest standards of academic integrity. In other words, don’t cheat by giving answers to others or taking them from anyone else. I will also adhere to the highest standards of academic integrity, so please do not ask me to change (or expect me to change) your grade illegitimately or to bend or break rules for one person that will not apply to everyone.