/robowaifu/ - DIY Robot Wives

Advancing robotics to a point where anime catgrill meidos in tiny miniskirts are a reality.

Happy New Year!

The recovered files have been restored.

Max message length: 6144

Drag files to upload or
click here to select them

Maximum 5 files / Maximum size: 20.00 MB

More

(used to delete files and postings)


“Fall seven times, stand up eight.” -t. Japanese Proverb


Robot Eyes/Vision General Robowaifu Technician 09/11/2019 (Wed) 01:13:09 No.97
Cameras, Lenses, Actuators, Control Systems Unless you want to deck out you're waifubot in dark glasses and a white cane, learning about vision systems is a good idea. Please post resources here. opencv.org/ https://archive.is/7dFuu github.com/opencv/opencv https://archive.is/PEFzq www.robotshop.com/en/cameras-vision-sensors.html https://archive.is/7ESmt >=== -patch subj
Edited last time by Chobitsu on 12/27/2024 (Fri) 17:31:13.
>>23776 what is mAP?
>>23777 The diagram indicates it's a form of accuracy to compare such models. >Mean Average Precision (mAP) https://blog.paperspace.com/mean-average-precision/ Found via: https://duckduckgo.com/?q=map+machine+learning+accuracy
>>23776 Very low-latency in detection is vital, insofar as her autonomous safety is concerned. The ideal is human-level speed at object recognition (or even faster). We're probably getting pretty close on smol devices already, so I predict we'll reach this goal generally by the time the first real-world robowaifus begin rolling out. Thanks Anon.
>>24909 - the computers connected to the eyes (cameras) should have different ways of sharing data with other computers, e.g. just sharing body movement analysis and recognition info as a text stream, same for the person being detected, or some emotional indicators. Sending photos and videos should be very limited, only sending encrypted files, also the system should mostly not store this data. Some home server might store and process some data for fine tuning, but needs to receive this data encrypted. Decision what to share should be made based on overall context coming from the general cognitive architecture >>24783 - fast and efficient segmentation of images (FPGAs?) - different variants or the same image, created very fast, maybe using FPGA. For further processing, e.g. only processing a low res partial image of an object to keep track of. The creation of that low res partial image should be done by a specialized system close to the cameras. - using object detection models based on context informed by the general cognitive architecture >>24783 or just based on awareness of what room she's in and maybe even at what she's looking at. So they can be smaller, faster and more specialized, including some models which are trained on the specific training data related to the household (photos and videos of the home environment).
Open file (215.89 KB 869x350 Screenshot_114.png)
Open file (326.25 KB 879x492 Screenshot_113.png)
Open file (162.74 KB 878x396 Screenshot_112.png)
>LERF optimizes a dense, multi-scale language 3D field by volume rendering CLIP embeddings along training rays, supervising these embeddings with multi-scale CLIP features across multi-view training images. After optimization, LERF can extract 3D relevancy maps for language queries interactively in real-time. LERF enables pixel-aligned queries of the distilled 3D CLIP embeddings without relying on region proposals, masks, or fine-tuning, supporting long-tail open-vocabulary queries hierarchically across the volume. >With multi-view supervision, 3D CLIP embeddings are more robust to occlusion and viewpoint changes than 2D CLIP embeddings. 3D CLIP embeddings also conform better to the 3D scene structure, giving them a crisper appearance. https://www.lerf.io https://github.com/kerrj/lerf https://drive.google.com/drive/folders/1vh0mSl7v29yaGsxleadcj-LCZOE_WEWB?usp=sharing https://arxiv.org/abs/2303.09553
> Face recognition Not tested, just looking what's available: https://github.com/cmusatyalab/openface Following quotes are from Reddit, not from me... https://github.com/ageitgey/face_recognition > I have tried this out. It's easy to code and accurately recognizes faces. The problem is it can't even detect faces 1 feet away from the camera. https://github.com/timesler/facenet-pytorch (FaceNet & MTCNN) > This can detect and recognize faces at a distance, but the problem is it can't recognize unknown faces correctly. I mean for unknown faces it always tries to label it as one of the faces from the model/ database encodings. https://github.com/serengil/deepface > I have tried VGG, ArcFace, Facenet512. The latter two gave me good results. But, the problem is I couldn't figure out how to change the detection from every 5 seconds to real-time. Also, I couldn't change the camera source. (If anyone can help me with these please do). Also, it had fps drops frequently. https://github.com/deepinsight/insightface > Couldn't test this yet. But in the demo YT video it shows the model incorrectly detecting a random object as a face. If someone knows how well this performs please let me know. https://www.reddit.com/r/computervision/comments/15ycwom/face_recognition_whats_the_state_of_the_art/ This here seems to be the best: https://github.com/ZoneMinder/zoneminder the Reddit link above has some thread and patch for detecting faces on distance, I think.
Open file (537.86 KB 877x878 LLaVA.png)
LLaVA: Large Language and Vision Assistant (https://llava-vl.github.io/) A project to integrate vision into large language models. Though very new and young as a concept, adding visual context to language models has tremendous potential. Notably, a waifu which can understand correlations between what she perceives in her environment with what she is told can lead to much more naturally feeling interactions. Fingers crossed for a fork that implements YOLO (https://pjreddie.com/darknet/yolo/) rather than CLIP (https://openai.com/research/clip) for better compute and memory efficiency. Getting this to run at sub 10 watts should be a goal.
Edited last time by Kiwi_ on 10/11/2023 (Wed) 18:33:59.
Open file (82.60 KB 386x290 Screenshot_158.png)
I was working on this here >>26112 using OpenCL to make video processing faster. So I got this here recommended by YouTube: https://www.youtu.be/0Kgm_aLunAo Github: https://github.com/jjmlovesgit/pipcounter This is using OpenCV to count pips on dominos, and does it much faster and better than GPT4-Vision. I wonder if it would be possible to have a LLM adjust the code dependent on the use case, and maybe having a library of common patterns to look out for. Ideally one would show it something new, it would detect the outer border like the stones here and then adjust till it can catch the details on all of these objects which are of interest. It could look out for patterns dependent on some context, like e.g. a desk.
>>26132 >and does it much faster and better than GPT4-Vision. Doesn't really surprise me. OpenCV is roughly the SoA in hand-written C++ code for computer vision. You have some great posts ITT Anon thanks... keep up the good work! :^)
There are several libraries and approaches that attempt to achieve generalized object detection within a context, although creating a completely automatic, context-based object detection system without predefining objects can be a complex task due to the variability of real-world scenarios. However, libraries and methodologies that have been utilized for more general object detection include: 1. YOLO (You Only Look Once): YOLO is a popular object detection system that doesn't require predefining objects in the training phase. It uses a single neural network to identify objects within an image and can detect multiple objects in real-time. However, it typically requires training on specific object categories. 2. OpenCV with Haar Cascades and HOG (Histogram of Oriented Gradients): OpenCV provides Haar cascades and HOG-based object detection methods. While not entirely context-based, they allow for object detection using predefined patterns and features. These methods can be more general but might not adapt well to various contexts without specific training or feature engineering. 3. TensorFlow Object Detection API: TensorFlow offers an object detection API that provides pre-trained models for various objects. While not entirely context-based, these models are designed to detect general objects and can be customized or fine-tuned for specific contexts. 4. Custom Object Detection Models with Transfer Learning: You could create a custom object detection model using transfer learning from a pre-trained model like Faster R-CNN, SSD, or Mask R-CNN. By fine-tuning on your own dataset, the model could adapt to specific contexts. 5. Generalized Shape Detection Algorithms: Libraries like scikit-image and skimage in Python provide various tools for general image processing and shape analysis, including contour detection, edge detection, and morphological operations. While not object-specific, they offer tools for identifying shapes within images. Each of these methods has its advantages and limitations when it comes to general object detection. If you're looking for a more context-aware system that learns and adapts to various contexts, combining traditional computer vision methods with machine learning models trained on diverse images may be a step towards achieving a more generalized object detection system. However, creating a fully context-aware, automatic object detection system that adapts to any arbitrary context without any predefined objects is still a challenging area of research. ----------------- In terms of computational requirements, here's a general ranking of the mentioned object detection methods based on the computational power and RAM they might typically require: 1. OpenCV with Haar Cascades and HOG: - Computational Power Needed: Low to Moderate - RAM Requirements: Low - These methods are computationally less intensive compared to deep learning-based models. They can run on systems with lower computational power and memory. 2. Generalized Shape Detection Algorithms (scikit-image, skimage): - Computational Power Needed: Low to Moderate - RAM Requirements: Low to Moderate - While these libraries might need slightly more computational power and RAM than Haar Cascades and HOG, they are still less demanding compared to deep learning-based models. 3. TensorFlow Object Detection API: - Computational Power Needed: Moderate to High - RAM Requirements: Moderate to High - Running pre-trained models from the TensorFlow Object Detection API might require more computational power and memory compared to traditional computer vision methods due to the complexity of the deep learning models. 4. Custom Object Detection Models with Transfer Learning: - Computational Power Needed: Moderate to High - RAM Requirements: Moderate to High - Training custom object detection models with transfer learning typically requires moderate to high computational power and memory, especially during the training phase. 5. YOLO (You Only Look Once): - Computational Power Needed: High - RAM Requirements: High - YOLO models are relatively demanding in terms of computational power and memory. They require more powerful machines due to their deep neural network architecture and real-time processing capabilities. The exact computational requirements and memory usage can vary based on the specific hardware, image sizes, complexity of the models, and the scale of the operations being performed. Deep learning models, in general, tend to demand more computational resources compared to traditional computer vision methods. If you're working with large datasets or real-time processing, more powerful hardware configurations would likely be necessary to achieve optimal performance. -------- https://github.com/opencv/opencv/tree/master/data/haarcascades -------- If your goal is to detect shapes without knowing the specific objects at first, OpenCV's contour detection methods combined with image processing techniques could be more appropriate than scikit-learn. Once shapes are identified, further analysis or categorization can be performed using traditional machine learning algorithms from scikit-learn or other methods.
>>26146 Understood. If the goal is to identify various objects within a specific context (like a desk) without predefining the objects, and the lighting conditions might vary, a more flexible approach using general computer vision techniques can be applied. This could involve methods such as contour detection, edge detection, and basic image processing techniques to identify objects within the context of a desk. You might use a more generalized version of object detection that isn’t specific to particular objects but rather identifies any distinguishable shape within the context. Here’s an example: python import cv2 # Read the image image = cv2.imread('path_to_your_image.jpg') # Convert to grayscale gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # Apply thresholding or other preprocessing techniques to enhance object edges # ... # Find contours contours, _ = cv2.findContours(processed_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) detected_objects = [] for contour in contours: # Apply some conditions to filter objects based on size, shape, etc. # For instance, you might filter by area or aspect ratio area = cv2.contourArea(contour) if area > some_minimum_area_threshold: detected_objects.append(contour) # Count and display the number of detected objects print(f"Number of objects detected: {len(detected_objects)}") This code applies general techniques such as contour detection to identify distinguishable shapes within the context of the desk. The process of identifying objects relies on the uniqueness of their shapes and their contrast against the background. The challenge in this approach lies in how the algorithm distinguishes objects based on their shapes and sizes. It might not identify specific objects but rather any shape that meets certain criteria (like area, aspect ratio, etc.) within the provided context (in this case, the desk). This method might detect a variety of objects but could also identify false positives or miss some objects. Fine-tuning the conditions for object identification (like area thresholds or other characteristics) can improve the accuracy of detection within the context of the desk, considering the variability in lighting and object characteristics.
Open file (346.77 KB 696x783 1698709850406174.png)
Open file (199.93 KB 767x728 1698710469395618.png)
I suppose this is a good thread to use for discussing this concept: a swarm of small drones available for a robowaifu's use for enhanced perimeter/area surveillance, etc.
>1.6B parameter model built using SigLIP, Phi-1.5 and the LLaVA training dataset. Weights are licensed under CC-BY-SA due to using the LLaVA dataset. Try it out on Hugging Face Spaces! https://github.com/vikhyat/moondream https://huggingface.co/spaces/vikhyatk/moondream1 https://youtu.be/oDGQrOlmC1s >The model is release for research purposes only, commercial use is not allowed. >circa 6GB or 4GB quantized
>>29286 Thanks. Do you have any views on it's usefulness r/n, Anon?
Open file (84.01 KB 960x720 yuina.png)
For people looking for a Kinect, I've had success finding them at electronics recycle centers. RE:PC in Seattle had a big bin. Also, I just checked and they're going for under ten dollars on eBay lol. I had also heard that the Kinect's depth camera isn't all too necessary at this point due to how good neural networks have gotten recently. Is there any merit to that?
>>29911 Unless you're using the kinect to do some sort of 3d mapping you can get stuff like pose landmark detection using AI stuff and a standard webcam, like Gulag's open-source library Mediapipe. https://mediapipe-studio.webapps.google.com/home I use some of their models for object recognition :D
>>29911 >>29915 Thanks for both the great tips, Anons! Cheers. :^)
>>29367 I think we would need workarounds if such models are not fast enough, but wow it needs less than a second to identify common objects in a photo of a room from a home. I guess on a smaller computer it would be slower, but still. This is good enough for now, and it's just a stepping stone. Keep in mind, we don't need it as fast and general as AI in cars. The waifus will mostly look at the same home with the same objects all the time. >>29911 My issue is rather that I don't want to use a device which I can only get from recycling centers. Also, I want two cams which can move on their own and I decide on which distance they are. I guess something like Kudan will be the way to go: https://www.youtube.com/@KudanLimited
A bit odd no one mentioned LiDAR. This would allow for a better sense of depth and objects behind themselves out of ordinary vision to avoid walking backwards into someone or elbowing them.
>>30138 but the cyberninjas wear black
>>30139 Black clothes aren't that black especially as the dye fades over time. If you want to be picky being just a secondary source of sight you could use at compromise of resolution instead use radar just for a general awareness to know to carefully turn to see what is at a location.
>>30138 To add to my earlier point. I found a diy LiDAR that is supposed to cost $40 to make. https://www.instructables.com/Project-Lighthouse-360-Mini-Arduino-LiDAR/
>>30174 > I found a diy LiDAR that is supposed to cost $40 to make. I'd think that's a game-changer for the mapping need, if it's legit and reliable. Thanks, Anon! Cheers. :^)
>>30180 Considering usual cost of LiDAR I am thinking this is a bit less accurate and shorter range but it's still useful for this kind of application likely. Im not sure why the developer privated his videos. They might be still viewable through Archive.
>>30189 >Considering usual cost of LiDAR I am thinking this is a bit less accurate and shorter range but it's still useful for this kind of application likely. Yeah makes sense. >Im not sure why the developer privated his videos. In my experience, that's one of the first signs that an opensource system is going closed source. They block the assets from the publice b/c """reasons""". >They might be still viewable through Archive. Not sure what that means.
>>30190 >signs that an opensource system is going closed source He left up the files for making it though. Apparently his whole YouTube channel is gone. >Not sure what that means. I found the URL for one at least that was archived. The follow up update video wasn't archived unfortunately. https://web.archive.org/web/20210202100801/https://www.youtube.com/watch?v=uYU534Wn4lA I managed to find a similar priced one though a little more cost that used to be available as a kit but it appears to be a different design, The website seems to no longer exist. https://web.archive.org/web/20211129020703/https://curiolighthouse.wixsite.com/lighthouse Found that one from a video of some guy assembling it https://www.youtube.com/watch?v=_aRcoI25HqE>>30190 Going down that rabbit hole from YouTube recommend vids lead me to two others $44 but this one is a single point instead of 360º https://www.dfrobot.com/product-1702.html This one is $99 https://www.dfrobot.com/product-1125.html
>>30193 Wait never mind about the curiolighthouse. It seems my browser was just not redirecting to the page properly. That site is still up.
Just found out 3D cameras for sensing depth are called A "depth camera" or "3D depth sensor" or "stereoscopic depth sensor" sometimes terms like "binocular depth camera" appear. They capture color (some IR too) and depth in a single system like our vision works. Though if you used one of these premade units it would mean having only head turning not eye turning.
>>29915 Started on the kinect lite guide because I don't want giant XBOX 360 bars on my robot's face. And just now after saying it I regret hacking it apart. It's still huge after making it half the size, the length of a smartphone. https://medium.com/robotics-weekends/how-to-turn-old-kinect-into-a-compact-usb-powered-rgbd-sensor-f23d58e10eb0
>>30877 I know this is a stupid question but can you strip those components right out of the suppoirt frame and have them simply connected to the wires?
>>30879 Zoom in to the whole in the centre. Looks like there is a circuit board under there. If one were to take it out of the frame it would require adding wires and attaching back to the circuit board I imagine.
>>30879 >>30880 I expect the physical positioning of the 3 camera components is tightly registered. Could be recalibrated I'm sure, but it would need to be done.
>>30879 >Depth Perception From what I know these systems work so that it knows the distance between the two cameras and this is part of the hardware. If you want to do this yourself then your system would need to know the distance. I think Kudan Slam is a software doing that: >>29937 and >>10646 >Kudan Visual SLAM >This tutorial tells you how to run a Kudan Visual SLAM (KdVisual) system using ROS 2 bags as the input containing data of a robot exploring an area https://amrdocs.intel.com/docs/2023.1.0/dev_guide/files/kudan-slam.html >The Camera Basics for Visual SLAM >“Simultaneous Localization and Mapping usually refer to a robot or a moving rigid body, equipped with a specific sensor, that estimates its motion and builds a model of the surrounding environment, without a priori information [2]. If the sensor referred to here is mainly a camera, it is called Visual SLAM.” https://www.kudan.io/blog/camera-basics-visual-slam/ >.... ideal frame rate ... 15 fps: for applications with robots that move at a speed of 1~2m/s >The broader the camera’s field of view, the more robust and accurate SLAM performance you can expect up to some point. >...the larger the dynamic range is, the better the SLAM performance. >... global shutter cameras are highly recommended for handheld, wearables, robotics, and vehicles applications. >Baseline is the distance between the two lenses of the stereo cameras. This specification is essential for use-cases involving Stereo SLAM using stereo cameras. >We defined Visual SLAM to use the camera as the sensor, but it can additionally fuse other sensors. >Based on our experience, frame skip/drop, noise in images, and IR projection are typical pitfalls to watch out. >Color image: Greyscale images suffice for most SLAM applications >Resolution: It may not be as important as you think >Visual SLAM: The Basics - https://www.kudan.io/archives/433 Edit: Added the tutorial and articles about "Camera Basics" and "Visual SLAM Basics".
Open file (225.52 KB 1252x902 kinectxie.jpg)
>>30877 The kinect was cheap at 12$ and I scaled it to the full sized robot head in gimp. I can use the main camera in the middle of aperture and the two projector/IR camera lenses as the eye shines. It won't look like this in the final robot head, but it will be positioned in this manner.
Will Cogley came out with a snap fit eye mechanism (no screws needed). > By removing ALL fasteners and using a 100% snap-fit assembly, assembly time is cut down 6 fold! Hopefully this design will also be more accessible if you struggle to get the right parts for my projects. If you don’t want to use my new PCB design (which admittedly is a work in progress) refer to [my previous design](https://www.notion.so/Simple-Eye-Mechanism-983e6cad7059410d9cb958e8c1c5b700?pvs=21) for electronics/wiring instructions. > If you do want to use the PCB, note that its still a work-in-progress. The design works although there is an issue with some holes being undersized. In theory the attached file is fixed but I’ve yet to test it myself to be 100% sure! https://youtu.be/uzPisRAmo2s https://nilheim-mechatronics.notion.site/Snap-fit-Eye-Mechanism-b88ae87ceae24d1ca942adf34750bf87
> (eye-assembly -related : >>35165 )
> (eye-design -related >>35318, >>35338 )
>>1666 >>8817 >>26306 There seems to be some interest in display "eyes" that don't actually help the robot to see, but probably not enough for it's own thread, so for now I'll just park this here. From this thread on the dollforum: NSFW https://dollforum.com/forum/viewtopic.php?t=189110 Links in thread reproduced here, just in case: An example of a sexdoll on reddit (NSFW): https://www.reddit.com/r/SexDolls/comments/1gvulh4/video_custom_eyes/ Same doll with different image for emotion(NSFW) https://www.reddit.com/r/SexDolls/comments/1gxums5/kawaii/ Same doll, different display with moving tongue(NSFW) https://www.reddit.com/r/SexDolls/comments/1gxvwme/omg_thats_good/ A display entry on amazon. Search "round tft display" as offerings change over time. https://www.amazon.com/gp/product/B0B7TFRNN1/ref=ppx_yo_dt_b_asin_title_o00_s00?ie=UTF8&psc=1 An Instructible article on the software: https://www.instructables.com/TFT-Animated-Eyes/ A tutorial video on youtube: Master the Round TFT Display on ESP32 and GC9A01 driver with the TFT_eSPI library https://www.youtube.com/watch?v=pmCc7z_Mi8I OP's results video: https://youtu.be/S-ktv1snsiQ Uncanny eyes Halloween skull https://www.instructables.com/Uncanny-Eyes-Halloween-Skull-Animatronic/ github link for large eyes (used in halloween skull) https://github.com/dalori/ESP32-uncanny-eyes-halloween-skull Large eyes tutorial on youtube: https://youtu.be/G2RZFX-qwnI
>>35511 This is definitely the correct thread, Robophiliac. >pic Care to >tl;dr what we're looking at here a bit more? The one on the right certainly looks pretty suited as a static eye. Can it 'move'? What about the left one? TIA.
>>35511 >>35518 > what we're looking at here a bit more? Sorry, it's a size comparison with a semispherical doll eye; to show it's pretty much a drop-in replacement fit. If I wanted to go that route, they would fit nicely in the heads I'm getting to modify.
>>35519 Ah, got it thanks Robophiliac. Once you have an assembly together, would you mind posting clip(s) of these eyes 'in action' please? It might help all of us to understand your approach better. Cheers. :^)
>>35318 >>35338 >Most animatronic eyes use a central pivot point in the eyeballs, greatly reducing the available area for a camera. They were designed as props, not robots. Only some mods to the inmoov design and a few others are intentionally "camera friendly", and only some also have eyelids. There is an inmoov mod for the ezrobot hardware I have, but that system uses a single camera, and the mod doesn't include eyelids. Among the security cams the main concerns are size, range of focus, the ability to continuously view the signal live (preferably via wire) and the presence of microphones for possible use as "ears". Any suggestions appreciated. >effective, highly capable (and 'sovlful') stereoscopic eye designs; including the accessory 'tissues' (lids, brows, &tc.) As it so often seems, the biggest hurdle to finding something online is figuring out what that thing is called, so you know what to look for. https://www.ebay.com/sch/i.html?_nkw=Mini-CCTV-Camera-Security-Micro-Audio-Wired If you have other ideas for search terms, go ahead and add them. As you can see from the search results, many of the cameras on offer will easily fit in the area in front of the central pivot of many animatronic eye designs. Most also have RCA type output connectors, but there are RCA to USB adapters available, so you could use them with open-CV, or any other system that uses a USB feed. Now that cameras are available to choose from, among considerations of actual size, the presence of a microphone, power consumption and any other features the camera may have, One area hardly ever discussed is resolution vs computing power- How high of a resolution can your robot process? If you are using an SBC, will it be able to process the video signal(s) and be able to perform other tasks at the same time? can it walk and chew gum? . If you are using a tethered system sending video and control signals back and forth wirelessly to a more powerful computer, it may be necessary to use a low-resolution camera system (or more than one data channel) to avoid "buffering" of the data flow. We don't want the robot to walk into a wall that it saw, but didn't get the message to turn in time to avoid. Yes, we could install collision sensors and an automatic stop function, but we could then be getting "pauses" every time the buffering situation occurred, during various tasks. This would be very non-human-like, and annoying . So, the problem becomes how much resolution do we want/need vs computing power and it's $ price? One question immediately occurs; would it be possible to change the resolution on the fly by using a software "switch" to tell the processing computer to drop every other bit(pixel?), or to process only every 3rd or 4th bit? Or to go to black-and-white for most operations?
Open file (57.12 KB 1083x584 termvision.jpeg)
>>35649 >One question immediately occurs; would it be possible to change the resolution on the fly by using a software "switch" to tell the processing computer to drop every other bit(pixel?), or to process only every 3rd or 4th bit? Or to go to black-and-white for most operations? Makes me think of Terminator Vision
Here's a good, albeit outdated tutorial on computer vision. https://www.societyofrobots.com/programming_computer_vision_tutorial.shtml
Just started going through this thread. Lots of options and depends on scope of project I guess. Sounds like you can do old way with things like depth sensors using ultrasonic or lidar but then you have to program all the spatial reasoning yourself. Spatial reasoning models look like they are just taking off though. For now, from what I've seen like link below, most are clipping frames, downsizing using ffmpeg and then passing to a vision model for image details. You could do that with a Qwen 2B-VL and pass to larger model or fine tune one depending on scope again. But that doesn't give you spatial reasoning. https://www.youtube.com/watch?v=QHBr8hekCzg Hopefully over next year open weight modals will be released and at some point a full multi-modal for text, audio and video reasoning will be within Nvidia Jetson range. Am I off here, or is that the current state basically?
>>35649 Interesting ideas, Robophiliac. Thanks! >variable-resolution encoding I think some wizardy with ffmpeg or other codec systems might provide you with that 'on-the-fly' variability, Anon? Maybe have two SBCs dedicated to the vision tasks onboard? Good luck, Anon! :^) >>35650 >pic I lel'd a little. I've wondered at this oft-repeated trope over the years (this film was made in the '80s sometime, I think). Why would they think a robowaifu (or terminator, in this case) would want to see a text overlay on it's visual field like it was playing some kind of vidya? :D
>>35649 >Mini-CCTV-Camera-Security-Micro-Audio-Wired Thanks. Good find. In the past I also looked at such small cameras, but these where for model airplanes. I think these were analog and it would've been a bit tricky to get the signal encoded into digital. >would it be possible to change the resolution on the fly by using a software "switch" to tell the processing computer to drop every other bit(pixel?), or to process only every 3rd or 4th bit? Or to go to black-and-white for most operations? This would be great. I had similar ideas, but rather for the computer next to the camera. Maybe some FPGA that can switch between different modes, idk? My vague idea was that the computer, or several small ones, would change the picture very fast into various formats and cuts. At least several resolutions down to very low ones, maybe removing the color, also only the center or certain parts of the picture. Maybe there's also a technique to change the color in a certain way, so the object in a color you are looking for sticks out more. Focus e.g. cutting out faces or objects would require a fast adaptive system, but the other operations should be done by something very fast and energy efficient. Maybe an ASIC, I guess. Then the system downstream would not look at video data the whole time, but only analyze the lowest amount of data to figure out what's going on.
>>35660 My headcanon is that since the neural net CPUs of the Terminators were like human brains, it helped them in some way to visually see that information.
>>35651 Thanks, GreerTech! >>35652 >Am I off here, or is that the current state basically? I think it's a good idea to experiement with current NVIDIA Jetson board if you can do so, Anon. As to the camera, I'd say to just pick the smolest one that gets the job done & is compatible with your processing board. This is an area that is under heavy R&D, so I wouldn't worry too much about waiting until "just the perfect choice" comes out. Good luck, Barf. >>35667 >pic Cute. :^) >>35680 Hehe, makes sense.
Haven't gone through all the threads yet, but here's a good repo of code and prints. It shows the frame processing for a visual LLM at about a frame per second, and shows it doing reasoning a bit. https://www.youtube.com/watch?v=0O8RHxpkcGc/&t=14m09s https://openroboticplatform.com/library https://github.com/NikodemBartnik/Machine-Learning-Robot I only have an ESP8266 and my main PC to start, but if I ever get that far, I might get a Jetson.

Report/Delete/Moderation Forms
Delete
Report