/robowaifu/ - DIY Robot Wives

Advancing robotics to a point where anime catgrill meidos in tiny miniskirts are a reality.

Server and LynxChan upgrade done. Canary update and a general address soon. -r

Max message length: 6144

Drag files to upload or
click here to select them

Maximum 5 files / Maximum size: 20.00 MB


(used to delete files and postings)

/agdg/ 's Game Jam runs from 3/3 to 4/4 ! Join now and learn2code during this special month, Anon! :DD

Robot Vision General Robowaifu Technician 09/11/2019 (Wed) 01:13:09 No.97
Cameras, Lenses, Actuators, Control Systems

Unless you want to deck out you're waifubot in dark glasses and a white cane, learning about vision systems is a good idea. Please post resources here.



Edited last time by Chobitsu on 09/11/2019 (Wed) 01:14:45.
>>10217 Paper is titled, "3D imaging from multipath temporal echoes" for anyone interested. PDF available: https://arxiv.org/abs/2011.09284
>>10630 Briefly skimming the paper, they seem to use a 3D camera to collect training data (like a Kinect, but more expensive), whilst the actual sensor for the acoustic system consists of a stock-standard Logitech PC speaker, and a microphone. Echos are recorded, passed into a network which reconstructs a depth map. All-in-all, it's a super simple system to replicate, and one that could easily be made for almost no cost, and very compact. Creating training data might be a pain, but given that I already have a Kinect, I'd be happy to volunteer to rig up a Kinect+Pi+mic+speaker and just walk around my local area collecting huge amounts of data. Let me know if anyone's interested. I'll probably do it myself anyway, but if there's any specific areas you'd like data for, write it down and I'll see what I can do.
>>10630 >>10631 Wow, great work Anon. Much appreciated. >Let me know if anyone's interested. I'm certainly interested to know about your results, though I don't have any personal requests ATM. >Kinect Hmm. Hasn't that product been discontinued now? I wonder what good alternatives exist for us today if any? >PDF available Always a good idea to archive a copy here on the board for access in case anything is ever pulled in the future. >
>>10632 I suggested the Kinect because it's the most common consumer level depth camera, and I already have one. Any 3D camera can be used, as long as you can get a depth map image from it. Regarding alternatives, they use an Intel Realsense D435, but that's about $300 new. Even at that price, it's still considered one of the cheaper depth cameras. Note that the 3D camera is only needed for training data collection. Once you've got your data, the system only requires a directional speaker (i.e a standard speaker), and a microphone. In theory, you could probably use the voice speaker + ears of the robowaifu, but you'd want to tweak your training data if you did.
>>10633 There are some other options: JeVois: https://youtu.be/MFGpN_Vp7mg eYs3D: https://youtu.be/NXWHYH0v638 e-con: https://youtu.be/vzXzz7VmWzo SceneScan from Nerian: https://youtu.be/mJ5UlXNguvg SP1 from Nerian: https://youtu.be/vVVjFqUkG4E Cadence: https://youtu.be/wPxi4ZYSJC0 Stereolabs ZED: https://youtu.be/7_8XLI99dno This here is software, trying to do depth perception on any camera: KudanSLAM: https://youtu.be/Pgami8jglmE
>>10646 Thanks for the work Anon.
>NIKKOR Lens Simulator https://imaging.nikon.com/lineup/lens/simulator/ An interesting utility the fine Anons on /p/ mentioned.
>related crosslink (>>10645, ...)
M-LSD: Towards Light-weight and Real-time Line Segment Detection: https://github.com/navervision/mlsd You Only Look at One Sequence https://github.com/hustvl/YOLOS >TL;DR: We study the transferability of the vanilla ViT pre-trained on mid-sized ImageNet-1k to the more challenging COCO object detection benchmark. >Directly inherited from ViT (DeiT), YOLOS is not designed to be yet another high-performance object detector, but to unveil the versatility and transferability of Transformer from image recognition to object detection. CLIP (Contrastive Language-Image Pre-Training) https://github.com/openai/CLIP > CLIP is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and >3. We found CLIP matches the performance of the original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples, overcoming several major challenges in computer vision. Dynamic Vision Transformer (DVT) >We develop a Dynamic Vision Transformer (DVT) to automatically configure a proper number of tokens for each individual image, leading to a significant improvement in computational efficiency, both theoretically and empirically. https://github.com/blackfeather-wang/Dynamic-Vision-Transformer
Open file (3.04 MB 444x250 very_nice_anon.gif)
>>10759 > Towards Light-weight and Real-time Line Segment Detection < Light-weight and Real-time Well I certainly find that combination of terms very encouraging Anon. Simply b/c that's the only combination that's going to be successful for devising reasonably inexpensive, mobile, autonomous gynoid companions robots. AKA robowaifus. It's gratifying seeing a small cadre of researchers seeming to be breaking off from the standard-issue Globohomo toadies, and tackling the real-world concerns of the billions of regular people, regarding AI advancement at the personal level. >tl;dr We'll never achieve an AI/Robowaifu Renaissance if we every one have to be beholden to the Globohomo cloud, hat in hand, begging "Please sir, may I have some more?"
>>10761 test
Found this reference image that is used by 'Ocularists' who paint glass eyes/ocular prosthesis. Might help some anon who is trying to paint their robowaifu a pair of pretty peepers. A note about eye lighting/refractive highlights: Despite what anime has taught us, I think it may be best not to paint on any lighting effects/refractive highlights (at least on non-cartoon irises). The highlights are best formed by a coating of clear epoxy resin. I say this because such refractive highlights move throughout the lens of the eye and across the surface of the sclera. So if they are painted on, they will be static...and this just looks wrong. Especially if the eye is also coated in clear resin. Because then you have the natural light reflected off the resin along with the static painted highlights. I think this is why Will Cogley leaves a concave depression inside his eye mold so that clear epoxy resin can set inside and form a lens shape, which reflects and refracts the light just like a real eye. Some measurements: (from the Wikipedia article on the human eyeball, and a scientific paper on the human iris). Human eyeball average height = 23.7mm (0.93 in) Human eyeball average width = 24.2mm (0.95 in) the eyeball is slightly wider than it is tall - which is why I didn't just give one diameter, but I know for the sake of simplicity that a 25.4mm (1 inch) diameter eyeball looks correct inside a life-sized human head. Human eyeball average anterior to posterior diameter = 24mm (0.94 in) This may be less important for a robowaifu since often they only use the front half of the eyeball, with the back/interior being fastened to a servohorn or pushrod in some way. Human eyeball average volume = 6 cubic centimeters (0.37 cu in) Iris size range = 10.2 to 13.0 mm in diameter with an average size of 12 mm in diameter, and a circumference of 37 mm. (From this paper: https://www.di.ubi.pt/~lfbaa/pubs/iscit2013.pdf 'Iris Surface Deformation and Normalization' by Somying Thainimit, Luis A. Alexandre and Vasco M.N. de Almeida. Fully dilated pupil = 4 to 8mm in diameter Fully constricted pupil = 2 to 4 mm in diameter (larger doe-eyed pupils are usually better since small ones can make the eyes look angry/scary/psycho - unless you are going for that, of course LOL). Also, please note that if you are mixing and pouring clear epoxy resin into those half-sphere (cabochon) silicone molds, remember to: A.) Wear an apron or some cheap clothes since that stuff sticks to everything worse than shit to a blanket. B.) Make sure that your mixing cups and stirrer are both disposable and as clean as possible (wear latex or nitrile gloves to avoid getting the molds greasy and wash everything with warm water, soap and even an isopropyl alcohol rinse if you have that). C.) Pour/mix the resin and hardener SLOWLY. There is plenty of time. Even with a thin layer of epoxy you've got about half an hour before it really starts to set, and I know from experience that 1 inch lenses made with the stuff take over 24 hours to set through completely. If you rush and try to beat epoxy resin like eggs in an omlette then you will just end up with thousands of tiny bubbles and a pair of cloudy looking lenses. (Also, don't do what I did and poke the back of your resin lense at about 20 hours in just to see if it has set yet. Because you end up with a dented, ruined lense. Best to wait for about 30 hours before popping them out just to be safe.)
>>10991 I should also mention that even if you get the best epoxy resin on the market, it will eventually yellow with time. UV light exites the chemical bonds and turns the resin yellow from the outside, in. If your robowaifu has eyelids that move, this is a good reason to close her eyes while she is inactive. It also means that the eyeballs themselves should be made with a degree of replacability, since you'll want to swap them out after a couple of years when they start to yellow noticeably. So maybe spend an hour or two painting in the irises but it's probably best not to try crafting a masterwork prosthesis ;D
>>10991 >>10992 Excellent information on this topic. The effort here is much appreciated Anon.
Open file (139.44 KB 1125x1219 IMG_20210409_013031.jpg)
>>10995 Thank you kindly Anon.
Open file (1.05 MB 1758x504 modulo_teaser.png)
I already mentioned there's open source software that can see heartbeats and micro facial movements called Eulerian Video Magnification in another thread, but I figure I might as well mention it here too: https://www.youtube.com/watch?v=ONZcjs1Pjmk As for hardware, I was thinking simple solid black camera eyes if I couldn't get 3-axis movement (vertical, horizontal, convergence) working with cameras without taking up too much space. I haven't thought too much about eyes other than that, except that I was thinking she should see in IR, to help in poor lighting without blinding me with LEDs on her face. I remember reading about something called a Modulo Camera that supposedly never over-exposes or something, so maybe it could just use a bigger camera for better night vision? There's also something called a "Light field camera" that keeps everything in focus, but I'm not sure how useful that is for robot vision, I just think it's neat.
>>13163 That's an interesting concept Anon, thanks. Yes, I think cameras and image analysis have very long legs yet, and we still have several orders of magnitude improvements yet to come in the future. It would be nice if our robowaifus (and not just our enemies) can take advantage of this for us. We need to really be thinking ahead in this area tbh.
It seems like CMOS is the default sensor for most CV applications due to cost. But seeing all these beautiful eye designs makes me consider carefully how those photons get processed into signal for the robowaifus. Cost aside, CCD as a technology seems better because the entire image is processed monolithically, as one crisp frame, instead of a huge array of individual pixel sensors, which I think causes noise which has to be dealt with in post image processing. CCD looks like its still the go-to for scientific instruments today. In astrophotography everyone drools over cameras with CCD, while CMOS is -ok- and fits most amateur needs, the pros use CCD. Astrophotography / scientific www.atik-cameras(dot)com/news/difference-between-ccd-cmos-sensors/ This article breaks it down pretty well from a strictly CV standpoint. www.adimec(dot)com/ccd-vs-cmos-image-sensors-in-machine-vision-cameras/
>>14751 That looks very cool Anon. I think you're right about CCDs being very good sensor tech. Certainly I think that if we can find ones that suit our specific mobile robowaifu design needs, then that would certainly be a great choice. Thanks for the post!
iLab Neuromorphic Vision C++ Toolkit The USC iLab is headed up by the PhD behind the Jevois cameras and systems. http://ilab.usc.edu/toolkit/
>(>>15997, ... loosely related)
>"Follow Me" eyes (crosslink): >>19037 - I somehow forgot that we had a dedicated thread for eyes.
> conversation-related (>>23398, ...)
Related: >>23405 >Once thing I would like to do with a board that allows for more than one camera, would be to have a way to use this for creating a somewhat 3D model of the world. Especially be able to know the distance of an object it recognizes. This will be absolutely crucial to understand the world. >>23410 >Stereo Depth Cameras ... using triangulation >>23431 > auto-mesh generation Is this about generating meshes from 2D pictures. I just wrote somewhere that I wonder how video to 3D model would work. It's possible to use AI generated videos to feed a game engine and render a even better video. I guess the background is done using this "auto-mesh generation" then (pose estimation to bone model for characters).
>>23431 >Motion isn't req'd. when you already know the dimensions of the object, you can only use a single image when you already know the actual height then its just a matter of measuring the difference between the real height vs the image height, its how snipers have to figure out distances in their scope when they dont have a rangefinder, it would need to keep a database of dimensions for known objects otherwise it has to go into pajeetmode to emulate stereoscopic vision, its doable but it seems like too much hassle when you can just use two cameras
>>23436 >it would need to keep a database of dimensions for known objects otherwise it has to go into pajeetmode to emulate stereoscopic vision Agreed, and that's an aspect of the 'well-calibrated' camera(s) part of the equation. For instance, when a robowaifu can remain in the relative safety of her master's home, then she can have the luxury of perfectly pre-learning basically everything in his space. This is a big win for all of her on-the-fly, object recognition/distance/volume/kinematic/mass/force/pose -estimate calculations. Including him, of course. :^) >"Master!? Have you been putting on weight again?" >=== -prose edit, fmt -add funpost spoilers
Edited last time by Chobitsu on 06/24/2023 (Sat) 23:34:03.
>>23435 >Is this about generating meshes from 2D pictures. Yes. It works far better using a combination of stereo depth cameras, and the ability to proactively transform the camera(s) around the object(s) in question. Much like a robowaifu (or a human photographer) would be able to do. The primary point being to highly-accurately model the world around her, including her own master and other humans. (For example: their own children romping about. :^)
>>23437 lol, i forgot it would already need a database anyway for those things, still calculating based on parallax is way simpler than comparing the image to known dimensions, especially if the object is rotated then you need to know the angle to get a real height to compare to
>>23440 >parallax is way simpler This comment touches on a technical aspect of computation, and the up-front costs involved with setup. But fair enough Anon. I'm sure it's more reliable, in general, than simplistic dimension analysis, particularly in tricky lighting conditions.
>>23436 >need to keep a database of dimensions for known objects Thanks for pointing this out, but that's something I want to do anyways. Robowaifus should have a rough estimate on the traits of identifiable objects, e.g. weight and size. This can be taken from some public databases or LLMs. Then on top of that, if they see something close to an unknown object, which they can identify, they should be able to draw a conclusion about the size of the unknown one based on that.
Open file (97.63 KB 1202x743 yolo_nas_frontier.png)
>Deci is thrilled to announce the release of a new object detection model, YOLO-NAS - a game-changer in the world of object detection, providing superior real-time object detection capabilities and production-ready performance. Deci's mission is to provide AI teams with tools to remove development barriers and attain efficient inference performance more quickly. https://github.com/Deci-AI/super-gradients/blob/master/YOLONAS.md
>>23776 what is mAP?
>>23777 The diagram indicates it's a form of accuracy to compare such models. >Mean Average Precision (mAP) https://blog.paperspace.com/mean-average-precision/ Found via: https://duckduckgo.com/?q=map+machine+learning+accuracy
>>23776 Very low-latency in detection is vital, insofar as her autonomous safety is concerned. The ideal is human-level speed at object recognition (or even faster). We're probably getting pretty close on smol devices already, so I predict we'll reach this goal generally by the time the first real-world robowaifus begin rolling out. Thanks Anon.
>>24909 - the computers connected to the eyes (cameras) should have different ways of sharing data with other computers, e.g. just sharing body movement analysis and recognition info as a text stream, same for the person being detected, or some emotional indicators. Sending photos and videos should be very limited, only sending encrypted files, also the system should mostly not store this data. Some home server might store and process some data for fine tuning, but needs to receive this data encrypted. Decision what to share should be made based on overall context coming from the general cognitive architecture >>24783 - fast and efficient segmentation of images (FPGAs?) - different variants or the same image, created very fast, maybe using FPGA. For further processing, e.g. only processing a low res partial image of an object to keep track of. The creation of that low res partial image should be done by a specialized system close to the cameras. - using object detection models based on context informed by the general cognitive architecture >>24783 or just based on awareness of what room she's in and maybe even at what she's looking at. So they can be smaller, faster and more specialized, including some models which are trained on the specific training data related to the household (photos and videos of the home environment).
Open file (215.89 KB 869x350 Screenshot_114.png)
Open file (326.25 KB 879x492 Screenshot_113.png)
Open file (162.74 KB 878x396 Screenshot_112.png)
>LERF optimizes a dense, multi-scale language 3D field by volume rendering CLIP embeddings along training rays, supervising these embeddings with multi-scale CLIP features across multi-view training images. After optimization, LERF can extract 3D relevancy maps for language queries interactively in real-time. LERF enables pixel-aligned queries of the distilled 3D CLIP embeddings without relying on region proposals, masks, or fine-tuning, supporting long-tail open-vocabulary queries hierarchically across the volume. >With multi-view supervision, 3D CLIP embeddings are more robust to occlusion and viewpoint changes than 2D CLIP embeddings. 3D CLIP embeddings also conform better to the 3D scene structure, giving them a crisper appearance. https://www.lerf.io https://github.com/kerrj/lerf https://drive.google.com/drive/folders/1vh0mSl7v29yaGsxleadcj-LCZOE_WEWB?usp=sharing https://arxiv.org/abs/2303.09553
> Face recognition Not tested, just looking what's available: https://github.com/cmusatyalab/openface Following quotes are from Reddit, not from me... https://github.com/ageitgey/face_recognition > I have tried this out. It's easy to code and accurately recognizes faces. The problem is it can't even detect faces 1 feet away from the camera. https://github.com/timesler/facenet-pytorch (FaceNet & MTCNN) > This can detect and recognize faces at a distance, but the problem is it can't recognize unknown faces correctly. I mean for unknown faces it always tries to label it as one of the faces from the model/ database encodings. https://github.com/serengil/deepface > I have tried VGG, ArcFace, Facenet512. The latter two gave me good results. But, the problem is I couldn't figure out how to change the detection from every 5 seconds to real-time. Also, I couldn't change the camera source. (If anyone can help me with these please do). Also, it had fps drops frequently. https://github.com/deepinsight/insightface > Couldn't test this yet. But in the demo YT video it shows the model incorrectly detecting a random object as a face. If someone knows how well this performs please let me know. https://www.reddit.com/r/computervision/comments/15ycwom/face_recognition_whats_the_state_of_the_art/ This here seems to be the best: https://github.com/ZoneMinder/zoneminder the Reddit link above has some thread and patch for detecting faces on distance, I think.
Open file (537.86 KB 877x878 LLaVA.png)
LLaVA: Large Language and Vision Assistant (https://llava-vl.github.io/) A project to integrate vision into large language models. Though very new and young as a concept, adding visual context to language models has tremendous potential. Notably, a waifu which can understand correlations between what she perceives in her environment with what she is told can lead to much more naturally feeling interactions. Fingers crossed for a fork that implements YOLO (https://pjreddie.com/darknet/yolo/) rather than CLIP (https://openai.com/research/clip) for better compute and memory efficiency. Getting this to run at sub 10 watts should be a goal.
Edited last time by Kiwi_ on 10/11/2023 (Wed) 18:33:59.
Open file (82.60 KB 386x290 Screenshot_158.png)
I was working on this here >>26112 using OpenCL to make video processing faster. So I got this here recommended by YouTube: https://www.youtu.be/0Kgm_aLunAo Github: https://github.com/jjmlovesgit/pipcounter This is using OpenCV to count pips on dominos, and does it much faster and better than GPT4-Vision. I wonder if it would be possible to have a LLM adjust the code dependent on the use case, and maybe having a library of common patterns to look out for. Ideally one would show it something new, it would detect the outer border like the stones here and then adjust till it can catch the details on all of these objects which are of interest. It could look out for patterns dependent on some context, like e.g. a desk.
>>26132 >and does it much faster and better than GPT4-Vision. Doesn't really surprise me. OpenCV is roughly the SoA in hand-written C++ code for computer vision. You have some great posts ITT Anon thanks... keep up the good work! :^)
There are several libraries and approaches that attempt to achieve generalized object detection within a context, although creating a completely automatic, context-based object detection system without predefining objects can be a complex task due to the variability of real-world scenarios. However, libraries and methodologies that have been utilized for more general object detection include: 1. YOLO (You Only Look Once): YOLO is a popular object detection system that doesn't require predefining objects in the training phase. It uses a single neural network to identify objects within an image and can detect multiple objects in real-time. However, it typically requires training on specific object categories. 2. OpenCV with Haar Cascades and HOG (Histogram of Oriented Gradients): OpenCV provides Haar cascades and HOG-based object detection methods. While not entirely context-based, they allow for object detection using predefined patterns and features. These methods can be more general but might not adapt well to various contexts without specific training or feature engineering. 3. TensorFlow Object Detection API: TensorFlow offers an object detection API that provides pre-trained models for various objects. While not entirely context-based, these models are designed to detect general objects and can be customized or fine-tuned for specific contexts. 4. Custom Object Detection Models with Transfer Learning: You could create a custom object detection model using transfer learning from a pre-trained model like Faster R-CNN, SSD, or Mask R-CNN. By fine-tuning on your own dataset, the model could adapt to specific contexts. 5. Generalized Shape Detection Algorithms: Libraries like scikit-image and skimage in Python provide various tools for general image processing and shape analysis, including contour detection, edge detection, and morphological operations. While not object-specific, they offer tools for identifying shapes within images. Each of these methods has its advantages and limitations when it comes to general object detection. If you're looking for a more context-aware system that learns and adapts to various contexts, combining traditional computer vision methods with machine learning models trained on diverse images may be a step towards achieving a more generalized object detection system. However, creating a fully context-aware, automatic object detection system that adapts to any arbitrary context without any predefined objects is still a challenging area of research. ----------------- In terms of computational requirements, here's a general ranking of the mentioned object detection methods based on the computational power and RAM they might typically require: 1. OpenCV with Haar Cascades and HOG: - Computational Power Needed: Low to Moderate - RAM Requirements: Low - These methods are computationally less intensive compared to deep learning-based models. They can run on systems with lower computational power and memory. 2. Generalized Shape Detection Algorithms (scikit-image, skimage): - Computational Power Needed: Low to Moderate - RAM Requirements: Low to Moderate - While these libraries might need slightly more computational power and RAM than Haar Cascades and HOG, they are still less demanding compared to deep learning-based models. 3. TensorFlow Object Detection API: - Computational Power Needed: Moderate to High - RAM Requirements: Moderate to High - Running pre-trained models from the TensorFlow Object Detection API might require more computational power and memory compared to traditional computer vision methods due to the complexity of the deep learning models. 4. Custom Object Detection Models with Transfer Learning: - Computational Power Needed: Moderate to High - RAM Requirements: Moderate to High - Training custom object detection models with transfer learning typically requires moderate to high computational power and memory, especially during the training phase. 5. YOLO (You Only Look Once): - Computational Power Needed: High - RAM Requirements: High - YOLO models are relatively demanding in terms of computational power and memory. They require more powerful machines due to their deep neural network architecture and real-time processing capabilities. The exact computational requirements and memory usage can vary based on the specific hardware, image sizes, complexity of the models, and the scale of the operations being performed. Deep learning models, in general, tend to demand more computational resources compared to traditional computer vision methods. If you're working with large datasets or real-time processing, more powerful hardware configurations would likely be necessary to achieve optimal performance. -------- https://github.com/opencv/opencv/tree/master/data/haarcascades -------- If your goal is to detect shapes without knowing the specific objects at first, OpenCV's contour detection methods combined with image processing techniques could be more appropriate than scikit-learn. Once shapes are identified, further analysis or categorization can be performed using traditional machine learning algorithms from scikit-learn or other methods.
>>26146 Understood. If the goal is to identify various objects within a specific context (like a desk) without predefining the objects, and the lighting conditions might vary, a more flexible approach using general computer vision techniques can be applied. This could involve methods such as contour detection, edge detection, and basic image processing techniques to identify objects within the context of a desk. You might use a more generalized version of object detection that isn’t specific to particular objects but rather identifies any distinguishable shape within the context. Here’s an example: python import cv2 # Read the image image = cv2.imread('path_to_your_image.jpg') # Convert to grayscale gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # Apply thresholding or other preprocessing techniques to enhance object edges # ... # Find contours contours, _ = cv2.findContours(processed_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) detected_objects = [] for contour in contours: # Apply some conditions to filter objects based on size, shape, etc. # For instance, you might filter by area or aspect ratio area = cv2.contourArea(contour) if area > some_minimum_area_threshold: detected_objects.append(contour) # Count and display the number of detected objects print(f"Number of objects detected: {len(detected_objects)}") This code applies general techniques such as contour detection to identify distinguishable shapes within the context of the desk. The process of identifying objects relies on the uniqueness of their shapes and their contrast against the background. The challenge in this approach lies in how the algorithm distinguishes objects based on their shapes and sizes. It might not identify specific objects but rather any shape that meets certain criteria (like area, aspect ratio, etc.) within the provided context (in this case, the desk). This method might detect a variety of objects but could also identify false positives or miss some objects. Fine-tuning the conditions for object identification (like area thresholds or other characteristics) can improve the accuracy of detection within the context of the desk, considering the variability in lighting and object characteristics.
Open file (346.77 KB 696x783 1698709850406174.png)
Open file (199.93 KB 767x728 1698710469395618.png)
I suppose this is a good thread to use for discussing this concept: a swarm of small drones available for a robowaifu's use for enhanced perimeter/area surveillance, etc.
>1.6B parameter model built using SigLIP, Phi-1.5 and the LLaVA training dataset. Weights are licensed under CC-BY-SA due to using the LLaVA dataset. Try it out on Hugging Face Spaces! https://github.com/vikhyat/moondream https://huggingface.co/spaces/vikhyatk/moondream1 https://youtu.be/oDGQrOlmC1s >The model is release for research purposes only, commercial use is not allowed. >circa 6GB or 4GB quantized
>>29286 Thanks. Do you have any views on it's usefulness r/n, Anon?
Open file (84.01 KB 960x720 yuina.png)
For people looking for a Kinect, I've had success finding them at electronics recycle centers. RE:PC in Seattle had a big bin. Also, I just checked and they're going for under ten dollars on eBay lol. I had also heard that the Kinect's depth camera isn't all too necessary at this point due to how good neural networks have gotten recently. Is there any merit to that?
>>29911 Unless you're using the kinect to do some sort of 3d mapping you can get stuff like pose landmark detection using AI stuff and a standard webcam, like Gulag's open-source library Mediapipe. https://mediapipe-studio.webapps.google.com/home I use some of their models for object recognition :D
>>29911 >>29915 Thanks for both the great tips, Anons! Cheers. :^)
>>29367 I think we would need workarounds if such models are not fast enough, but wow it needs less than a second to identify common objects in a photo of a room from a home. I guess on a smaller computer it would be slower, but still. This is good enough for now, and it's just a stepping stone. Keep in mind, we don't need it as fast and general as AI in cars. The waifus will mostly look at the same home with the same objects all the time. >>29911 My issue is rather that I don't want to use a device which I can only get from recycling centers. Also, I want two cams which can move on their own and I decide on which distance they are. I guess something like Kudan will be the way to go: https://www.youtube.com/@KudanLimited

Report/Delete/Moderation Forms