{"id":997138,"date":"2023-12-29T20:28:00","date_gmt":"2023-12-29T12:28:00","guid":{"rendered":"https:\/\/geetests.com\/article\/captcha-creation-meets-stable-diffusion"},"modified":"2025-09-15T15:42:18","modified_gmt":"2025-09-15T07:42:18","slug":"captcha-creation-meets-stable-diffusion","status":"publish","type":"post","link":"\/en\/article\/captcha-creation-meets-stable-diffusion","title":{"rendered":"Stable Diffusion Refines CAPTCHA, Merging Art with Security"},"content":{"rendered":"<div class=\"vgblk-rw-wrapper limit-wrapper\"><span class=\"ql-size-16px\">Following our initial discussion on\u00a0<\/span><a class=\"ql-size-16px\" href=\"https:\/\/blog.geetest.com\/en\/article\/aigc-bot-mitigation-exploration\" target=\"_blank\" rel=\"noopener noreferrer\">GeeTest&#8217;s foray into AI-generated content (AIGC)<\/a><span class=\"ql-size-16px\">, this article delves deeper into the technological strides and real-world applications we&#8217;ve achieved. We&#8217;re witnessing a transformative phase in text-to-image technology, where transforming text into vivid images isn&#8217;t just a simple conversion, it&#8217;s an intricate blend of text and visuals, pushing digital imagery&#8217;s boundaries. Our focus today is on the Stable Diffusion model and its pivotal role in evolving image and image-based CAPTCHA systems.<\/span><\/p>\n<h2><strong class=\"ql-size-28px\">What is Stable Diffusion?<\/strong><\/h2>\n<p><span class=\"ql-size-16px\">Stable Diffusion (SD) is a state-of-the-art generative AI model classified under diffusion models in deep learning. Designed to generate data closely resembling its training data, Stable Diffusion specializes in image processing. It is acclaimed for efficiently generating and modifying images, making it a standout in text-to-image technology. This efficiency, coupled with its open-source nature, has garnered widespread interest in the technology community.<\/span><\/p>\n<p><span class=\"ql-size-16px\"><img decoding=\"async\" src=\"https:\/\/geetests.com\/wp-content\/uploads\/2025\/09\/figure1.png\" alt=\"\"><\/span><\/p>\n<h2><strong class=\"ql-size-28px\">Underlying Technology of Stable Diffusion<\/strong><\/h2>\n<p><span class=\"ql-size-16px\">Stable Diffusion, as a latent diffusion model, revolutionizes image processing by compressing images into a significantly smaller latent space, instead of operating in the traditional high-dimensional image space. This approach boosts the model&#8217;s speed and efficiency.<\/span><\/p>\n<p><span class=\"ql-size-16px\">The capabilities of Stable Diffusion are diverse. It includes text-to-image generation, image-to-image translation, and image enhancement tasks like super-resolution and colorization. It utilizes a Variational Autoencoder (VAE) comprising an encoder for compressing the image into the latent space and a decoder for reconstructing the image from this compressed form. We will showcase it later.<\/span><\/p>\n<p><span class=\"ql-size-16px\">In terms of its diffusion process, Stable Diffusion employs both forward and reverse diffusion techniques. Forward diffusion involves gradually adding noise to an image until it becomes random noise. Reverse diffusion, conversely, involves starting with this noise and iteratively removing it to create an image. All these diffusion processes occur in the latent space during training. Instead of corrupting an image with noise in the image space, Stable Diffusion corrupts the representation of the image in the latent space with latent noise. This process is faster due to the smaller size of the latent space.<\/span><\/p>\n<p><span class=\"ql-size-16px\">Conditioning plays a crucial role in how Stable Diffusion converts text into images. It involves steering the noise predictor to produce the desired outcome based on the text prompt. For example, when given prompts like &#8220;paradise,&#8221; &#8220;cosmic,&#8221; or &#8220;beach,&#8221; the model generates images that visually align with these concepts, creating scenes with elements like clear skies or vast beaches. This innovative process allows Stable Diffusion to interpret and visualize textual descriptions effectively.<\/span><\/p>\n<p><span class=\"ql-size-16px\"><img decoding=\"async\" src=\"https:\/\/geetests.com\/wp-content\/uploads\/2025\/09\/720X720.png\" alt=\"\"><\/span><\/p>\n<p><span class=\"ql-size-16px\">Here is how Stable Diffusion actually works in the text-to-image process.<\/span><\/p>\n<ol>\n<li><strong class=\"ql-size-16px\">Text Representation Generation (Text Encoder &#8211; Blue Module):<\/strong><\/li>\n<li class=\"ql-indent-1\"><span class=\"ql-size-16px\">Tokenizing and Encoding: The model tokenizes the input text prompt into a standardized sequence of tokens. Each token is then converted into a text vector using CLIP&#8217;s text encoder, creating a representation rich in image-related information.<\/span><\/li>\n<li><strong class=\"ql-size-16px\">Image Representation Refining (Image Information Creator &#8211; Pink Module)<\/strong><span class=\"ql-size-16px\">:<\/span><\/li>\n<li class=\"ql-indent-1\"><strong class=\"ql-size-16px\">Initial Noise and Refinement:<\/strong><span class=\"ql-size-16px\">\u00a0The process begins with random noise, which is refined over multiple iterations (typically 30-50 timesteps).<\/span><\/li>\n<li class=\"ql-indent-1\"><strong class=\"ql-size-16px\">UNet Process<\/strong><span class=\"ql-size-16px\">: At each timestep, the UNet network, integral to the Image Information Creator, predicts and removes noise from the image representation. This is guided by the text vectors and a scheduler that regulates noise removal, gradually enhancing the image quality.<\/span><\/li>\n<li><strong class=\"ql-size-16px\">Image Upscaling (Image Decoder &#8211; Yellow Module)<\/strong><span class=\"ql-size-16px\">:<\/span><\/li>\n<li class=\"ql-indent-1\"><span class=\"ql-size-16px\">Upscaling to High Resolution: After the refinement process, the Image Decoder upscales the detailed image representation into a high-resolution image that closely aligns with the text prompt.<\/span><\/li>\n<\/ol>\n<p><span class=\"ql-size-16px\"><img decoding=\"async\" src=\"https:\/\/geetests.com\/wp-content\/uploads\/2025\/09\/figure3.png\" alt=\"\"><\/span><\/p>\n<p><span class=\"ql-size-16px\"><img decoding=\"async\" src=\"https:\/\/geetests.com\/wp-content\/uploads\/2025\/09\/640-2.png\" alt=\"\"><\/span><\/p>\n<p><span class=\"ql-size-16px\"><img decoding=\"async\" src=\"https:\/\/geetests.com\/wp-content\/uploads\/2025\/09\/640.jpeg\" alt=\"\"><\/span><\/p>\n<p><span class=\"ql-size-16px\">As depicted, by feeding both the initial pure noise vector and the subsequently denoised latent vector into the Image Decoder, we can discern the stark contrast in the resulting images. The sequence reveals that the pure noise vector, devoid of meaningful content, translates into an image comprised solely of noise. Conversely, the latent vector, having undergone 50 iterations of denoising, incorporates semantic information, leading to an image that effectively embodies this semantic content.<\/span><\/p>\n<h2><strong class=\"ql-size-28px\">Integrating Stable Diffusion with CAPTCHA Generation<\/strong><\/h2>\n<p><span class=\"ql-size-16px\">The adoption of the SD model for CAPTCHA generation has significantly bolstered the security of these systems. SD&#8217;s advanced latent diffusion techniques enable the production of complex verification images, overcoming the common vulnerabilities and inefficiencies of traditional CAPTCHAs.<\/span><\/p>\n<h3><strong class=\"ql-size-22px\">Enhanced Security Through Text Effects<\/strong><\/h3>\n<p><span class=\"ql-size-16px\">The SD model introduces sophisticated visual effects like shadow text, challenging for AI recognition systems yet discernible to humans. Utilizing ControlNet, SD manipulates light and shadow to create image-based CAPTCHAs with deliberately vague and distorted characters, effectively confusing automated image recognition models.<\/span><\/p>\n<p><span class=\"ql-size-16px\">For instance, Chinese characters like &#8220;\u51b0&#8221; (ice), &#8220;\u62ff&#8221; (take), and &#8220;\u94c1&#8221; (iron), crafted with shadow effects from environmental elements, remain clear to human users while stumping image recognition algorithms with atypical character formation.<\/span><\/p>\n<p><span class=\"ql-size-16px\"><img decoding=\"async\" src=\"https:\/\/geetests.com\/wp-content\/uploads\/2025\/09\/bingnatie.png\" alt=\"\"><\/span><\/p>\n<p><span class=\"ql-size-16px\"><img decoding=\"async\" src=\"https:\/\/geetests.com\/wp-content\/uploads\/2025\/09\/bingnatie2.jpeg\" alt=\"\"><\/span><\/p>\n<p><span class=\"ql-size-16px\">Similarly, characters such as &#8220;\u66f2\u5947&#8221; (cookie), &#8220;\u9ed1\u68ee\u6797&#8221; (black forest), &#8220;\u679c\u51bb&#8221; (jelly), and &#8220;\u84dd\u8393&#8221; (blueberry) are easily distinguishable by people against noisy backdrops but are often misinterpreted by image recognition models. By weaving shadow art into the base image and introducing errors in shadow placement, overlap, and alignment, SD generates a high rate of AI recognition failure, 99.74% in tests involving 5,000 shadow images.\u00a0<\/span><\/p>\n<p><span class=\"ql-size-16px\"><img decoding=\"async\" src=\"https:\/\/geetests.com\/wp-content\/uploads\/2025\/09\/51b3dc65-ac31-40d5-a127-6727a75be1d5.png\" alt=\"\"><\/span><\/p>\n<p><span class=\"ql-size-16px\">This approach not only maintains accuracy for human users but also significantly increases the difficulty for bots, enhancing CAPTCHA security beyond traditional character warping and background interference methods.<\/span><\/p>\n<h3><strong class=\"ql-size-22px\">Elevated Aesthetics and User-Friendly Experience<\/strong><\/h3>\n<p><span class=\"ql-size-16px\">SD&#8217;s advanced technology not only fortifies CAPTCHAs against automated attacks but also enhances their realism and aesthetic appeal. These CAPTCHAs, distinguished by their vibrant colors and sharp resolution, substantially improve the user experience.\u00a0<\/span><\/p>\n<p><span class=\"ql-size-16px\">GeeTest&#8217;s integration of SD below exemplifies how shadow text can be merged into engaging images, finely tuned to balance security and usability.<\/span><\/p>\n<p><span class=\"ql-size-16px\"><img decoding=\"async\" src=\"https:\/\/geetests.com\/wp-content\/uploads\/2025\/09\/9d9b8abd-538c-457c-82ed-a65be8aa7860.png\" alt=\"\"><\/span><\/p>\n<p><span class=\"ql-size-16px\">Wu Yuan, CEO of GeeTest, emphasizes the design challenge of CAPTCHAs: they must prevent bot invasions without degrading the user experience. Adopting SD for the processing of character-based icon CAPTCHAs has proven popular among users. The resultant lively and clearer images enable users to complete verifications swiftly, reducing the time to just three seconds, significantly less than that required by traditional CAPTCHAs.<\/span><\/p>\n<h3><strong class=\"ql-size-22px\">Increased Efficiency<\/strong><\/h3>\n<p><span class=\"ql-size-16px\">SD integration has revolutionized CAPTCHA design, phasing out the need for manual image creation. Inputting a text prompt into SD quickly produces intricate validation images, greatly reducing time and labor. GeeTest&#8217;s CAPTCHA V4 introduces an automated image update system, enhancing security against brute-force attacks and improving image generation speed by 30%.<\/span><\/p>\n<p><span class=\"ql-size-16px\"><img decoding=\"async\" src=\"https:\/\/geetests.com\/wp-content\/uploads\/2025\/09\/147acf00-a9a2-426d-b45f-62f272672637.png\" alt=\"\"><\/span><\/p>\n<p><span class=\"ql-size-16px\">This integration proves highly effective, with SD surpassing traditional methods in security, efficiency, and user experience. It significantly speeds up CAPTCHA image production, boosting system responsiveness. By combining SD and Generative Adversarial Networks (GANs), the resulting CAPTCHAs are resilient against advanced cracking tactics, marking a leap forward in bot detection and prevention strategies.<\/span><\/p>\n<h2><strong class=\"ql-size-28px\">Technical Advancements in CAPTCHA Creation Using the SD Model<\/strong><\/h2>\n<p><span class=\"ql-size-16px\">The SD model redefines image generation as a diffusion process that progressively eliminates noise. Beginning with random Gaussian noise, it methodically removes noise through training until the image is noise-free, ultimately producing visuals that closely mirror textual prompts. However, this denoising is resource-intensive, particularly for high-resolution image production, posing challenges in the efficient allocation of computational resources and in scaling GPU utilization.<\/span><\/p>\n<p><span class=\"ql-size-16px\">To address these challenges, we&#8217;ve identified three strategic objectives:<\/span><\/p>\n<ol>\n<li><span class=\"ql-size-16px\">Model as a Service: Given the necessity of GPU usage for larger models, we must consider the higher costs associated with cloud-based resources.<\/span><\/li>\n<li><span class=\"ql-size-16px\">Cost-Effective Resource Access: Initially, to control upfront investments, we prefer a pay-as-you-go approach to GPU resources, avoiding hefty monthly fees and optimizing resource usage.<\/span><\/li>\n<li><span class=\"ql-size-16px\">Streamlined Model Service Code: The codebase for model services should be compact and designed for easy horizontal scaling.<\/span><\/li>\n<\/ol>\n<p><span class=\"ql-size-16px\">In response, we&#8217;ve developed a model service architecture utilizing Ray and Kubernetes (K8s):<\/span><\/p>\n<p><span class=\"ql-size-16px\"><img decoding=\"async\" src=\"https:\/\/geetests.com\/wp-content\/uploads\/2025\/09\/25893e2b-13ae-43e7-af6f-a6fe3da4035a.png\" alt=\"\"><\/span><\/p>\n<p><span class=\"ql-size-16px\">This framework enables the deployment of a model service with a lean codebase, substantially curtailing both memory usage and computational expenses.<\/span><\/p>\n<p><span class=\"ql-size-16px\">Additionally, for the collective management and generation of CAPTCHA image sets, we&#8217;ve crafted a suite of functional interfaces around ray.serve and the SD model&#8217;s framework. These interfaces are dedicated to managing the prompt database and streamlining the automated pipeline production of images.<\/span><\/p>\n<p><span class=\"ql-size-16px\"><img decoding=\"async\" src=\"https:\/\/geetests.com\/wp-content\/uploads\/2025\/09\/237e61a8-80c2-4286-a070-78f1e0c7793d.jpeg\" alt=\"\"><\/span><\/p>\n<p><span class=\"ql-size-16px\"><img decoding=\"async\" src=\"https:\/\/geetests.com\/wp-content\/uploads\/2025\/09\/WX20231229-202141@2x.png\" alt=\"\"><\/span><\/p>\n<h2><strong class=\"ql-size-28px\">Closing Thoughts<\/strong><\/h2>\n<p><span class=\"ql-size-16px\">The open-source Stable Diffusion model, a standout in latent diffusion technology, eclipses competitors like DALL and Midjourney with its rapid development and versatility. Its integration across various platforms and access to numerous pre-trained models highlight its adaptability. The community&#8217;s active engagement has propelled SD to the forefront of diverse image generation.<\/span><\/p>\n<p><span class=\"ql-size-16px\">SD&#8217;s innovation extends beyond image creation to revolutionize human-computer interaction. Utilizing latent diffusion and Generative Adversarial Networks, it excels in producing complex, realistic CAPTCHA images, enhancing security and user experience. This advancement positions SD to bring transformative changes in digital security across industries, marking an exciting era of technological evolution.<\/span><\/div>\n<p><!-- .vgblk-rw-wrapper --><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Discover how Stable Diffusion is transforming CAPTCHA generation, enhancing security while adding an artistic dimension to user experience in our latest insightful article.<\/p>\n","protected":false},"author":7,"featured_media":996357,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[90],"tags":[],"class_list":["post-997138","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cyberwatch"],"_links":{"self":[{"href":"\/en\/wp-json\/wp\/v2\/posts\/997138","targetHints":{"allow":["GET"]}}],"collection":[{"href":"\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"\/en\/wp-json\/wp\/v2\/users\/7"}],"replies":[{"embeddable":true,"href":"\/en\/wp-json\/wp\/v2\/comments?post=997138"}],"version-history":[{"count":3,"href":"\/en\/wp-json\/wp\/v2\/posts\/997138\/revisions"}],"predecessor-version":[{"id":997700,"href":"\/en\/wp-json\/wp\/v2\/posts\/997138\/revisions\/997700"}],"wp:featuredmedia":[{"embeddable":true,"href":"\/en\/wp-json\/wp\/v2\/media\/996357"}],"wp:attachment":[{"href":"\/en\/wp-json\/wp\/v2\/media?parent=997138"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"\/en\/wp-json\/wp\/v2\/categories?post=997138"},{"taxonomy":"post_tag","embeddable":true,"href":"\/en\/wp-json\/wp\/v2\/tags?post=997138"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}