Web_scraping_project

Author	SHA1	Message	Date
Ofure Ikheloa	c370de83d5	Refactor scraper and sender modules for improved Redis management and SSL connection handling - Introduced RedisManager class in scraper.py for centralized Redis operations including job tracking and caching. - Enhanced job scraping logic in MultiPlatformJobScraper to support multiple platforms (Ashby, Lever, Greenhouse). - Updated browser initialization and context management to ensure better resource handling. - Improved error handling and logging throughout the scraping process. - Added SSL connection parameters management in a new ssl_connection.py module for RabbitMQ connections. - Refactored sender.py to utilize RedisManager for job deduplication and improved logging mechanisms. - Enhanced CSV processing logic in sender.py with better validation and error handling. - Updated connection parameters for RabbitMQ to support SSL configurations based on environment variables.	2025-12-12 13:48:26 +01:00
Ofure Ikheloa	160efadbfb	modifications to work with postgre and use llm to extract and refine data	2025-12-05 17:00:43 +01:00
Ofure Ikheloa	d7d92ba8bb	fix(job_scraper): increase timeout values for page navigation The previous timeout values were too short for slower network conditions, causing premature timeouts during job scraping. Increased wait_for_function timeout from 30s to 80s and load_state timeout from 30s to 60s to accommodate slower page loads.	2025-11-27 12:28:21 +01:00
Ofure Ikheloa	fd4e8c9c05	feat(scraper): add LLM-powered job data refinement and new scraping logic - Implement LLMJobRefiner class for processing job data with Gemini API - Add new job_scraper2.py with enhanced scraping capabilities - Remove search_keywords parameter from scraping engine - Add environment variable loading in config.py - Update main script to use new scraper and target field	2025-11-24 12:25:50 +01:00
Ofure Ikheloa	458e914d71	feat(scraping): enhance job scraping with session persistence and feedback system - Add config module for spoof data management - Implement session persistence to reuse authenticated sessions - Add feedback system to track success rates and adjust fingerprinting - Improve job link collection with pagination and scroll detection - Separate verified/unverified job listings into different folders - Enhance error handling for CAPTCHA and Cloudflare challenges	2025-11-21 16:51:26 +01:00
Ofure	1a216a1aa8	Add scraping_engine.py	2025-11-20 18:58:26 +00:00

6 Commits