Web_scraping_project

Author	SHA1	Message	Date
Ofure Ikheloa	94d87943de	Refactor environment variable handling in AshbyJobScraper and Sender classes; remove fallback values for RabbitMQ and Redis configurations.	2025-12-10 13:26:47 +01:00
Ofure Ikheloa	762846cb4a	Add AshbyJobScraper and Sender classes for job scraping and message sending; implement Redis caching and RabbitMQ integration.	2025-12-10 12:02:43 +01:00
Ofure Ikheloa	2d22fbdb92	Enhance AmazonJobScraper to support flexible location matching and extract posted dates; refine LLMJobRefiner prompts for better data extraction.	2025-12-09 12:00:57 +01:00
Ofure Ikheloa	e216db35f9	Increase max pages to scrape and extend wait time between job title scrapes; add posted date to job data extraction	2025-12-09 09:30:44 +01:00
Ofure Ikheloa	cbcffa8cd4	modify to queue failed jobs and also extract date of job posting	2025-12-09 09:12:35 +01:00
Ofure	4782f174e2	Delete browser_sessions/job_scraping_12_session.json	2025-12-05 17:49:56 +00:00
Ofure	10fa1ac633	Delete browser_sessions/job_scraping_123_session.json	2025-12-05 17:49:46 +00:00
Ofure	ba783112f5	Delete spoof_config.json	2025-12-05 17:49:30 +00:00
Ofure	9ed5641540	Delete tr.py	2025-12-05 16:50:52 +00:00
Ofure Ikheloa	370fce0514	Merge branch 'amazon_agent' of https://gitea.thejobhub.xyz/Ofure/Web_scraping_project into amazon_agent	2025-12-05 17:50:10 +01:00
Ofure Ikheloa	efa47d50ae	amazon specific built engine	2025-12-05 17:49:31 +01:00
Ofure	e49860faae	Delete linkedin_main.py	2025-12-05 16:45:12 +00:00
Ofure	0942339426	Delete job_scraper2.py	2025-12-05 16:44:52 +00:00
Ofure	7e80801f89	Delete job_scraper.py	2025-12-05 16:44:23 +00:00
Ofure	06f9820c38	Delete feedback_job_scraping_123.json	2025-12-05 16:44:08 +00:00
Ofure	fbde4d03e1	Delete feedback_job_scraping_12.json	2025-12-05 16:43:42 +00:00
Ofure	d0aabc5970	Delete .env	2025-12-05 16:43:25 +00:00
Ofure Ikheloa	672c6a0333	scraper for amazon	2025-12-05 17:25:54 +01:00
Ofure Ikheloa	224b9c3122	llm_agent now responsible for extraction.	2025-12-05 17:23:31 +01:00
Ofure Ikheloa	160efadbfb	modifications to work with postgre and use llm to extract and refine data	2025-12-05 17:00:43 +01:00
Ofure Ikheloa	4f78a845ae	refactor(llm_agent): switch from XAI to DeepSeek API and simplify job refinement - Replace XAI/Grok integration with DeepSeek's OpenAI-compatible API - Remove schema generation and caching logic - Simplify prompt structure and response parsing - Standardize database schema and markdown output format - Update config to use DEEPSEEK_API_KEY instead of XAI_API_KEY - Change default search keyword in linkedin_main.py	2025-12-01 10:25:37 +01:00
Ofure Ikheloa	d7d92ba8bb	fix(job_scraper): increase timeout values for page navigation The previous timeout values were too short for slower network conditions, causing premature timeouts during job scraping. Increased wait_for_function timeout from 30s to 80s and load_state timeout from 30s to 60s to accommodate slower page loads.	2025-11-27 12:28:21 +01:00
Ofure Ikheloa	d025828036	feat: update LLM model and increase content size limit refactor: update timeout values in job scraper classes feat: add spoof config for renderers and vendors build: update pycache files for config and modules	2025-11-24 13:47:47 +01:00
Ofure Ikheloa	fd4e8c9c05	feat(scraper): add LLM-powered job data refinement and new scraping logic - Implement LLMJobRefiner class for processing job data with Gemini API - Add new job_scraper2.py with enhanced scraping capabilities - Remove search_keywords parameter from scraping engine - Add environment variable loading in config.py - Update main script to use new scraper and target field	2025-11-24 12:25:50 +01:00
Ofure Ikheloa	7dca4c9159	refactor(job_scraper): improve page loading and typing in linkedin scraper - Change page load strategy from 'load' to 'domcontentloaded' and 'networkidle' for better performance - Make search_keywords parameter optional to handle empty searches - Update type imports to include List for better type hints - Set headless mode to true by default for production use	2025-11-23 09:27:05 +01:00
Ofure Ikheloa	458e914d71	feat(scraping): enhance job scraping with session persistence and feedback system - Add config module for spoof data management - Implement session persistence to reuse authenticated sessions - Add feedback system to track success rates and adjust fingerprinting - Improve job link collection with pagination and scroll detection - Separate verified/unverified job listings into different folders - Enhance error handling for CAPTCHA and Cloudflare challenges	2025-11-21 16:51:26 +01:00
Ofure	68495a0a54	Update README.md	2025-11-21 08:53:05 +00:00
Ofure	01d4ca8001	Add linkedin_main.py	2025-11-20 19:00:43 +00:00
Ofure	f52868edfa	Add job_scraper.py	2025-11-20 18:59:46 +00:00
Ofure	1a216a1aa8	Add scraping_engine.py	2025-11-20 18:58:26 +00:00
Ofure	28d7197378	Initial commit	2025-11-20 18:56:21 +00:00

31 Commits