31 Commits

Author SHA1 Message Date
94d87943de Refactor environment variable handling in AshbyJobScraper and Sender classes; remove fallback values for RabbitMQ and Redis configurations. 2025-12-10 13:26:47 +01:00
762846cb4a Add AshbyJobScraper and Sender classes for job scraping and message sending; implement Redis caching and RabbitMQ integration. 2025-12-10 12:02:43 +01:00
2d22fbdb92 Enhance AmazonJobScraper to support flexible location matching and extract posted dates; refine LLMJobRefiner prompts for better data extraction. 2025-12-09 12:00:57 +01:00
e216db35f9 Increase max pages to scrape and extend wait time between job title scrapes; add posted date to job data extraction 2025-12-09 09:30:44 +01:00
cbcffa8cd4 modify to queue failed jobs and also extract date of job posting 2025-12-09 09:12:35 +01:00
4782f174e2 Delete browser_sessions/job_scraping_12_session.json 2025-12-05 17:49:56 +00:00
10fa1ac633 Delete browser_sessions/job_scraping_123_session.json 2025-12-05 17:49:46 +00:00
ba783112f5 Delete spoof_config.json 2025-12-05 17:49:30 +00:00
9ed5641540 Delete tr.py 2025-12-05 16:50:52 +00:00
370fce0514 Merge branch 'amazon_agent' of https://gitea.thejobhub.xyz/Ofure/Web_scraping_project into amazon_agent 2025-12-05 17:50:10 +01:00
efa47d50ae amazon specific built engine 2025-12-05 17:49:31 +01:00
e49860faae Delete linkedin_main.py 2025-12-05 16:45:12 +00:00
0942339426 Delete job_scraper2.py 2025-12-05 16:44:52 +00:00
7e80801f89 Delete job_scraper.py 2025-12-05 16:44:23 +00:00
06f9820c38 Delete feedback_job_scraping_123.json 2025-12-05 16:44:08 +00:00
fbde4d03e1 Delete feedback_job_scraping_12.json 2025-12-05 16:43:42 +00:00
d0aabc5970 Delete .env 2025-12-05 16:43:25 +00:00
672c6a0333 scraper for amazon 2025-12-05 17:25:54 +01:00
224b9c3122 llm_agent now responsible for extraction. 2025-12-05 17:23:31 +01:00
160efadbfb modifications to work with postgre and use llm to extract and refine data 2025-12-05 17:00:43 +01:00
4f78a845ae refactor(llm_agent): switch from XAI to DeepSeek API and simplify job refinement
- Replace XAI/Grok integration with DeepSeek's OpenAI-compatible API
- Remove schema generation and caching logic
- Simplify prompt structure and response parsing
- Standardize database schema and markdown output format
- Update config to use DEEPSEEK_API_KEY instead of XAI_API_KEY
- Change default search keyword in linkedin_main.py
2025-12-01 10:25:37 +01:00
d7d92ba8bb fix(job_scraper): increase timeout values for page navigation
The previous timeout values were too short for slower network conditions, causing premature timeouts during job scraping. Increased wait_for_function timeout from 30s to 80s and load_state timeout from 30s to 60s to accommodate slower page loads.
2025-11-27 12:28:21 +01:00
d025828036 feat: update LLM model and increase content size limit
refactor: update timeout values in job scraper classes

feat: add spoof config for renderers and vendors

build: update pycache files for config and modules
2025-11-24 13:47:47 +01:00
fd4e8c9c05 feat(scraper): add LLM-powered job data refinement and new scraping logic
- Implement LLMJobRefiner class for processing job data with Gemini API
- Add new job_scraper2.py with enhanced scraping capabilities
- Remove search_keywords parameter from scraping engine
- Add environment variable loading in config.py
- Update main script to use new scraper and target field
2025-11-24 12:25:50 +01:00
7dca4c9159 refactor(job_scraper): improve page loading and typing in linkedin scraper
- Change page load strategy from 'load' to 'domcontentloaded' and 'networkidle' for better performance
- Make search_keywords parameter optional to handle empty searches
- Update type imports to include List for better type hints
- Set headless mode to true by default for production use
2025-11-23 09:27:05 +01:00
458e914d71 feat(scraping): enhance job scraping with session persistence and feedback system
- Add config module for spoof data management
- Implement session persistence to reuse authenticated sessions
- Add feedback system to track success rates and adjust fingerprinting
- Improve job link collection with pagination and scroll detection
- Separate verified/unverified job listings into different folders
- Enhance error handling for CAPTCHA and Cloudflare challenges
2025-11-21 16:51:26 +01:00
68495a0a54 Update README.md 2025-11-21 08:53:05 +00:00
01d4ca8001 Add linkedin_main.py 2025-11-20 19:00:43 +00:00
f52868edfa Add job_scraper.py 2025-11-20 18:59:46 +00:00
1a216a1aa8 Add scraping_engine.py 2025-11-20 18:58:26 +00:00
28d7197378 Initial commit 2025-11-20 18:56:21 +00:00