• 1 Post
  • 3 Comments
Joined 13 days ago
cake
Cake day: August 4th, 2025

help-circle
  • @xyro Ah, I see! I’m not using Ollama at the moment — my setup is based on GPT4All with a locally hosted DeepSeek model, which handles the semantic parsing directly.

    As mentioned earlier, the pipeline doesn’t just diff pages — it detects new document URLs from the source feed (via selectors), downloads them, and generates structured summaries. Here’s a snippet from the YAML config to illustrate how that works:

    (extract:
      events:
        selector: "results[*]"
        fields:
          url: pdf_url
          title: title
          order_number: executive_order_number
    
    download:
      extensions: [".pdf"]
    
    gpt:
      prompt: |
        Analyze this Executive Order document:
        - Purpose: 1–2 sentences
        - Key provisions: 3–5 bullet points
        - Agencies involved: list
        - Revokes/amends: if any
        - Policy impact: neutral analysis
    )
    

    To keep things efficient, I also support regex-based extraction before passing content to the LLM. That way, I can isolate relevant blocks (e.g. addresses, client names, conclusions) and reduce the noise in the prompt. Example from another config:

    processing:
      extract_regex:
        - "object of cultural heritage"
        - "address[:\\s]\\s*(.{10,100}?)(?=\\n|$)"
        - "project(?:s)?"
        - "circumstances"
        - "client\\s*:?\\s*(.{10,100}?)(?=\\n|$)"
        - "(?:conclusions?)\\s*(.{50,300}?)(?=\\n|$)"
    

    Let me know if you’re experimenting with similar flows — I’d be happy to share templates or compare how DeepSeek performs on your sources!