Home Forums The XBRL API Extracting Blocktext tags (i.e. sections of filings)

Viewing 4 reply threads
  • Author
    • #202987
      Satish Sahoo

      Hi David/Others,
      I want to extract certain sections of the 10-k filings. For example I tried to extract segment information using ‘SegmentReportingDisclosureTextBlock’ tag. I was hoping to get the entire section (including the tables and text etc).
      While I am able to pull the tag, its fact.value just gives me the details of the tag.
      For example I get-
      <p style=’margin-top:0pt; margin-bottom:0pt’><font style=”font-family:Times New Roman;font-size:10pt;font-weight:bold;margin-left:0px;”>NOTE 4.</font><font style=”font-family:Times New Roman;font-…

      <div style=”font-family:Times New Roman;font-size:10pt;”><div style=”line-height:120%;padding-top:18px;font-size:10pt;”><font style=”font-family:inherit;font-size:10pt;font-weight:bold;”>ACQUISITI…

      ACQUISITIONS, GOODWILL, AND ACQUIRED INTANGIBLE ASSETS<div style=”line-height:120%;padding-top:6px;text-indent:16px;font-size:10pt;”><span style=”font-family:inherit;font-size:10pt;font-style:ital…

      How can I get the entire section’s text and formatting information to put it in another HTML/Text File ? I tried footnote.* as well without any luck.

      Your help would be much appreciated.


    • #203004

      Hi Satish – these facts are HTML encoded; there is no ‘plain text’ version – the data is in there, but might be under several HTML tags for formatting purposes. You have a couple of options:

      • use regex in your routine to remove tags after you’ve retrieved the data (something like <.*?> should leave you with plain text, which might be tough to read … maybe replace it with spaces, tabs or line breaks?)
      • concatenate the fact.id with this string to create a URL that renders the fact: CONCAT( https://csuite.xbrl.us/php/dispatch.php?Task=htmlExportFact&FactID= , xxxxxx ) – we’re using this approach in some of the spreadsheet templates posted in the XBRL Data Community
    • #203027
      Satish Sahoo

      Hi David,
      Thanks a lot for your response. This is very helpful.
      Just have another related quick question. I see that at least since the Inline XBRL has started, the section files are separately posted in EDGAR website filings. Is it possible to point to the URL of those files using the API? Not sure if this is within the API framework. If it’s then it would be great. Thanks

      • #203046

        Hi Satish – thanks for your question. As part of the process to keep our Public Filings Database current, we make exact copies of the documents submitted to the SEC, FERC and other regulators that contain XBRL (as .xml instances or .html files that have inline XBRL in them). We do not copy the exhibit files (.htm but without XBRL), images, text files, etc.). You can use the report.sec-url field to get the page on EDGAR where these additional files exist.

    • #203044
      Satish Sahoo

      Hi David,
      After some more digging, I could get to the files that contain text for any particular TextBlock tag. But I realized that the fact.value doesn’t really contain the entire block inside it. Rather it seems to be truncated. Is there a size limit on the fact.value output ? If yes is there any setting that can be used for the fact.value to contain the entire text block within it ?

      As an example you can check the fact.value of the following fact id.


      • #203047

        Hi Satish – there might be a character limit if you’re trying to get the HTML from spreadsheet. This is why we use a hyperlink in spreadsheet to the browser view of the fact when there’s a “<\” character combination.

        If you query with curl or python, or use an API testing tool, you should see all of the HTML (the data in the HTML we present is the same data in our database).

    • #203054
      Satish Sahoo

      Thanks, David. I think you pointed me in the right direction. I guess the truncation is happening when I am writing the json list which is the output from the API into a panda data frame. So the API is still producing the entire section. It’s just the output rendering that is causing the truncation. Thanks

Viewing 4 reply threads
  • You must be logged in to reply to this topic.

Upcoming XBRL US Events

Webinar: SEC Rule – Tailored Shareholder Reports for Mutual Funds and ETFs
Wednesday, May 15, 2024

Domain Steering Committee Meeting
Tuesday, May 21, 2024

Communications & Services Steering Committee Meeting
Tuesday, May 21, 2024

GovFin 2024: Municipal Reporting Workshop
Tuesday, July 30, 2024